   RFM analysis is a popular segmentation technique that studies data-driven customer behavior. Recency (R), frequency (F) and monetary (M) scores are assigned to each customer separately. This score gets a score from 1 to 5 We get our RFM score by combining the novelty (R), frequency (F), monetary (M) scores side by side as a single score. However, the interpretation of this scoring may vary according to the field of activity of the enterprise or firm.
For example, if your business is a car business, you expect your customers to make high-priced purchases, but the frequency of purchases and innovation time will be low. Therefore, interpreting individual scores under these conditions gives more accurate results.

Recency — (R)
Frequency — (F)
Monetary — (M)

* Recency: How recently the customer made a purchase
* Frequency: Frequency of the customer's purchase
* Monetary: How much money the customer spends on purchases
    
> RFM analysis numerically scores a customer on a scale of 1 to 5 in each of these three categories (the higher the number, the better the result). The “best” customer gets the highest score in each category (555).

![](https://d35fo82fjcw0y8.cloudfront.net/2018/03/01013508/Incontent_image.png)

# Step by Step RFM Analysis

# 1) Let's Explore the Dataset:

We will use the data set named “online_retail_II” in our RFM Analysis. This dataset contains the sales data of an online sales store selling souvenirs based in the UK, between 01/12/2009 - 09/12/2011.
There are 2 sheets in Excel. We will use sheet 2, which is “Year 2010–2011”.

* #Importing Libraries

In [None]:
import datetime as dt
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', lambda x: '%.5f' % x)

* #Reading the Dataset

In [None]:
df = pd.read_csv("../input/online-retail-ii-uci/online_retail_II.csv")

After reading the dataset, we can now examine the descriptive statistics of the dataset.

In [None]:
df.head()

**Variables**

*InvoiceNo:* Invoice number. It is a unique value. If this code starts with C, it means return

*StockCode:* Product code. It is unique for each product.

*Description:* Product name

*Quantity:* Number of products. It expresses how many of the products on the invoices have been sold. Those starting with C take a negative value.

*InvoiceDate:* Invoice date and time

*UnitPrice:* Product price

*Customer ID:* Customer number. It is a unique number for each customer

*Country:* The name of the country. Indicates the country where the customer lives.

In [None]:
df.describe().T

Are there any missing observations in the dataset? If yes, how many missing observations in each variable?

In [None]:
df.isnull().sum()

We remove the missing observations from the dataset. We also use the 'inplace=True' parameter in the subtraction process. Thus, the changes we have made become permanent.

In [None]:
df.dropna(inplace=True)

Let's examine how many unique items there are?

In [None]:
df["Description"].nunique()

The information of how many of each product is in the data set can also provide an important insight for us.

In [None]:
df["Description"].value_counts().head()

We can sort the 10 most ordered products from most to least.

In [None]:
df.groupby("Description").agg({"Quantity": "sum"}).sort_values(by="Quantity", ascending=False).head(10)

The 'C' statement in the invoices shows the canceled transactions. We need to remove the canceled transactions from the dataset. Thus, we can make our analysis more robust. Also, just in case, if there are negative values in the Quantity and Price columns, we remove them from the data set.

In [None]:
df = df[~df["Invoice"].str.contains("C", na=False)]
df = df[(df['Quantity'] > 0)]
df = df[(df['Price'] > 0)]

In order to use it in our calculations, we create a variable called 'TotalPrice', which expresses the total earnings per Invoice.

In [None]:
df["TotalPrice"] = df["Quantity"] * df["Price"]

# 2) Calculating RFM Metrics

We select the largest date in the dataset and add 2 days to it. We will use this information when calculating the recency value.

In [None]:
today_date = pd.to_datetime(df['InvoiceDate'].max()) + pd.DateOffset(days=2)

In [None]:
rfm = df.groupby('Customer ID').agg({'InvoiceDate': lambda InvoiceDate: (today_date - pd.to_datetime(InvoiceDate).max()).days,
                                     'Invoice': lambda Invoice: Invoice.nunique(),
                                     'TotalPrice': lambda TotalPrice: TotalPrice.sum()})

In [None]:
rfm.head()

We rename the columns we obtained in accordance with the 3 basic metrics we gave at the beginning of the article. In addition, we do not include those with a monetary value less than zero in our calculations.

In [None]:
rfm.columns = ['recency', 'frequency', 'monetary']

rfm = rfm[(rfm['monetary'] > 0)]

# 3) Generating and Converting RFM Scores to a Single Variable

In [None]:
rfm["recency_score"] = pd.qcut(rfm['recency'], 5, labels=[5, 4, 3, 2, 1])

rfm["frequency_score"] = pd.qcut(rfm['frequency'].rank(method="first"), 5, labels=[1, 2, 3, 4, 5])

rfm["monetary_score"] = pd.qcut(rfm['monetary'], 5, labels=[1, 2, 3, 4, 5])

It is necessary to concatenate the Recency and Frequency columns we have obtained, so we will find the final scoring. Since the Monetary value is mostly similar to the values in the F column, we will not need to use this column.

In [None]:
rfm["RFM_SCORE"] = (rfm['recency_score'].astype(str) +
                    rfm['frequency_score'].astype(str))

In [None]:
rfm.head()

In [None]:
rfm.describe().T

# Classification of RFM Scores by Segments

According to the RF scores we get, we will segment the customers. The map below shows the corresponding scores and segments for this classification. We will do this using regular expression in python.

![](https://miro.medium.com/max/1234/0*JJBP4ToZiaw0HVPN.png)

In [None]:
seg_map = {
    r'[1-2][1-2]': 'hibernating',
    r'[1-2][3-4]': 'at_Risk',
    r'[1-2]5': 'cant_loose',
    r'3[1-2]': 'about_to_sleep',
    r'33': 'need_attention',
    r'[3-4][4-5]': 'loyal_customers',
    r'41': 'promising',
    r'51': 'new_customers',
    r'[4-5][2-3]': 'potential_loyalists',
    r'5[4-5]': 'champions'
}

In [None]:
rfm['segment'] = rfm['RFM_SCORE'].replace(seg_map, regex=True)
rfm.head()

And we got every customer's RFM score. In this way, we have segmented each of our customers, and by using these scores, we can both comment on customer behavior and determine a strategy for the future.
Since these customers, whose RFM scores have been determined, have their own behavioral characteristics, they can be grouped together.

In the final version of our table where customers are segmented. Let's comment on the points we find important, both in terms of action decisions and the structure of the segments (mean RFM values). 

lets see the best and the worst customers first...

In [None]:
rfm[rfm["RFM_SCORE"] == "11"].head()

In [None]:
rfm[rfm["RFM_SCORE"] == "55"].head()

or we can list the customers in the Loyal Customers category.

In [None]:
loyal_custmer = rfm[rfm['segment'] == 'loyal_customers']
loyal_custmer = loyal_custmer.groupby("segment").agg({"monetary": ["mean", "sum"]})
loyal_custmer.head()

If we want to observe how many customers we have in which segment, we can use the code below.

In [None]:
rfm["segment"].value_counts()

If we want to look at the profitability values in the segments, we can use the group by method and examine it.

In [None]:
rfm_monetary = rfm.groupby("segment").agg({"monetary": ["mean", "sum"]})

rfm_monetary.reset_index()

rfm_monetary.columns = rfm_monetary.columns.droplevel(0)
rfm_monetary.sort_values(by = "mean", ascending = False)

In [None]:
rfm[['segment','recency','frequency','monetary']].groupby("segment").agg(['mean','count'])

# CONCLUSION

We can examine some segments in detail.

Champions Segment [R(5), F(4–5)]:

* There are 837 customers in this group.
* They have not been shopping for an average of 9.5 days.
* They shopped an average of 19.33 times.
* The company have gained an average of 10901.12 pounds from them.

At Risk [R(1-2), F(3–4)]:

* There are 753 customers in this group.
* They have not been shopping for an average of 373.6 days.
* They shopped an average of 3 times.
* The company have gained an average of 1382 pounds from them.

Need Attention [R(3), F(3)]:

* There are 266 customers in this group.
* They have not been shopping for an average of 114.3 days.
* They shopped an average of 3.15 times.
* The company have gained an average of 1286.65 pounds from them.

Can't Loose [R(1-2), F(5)]:

* There are 71 customers in this group.
* They have not been shopping for an average of 332.2 days.
* They shopped an average of 15 times.
* The company have gained an average of 8355.67 pounds from them.

Loyal Customers [R(1-2), F(3–4)]:

* There are 1161 customers in this group.
* They have not been shopping for an average of 68.06 days.
* They shopped an average of 9.83 times.
* The company have gained an average of 4211.83 pounds from them.

Since the customers in the "At Risk" category are both high in monetary value and large in number, I would recommend working on those customers. Moreover, it will be relatively easy to move them to Loyal Customers status. For the same reason, the "Need Attention" group also stands out as a segment that should be emphasized. Similar to these, the "Loyal Customers" segment is very important and profitable for the company due to the monetary value it brings. However, similarly, "Cant Loose" customers, although highly profitable, are very few in number. As a general strategy, we may aim to move customers in the Cant loose, At Risk and Need Attention segments to the Loyal customers.