In this notebook, I will be going through the implementation of RFM or Recency, Frequency and Monetary. This was popularised in the 1960s and 1970s. It might be old but the Wharton's course on Customer Analytics at Coursera teaches this pretty well, so if you feel like checking it out here's the [link](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=video&cd=&cad=rja&uact=8&ved=2ahUKEwjj4ZeB3bvrAhU16nMBHRuRCJ8QtwIwAXoECAEQAQ&url=https%3A%2F%2Fwww.coursera.org%2Flecture%2Fwharton-customer-analytics%2Fbeyond-period-2-7fduC&usg=AOvVaw3-I-Zc9QAF2OrUWPaAAZGW).

# Importing and Reading Data:

Importing pandas for working with the data:

In [None]:
import pandas as pd

In [None]:
# Reading the data:
data = pd.read_csv("../input/retailtransactiondata/Retail_Data_Transactions.csv")

In [None]:
data.head()

# Playing with the Data:

In [None]:
# Defining Recency as two of the three columns in data:
recency = data[['trans_date', 'customer_id']]

Checking the data for the number of unique values:

In [None]:
recency.apply(pd.Series.nunique)

Lets take a look at the dimensions of our data:

In [None]:
recency.shape

Well, since there are more examples than the number of unique customer ids this means that a lot of customers bought from us more than once,

Time to work with dates:

In [None]:
import time

recency['trans_date'] = pd.to_datetime(recency.trans_date)

In [None]:
# now refers to the latest date available in the data, to which we will peg our rececny dimensions on:
now = max(recency['trans_date'])

The groupby function in pandas allows of for grouping of many index column values or as we saw above there were more than one instances of a single customer purchasing so why not combine all their purchases?

In [None]:
recency = recency.groupby(['customer_id']).max()

Okay so, recency refers to the time since the last purchase. So lets find out the number of days:

In [None]:
recency_days = now - recency['trans_date']

In [None]:
recency_days = pd.DataFrame(recency_days)
recency_days.head()

Monetary refers to the total money spent by a customer overtime:

In [None]:
monetary = data[['customer_id', 'tran_amount']]

Via Groupby here, I sum up all the transactions of every respective customer:

In [None]:
monetary = monetary.groupby(['customer_id']).sum()

In [None]:
monetary.head()

Frequency refers to the number of times a customer has made purchases:

In [None]:
frequency = data[['customer_id', 'trans_date']]

Via Groupby, I use count to count the number of times the customer has made purchases:

In [None]:
frequency = frequency.groupby(['customer_id']).count()

In [None]:
frequency.head()

Finally concatenating the dataframes:

In [None]:
recency = pd.DataFrame(recency_days['trans_date'].astype('timedelta64[D]'))
recency.columns = ['recency']

Taking a look at this beauty:

In [None]:
recency.head()


Since we have the absolute integer values of all the three key elements lets concatenate them:

In [None]:
rfm = pd.concat([recency, frequency, monetary], axis=1)

In [None]:
# Defining the columns:
rfm.columns=['recency', 'frequency', 'monetary']

rfm.head()

# RFM Analysis:

I think we should take a look at the distribution because plotting can give a really good idea of what is up with the customers:

In [None]:
# Plotting for the last day since the customer made a purchase:

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8,8))
sns.set_context("poster")
sns.set_palette(['skyblue'])
sns.distplot(rfm['recency'])
plt.xlabel('Days since last purchase')

In [None]:
# Plotting the number of times the customer has made a purchase:

plt.figure(figsize=(8,8))
sns.set_context("poster")
sns.set_palette(['pink'])
sns.distplot(rfm['frequency'])


In [None]:
# Plotting the total revenue that the particular customer brought in to the shop:

plt.figure(figsize=(8,8))
sns.set_context("poster")
sns.set_palette(['orange'])
sns.distplot(rfm['monetary'])
plt.xlabel('Dollars')

Since we have a basic idea of what the distribution of the three indicators, lets use quantiles to educate ourselves further with quantiles:

# KMeans ALgorithm:

I have had this idea in my head for a while now to combine the KMeans algorithm along with RFM Analysis. The steps to identify the superfans of any business being the following:

1) Preprocessing the data to find the three key elements needed:
        - Recency
        - Frequency
        - Monetary Value
        
2) Using the KMeans Algorithm to find the cluster of customers with high average monetary value and are highly frequent.

3) The cluster taken out now would be put through RFM Analysis and processed.

4) Customers with the highest Aggregate RFM Scores would be considered the Actual Superfans.

In [None]:
# Making a copy so that I don't lose my temper over messing up a previously perfect dataframe.
RFM = rfm.copy()

# Importing KMeans and finding clusters:
from sklearn.cluster import KMeans

km = KMeans(n_clusters = 4, init = 'k-means++')

nclusters = km.fit_predict(RFM)

clusters = pd.DataFrame(nclusters, columns = ['clusters'], index = RFM.index)

# Concatenating the clusters with the RFM dataframe:
rfmK= pd.concat([RFM, clusters], axis=1)

Lets take a look at the clusters:


In [None]:
plt.figure(figsize=(8,8))
sns.scatterplot(data = rfmK, x='frequency', y='monetary', hue='clusters')

Since it is clear that the cluster with the number 2 as its index has the highest frequency and monetary value, lets use it. The clusters with the index as 1 are the thrifty customers who don't visit the shop that much and don't pay much too. 0 and 3 are in between the extremes.


In [None]:
RFMKMEANS = rfmK.copy()

RFMKMEANS['clusters'] = ['SuperFans' if x == 3 else x for x in RFMKMEANS['clusters']]
RFMKMEANS['clusters'] = ['UsualCustomers' if x == 2 else x for x in RFMKMEANS['clusters']]
RFMKMEANS['clusters'] = ['FrequentCustomers' if x == 0 else x for x in RFMKMEANS['clusters']]
RFMKMEANS['clusters'] = ['Thrifters' if x == 1 else x for x in RFMKMEANS['clusters']]

In [None]:
plt.figure(figsize=(15,7))
sns.set_context("poster", font_scale=0.7)
sns.set_palette('twilight')
sns.countplot(RFMKMEANS['clusters'])

Since, we have the cluster its time to seperate the cluster from the dataframe:

In [None]:
superfans_df = rfmK.loc[rfmK['clusters'] == 2]
superfans_df.head()

In [None]:
# Dropping the clusters column because there's just a single value in the whole column:
superfans_df.drop(['clusters'], axis= 1, inplace=True)

In [None]:
# Lets take a look at the quantile:
superfans_df.quantile([.25, .5, .75, 1], axis=0)

In [None]:
# Copying the RFM dataset so that it isn't affected by the changes:
RFMscores = superfans_df.copy()

# RFM SCORE CARD:

Below is the conversion of columns into RFM scores between 1 to 4. These will come handy in calculating the aggregate RFM Score. As per the quantile I have assigned the indices in the columns a value between 1 to 4. 

> '4' being great and '1' being poor.


Apart from that what must be noted is that different columns see this score differently. Such as, the higher the monetary value the higher chance of that value being 4 since more revenue is never a bad thing. While, smaller the value of Recency the more likely it is that the index will be given the value of 1, since the more recent customer is more likely to comeback. In case of Frequency the metrics are the same as Monetary since, higher the Frequency the more beneficial it is for business.

In [None]:
# Converting Recency:

RFMscores['recency'] = [4 if x <= 14 else x for x in RFMscores['recency']]
RFMscores['recency'] = [3 if 14 < x <= 37 else x for x in RFMscores['recency']]
RFMscores['recency'] = [2 if 37 < x <= 70 else x for x in RFMscores['recency']]
RFMscores['recency'] = [1 if x > 70 else x for x in RFMscores['recency']]

In [None]:
# Converting Frequency:

RFMscores['frequency'] = [1 if a <= 24 else a for a in RFMscores['frequency']]
RFMscores['frequency'] = [2 if a == 25 else a for a in RFMscores['frequency']]
RFMscores['frequency'] = [3 if 27 > a > 25 else a for a in RFMscores['frequency']]
RFMscores['frequency'] = [4 if a >= 27 else a for a in RFMscores['frequency']]

In [None]:
# Converting Monetary:

RFMscores['monetary'] = [1 if x < 1721 else x for x in RFMscores['monetary']]
RFMscores['monetary'] = [2 if 1721 <= x < 1808 else x for x in RFMscores['monetary']]
RFMscores['monetary'] = [3 if 1808 <= x < 1954 else x for x in RFMscores['monetary']]
RFMscores['monetary'] = [4 if 1954 <= x else x for x in RFMscores['monetary']]

Lets check if we got it right by taking a look at the number of unique values in each column:

In [None]:
RFMscores.apply(pd.Series.nunique)

In [None]:
RFMscores.head()

# Aggregate RFM Score:

The aggregate RFM Score is basically the average of the RFM values in the RFM Scores dataset:

In [None]:
score = pd.DataFrame((RFMscores['recency'] + RFMscores['frequency'] + RFMscores['monetary'])/3, columns=['AggrScore'])


In [None]:
# Concatenating the two:
RFMscores = pd.concat([RFMscores, score], axis = 1)

In [None]:
RFMscores.head()

# Most Valuable Customers (MVCs):

In [None]:
# Using Quantiles we find the limit for our top 25% customers:
RFMscores.quantile([.75], axis=0)

The top customers would be the ones who are in the top 25% of the top RFM Scores:

In [None]:
topcustomers = RFMscores['AggrScore'].iloc[[x >= 3 for x in RFMscores['AggrScore']]]

In [None]:
MVCs = pd.DataFrame(topcustomers, columns = ['AggrScore'])

In [None]:
MostValuableCustomers = list(MVCs.index)

In [None]:
MostValuableCustomers

These are the rows of customers who are the most valuable.

In [None]:
rfmK.loc[MVCs.index]

And thus, we have the names of the customers who love the shop! If you liked the notebook, an upvote would be much appreciated. :-)