## RFM Analysis & Association Rules For Successful Customer Segmentation
*“RFM is a method used for analyzing customer value”.*
* It groups customers based on their transaction history :
  * Recency        — How many days ago was their last purchase?
  * Frequency      — How many times has the customer purchased from our store?
  * Monetary Value — How much do they spend?


<hr>

# Introduction
<br>
<font color = 'blue'>
<b>Content: </b>

1. [Load Libraries](#1)
1. [Load Dataset](#2)
1. [Spending of Countries](#3)
1. [How RFM Analysis Works](#4)
1. [Find Recency, Monetary and Frequency](#5)
    * [RFM Dataframe](#6)
    * [RFM Segmentation](#7)
    * [RFM Segmentation Readily Answers These Questions For Business](#8)
        * [Who are my best customers?](#9)
        * [Which customers are at the verge of churning?](#10)
        * [Who are the lost customers?](#11)
        * [Who are the loyal customers?](#12)
    * [Summing the RFM Score](#13)
        * [Making Classification For Customers Depends On RFM Sum Score](#14) 
        * [RFM Segmentation Visualization](#15)
1. [K-Means Segmentation](#16)
    * [Plot RFM Distributions](#17)
    * [Normalization](#18)
    * [Elbow Method](#19)
    * [KMeans Clustering with 4 Clusters](#20)
    * [Which Cluster Is Our Best Customers](#21)
1. [Association Rules](#22)
    * [Support & Confidence Values](#23)

<a id = "1"></a><br>
## Load Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import datetime as dt
import matplotlib.pyplot as plt
import squarify
from scipy.stats import norm
from sklearn.preprocessing import MinMaxScaler
#
from sklearn.cluster import KMeans
#
import plotly.offline as pyo 
import plotly.graph_objs as go
import plotly.figure_factory as ff
#
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id = "2"></a><br>
## Load Dataset

In [None]:
data = pd.read_csv("/kaggle/input/ecommerce-data/data.csv",encoding = 'unicode_escape')
data.head()

In [None]:
# Count of Countries 
data["Country"].value_counts()

In [None]:
# Check missing values
data.isnull().sum()

In [None]:
#Total Price
data['TotalPrice'] = data['UnitPrice'] * data['Quantity']
data.head()

In [None]:
# Total Spending of Countries
data_country = data.groupby("Country").agg({'TotalPrice': lambda x: x.sum()})

In [None]:
# Drop Unnecessary Countries for Visualization 
data_country.drop(["RSA","Unspecified","EIRE","European Community","Channel Islands"],axis=0,inplace=True)
data_country.head()

<a id = "3"></a><br>
## Spending of Countries

In [None]:
price = []
for i in range(len(data_country["TotalPrice"])):
    price.append(data_country["TotalPrice"][i])

country_price = pd.DataFrame(index=["AUS","AUT","BHR","BEL","BRA","CAN","CYP","CZE","DNK","FIN","FRA","DEU","GRC","HKG","ISL","ISR",
                                    "ITA","JPN","LBN","LTU","MLT","NLD","NOR","POL","PRT","SAU","SGP","ESP","SWE","CHE","USA",
                                    "ARE","GBR"],columns=["TotalPrice","country"])
country_price["country"] = data_country.index
country_price["TotalPrice"] = price
country_price.head()

In [None]:
worldmap = [dict(type = 'choropleth', locations = country_price['country'], locationmode = 'country names',
                 z = country_price['TotalPrice'], autocolorscale = True, reversescale = False, 
                 marker = dict(line = dict(color = 'rgb(180,180,180)', width = 0.5)), 
                 colorbar = dict(autotick = False, title = 'Total Price'))]

layout = dict(title = 'Total Price For Each Country', geo = dict(showframe = False, showcoastlines = True, 
                                                                projection = dict(type = 'Mercator')))

fig = dict(data=worldmap, layout=layout)
pyo.iplot(fig, validate=False)

<a id = "4"></a><br>
## How RFM Analysis Works

* Choose the event that signifies activity for your application, and set the date range you want to analyze. An ecommerce app might track purchases, while a media app might monitor content viewed or rated.

* For every user who has performed the defined event, the Analysis will calculate:

  * How many times the event has occurred
  * The last time a user performed the event
View a complete analysis of your user base on a Recency and Frequency Grid, broken down into segments.

![](https://d35fo82fjcw0y8.cloudfront.net/2017/12/06085307/In-content-screen-shot-1.png)

### Recency and Frequency Grid
* A low recency and frequency score (bottom left) represents hibernating users who haven’t been active recently or frequently. A high recency and frequency score (top right) represents users who have been active recently and frequently, indicating your app’s champions.

* Users are then ranked in order of percentile. For example, a user who has performed the activity most recently would constitute the 100th percentile. Users are then ranked by a score of 1 through 5, based on their percentile, with 5 being the highest.

### The Recency and Frequency Grid breaks your user base down into:

* Champions [R(4 – 5), F(4 – 5)]
* Loyal Customers [R(3 – 4), F(4 – 5)]
* Potential Loyalists [R(4 – 5), F(2 – 3)]
* Promising [R(3 – 4), F(0 – 1)]
* Can’t Lose Them [R(1 – 2), F(4 – 5)]
* At Risk [R(1 – 2), F(3 – 4)]
* About to Sleep [R(2 – 3), F(1-2)]
* Hibernating [R(1 – 2), F(1 – 2)]
* New Customers R [(4 – 5), F(0 – 1)]
* Need Attention R [(2 – 3), F(2 – 3)]

In [None]:
data.head()

In [None]:
data.shape

In [None]:
# Change Data Type:
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'])

# Adjust today:
today = dt.datetime(2012,1,1)
print(today)

# Bigger than zero and just UK
data = data[data['Quantity'] > 0]
data = data[data['TotalPrice'] > 0]
data = data[data["Country"] == "United Kingdom"]
data.shape

In [None]:
data.info()

<a id = "5"></a><br>
## Find Recency, Monetary and Frequency

In [None]:
# Recency and Monetary 
data_x = data.groupby('CustomerID').agg({'TotalPrice': lambda x: x.sum(),
                                        'InvoiceDate': lambda x: (today - x.max()).days})
data_x.head()

In [None]:
# Dataset is basis on StockCode    
data_y = data.groupby(['CustomerID','InvoiceNo']).agg({'TotalPrice': lambda x: x.sum()})
data_y.head(20)

In [None]:
# Find Frequency
data_z = data_y.groupby('CustomerID').agg({'TotalPrice': lambda x: len(x)})
data_z.head()

<a id = "6"></a><br>
## RFM Dataframe

In [None]:
# RFM Dataframe
rfm_table= pd.merge(data_x,data_z, on='CustomerID')

# Change Column Name
rfm_table.rename(columns= {'InvoiceDate': 'Recency',
                          'TotalPrice_y': 'Frequency',
                          'TotalPrice_x': 'Monetary'}, inplace= True)
rfm_table.head()

<a id = "7"></a><br>
## RFM Segmentation

* Customers with the lowest recency, highest frequency and monetary amounts considered as top customers.

In [None]:
#Frequency bulma
def FScore(x,p,d):
    if x <= d[p][0.20]:
        return 0
    elif x <= d[p][0.40]:
        return 1
    elif x <= d[p][0.60]: 
        return 2
    elif x <= d[p][0.80]:
        return 3
    else:
        return 4

quantiles = rfm_table.quantile(q=[0.20,0.40,0.60,0.80])
quantiles = quantiles.to_dict()
rfm_table['Freq_Tile'] = rfm_table['Frequency'].apply(FScore, args=('Frequency',quantiles,))

#Recency 
rfm_table = rfm_table.sort_values('Recency',ascending=True)
rfm_table['Rec_Tile'] = pd.qcut(rfm_table['Recency'],5,labels=False)

#Monetary 
rfm_table['Mone_Tile'] = pd.qcut(rfm_table['Monetary'],5,labels=False)

# instead of zero, plus 1 
rfm_table['Rec_Tile'] = rfm_table['Rec_Tile'] + 1
rfm_table['Freq_Tile'] = rfm_table['Freq_Tile'] + 1
rfm_table['Mone_Tile'] = rfm_table['Mone_Tile'] + 1

# Add to dataframe
rfm_table['RFM Score'] = rfm_table['Rec_Tile'].map(str) + rfm_table['Freq_Tile'].map(str) + rfm_table['Mone_Tile'].map(str)
rfm_table.head()

<a id = "8"></a><br>
## RFM Segmentation Readily Answers These Questions For Business

<a id = "9"></a><br>
## Who are my best customers?

In [None]:
rfm_table[rfm_table['RFM Score'] == '555'].sort_values('Monetary', ascending=False).head()

<a id = "10"></a><br>
## Which customers are at the verge of churning?

In [None]:
#Customers who's recency value is low
rfm_table[rfm_table['Rec_Tile'] <= 2 ].sort_values('Monetary', ascending=False).head()

<a id = "11"></a><br>
## Who are the lost customers?

In [None]:
#Customers who's recency, frequency as well as monetary values are low 
rfm_table[rfm_table['RFM Score'] == '111'].sort_values('Recency',ascending=False).head()

<a id = "12"></a><br>
## Who are the loyal customers?

In [None]:
#Customers with high frequency value

rfm_table[rfm_table['Freq_Tile'] >= 3 ].sort_values('Monetary', ascending=False).head()

<a id = "13"></a><br>
## Summing the RFM Score
* One of the most straightforward methods is to sum our scores to a single number and define RFM levels for each score range.

In [None]:
# Calculate RFM_Score
rfm_table['RFM_Sum'] = rfm_table[['Freq_Tile','Rec_Tile','Mone_Tile']].sum(axis=1)
rfm_table.head()

<a id = "14"></a><br>
## Making Segmentation For Customers Depends On RFM Sum Score

* We can get creative and hypothesize about what each score range entails, but for this exercise I will take inspiration from some common segment names.

In [None]:
# Define rfm_level function
def rfm_level(df):
    if df['RFM_Sum'] >= 9:
        return 'Can\'t Loose Them'
    elif ((df['RFM_Sum'] >= 8) and (df['RFM_Sum'] < 9)):
        return 'Champions'
    elif ((df['RFM_Sum'] >= 7) and (df['RFM_Sum'] < 8)):
        return 'Loyal'
    elif ((df['RFM_Sum'] >= 6) and (df['RFM_Sum'] < 7)):
        return 'Potential'
    elif ((df['RFM_Sum'] >= 5) and (df['RFM_Sum'] < 6)):
        return 'Promising'
    elif ((df['RFM_Sum'] >= 4) and (df['RFM_Sum'] < 5)):
        return 'Needs Attention'
    else:
        return 'Require Activation'
# Create a new variable RFM_Level
rfm_table['RFM_Level'] = rfm_table.apply(rfm_level, axis=1)
# Print the header with top 5 rows to the console
rfm_table.head()

In [None]:
rfm_table["RFM_Level"].value_counts()

In [None]:
# Calculate average values for each RFM_Level, and return a size of each segment 
rfm_level_agg = rfm_table.groupby('RFM_Level').agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'Monetary': ['mean', 'count']}).round(1)
# Print the aggregated dataset
print(rfm_level_agg)

<a id = "15"></a><br>
## RFM Segmentation Visualization

In [None]:
rfm_level_agg.columns = rfm_level_agg.columns.droplevel()
rfm_level_agg.columns = ['RecencyMean','FrequencyMean','MonetaryMean', 'Count']
#Create our plot and resize it.
fig = plt.gcf()
ax = fig.add_subplot()
fig.set_size_inches(16, 9)
squarify.plot(sizes=rfm_level_agg['Count'], 
              label=['Can\'t Loose Them',
                     'Champions',
                     'Loyal',
                     'Needs Attention',
                     'Potential', 
                     'Promising', 
                     'Require Activation'], alpha=.6 )
plt.title("RFM Segments",fontsize=18,fontweight="bold")
plt.axis('off')
plt.show()

<a id = "16"></a><br>
## K-Means Segmentation

<a id = "17"></a><br>
## Plot RFM distributions

In [None]:
plt.figure(figsize=(12,10))
# Plot distribution of R
plt.subplot(3, 1, 1); sns.distplot(rfm_table['Recency'],fit=norm)
# Plot distribution of F
plt.subplot(3, 1, 2); sns.distplot(rfm_table['Frequency'],fit=norm)
# Plot distribution of M
plt.subplot(3, 1, 3); sns.distplot(rfm_table['Monetary'],fit=norm)
# Show the plot
plt.show()

<a id = "18"></a><br>
## Normalization

In [None]:
clustering_fm = rfm_table[['Recency',"Frequency","Monetary"]].copy()
clustering_fm.head()

In [None]:
min_max_scaler = MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(clustering_fm)
data_scaled2 = pd.DataFrame(x_scaled)

In [None]:
data_scaled2.head()

<a id = "19"></a><br>
## Elbow Method

In [None]:
wscc = []
for i in range(1,15): 
    kmeans = KMeans(n_clusters=i, init="k-means++",random_state=0)
    kmeans.fit(data_scaled2)
    wscc.append(kmeans.inertia_)  

plt.plot(range(1,15),wscc,marker="*",c="black")
plt.title("Elbow plot for optimal number of clusters")

<a id = "20"></a><br>
## KMeans clustering with 4 clusters

In [None]:
kmeans = KMeans(n_clusters = 4, init='k-means++', n_init =10,max_iter = 300)
kmeans.fit(data_scaled2)
pred = kmeans.predict(data_scaled2)

In [None]:
np.unique(kmeans.labels_)

In [None]:
from sklearn.metrics import silhouette_score
score = silhouette_score (data_scaled2, kmeans.labels_)
print("Score = ", score)

In [None]:
y_kmeans = kmeans.predict(data_scaled2)

In [None]:
y_kmeans[:4]

In [None]:
# Count of Clusters
d_frame = pd.DataFrame(clustering_fm)
d_frame['cluster'] = y_kmeans
d_frame['cluster'].value_counts()

In [None]:
d_frame.head()

<a id = "21"></a><br>
## Which Cluster Is Our Best Customers

In [None]:
d_frame.groupby('cluster').mean()

<hr>

<a id = "22"></a><br>
## Association Rules
* Apriori is an algorithm for frequent item set mining and association rule learning over relational databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.

In [None]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [None]:
data_apriori = data[data['Country']=='United Kingdom']
data_apriori.head()

In [None]:
data_apriori["Description"].nunique()

In [None]:
# Which Product and Their Count 
data_apr = data_apriori.groupby(['InvoiceNo', 'Description'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('InvoiceNo')
data_apr.head()

In [None]:
def num(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket_new = data_apr.applymap(num)
basket_new.head()

In [None]:
from mlxtend.frequent_patterns import fpgrowth
rule_fp = fpgrowth(basket_new, min_support=0.02, use_colnames=True)
rule_fp

In [None]:
items = apriori(basket_new, min_support=0.02, use_colnames=True)
items

<a id = "23"></a><br>
## Support & Confidence Values	

* Support
<br>

![](https://miro.medium.com/max/1400/1*bqdq-z4Ec7Uac3TT3H_1Gg.png)

* Confidence
<br>

![](https://miro.medium.com/max/1400/1*E3mNKHcudWzHySGMvo_vPg.png)

In [None]:
rule = association_rules(items, metric="lift", min_threshold=1)
rule