<a href="https://colab.research.google.com/github/sowmyaganesan-2601/Machine-Learning-and-Python-Projects/blob/main/EDA_%2CCustomer_Segmentation_using_RFM_and_KMeans.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##EDA, Customer Segmentation using RFM and KMeans¶

Customer segmentation is the process of dividing customers into groups based on common characteristics so companies can market to each group effectively and appropriately.

This kernel is EDA and customer segmentation on Online Retail II data set containing all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.

## Types of Segmentation factors:

* Demographic (Age, Gender, Income, Location, Education, Ethnicity)
* Psychographic (Interests, Lifestyles, Priorities, Motivation, Influence)
* Behavioural (Purchasing habits, Spending habits, User status, Brand interactions)
* Geographic (zip code, city, country, climate)

Major purpose of customer segmentation is Testing Pricing options, Focusing on Profitable customers, Communicating Targeted Marketing messages.

##Methodology

In this dataset we only have features that demonstrate Purchasing habits and Spending habits (Behavioural) factors. We perform RFM Modelling and KMeans Clustering on this dataset to segment customers.

## Libraries

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import datetime as dt
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from numpy import math

# Loading Dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

In [None]:
path='/content/drive/MyDrive/Onlineretail_csvfile/OnlineRetail.csv'
df=pd.read_csv(path)
df


In [None]:
#shape of dataset
df.shape

In [None]:
#checking the datatypes and null values in the dataset
df.info()

In [None]:
#checking for null values
df.isna().sum()

##Observations
* Datatype of InvoiceDate is object need to convert it into datatime.
* There are null values in CustomerID and Description.

Customer ID is our Identification feature and Description has Product description.

We cannot do RFM analysis and KMeans Clustering without Customer ID values.

Hence, droppingg the missing values

In [None]:
df.dropna(subset=['Customer ID'],inplace=True)


In [None]:
df.isnull().sum()

# Dataset Summary

In [None]:
df.describe()

In [None]:
(df[['Quantity','Price']] < 0).sum()

we observe quantity column has negative values also check for negative values in other columns,lets explore these entries

*   List item
*   List item



In [None]:
df[df['Quantity']<0]
df[df['Price']<0]

In [None]:
# Filter out Negative values from quantity column
df=df[df.Quantity>0]
df=df[df.Price>0]
df

Invoice numbers start with C and as per description of data these are cancellations hence dropping these entries



In [None]:
df = df[df["Invoice"].str.contains("C") == False]
df

In [None]:
df.shape

In [None]:
df.describe()

## Feature Engineering

In [None]:

# Converting InvoiceDate to Datatime
df['InvoiceDate']=pd.to_datetime(df['InvoiceDate'])
df

In [None]:
# Extracting month from invoice date
df['Month']=df['InvoiceDate'].dt.month_name()
df['Day']=df['InvoiceDate'].dt.day_name()
df

In [None]:
# Creating Total Amount column by multiplying  Quantity with Price
df['Total Amount']=df['Quantity']*df['Price']
df

## Exploratory Data Analysis

In [None]:
df.columns

## 1) Top 10 Highest Selling Product of the store


In [None]:
high_sale=df.groupby('Description').sum()
high_sale.sort_values(by='Quantity',ascending=False,inplace=True)
high_sale.reset_index(inplace=True)
top_product_sale=high_sale[['Description','Quantity']][:10]
top_product_sale

In [None]:
#plot top 10 highest selling products
plt.figure(figsize=(12,6))
ax=sns.barplot(x=top_product_sale['Quantity'],y=top_product_sale['Description'])
ax.bar_label(ax.containers[0])
plt.title("Top 10 highest selling products ")
#show labels


**Observations**
* WORLD WAR 2 GLIDERS ASSTD DESIGNS was the highest selling product
* HANGING HEART T-LIGHT HOLDER was the second highest selling product

## 10 Least selling products

In [None]:
least_product_sale=high_sale[['Description','Quantity']]
least_product_sale.tail(10)

**These are the least selling products of the store with only 1 unit sold of each product**




## Top 10 highest spending customers

In [None]:
Top_spending=df.groupby('Customer ID')['Total Amount'].sum().reset_index().sort_values('Total Amount',ascending=False).head(10)
Top_spending

In [None]:
#visualize top 10 spending customers
plt.figure(figsize=(12,6))
ax=sns.barplot(x=Top_spending['Customer ID'],y=Top_spending['Total Amount'])
ax.bar_label(ax.containers[0])
plt.title("Top 10 spending customers")

# Top 10 Frequent customers

In [None]:
top_frequent=df['Customer ID'].value_counts().sort_values(ascending=False).reset_index()
top_frequent_new=top_frequent.rename(columns={'index':'Customer ID','Customer ID':'Frequency'}).head(10)
top_frequent_new

We observe that both lists have 3 Customer IDs common imptlying most frequent customers tend to be the most spending customers

## Top 10 Customers by average order amount by percentage

In [None]:
avg_amount=df.groupby('Customer ID')['Total Amount'].mean().round(2).sort_values(ascending=False).reset_index().rename(columns={'Total Amount':'Avg_amt_per_cust'}).head(10)
avg_amount


In [None]:
#visualize top 10 customers by average order amount by percentage
plt.figure(figsize=(12,6))
sns.barplot(x=avg_amount['Customer ID'],y=avg_amount['Avg_amt_per_cust'])
plt.title('Average amount spent by each Customer')

# Top countries contributing highest revenue to the store

In [None]:
top_countries=df.groupby('Country')['Total Amount'].sum().round(2).sort_values(ascending=False).head(5).reset_index()
top_countries

In [None]:

pal_ = list(sns.color_palette(palette='plasma_r',
                              n_colors=len(top_countries['Country'])).as_hex())
#plot a pie chart
plt.figure(figsize=(6, 6))
plt.rcParams.update({'font.size': 9})
plt.pie(top_countries['Total Amount'],
        labels= top_countries.Country,
        colors=pal_, autopct='%1.1f%%',
        pctdistance=0.9)
plt.legend(bbox_to_anchor=(1, 1), loc=2, frameon=False)
plt.show()

**UK contributes most revenue to the store**
* **European countries like Germany, France, Netherlands, EIRE contribute significant revenue to the store**



In [None]:
# top 5 countries where least sell happens.
plt.figure(figsize=(15,6))
sns.lineplot(x=top_countries['Country'].tail(5),y=top_countries['Total Amount'].tail(5))
plt.title('Top 5 Countries based on last store revenue contributors ')

**Countries contributing least to the store revenue are non european countries**


### Sales in different months.


In [None]:
# Sales different months.
Sales_by_Month=df.groupby('Month')['Total Amount'].sum().sort_values(ascending=False).reset_index()
Sales_by_Month

In [None]:
plt.figure(figsize=(20,6))
sns.barplot(x=Sales_by_Month['Month'],y=Sales_by_Month['Total Amount'])
plt.title('Sales in different Months ')

**Highest sales happened in the month of November (Eve of Holiday Season) while least sale happened in the month of February**



## Model Building

# RFM Model Analysis¶


RFM is a method used to analyze customer value. RFM stands for RECENCY, Frequency, and Monetary.

* RECENCY: How recently did the customer visit our website or how recently did a customer purchase?

* Frequency: How often do they visit or how often do they purchase?

* Monetary: How much revenue we get from their visit or how much do they spend when they purchase?

The RFM Analysis helps the businesses to segment their customer base into different homogenous groups so that they can engage with each group with different targeted marketing strategies

In [None]:
#Recency = Latest Date - Last Invoice Date, Frequency = count of invoice no. of transaction(s), Monetary = Sum of Total Amount for each customer
#Creating RFM Modelling scores for each customer
df['InvoiceDate']=pd.to_datetime(df['InvoiceDate'],format='%m/%d/%Y %H:%M')
df['Recency']=max(df['InvoiceDate'])-df['InvoiceDate']
recency=df.groupby('Customer ID')['Recency'].min().dt.days.reset_index()
recency

In [None]:
frequency=df.groupby('Customer ID')['Invoice'].count().reset_index().rename(columns={'Invoice':'Frequency'})
frequency

In [None]:
monetary=df[['Customer ID','Total Amount']].copy()
monetary=df.groupby('Customer ID')['Total Amount'].sum().reset_index().rename(columns={'Total Amount':'Monetary'})
monetary

In [None]:
rfm_cust=pd.merge(recency,frequency,on='Customer ID',how='inner')
rfm=pd.merge(rfm_cust,monetary,on='Customer ID',how='inner')
rfm.head()

## Descriptive Summary and distribution of Recency



In [None]:
rfm.Recency.describe()

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(x=rfm['Recency'])
plt.title("Distribution of Recency")

Recency distribution is right skewed

**Descriptive summary and distribution of frequency**

In [None]:
rfm['Frequency'].describe()

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(x=rfm['Frequency'])
plt.title("Distribution of Frequency")

Frequency distribution is skewed extremely right

**Descriptive summary and distribution of Monetary**

In [None]:
rfm['Monetary'].describe()

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(x=rfm['Monetary'])
plt.title('Distribution of Monetary')


Monetary distribution is skewed to extreme right

**Splitting data into four sections using quantile**

In [None]:
quantile=rfm.quantile(q=[0.25,0,0.50,0.75])
quantile=quantile.to_dict()

In [None]:
quantile

In [None]:
# arguments (x = value, p = recency, monetary_value, frequency, d = quartiles dict)
# Good customer= Low Recency, High Frequency, High Monetary
#Function for scoring recency
def Rscoring(x,p,d):
  if x<=d[p][0.25]:
    return 1
  elif x<=d[p][0.50]:
    return 2
  elif x<=d[p][0.75]:

    return 3
  else:
    return 4

#Function for scoring frequency and monetary
def FnMscoring(x,p,d):
  if x<=d[p][0.25]:
    return 4
  elif x<=d[p][0.50]:
    return 3
  elif x<=d[p][0.75]:
    return 2
  else:
    return 1




Calculating R,F and M values and adding to dataframe



In [None]:
rfm["R"]=rfm['Recency'].apply(Rscoring,args=('Recency',quantile))
rfm["F"]=rfm['Frequency'].apply(FnMscoring,args=('Frequency',quantile))
rfm["M"]=rfm['Monetary'].apply(FnMscoring,args=('Monetary',quantile))
rfm.head()

Adding combined RFM value to the dataset

In [None]:
rfm["RFM_Group"]=rfm["R"].map(str)+" "+rfm["F"].map(str)+" "+rfm["M"].map(str)
rfm

Creating RFM score column by adding R,F and M values

In [None]:
rfm['RFM_Score']=rfm[['R','F','M']].sum(axis=1)
rfm

In [None]:
rfm.info()

In [None]:
rfm['RFM_Score'].unique()

Assigning Loyal Level to each customer

In [None]:
loyalty_level=['Platinum','Gold','Silver','Bronze']
rfm['RFM_Loyalty_Level']=pd.qcut(rfm['RFM_Score'],q=4,labels=loyalty_level)
rfm

Checking data for RFM_Group=111

In [None]:
rfm[rfm['RFM_Group']=='1 1 1'].sort_values('Monetary',ascending=False).reset_index().head(10)

**Segmentation based on RFM**

In [None]:
segmentation_rfm=rfm[['Recency','Frequency','Monetary','RFM_Loyalty_Level']]
segmentation_rfm

In [None]:
segmentation_rfm.groupby('RFM_Loyalty_Level').agg({'Recency':['mean','min','max'],'Frequency':['mean','min','max'],'Monetary':['mean','min','max']})

In [None]:
def handle_neg_zero(num):
  if num<=0:
    return 1
  else:
    return num
#Apply handle_neg_n_zero function to Recency and Monetary columns
rfm['Recency']=[handle_neg_zero(x) for x in rfm.Recency]
rfm['Monetary']=[handle_neg_zero(x) for x in rfm.Monetary]

In [None]:
#Perform Log transformation to bring data into normal or near normal distribution
log_rfm_df=rfm[['Recency','Frequency','Monetary']].apply(np.log,axis=1).round(3)
log_rfm_df


**Now let's Visualize the Distribution of Recency,Frequency and Monetary.**



In [None]:
plt.figure(figsize=(12,6))
sns.distplot(x=log_rfm_df['Recency'])
plt.title('Distribution of Recency')

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(x=log_rfm_df['Frequency'])
plt.title('Distribution of Frequency')

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(x=log_rfm_df['Monetary'])
plt.title('Distribution of Monetary')

In [None]:
rfm['Recency_Log']=rfm['Recency'].apply(math.log)
rfm['Frequency_Log']=rfm['Frequency'].apply(math.log)
rfm['Monetary_Log']=rfm['Monetary'].apply(math.log)
rfm

## KMeans Clustering

**Applying elbow method on Recency and Monetary**

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score


In [None]:
Recency_Monetary=rfm[['Recency_Log','Monetary_Log']].copy()
# taking only values of recency and monetary in Recency_Monetary
Recency_Monetary


In [None]:
#Standarising the data
scalar=StandardScaler()
Recency_Monetary=scalar.fit_transform(Recency_Monetary)
#Applying Elbow Method
wcss={}
for k in range(1,15):
  km=KMeans(n_clusters=k,init='k-means++',max_iter=1000)
  km.fit(Recency_Monetary)
  wcss[k]=km.inertia_

In [None]:
#Plot the graph for the sum of square distance values and Number of Clusters
plt.figure(figsize=(12,6))
sns.pointplot(x=list(wcss.keys()),y=list(wcss.values()))
plt.xlabel('Number of Clusters(k)')
plt.ylabel('Sum of Square Distances')
plt.title('Elbow Method For Optimal k')
plt.show()

**Silhouette Score**

In [None]:
Recency_Monetary
scaler=StandardScaler()
Recency_Monetary=scaler.fit_transform(Recency_Monetary)
range_n_clusters=[2,3,4,5,6,7,8,9,10,11,12,13,14,15]
for n_clusters in range_n_clusters:
  cluster=KMeans(n_clusters=n_clusters,random_state=1)
  pred=cluster.fit_predict(Recency_Monetary)
  center=cluster.cluster_centers_
  score=silhouette_score(Recency_Monetary,pred)
  print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))


In [None]:
# applying Kmeans_clustering algorithm
kmeans_rec_mon = KMeans(n_clusters=2)
kmeans_rec_mon.fit(Recency_Monetary)
y_kmeans= kmeans_rec_mon.predict(Recency_Monetary)

In [None]:
# Find the clusters for the observation given in the dataset
rfm['Cluster_based_rec_mon'] = kmeans_rec_mon.labels_
rfm.head(10)

In [None]:
# Centers of the clusters
centers = kmeans_rec_mon.cluster_centers_
centers

In [1]:
# plotting visualizing the clusters
plt.figure(figsize=(15,10))
plt.title('customer segmentation based on Recency, Frequency and Monetary')
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='winter')

centers = kmeans_freq_mon_rec.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=300, alpha=0.8)

NameError: ignored

In [None]:
data_process_normalized2=rfm_dataframe[['Recency','Frequency','Monetary','Recency_log','Frequency_log','Monetary_log','RFM_Loyalty_Level','Cluster_based_on_freq_mon_rec']]

In [None]:
data_process_normalized2.groupby('Cluster_based_on_freq_mon_rec').agg({
    'Recency': ['mean', 'min', 'max'],
    'Frequency': ['mean', 'min', 'max'],
    'Monetary': ['mean', 'min', 'max','count']
})
