# **Customer Segmentation using RFM Analysis and K-means Clustering Methods:**

**Comparison between RFM and K-Means (Data set: ONLINE RETAIL II- 2010_2011)**

**Description of the data set:**
This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.


**Attribute Information:**

InvoiceNo: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. 

If this code starts with the letter 'c', it indicates a cancellation.

StockCode: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.

Description: Product (item) name. Nominal.

Quantity: The quantities of each product (item) per transaction. Numeric.

InvoiceDate: Invice date and time. Numeric. The day and time when a transaction was generated.

UnitPrice: Unit price. Numeric. Product price per unit in sterling (Â£).

CustomerID: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.

Country: Country name. Nominal. The name of the country where a customer resides.



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

df_ = pd.read_csv("../input/online-retail-ii-csv-new/online_retail_II.csv")

In [None]:
df = df_.copy()

**UNDERSTANDING DATA:**

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
df.head()

In [None]:
df[["Quantity", "Price"]].describe().T

In [None]:
df[["Quantity", "Price"]].quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T

**DATA PREPROCESSING**

In [None]:
# Missing Values
df.isnull().sum()

In [None]:
df["Description"].fillna("Missing", inplace= True)

# Now the missing values are only in Customer ID column. Customer ID is essential to segment 
# the customers, we drop these missing values.
df.dropna(inplace= True)

# The invoices that contain C letter in the Invoice column are  the cancelled ones. 
# we remove them from the dataset
df = df[~df["Invoice"].str.contains("C", na= False)]

# The values in the price column are the unit prices of the products, 
# we need the total price per invoice.
df["TotalPrice"] = df["Price"] * df["Quantity"]

**CUSTOMER SEGMENTATION USING RFM ANALYSIS:**

I only focus on recency and freequency values.

Let’s calculate the metrics we need to group the data over the Customer ID. I include only recency and frequency metrics.
1. Customer’s Recency:
Today’s date (or a few days after the latest date in the data, if the data is old) - Customer’s last purchase date
2. Customer’s Frequency:
 Total number of unique invoices


In [None]:
# the data is old, so i choose today's date as: 2011-12-11, to get reasonable recency values.
today_date = dt.datetime(2011, 12, 11)

In [None]:
# the type of the InvoiceDate column is object, i convert it to datetime:
df["InvoiceDate"]= pd.to_datetime(df["InvoiceDate"])

In [None]:
df_rf = df.groupby("Customer ID").agg({ "InvoiceDate" : lambda date : (today_date - date.max()).days ,
                                "Invoice" : lambda invoice : invoice.nunique() })

In [None]:
df_rf.columns = ["Recency", "Frequency"]
df_rf.head()

In [None]:
df_rf["Recency_Score"] = pd.qcut(df_rf["Recency"], 5, labels= [5,4,3,2,1])
df_rf["Frequency_Score"] = pd.qcut(df_rf["Frequency"].rank(method = "first"), 5 , labels= [1,2,3,4,5])

df_rf["RF_Score"]= (df_rf["Recency_Score"].astype(str) + df_rf["Frequency_Score"].astype(str))
df_rf.head()

In [None]:
# RF naming:
seg_map = {
    r'[1-2][1-2]': 'hibernating',
    r'[1-2][3-4]': 'at_Risk',
    r'[1-2]5': 'cant_loose',
    r'3[1-2]': 'about_to_sleep',
    r'33': 'need_attention',
    r'[3-4][4-5]': 'loyal_customers',
    r'41': 'promising',
    r'51': 'new_customers',
    r'[4-5][2-3]': 'potential_loyalists',
    r'5[4-5]': 'champions'
}

# Segmantation according to RF scores
df_rf["RF_Segment"] = df_rf["RF_Score"].replace(seg_map, regex= True)

df_rf.head()

We have the RF segments, in order to compare with, we will compose the K-means segments too.

**CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING METHOD:**

In [None]:
# we scale these values, in order to evaluate fairly and to use k-means
df_to_be_scaled = df_rf[["Recency", "Frequency"]]
sc = MinMaxScaler((0, 1))
df_rf_scaled = sc.fit_transform(df_to_be_scaled)
df_rf_scaled[0:5]

In [None]:
# Creating the Clusters
kmeans = KMeans(n_clusters= 10, random_state= 41).fit(df_rf_scaled)
kmeans.cluster_centers_

clusters = kmeans.labels_

clusters

In [None]:
pd.DataFrame({"Customer ID": df_rf.index, "Segments": clusters})

In [None]:
df_rf["Kmeans_Segment"] = clusters
# in order to start segments from 1 not 0:
df_rf["Kmeans_Segment"] = df_rf["Kmeans_Segment"] + 1
df_rf.head()

**COMPARING THE METHODS**

In [None]:
df_rf["Kmeans_Segment"].value_counts()

In [None]:
df_rf["RF_Segment"].value_counts()

In [None]:
# in order to get the mean values, convert the values to integer type.
df_rf["Recency_Score"] = df_rf["Recency_Score"].astype(int)
df_rf["Frequency_Score"] = df_rf["Frequency_Score"].astype(int)

# to get the mode of RF_Segment column, we need to import stats from scipy
from scipy import stats
df_rf.groupby("Kmeans_Segment").agg({"Recency_Score" : ["min", "max","mean","count"], 
                                     "Frequency_Score": ["min", "max","mean","count"], "RF_Segment" : stats.mode })

'champions'Considering the mean values and the RF_score frequency, we can say that they are clustered as follows:

1  : 'champions'

2  : 'hibernating'

3  : 'hibernating'

4  : 'need_attention', 'loyal_customers'

5  : 'hibernating'

6  : 'at_Risk'

7  : 'need_attention', 'loyal_customers'

8  : 'hibernating'

9  : 'champions'

10 : 'hibernating'

In [None]:
df_rf.groupby("RF_Segment").agg({"Recency_Score" : ["min", "max","mean","count"], 
                                     "Frequency_Score": ["min", "max","mean","count"], "Kmeans_Segment" : stats.mode })

Summary:
We performed customer segmentation using K-means and RFM Analysis methods. 
We identified 10 segments for both.
According to the Recency and Frequency values, we have seen that customers with similar purchasing 
behavior, are mostly in similar segments, but the two methods also differ in many respects.

For example, with the number of observations of 1524,the most frequent one, the hibernating segment appears to be scattered across 5 separate K-means clusters. Mostly in cluster 2, but this cluster also includes customers with a frequency score of 5. The clusters 3 and 5 are more homogeneous than 2.

The can't loose, new customers and promising segments, with the least number of customers in RFM Analysis, are not clearly observed in a cluster. Can't loose segment is mostly in cluster 6; Customers in the new customers and promising segments are mostly in segment 1.

Customers in the need attention and loyal customers segments seem to be scattered in clusters 4 and 7.

The champions class is mostly observed in clusters 1 (98%) and 9 (43%).

To sum up, We created 10 different segments using different methods and tried to observe their differences and similarities.