**About the dataset**

Online retail is a transnational data set which contains all the transactions occurring between 01/12/2010 - 09/12/2011 for a UK-based and registered non-store online retail. 
The company mainly sells unique all-occasion gifts. 
Many customers of the company are wholesalers.

# 1. Import Libary

In [None]:
#import required libaries for data set
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

# import required libraries for clustering
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score



In [None]:
#use panadas's function to read the data which need to be analysed (in CSV file)
retail = pd.read_csv("../input/online-retail-customer-clustering/OnlineRetail.csv",sep=",",encoding="ISO-8859-1")
retail.head()
                     

# 2. Raw Data Overview

In [None]:
#shape of the dataframe (df)
retail.shape

In [None]:
# info of the df
retail.info()

In [None]:
# description of the df
retail.describe()

# 3. Data Cleaning

In [None]:
# checking df's missing value's attribution in %
df_null = round(100*(retail.isnull().sum())/len(retail), 2)
df_null

In [None]:
# checking df's missing value's lcoation
retail[retail.isnull().values==True]


In [None]:
# drop the rows which have the missing value
retail = retail.dropna()
retail.shape

In [None]:
# change customer ID data type, astype()--Cast a pandas object to a specified dtype dtype

retail['CustomerID'] = retail['CustomerID'].astype(str)

In [None]:
# check the raw data after drop the missing value
df_null_after = round(100*(retail.isnull().sum())/len(retail), 2)
df_null_after

# 4. prepare the data by RFM model

RFM Model is used for evaluate customer value, in order to segement different customer segementation, and afterwords create different marketing strategy for each segementation for increasing value.

**RFM stands for the three dimensions:**
* **Recency** – How recently did each customer visit the website in period?
* **Frequency** – How often did each customer come back in period?
* **Monetary Value** – How much did each customer spend each period?

**New attribute: Recency**
* group each row by customer ID;
* get the number of last date/time of the period - each customer ID's last visit data/time.



In [None]:
# convert datetime
retail['InvoiceDate'] = pd.to_datetime(retail['InvoiceDate'],format='%d-%m-%Y %H:%M')

In [None]:

# compute the last date/time of the period
max_date = max(retail['InvoiceDate'])
max_date


In [None]:
# dir()shows the list of what objects can do
dir(max_date)

In [None]:
# compute the recency
retail['Recency'] = max_date - retail['InvoiceDate']
retail.head()

In [None]:
# calculate the last transition date
rfm_r = retail.groupby("CustomerID")["Recency"].min().reset_index()
rfm_r.head()

In [None]:
# extract only the days
rfm_r["Recency"] = rfm_r["Recency"].dt.days
rfm_r.head()

**New attribute: Frequency**

* group each row by Invoice Number
* count how many(total) invioces of each Customer ID

In [None]:
# extract each customerID's total Invoice Number & count in total 
rfm_f = retail.groupby('CustomerID')['InvoiceNo'].count().reset_index()
# form new data set's columus
rfm_f.columns = ['CustomerID', 'Frequency']
# function head()'s default parameter size is 5, so output will show 5
rfm_f.head()

**New attribute: Monetory**

* Group each row by Customer ID
* sum the quantity x unit price

In [None]:
retail["Amount"] = retail['Quantity']*retail['UnitPrice']
rfm_m = retail.groupby("CustomerID")["Amount"].sum().reset_index()
rfm_m.head()

In [None]:
# merge R & F & M
rfm = pd.merge(rfm_r, rfm_f, on='CustomerID', how='inner')
rfm = pd.merge(rfm, rfm_m, on='CustomerID', how='inner')
rfm.head()

In [None]:
rfm.shape

In [None]:
# use IQR method to detect outliers, tells any value which is beyond the range of -1.5 x IQR to 1.5 x IQR treated as outliers
def iqr_outliers(df, field):
    q1 = df[field].quantile(0.25)
    q3 = df[field].quantile(0.75)
    iqr = q3-q1
    lower_tail = q1 - 1.5 * iqr
    upper_tail = q3 + 1.5 * iqr
    
    df = df[(df[field] >= lower_tail) & (df[field] <= upper_tail)]    
    return df

rfm.head()

rfm_copy = iqr_outliers(rfm, 'Recency')
rfm_copy = iqr_outliers(rfm_copy, 'Frequency')
rfm_copy = iqr_outliers(rfm_copy, 'Amount')

#rfm_copy.head()

In [None]:
# check columns name for extract columns
rfm_copy.columns

In [None]:
# extract columns for rescaling the attributes
rfm_copy[['Amount','Frequency','Recency' ]]

In [None]:
rfm_rescale = rfm_copy[['Amount','Frequency','Recency' ]]
rfm_rescale.head()

In [None]:
# rescaling the data frame by using Standardisation Scaling
rfm_df = rfm_rescale[['Amount', 'Frequency', 'Recency']]

scaler = StandardScaler()

rfm_rescale_rescale = scaler.fit_transform(rfm_rescale)
rfm_rescale_rescale.shape

In [None]:
rfm_rescale_rescale = pd.DataFrame(rfm_rescale_rescale)
rfm_rescale_rescale.columns = ['Amount', 'Frequency', 'Recency']
rfm_rescale_rescale.head()

# 5. K-Means Clustering

**Use Silhouette Method for finding the best K**

Silhouette Score Si = bi-ai/max(ai,bi)

bi  is the mean distance to the points in the nearest cluster that the data point is not a part of

ai  is the mean intra-cluster distance to all the points in its own cluster.

The value of the silhouette score range is [-1 - 1].

A score closer to 1: the data point i is very similar to other data points in its cluster;

A score closer to -1: the data point i is not similar to the data points in its cluster;

A score closer to 0 : the data point i is on the edge of two clusters.


In [None]:
range_n_clusters = [2, 3, 4, 5, 6, 7, 8,9,10]

for num_clusters in range_n_clusters:
    
    # intialise kmeans
    kmeans = KMeans(n_clusters=num_clusters, max_iter=80)
    kmeans.fit(rfm_rescale_rescale)
    
    cluster_labels = kmeans.labels_
    
    # silhouette score
    silhouette_avg = silhouette_score(rfm_rescale_rescale, cluster_labels)
    print("For n_clusters={0}, the silhouette score is {1}".format(num_clusters, silhouette_avg))
    
    

In [None]:
# the final K is 3, because it is the closest to 1
kmeans = KMeans(n_clusters=3, max_iter=80)
kmeans.fit(rfm_rescale_rescale)

In [None]:
kmeans.labels_


In [None]:

rfm_copy['Cluster_ID'] = kmeans.labels_

rfm_copy.head()

****Box plot to visualize ClusterID with RFM****

In [None]:
# ClusterID vs Recency
sns.boxplot(x='Cluster_ID', y='Recency', data=rfm_copy)

In [None]:
# ClusterID vs Frequency
sns.boxplot(x='Cluster_ID', y='Frequency', data=rfm_copy)

In [None]:
# ClusterID vs Amount
sns.boxplot(x='Cluster_ID', y='Amount', data=rfm_copy)

# 6. Analysis

RFM analysis is used to organize customer databases into specific segments. Basing these segments off customer’s purchase behavior gives crucial insight into the type of marketing which should be used to target them and to move them between segments.

Frequency and monetary values are core to calculating the customer lifetime value (CLV); meanwhile, recency gives insight into customer engagement and indicates the likelihood of that customer coming back again.

The Pareto principle states that for many outcomes roughly 80% of consequences come from 20% of the causes, so 80% of the revenue could be attributable to 20% of your customers. This means that the most purchasing and most loyal customers are important to remain.
It is important to acquiring new customers, however it will costs more to make new friends then to remain the established good one.


The company mainly sells unique all-occasion gifts, and many customers of the company are wholesalers, so this online store has seasonal characteristic, its peak season based on different festivals, for example: customers will purchase and stock up big amount of product before Christmas.


**In this online retail data set, K-means clustering has gave 3 ClusterID:**

* 0-Best customer(highR/highF/highM): this customer segmentation shows customer has high loyalty, who visit the retail store most recent, high frequently purchased and have purchased with high amount of goods, therefore this segmentation is the best customer who brings the highlest value to the company. "Recency" shows customer from this segmentation purchase 1-2 months ago before December with monetary amount of 1300-2200£, so the online store could start to prepare and send marketing information to customer before Octorber to increase customer's attention, the company should give most of the marketing budget to this segmentation. The company could launch new product information through email marketing for promotion, provide VIP service to increase customer satisfaction, send customer questionaire to know the need of specific product and prepare enough stock. Furthermore, can trace back the CustomerID in order to know this customer segmentation's demographic,geographic,psychographics,etc, information to predict the customer buying behavior, and combine with "frequency" and "Monetary" data to predict customer life cycle in this segmentation.


* 1-Passive customer(lowR/lowF/lowM): Customer visit online store 5-9 months ago, with very low purchase amount and very low buying frequency, should reduce or withdrawal budget for marketing and service for this segmentation.  


* 2-Potential customer(highR/lowF/lowM): customer have purchased recently few times, with small amount of goods. Apply more detailed customer segmentation in this segmentation, in order to know the reason of low frequency and low purchase amount.The customer could be new customer who purchase frist time through this online store, or existing customer who purchase small amount in general. The company should create marketing strategy by understanding each sub-segmentation's customer pain point and needs, in order to transfer this customer segmentation into the Best Customer segmentation.


**Conclusion**: 

This company is mainly towards B2B market, there are three customer segmantations, the company should mainly focus on increase customer stickiness and customer loyalty and customer satisfaction of Cluster0 customer, understand different sub-segmentation of cluster2 customer and create personalised marketing strategy for transfering it to the best customer segmentation, and reduce/withdrawal budget for Cluster1 customer.