## Project Overview
Online retail is a Kaggle data set</a> which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered online-only retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

## Our Goal
We will conduct a RFM analysis on the company's customers and thus better understand our customers' buying preference.

#### The steps are broadly divided into:

1. [Step 1: Reading and Understanding Data](#1)
1. [Step 2: Data Cleaning](#2)
1. [Step 3: Data Preparation](#3)
1. [Step 4: Building K-Means Model](#4)
1. [Step 5: Business Insight](#5)

This notebook is based on the work of Manish Kumar's [Kaggle kernel](https://www.kaggle.com/hellbuoy/online-retail-k-means-hierarchical-clustering). He has done a great job in the organization of the whole kernel. 

If this Notebook helps you get a better understanding of K-Means clustering and unsupervised learning, please give me a <font color="orange"><b>UPVOTE</b></font>. 

<a id="1"></a> <br>
## Step 1 : Reading and Understanding Data

In [None]:
# import required libraries for dataframe and visualization

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

# import required libraries for clustering
import sklearn
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

In [None]:
# Read data
data = pd.read_csv('../input/online-retail-customer-clustering/OnlineRetail.csv',encoding="ISO-8859-1")
print(data.head())

In [None]:
# shape
print(data.shape)

In [None]:
# data description
print(data.describe())

<a id="2"></a> <br>
## Step 2 : Data Cleaning

In [None]:
# Calculate missing values % in original data
data_null = round(100 * (data.isnull().sum()) / len(data), 2)
print(data_null)

In [None]:
# Drop rows with missing values
data = data.dropna()
print(data.shape)

In [None]:
# Change data type of Customer Id; they are not numeric in essence
data['CustomerID'] = data['CustomerID'].astype(str)

<a id="3"></a> <br>
## Step 3 : Data Preparation

#### We are going to analysis the Customers based on below 3 factors:
- R (Recency): Number of days since last purchase
- F (Frequency): Number of transactions
- M (Monetary): Total spending by customers (revenue)

Calculate recency

In [None]:
# New Attribute : Recency
# Reformat datetime
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'], format='%d-%m-%Y %H:%M')
print(data['InvoiceDate'])

In [None]:
# Compute the difference between most recent date and transaction date
data['Diff'] = max(data['InvoiceDate']) - data['InvoiceDate']
print(data.head())

In [None]:
# Compute last transaction date to get the recency of customers
rfm_r = data.groupby('CustomerID', as_index=False)['Diff'].min()

In [None]:
# Extract number of days only
rfm_r['Diff'] = rfm_r['Diff'].dt.days
rfm_r.columns = ['CustomerID', 'Recency']
print(rfm_r.head())

Calculate frequency

In [None]:
# New Attribute : Frequency
rfm_f = data.groupby('CustomerID', as_index=False)['InvoiceNo'].count()
rfm_f.columns = ['CustomerID', 'Frequency']
print(rfm_f.head())

Calculate amount (monetary)

In [None]:
# New Attribute : Monetary
data['Amount'] = data['Quantity'] * data['UnitPrice']
rfm_m = data.groupby('CustomerID', as_index=False)['Amount'].sum()
rfm_m.columns = ['CustomerID', 'Amount']
print(rfm_m.head())

In [None]:
# Combine R, F, M
rfm = pd.concat((rfm_r['Recency'], rfm_f['Frequency'], rfm_m['Amount']), axis=1)
print(rfm.head())

In [None]:
# Remove negative amounts (excluding goods refund)
rfm = rfm[rfm.Amount > 0]

In [None]:
# Standardize data
from sklearn.preprocessing import StandardScaler

# Store original column names and data
keys = rfm.keys()
rfm_unscaled = rfm

# Scale rfm
scaler = StandardScaler()
rfm = scaler.fit_transform(rfm)
rfm = pd.DataFrame(rfm, columns=keys)
print(rfm.head())

We remove outliers that are outside 3-sigma range, as part of data preprocessing

In [None]:
# Remove values outside mean +/- 3 std range
for key in rfm.keys():
    mean = np.mean(rfm[key])
    std = np.std(rfm[key])
    rfm = rfm[np.abs(rfm[key] - mean) / std <= 3]

<a id="4"></a> <br>
## Step 4 : Building K-Means Model

### K-Means Clustering

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.<br>

The algorithm works as follows:

1. Randomly initialize k points as the center of clustering ("centroid")
2. Categorize each data point to the closest centroid
3. For each centroid, update its coordinate to be the average of all points categorized to it. 
3. Repeat step 2&3 for a given number of iterations and use the centroid location with the lowest error (sum of distance from all data points in a cluster to the cluster centroid), as the optimal choice of centroid location and final output. 

### Finding the Optimal Number of Clusters

#### Elbow Curve to get the right number of Clusters
A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k.

In [None]:
# Draw elbow curve to find optimal K value
# 2 <= i <= 10
inertia = {}
for i in range(2, 11):
    kmeans = KMeans(n_clusters=i, max_iter=1000)
    kmeans.fit(rfm)
    inertia[i] = kmeans.inertia_

for k, v in inertia.items():
    print(str(k), ': ', str(v))

In [None]:
# Plot for each K value
plt.subplots()
plt.plot(list(inertia.values()), 'b+-')
plt.show()

The elbow curve is close to a linear function after K value of 3, which indicates that adding more centroids may not result in better clustering outcomes. Therefore, K should be set to 3 according to the elbow curve.

However, just looking at the elbow curve above can only give us an imprecise estimation of K value. To have more comfort in the K value of 3 that we have selected, we can do a Silhouette analysis.

### Silhouette Analysis

$$\text{silhouette score}=\frac{p-q}{max(p,q)}$$

$p$ is the mean distance to the points in the nearest cluster that the data point is not a part of

$q$ is the mean intra-cluster distance to all the points in its own cluster.

* The value of the silhouette score range lies between -1 to 1. 

* A score closer to 1 indicates that the data point is very similar to other data points in the cluster, 

* A score closer to -1 indicates that the data point is not similar to the data points in its cluster.

**Rule: Higher score is better.**

In [None]:
# # Silhouette analysis
for num_clusters in range(2,10):
    kmeans = KMeans(n_clusters=num_clusters, max_iter=1000)
    kmeans.fit(rfm)
    cluster_labels = kmeans.labels_
    
    silhouette_avg = silhouette_score(rfm, cluster_labels)
    print("For n_clusters={0}, the silhouette score is {1:.4f}".format(num_clusters, silhouette_avg))

Although K=2 is the number of K with the highest score, from the perspective of business and for easier interpretation, we continue to adopt 3 as the K value. 
You will see how K=3 works out in the later sections of this article.

In [None]:
# Final model with k=3
kmeans = KMeans(n_clusters=3, max_iter=1000)
kmeans.fit(rfm)
print(kmeans.labels_)

In [None]:
# Assign label
rfm['Cluster_Id'] = kmeans.labels_
print(rfm.head())
print(rfm.shape)

These box plots show our algorithm actually works quite well in that there is only small overlap between clusters, which means none of three clusters is redundant.

We can also draw scatter plots with colored clusters

In [None]:
# Scatter plot
plt.subplots()
plt.scatter(x=rfm['Recency'], y=rfm['Amount'], c=rfm['Cluster_Id'], alpha=0.4)
plt.xlabel('Recency')
plt.ylabel('Amount')
plt.title("Clustering: Recency vs Amount")

plt.subplots()
plt.scatter(x=rfm['Frequency'], y=rfm['Amount'], c=rfm['Cluster_Id'], alpha=0.4)
plt.xlabel('Frequency')
plt.ylabel('Amount')
plt.title("Clustering: Frequency vs Amount")

plt.subplots()
plt.scatter(x=rfm['Frequency'], y=rfm['Recency'], c=rfm['Cluster_Id'], alpha=0.4)
plt.xlabel('Frequency')
plt.ylabel('Recency')
plt.title("Clustering: Frequency vs Recency")

Well, these 2D plots may not be intutive enough for you guys. So, why not draw a 3D plot to visualize which group our customers belong to, after we computing three variables of R, F, and M? 

In [None]:
# Create 3D scatter plot
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(xs=rfm['Recency'], ys=rfm['Frequency'], zs=rfm['Amount'], c=rfm['Cluster_Id'])

plt.title('RFM Clustering')
ax.set_xlabel('Recency')
ax.set_ylabel('Frequency')
ax.set_zlabel('Amount')

# Qualitative Analysis of Clustering Result
This 3D scatter graph clearly shows that our K-Means clustering algorithm has successfully segmented all customers into three distinct categories:

* Category 1: 

Color: green

Number: quite a lot

Purchase frequency: low

Amount of spending: low

Recency: vary

* Category 2:

Color: purple

Number: many

Purchase frequency: low to mid

Amount of spending: low to mid

Recency: low


* Category 3:

Color: yellow

Number: few

Purchase frequency: vary

Amount of spending: high

Recency: low


<a id="5"></a> <br>
## Step 5 : Business Insight

We have already learned that a small number of large customers have contributed a great portion of sale revenue, and it is worth our time and some further coding to explore our customer structure in terms of three clusters.

In [None]:
# Restore rfm
rfm = rfm_unscaled
print(rfm.head())



In [None]:
# Remove values outside mean +/- 3 std range
for key in rfm.keys():
    mean = np.mean(rfm[key])
    std = np.std(rfm[key])
    rfm = rfm[np.abs(rfm[key] - mean) / std <= 3]


In [None]:
# Refit model with k=3 to unscaled rfm
kmeans = KMeans(n_clusters=3, max_iter=1000)
kmeans.fit(rfm)
print(kmeans.labels_)

In [None]:
# Assign label
rfm['Cluster_Id'] = kmeans.labels_
print(rfm.head())
print(rfm.shape)

In [None]:
# Create a new data frame for cluster analysis
cluster_sale = rfm.groupby('Cluster_Id', as_index=False)['Amount'].sum()
cluster_sale.columns = ['ClusterId', 'Revenue']

In [None]:
# Count the number of customers by their cluster id
cluster_sale['CustomerCount'] = rfm['Cluster_Id'].value_counts()
cluster_sale['Customer%'] = cluster_sale['CustomerCount'] / sum(cluster_sale['CustomerCount'])

In [None]:
# Calculate the average recency of customers
cluster_sale['MeanRecency'] = rfm.groupby('Cluster_Id', as_index=False)['Recency'].mean()['Recency']

In [None]:
# Calculate the number of transactions done by clusters
cluster_sale['TransactionCount'] = rfm.groupby('Cluster_Id', as_index=False)['Frequency'].sum()['Frequency']
cluster_sale['Transaction%'] = cluster_sale['TransactionCount'] / sum(cluster_sale['TransactionCount'])

In [None]:
# Calculate the average number of times customers do shopping
cluster_sale['MeanFrequency'] = cluster_sale['TransactionCount'] / cluster_sale['CustomerCount']

In [None]:
# Calculate the percentage of total revenue by clusters
cluster_sale['Revenue%'] = cluster_sale['Revenue'] / sum(cluster_sale['Revenue'])

In [None]:
# Calculate average revenue per transaction
cluster_sale['ARPT'] = cluster_sale['Revenue'] / cluster_sale['TransactionCount']

In [None]:
# Calculate average revenue per customer
cluster_sale['ARPC'] = cluster_sale['Revenue'] / cluster_sale['CustomerCount']

In [None]:
# Reorganize columns for easier reading
cluster_sale = cluster_sale[
    ['ClusterId', 'CustomerCount', 'Customer%', 'MeanRecency', 'MeanFrequency', 'Revenue', 'Revenue%',
     'TransactionCount', 'Transaction%', 'ARPT', 'ARPC']]

In [None]:
print(cluster_sale.head())

# Interpretation of result

**Cluster 0 are mostly one-time, infrequent customers**

1. They are the great majority of customers (83%) but only contribute slightly more than 40% of total revenue. Their transaction amount per invoice is quite low, at 13 dollars.

2. On average, the last time they shop online is about 102 days ago. 

3. They only shop occasionally in that during the period studied, a typical class-0 customer places about 50 orders in total.

**Cluster 1 are general customers**

1. They are about 14% of all customers and the second largest source of income (40%)

2. Average recency is about 30 days ago. 

3. They shop often with average 210 orders per person.

**Cluster 2 are frequent, high-value customers/wholesalers**

1. There are only 93 cluster-2 customers who makes up just 2% of total customers and 9% of total transactions, but more than 18% of total revenue come from them. 

2. They also love shopping and spend much more in total. Their average purchase amount per transaction is about 36 dollars, which is the highest among all three customer groups. Their mean total spending amount is more than 11,500 dollars. 

3. They buy quite often with 17 days mean recency. They shop almost every day: an average class-2 customer shop 321 times in the past year. 

If this Notebook helps you get a better understanding of K-Means clustering and unsupervised learning, please give me a <font color="orange"><b>UPVOTE</b></font>. 