# Clustering with K-means for eCommerce
## [RFM Analysis](https://medium.com/mlearning-ai/crm-analytics-customer-segmentation-customer-lifetime-value-prediction-1163fa6e4ae9)
Understand customers' buying patterns using 3 'customer lifetime value' metrics: 
- Recency (how many days ago was their last purchase?)
- Frequency (how many times did they purchase?) 
- Monetary (how much money did they spend?)


## 1. Import modules and data

In [None]:
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
from sklearn import preprocessing, metrics, cluster


In [None]:
# load csv
dataset = pd.read_csv('data.csv', header = 0 , encoding = 'unicode_escape')

df = dataset.copy() #backup copy!

df.head()

# transform into a df like this: | Customer ID | Recency | Frequency | Monetary |
today_date = dt.datetime(2011,12,11)
#TODO: change to .today day

rfm = df.groupby('CustomerID').agg({'InvoiceDate': lambda invoice_date: (today_date - invoice_date.max()).days,
                                    'InvoiceNo': lambda invoice: invoice.nunique(),
                                    'TotalPrice': lambda total_price: total_price.sum()})

rfm.columns = ['recency','frequency','monetary']
rfm = rfm[(rfm['monetary'] > 0)]
rfm = rfm.reset_index()
rfm.head()

## 2. Clean/preprocess data

### View descriptive statistics and compare to model requirements

In [None]:
#describe
print(dataframe.dtypes)
print(dataframe.isnull().sum())
#distribution

We can see that:

- There are XXX variables with missing values.
- There are XXX negative values
- The dtypes are XXcorrect
- The distribution is XXX

KMeans requires:
- XXX

### Outliers

In [None]:
#use IQR

### Feature scaling: Z-Score Standardisation
Standardisation allows us to compare features' values on a similar scale. This is done typically so that mean of all data types becomes 0, and the scale is based on the unit variance from this new mean. 

StandardScaler() requires a normal gaussian distribution.

missing values

In [None]:
df.dropna(inplace=True)

print('Missing Values: {}'.format(df.isnull().sum().sum()))

In [None]:
# Rescaling the attributes
rfm_df = rfm[['Expenditure', 'Frequency', 'Recency']]

# Instantiate
scaler = StandardScaler()

# fit_transform
rfm_df_scaled = pd.DataFrame(scaler.fit_transform(rfm_df)) #change back if doesn't work
rfm_df_scaled.columns = ['Expenditure', 'Frequency', 'Recency']
rfm_df_scaled.head()

###  Later: 

#### pick out better scalers here? 
https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py

#### Principal component analysis (PCA)
if using more dimensions

#### more steps: 
polynomial (preprocessing) https://scikit-learn.org/stable/modules/preprocessing.html#polynomial-features



## 3. Initialise K-Means model and clusters

In [None]:
kmeans = KMeans(n_clusters=4, max_iter=50)
#TODO: add k-means++ param - maybe later in optimisation phase

kmeans.fit(rfm_df_scaled)

In [None]:

# Assign the labels to each data point, and execute the following script.
kmeans.labels_
label_list=kmeans.labels_
sorted(Counter(label_list).items())

## 4. Optimise the number of clusters (k)
K-means is sensitive not only to the placement but also to the number of clusters. The Elbow Method and Silhouette Score can help us find the best number to fit the data well but without overfitting.

### Elbow method

In [None]:
kmeans_data = rfm.loc[:,['recency_score','frequency_score']]

inertia = []

k = [1,2,3,4,5,6,7,8,9]

for i in k:
    
    kmean = KMeans(n_clusters = i)
    kmean.fit(kmeans_data)
    inertia.append(kmean.inertia_)
    
data = go.Scatter(x = k, y = inertia, mode = 'lines + markers', marker = dict(size= 10))

layout = go.Layout(title = {'text' : 'Elbow Method',
                            'y' : 0.9,
                            'x' : 0.5,
                            'xanchor' : 'center',
                            'yanchor' : 'top'},
                   width = 650,
                   height = 470,
                   xaxis = dict(title = 'Number Of Clusters'),
                   yaxis = dict(title = 'Sum of Squared Distance'),
                   template = 'plotly_white')

fig = go.Figure(data = data, layout = layout)
iplot(fig)                            

### Silhouette score

In [None]:
kmeans = KMeans(n_clusters = 3, random_state = 42)
kmeans.fit(kmeans_data)
print('Silhoutte Score : {}'.format(round(metrics.silhouette_score(kmeans_data, kmeans.labels_), 3))) 

### Compare performance of base model to model using `k-means++` 

K-means is sensitive not only to the number of clusters but also their initial placement. We already accounted for this by repeating the clustering with new random starting points each time. Scikit learn also provides a method to try to deal with this problem. Setting the k-means++ hyperparameter in scikit-learn will initialise centroids that are spread-out rather than random. 

As the silhouette score measures the separation between clusters, we can measure the impact of the method's attempt to spread the initial centroids.

## Run k-optimised K-Means model and visualise results

In [None]:
#run

In [None]:
# Box plot to visualize Cluster Id vs Amount
sns.boxplot(x='Cluster_Id', y='Amount', data=rfm)

# Box plot to visualize Cluster Id vs Frequency
sns.boxplot(x='Cluster_Id', y='Frequency', data=rfm)

# Box plot to visualize Cluster Id vs Recency
sns.boxplot(x='Cluster_Id', y='Recency', data=rfm)

## Last: also try K-Medoids
https://medium.com/@ali.soleymani.co/beyond-scikit-learn-is-it-time-to-retire-k-means-and-use-this-method-instead-b8eb9ca9079a