## **In this notebook I will try to do an exploratory data analysis and clustering for the Online Retail II dataset, found in the UCI machine learning repository.**



### RFM is a method used for analyzing customer value. It is commonly used in database marketing and direct marketing and has received particular attention in retail and professional services industries.

### RFM stands for the three dimensions:

### * Recency – How recently did the customer purchase?
### * Frequency – How often do they purchase?
### * Monetary Value – How much do they spend?
### Customer purchases may be represented by a table with columns for the customer name, date of purchase and purchase value.
### One approach to RFM is to assign a score for each dimension on a scale from 1 to 10 but in this notebook we are just going to standardize the values of each column prior to clustering.

Let it be noted that because the nature of clustering is experimental, there is no objectively correct algorithm

<h1 id="Exploratory_Data_Analysis">
1. Exploratory Data Analysis
<a class="anchor-link" href="https://www.kaggle.com/johnmantios/travel-review-ratings-dataset/edit/run/47115230#Exploratory_Data_Analysis">¶</a>
</h1>

In [None]:
import numpy as np
import pandas as pd
import os


data = pd.read_csv('../input/online-retail-ii-uci/online_retail_II.csv')

data

In [None]:
data.isnull().sum()

In [None]:
data.info()

In [None]:
pd.set_option("display.max_rows", 101)
pd.set_option("display.max_columns", 10)


In [None]:
#items whose price is negative
negprice = data[data['Price'] < 0].sum() 

#items whose quantity is negative
negquantity = data[data['Quantity'] < 0].count()


In [None]:
np.set_printoptions(threshold=np.inf)
#df = pd.DataFrame(data)
data['Description'].unique()

We observe that all the possible descriptions that indicate faulty items are in lowercase letters.Let's see only the lowercase descriptions.

In [None]:
df_new = data['Description']
df_new = df_new.dropna()
df_new.dropna()
df_new = pd.DataFrame(df_new)


my_list = [y for x in df_new['Description'] for y in x.split() if y.islower()]
mylist = list(set(my_list))
mylist

We will remove the items whose description implies they are faulty

In [None]:
data_bad = data[data['Description'].isin([ mylist ])]

data = data[~data.apply(tuple,1).isin(data_bad.apply(tuple,1))] 

In [None]:
#replace null customer values with 99999
data[['Customer ID']] =data[['Customer ID']].fillna(99999)
#replace null description values with 'Unknown'
data[['Description']] =data[['Description']].fillna('Unknown')

#leave only positive quantities and known customer IDs
data = data[data['Quantity'] > 0]
data = data[data['Customer ID'] != 99999]

# Removing returned products (Invoice numbers starting with C) from the data set
data = data[~data["Invoice"].str.contains("C", na = False)]

data.info()

In [None]:
data.shape

In [None]:
data = data.drop_duplicates()

In [None]:
data.shape

# RFM Implementation

We will make 3 different dataframes: one containing the Customer's ID and amount he/she spent, one containing the Customer's ID and the frequency of purchases and one containing the Customer's ID and the recency of purchases. In the end we will merge those together.

In [None]:
amount = pd.DataFrame(data.Quantity * data.Price, columns = ['Amount'])

amount

In [None]:
data_cust = np.array(data['Customer ID'], dtype=np.object)

data_cust = pd.DataFrame(data_cust, columns = ["Customer ID"])

data_cust = pd.concat(objs = [data_cust, amount], axis = 1, ignore_index = False)

monetary = data_cust.groupby(by = ["Customer ID"]).Amount.sum()

monetary = monetary.reset_index()
monetary = monetary[monetary['Customer ID'] != 99999]
monetary

In [None]:
frequency = data[['Customer ID', 'Invoice']]

frequency_df = frequency.groupby("Customer ID").Invoice.count()
frequency_df = pd.DataFrame(frequency_df)
frequency_df = frequency_df.reset_index()
frequency_df.columns = ["Customer ID", "Frequency"]
frequency_df.head()

In [None]:
import datetime as dt

data["InvoiceDate"].max() # Last invoice date

In [None]:
today_date = dt.datetime(2011,12,9) # last invoice date is assigned to today_date variable

In [None]:
# The type of Customer ID variable needs to be turned into an integer for following commands.
data["Customer ID"] = data["Customer ID"].astype(int) 

In [None]:
# The type of InvoiceDate variable needs to be turned into datetime for following commands.
data["InvoiceDate"] = pd.to_datetime(data["InvoiceDate"])

In [None]:
# Grouping the last invoice dates according to the Customer ID variable, subtracting them from today_date, and assigning them as recency
recency = (today_date - data.groupby("Customer ID").agg({"InvoiceDate":"max"}))
# Rename column name as Recency
recency.rename(columns = {"InvoiceDate":"Recency"}, inplace = True)
# Change the values to day format
recency_df = recency["Recency"].apply(lambda x: x.days)
recency_df.head()

In [None]:
RFM = frequency_df.merge(monetary, on = "Customer ID")
RFM = RFM.merge(recency_df, on = "Customer ID")
RFM

## Outlier Detection

In [None]:
import plotly.express as px

fig = px.box(RFM, y = [ 'Frequency', 'Amount'] )
fig.show()

In [None]:
#outlier treatment: We will delete everything outside the IQR

Q1 = RFM.Amount.quantile(0.25)
Q3 = RFM.Amount.quantile(0.75)
IQR = Q3 - Q1
RFM = RFM[(RFM.Amount >= (Q1 - 1.5*IQR)) & (RFM.Amount <= (Q1 + 1.5 * IQR))]

In [None]:
#outlier treatment : We will delete everything outside the IQR

Q1 = RFM.Frequency.quantile(0.25)
Q3 = RFM.Frequency.quantile(0.75)
IQR = Q3 - Q1
RFM = RFM[(RFM.Frequency >= (Q1 - 1.5*IQR)) & (RFM.Frequency <= (Q1 + 1.5 * IQR))]

In [None]:
fig = px.box(RFM, y = [ 'Recency'] )
fig.show()

## Scaling (Standardization)

We will perform Standardization scaling to our data to make the job of the clustering algorithms we will use next easier

In [None]:
from sklearn.preprocessing import normalize, StandardScaler

In [None]:
RFM = RFM.drop(['Customer ID'], axis=1)

In [None]:
standard_scaler = StandardScaler()
RFM = standard_scaler.fit_transform(RFM)

#RFM = pd.DataFrame(RFM)
RFM = pd.DataFrame(data = RFM, columns = ['Frequency', 'Amount', 'Recency'])

RFM

<h1 id="Clustering">
2. Clustering
<a class="anchor-link" href="https://www.kaggle.com/johnmantios/travel-review-ratings-dataset/edit/run/47115230#Clustering">¶</a>
</h1>

Let's assess the clusterability of the dataset using the hopkins statistic. According to the pyclustertend library, on a scale from 0 to 1, the lower the score, the better the clusterability of the dataset

In [None]:
!pip3 install pyclustertend
from pyclustertend import hopkins 

hopkins(RFM, RFM.shape[0])

This score looks quite promising

<h1 id="agglomerative">
2.1 Agglomerative Clustering
<a class="anchor-link" href="https://www.kaggle.com/johnmantios/travel-review-ratings-dataset/edit/run/47115230#agglomerative">¶</a>
</h1>

We begin the clustering process with the agglomerative clustering algorithm for a simple reason: it is a hierarchical clustering algorithm, so we simplify the problem of having to choose beforehand the number of clusters in our model.However, hierarchical clustering does not avoid the problem with choosing the number of clusters. It simply  constructs the tree spaning over all samples, which shows which samples (later on clusters) merge together to create a bigger cluster. This happens recursively till you have just two clusters (this is why the default number of clusters is 2) which are merged to the whole dataset.

Firstly, we are going to determine which linkage method to use. In order to do that we will calculate the cophenet index. Cophenet index is a measure of the correlation between the distance of points in feature space and distance on the dendrogram. If the distance between these points increases as the dendrogram distance between the clusters does then the Cophenet index is closer to 1. So, values closer to 1 mean a better linkage method.

In [None]:
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster import hierarchy
from sklearn.cluster import AgglomerativeClustering
from scipy.spatial.distance import pdist
import plotly.graph_objects as go

Z = hierarchy.linkage(RFM, 'ward')

c, coph_dists = hierarchy.cophenet(Z, pdist(RFM, 'hamming'))

ward = c


A = hierarchy.linkage(RFM,'average')

c, coph_dists = hierarchy.cophenet(A, pdist(RFM, 'hamming'))

average = c


B = hierarchy.linkage(RFM,'single')

c, coph_dists = hierarchy.cophenet(B, pdist(RFM, 'hamming'))

single = c


C = hierarchy.linkage(RFM,'complete')

c, coph_dists = hierarchy.cophenet(C, pdist(RFM, 'hamming'))

complete = c


D = hierarchy.linkage(RFM,'weighted')

c, coph_dists = hierarchy.cophenet(D, pdist(RFM, 'hamming'))

weighted = c


E = hierarchy.linkage(RFM,'centroid')

c, coph_dists = hierarchy.cophenet(E, pdist(RFM, 'hamming'))

centroid = c

F = hierarchy.linkage(RFM,'median')

c, coph_dists = hierarchy.cophenet(F, pdist(RFM, 'hamming'))

median = c


metrics=['ward', 'average', 'single', 'complete', 'weighted', 'centroid', 'median']


fig = go.Figure([go.Bar(x=metrics, y=[ward, average, single, complete, weighted, centroid, median])])
fig.show()



We clearly see that the complete linkage method is the preferred one.

Let's calculate some useful metrics that will help us decide the number of clusters. Since the ground truth labels are not known we will use such metrics as the silhouette coefficient, the Davies-Bouldin score and the Calinski-Harabasz Index.

In [None]:
from sklearn.metrics import silhouette_score
for n_clusters in range(2,15):
    clusterer = AgglomerativeClustering (n_clusters=n_clusters, distance_threshold = None)
    preds = clusterer.fit_predict(RFM)
    

    score = silhouette_score (RFM, preds)
    print ("For n_clusters = {}, silhouette score is {})".format(n_clusters, score))

From the various silhouette score we can see that 2 clusters is the optimal number.

Let's check the Davies-Bouldin score.We want a values as close to 0 as possible

In [None]:
from sklearn.metrics import davies_bouldin_score

for n_clusters in range(2,10):
    clusterer = AgglomerativeClustering (n_clusters=n_clusters, distance_threshold = None)
    preds = clusterer.fit_predict(RFM)
    
    score = davies_bouldin_score (RFM , preds)
    print ("For n_clusters = {}, the Davies-Bouldin score is {})".format(n_clusters, score))

3 clusters have the lowest score from a reasonable range of clusters.

Calinski-Harabasz index.We want as high a score as possible

In [None]:
from sklearn.metrics import calinski_harabasz_score

for n_clusters in range(2,10):
    clusterer = AgglomerativeClustering (n_clusters=n_clusters, distance_threshold = None)
    preds = clusterer.fit_predict(RFM)
    
    score = calinski_harabasz_score(RFM, preds)
    print ("For n_clusters = {}, the Calinski-Harabasz score is {})".format(n_clusters, score))

2 clusters

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

model = AgglomerativeClustering(distance_threshold=0, n_clusters = None)
model = model.fit(RFM)

Z = hierarchy.linkage(model.children_, 'complete')

plt.figure(figsize=(20,10))

dn = hierarchy.dendrogram(Z)

We will cut the dendrogram at 7000, so we will have 3 clusters

Now, let's plot our data using the labels that the algorithm generated.We are going to make a scatterplot.

In [None]:
import seaborn as sns

model = AgglomerativeClustering(distance_threshold=None, n_clusters = 3)
model = model.fit(RFM)
y_agg=model.fit_predict(RFM)



In [None]:
RFM_agg = RFM.copy()

RFM_agg['Agg_Labels'] = y_agg

fig = px.scatter_3d(RFM_agg, x='Frequency', y='Amount', z='Recency',
              color= 'Agg_Labels')
fig.show()

<h1 id="DBSCAN">
2.2 DBSCAN
<a class="anchor-link" href="https://www.kaggle.com/johnmantios/travel-review-ratings-dataset/edit/run/47115230#DBSCAN">¶</a>
</h1>

Following, we apply to our dataset the DBSCAN algorithm.DBSCAN works by running a connected components algorithm across the different core points. If two core points share border points, or a core point is a border point in another core point’s neighborhood, then they’re part of the same connected component, which forms a cluster.

A low min_samples parameter means it will build more clusters from noise, so we shouldn't choose it too small.
The DBSCAN paper suggests to choose minPts based on the dimensionality, and eps based on the elbow in the k-distance graph.
For eps, we can try to do a knn distance histogram and choose a "knee" there, but there might be no visible one, or multiple.

In the more recent publication

Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017).
DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN.
ACM Transactions on Database Systems (TODS), 42(3), 19.


The authors suggest to use a larger minpts for large and noisy datasets, and to adjust epsilon depending on whether you get too large clusters (decrease epsilon) or too much noise (increase epsilon). Clustering requires iterations.


Let's apply the Knee method using KNN to find the optimal eps value for our model. Since KNN is a supervised learning algorithm and our data is not labeled, we will apply a general rule of thumb popularized by the "Pattern Classification" book by Duda et al., saying that the optimal K value usually found is the square root of N, where N is the total number of samples.

In [None]:
from sklearn.neighbors import NearestNeighbors

nearest_neighbors = NearestNeighbors(n_neighbors=63) #sqrt(samples = 4082)
nearest_neighbors.fit(RFM)
distances, indices = nearest_neighbors.kneighbors(RFM)
distances = np.sort(distances, axis=0)[:, 1]
#print(distances)
plt.figure(figsize=(20,10))
plt.plot(distances)
plt.show()

Optimal value for eps where a 'knee' is formed is 0.3.

A general rule of thumb is ==> Min points ≥dimensionality +1
After some trial and error, min_samples = 7 produces the lowest amount of noise possible


In [None]:
from sklearn.cluster import DBSCAN

model2 = DBSCAN(eps = 0.3, min_samples = 13)

model2 = model2.fit(RFM)

y_db=model2.labels_

RFM_DB = RFM.copy()
RFM_DB["DBLabels"] = y_db

RFM_DB["DBLabels"].value_counts(0)

Let's check the shilhouette score

In [None]:
silhouette_score (RFM_DB, y_db) 

In [None]:
fig = px.scatter_3d(RFM_DB, x='Frequency', y='Amount', z='Recency',
              color= 'DBLabels')
fig.show()

In [None]:
RFM_DB_filtered = RFM_DB.copy()

# Get names of indexes for which column DB_Labels has value -1
indexNames = RFM_DB_filtered[ RFM_DB_filtered['DBLabels'] == -1 ].index
# Delete these row indexes from dataFrame
RFM_DB_filtered.drop(indexNames , inplace=True)

RFM_DB_filtered['DBLabels'].unique()

In [None]:
#plot clustered data without noise

fig = px.scatter_3d(RFM_DB_filtered, x='Frequency', y='Amount', z='Recency',
              color= 'DBLabels')
fig.show()

<h1 id="EM">
2.3 EM using GMM
<a class="anchor-link" href="https://www.kaggle.com/johnmantios/travel-review-ratings-dataset/edit/run/47115230#EM">¶</a>
</h1>

At its simplest, GMM is also a type of clustering algorithm. As its name implies, each cluster is modelled according to a different Gaussian distribution. This flexible and probabilistic approach to modelling the data means that rather than having hard assignments into clusters like k-means, we have soft assignments. This means that each data point could have been generated by any of the distributions with a corresponding probability.

Let's calculate the silhouette score for various number of n_components

In [None]:
from sklearn.mixture import GaussianMixture as GMM

for n_components in range(2,15):
    clusterer = GMM(n_components=n_components)
    preds = clusterer.fit_predict(RFM)
    

    score = silhouette_score (RFM, preds)
    print ("For n_clusters = {}, silhouette score is {})".format(n_components, score))

Let's calculate the BIC score.This criterion gives us an estimation on how good the GMM in terms of predicting the data we actually have. The lower is the BIC, the better is the model to actually predict the data we have, and by extension, the true, unknown, distribution. In order to avoid overfitting, this technique penalizes models with big number of clusters.

In [None]:
import numpy as np
import itertools

from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl



print(__doc__)



lowest_bic = np.infty
bic = []
n_components_range = range(1, 9)
cv_types = ['spherical', 'tied', 'diag', 'full']
for cv_type in cv_types:
    for n_components in n_components_range:
        # Fit a Gaussian mixture with EM
        gmm = GMM(n_components=n_components,
                                      covariance_type=cv_type)
        gmm.fit(RFM)
        bic.append(gmm.bic(RFM))
        if bic[-1] < lowest_bic:
            lowest_bic = bic[-1]
            best_gmm = gmm

bic = np.array(bic)
color_iter = itertools.cycle(['navy', 'turquoise', 'cornflowerblue',
                              'darkorange'])
clf = best_gmm
bars = []

# Plot the BIC scores
plt.figure(figsize=(20, 10))
spl = plt.subplot(2, 1, 1)
for i, (cv_type, color) in enumerate(zip(cv_types, color_iter)):
    xpos = np.array(n_components_range) + .2 * (i - 2)
    bars.append(plt.bar(xpos, bic[i * len(n_components_range):
                                  (i + 1) * len(n_components_range)],
                        width=.2, color=color))
plt.xticks(n_components_range)
plt.ylim([bic.min() * 1.01 - .01 * bic.max(), bic.max()])
plt.title('BIC score per model')
xpos = np.mod(bic.argmin(), len(n_components_range)) + .65 +\
    .2 * np.floor(bic.argmin() / len(n_components_range))
plt.text(xpos, bic.min() * 0.97 + .03 * bic.max(), '*', fontsize=14)
spl.set_xlabel('Number of components')
spl.legend([b[0] for b in bars], cv_types)

Based on the gradient descent of the criterion, it seems that 6 clusters with a full covariance type is the optimal option.

In [None]:
n_components = np.arange(1, 21)
models = [GMM(n, covariance_type='full', random_state=0).fit(RFM)
          for n in n_components]

plt.plot(n_components, [m.bic(RFM) for m in models], label='BIC')
plt.plot(n_components, [m.aic(RFM) for m in models], label='AIC')
plt.legend(loc='best')
plt.xlabel('n_components');

The choice of number of components measures how well GMM works as a density estimator, not how well it works as a clustering algorithm. From the above plot it shows that optimal number of components is 11 (where the gradient stops decreasing). We will choose nonetheless 6 clusters since they can be interpreted more easily than 11.

In [None]:
RFM_gmm = RFM.copy()

RFM_gmm['gmm'] = GMM(n_components=6, random_state=42).fit_predict(RFM)

In [None]:
fig = px.scatter_3d(RFM_gmm, x='Frequency', y='Amount', z='Recency',
              color= 'gmm')
fig.show()

<h1 id="kmeans">
2.4 KMeans
<a class="anchor-link" href="https://www.kaggle.com/johnmantios/travel-review-ratings-dataset/edit/run/47115230#kmeans">¶</a>
</h1>

Kmeans algorithm is an iterative algorithm that tries to partition the dataset into K pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group
It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.

In [None]:
from sklearn.cluster import KMeans

from sklearn.metrics import silhouette_score
for n_clusters in range(2,15):
    clusterer = KMeans (n_clusters=n_clusters)
    preds = clusterer.fit_predict(RFM)
    

    score = silhouette_score (RFM, preds)
    print ("For n_clusters = {}, silhouette score is {})".format(n_clusters, score))

2 clusters seem like the most preferable option. Let's also check using the elbow method

In [None]:
from scipy.spatial.distance import cdist

distortions = [] 
inertias = [] 
mapping1 = {} 
mapping2 = {} 
K = range(1,10) 
  
for k in K: 
    #Building and fitting the model 
    kmeanModel = KMeans(n_clusters=k).fit(RFM) 
    kmeanModel.fit(RFM)     
      
    distortions.append(sum(np.min(cdist(RFM, kmeanModel.cluster_centers_, 
                      'euclidean'),axis=1)) / RFM.shape[0]) 
    inertias.append(kmeanModel.inertia_) 
  
    mapping1[k] = sum(np.min(cdist(RFM, kmeanModel.cluster_centers_, 
                 'euclidean'),axis=1)) / RFM.shape[0] 
    mapping2[k] = kmeanModel.inertia_ 
    
for key,val in mapping2.items(): 
    print(str(key)+' : '+str(val))
    
plt.plot(K, inertias, 'bx-') 
plt.xlabel('Values of K') 
plt.ylabel('Inertia') 
plt.title('The Elbow Method using Inertia') 
plt.show()     

We will opt for 3 clusters

In [None]:
model4 = KMeans(n_clusters = 3)

model4 = model4.fit(RFM)

np.unique(model4.labels_)

RFM_kmeans = RFM.copy()

RFM_kmeans['kmeans'] = KMeans(n_clusters=3).fit_predict(RFM)


fig = px.scatter_3d(RFM_kmeans, x='Frequency', y='Amount', z='Recency',
              color= 'kmeans')
fig.show()