<a href="https://colab.research.google.com/github/twisha-k/Python_notes/blob/main/111_coding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Lesson 111: K-Means Clustering - Outliers Removal and Customer Segmentation

---

### Teacher-Student Activities

In the previous class, we learned RFM analysis for analysing customers based on three factors: Recency, Frequency, and Monetary Value.

In this class, we will remove outliers from the dataset and apply K-Means clustering to create clusters of customers exhibiting similar purchase behaviour.

Let's quickly run the code cells and go through the problem statement covered in the previous lesson and begin this lesson from **Activity 1: Data Analysis**.


---

#### Customer Segmentation Problem Statement


We have a transactional dataset that contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

The company wants to segment its customers and determine marketing strategies according to these segments.

The dataset consists of the following attributes:

- `InvoiceNo`: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.

- `StockCode`: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.

- `Description`: Product (item) name. Nominal.

- `Quantity`: The quantities of each product (item) per transaction. Numeric.

- `InvoiceDate`: Invoice date and time. Numeric, the day and time when each transaction was generated. The date-time format used here is `yyyy-mm-dd hh:mm:ss`.

- `UnitPrice`: Unit price. Numeric, product price per unit in pound sterling, also known as GBP (Great Britain Pound).

- `CustomerID`: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.

- `Country`: Country name. Nominal, the name of the country where each customer resides.



**Dataset Credits:** https://archive.ics.uci.edu/ml/datasets/online+retail

**Citation:** Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.


---

#### Recap

#### Loading the Dataset





**Dataset Link:** https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/online-retail-customers.xlsx



In [None]:
# Read the dataset and create a Pandas DataFrame.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

file_path = "https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/online-retail-customers.xlsx"
df = pd.read_excel(file_path)
df.head()

In [None]:
# Get the total number of rows and columns, data types of columns and missing values (if exist) in the dataset.
df.info()

---

#### Removing the Cancelled Orders



In [None]:
# Check the data type of 'InvoiceNo' field
type(df['InvoiceNo'][0])

In [None]:
# Convert 'InvoiceNo' field to string and verify whether the data type is converted or not.
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
type(df['InvoiceNo'][0])

In [None]:
# Use regex to find 'C' in the 'InvoiceNo' field
import re
df[df['InvoiceNo'].str.contains(pat = 'C', flags = re.IGNORECASE)]

In [None]:
# Check total number of orders including cancelled orders.
df['InvoiceNo'].shape[0]

In [None]:
# Remove canceleled invoices from the dataset
df = df[~(df['InvoiceNo'].str.contains('C', flags = re.IGNORECASE, regex = True))]
df

---

#### Removing Missing Values

In [None]:
# Obtain the number of missing or null values in df
df.isnull().sum()

In [None]:
# Determine the percentage of null values in each column.
df.isnull().sum() * 100 / df.shape[0]

In [None]:
# Remove the null valued rows.
print(f"Before removing null values:\nNumber of rows = {df.shape[0]}")
df.dropna(inplace = True)
print(f"After removing null values:\nNumber of rows = {df.shape[0]}")

In [None]:
# Again obtain the number of null values in df.
df.isnull().sum()

In [None]:
# Check the data type of CustomerID column.
df['CustomerID'].dtype

In [None]:
# Convert 'CustomerID' field to integer based categorical column.
df['CustomerID'] = df['CustomerID'].astype('int64').astype('category')
df['CustomerID'].dtype

---

#### RFM Analysis



In [None]:
# Check the first 5 samples of the DataFrame
df.head()

In [None]:
# Obtain the the total purchase amount for the customers
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']
df.head()

In [None]:
# Obtain the number of unique customers
df['CustomerID'].unique()

In [None]:
# Obtain the Monetary information from the DataFrame
monetary_df = df[['CustomerID', 'TotalPrice']].groupby('CustomerID', as_index = False).sum()
monetary_df.rename(columns = {'TotalPrice' : 'Monetary'}, inplace = True)
monetary_df

In [None]:
# Obtain the Frequency information from the DataFrame
frequency_df =  df[['CustomerID', 'InvoiceNo']].groupby('CustomerID', as_index = False).count()
frequency_df.rename(columns = {'InvoiceNo': 'Frequency'}, inplace = True)
frequency_df

---

#### Merging DataFrames



In [None]:
# Merge 'monetary_df' and 'frequency_df' DataFrames.
rfm_df = pd.merge(monetary_df, frequency_df, on='CustomerID', how = 'inner')
rfm_df.head()

---

#### Calculating Recency


In [None]:
# Obtain the last purchase date for each customer
recency_df = df[['CustomerID', 'InvoiceDate']].groupby('CustomerID', as_index = False).max()
recency_df.rename(columns = {'InvoiceDate': 'LastPurchaseDate'}, inplace = True)
recency_df

In [None]:
# Obtain the last invoice date in the dataset.
df['InvoiceDate'].max()

In [None]:
# Obtain the present date i.e LastPurchaseDate + 1 day
present_date = df['InvoiceDate'].max() + pd.Timedelta("1 day")
present_date

In [None]:
# Obtain the days since last purchase made by a customer
days_last_purchase = present_date - recency_df['LastPurchaseDate']
days_last_purchase

In [None]:
# Extract days from datetime using 'dt.days' attribute
recency_days = days_last_purchase.dt.days
recency_days

In [None]:
# Add 'recency_days' as column to the merged DataFrame 'rfm_df'.
rfm_df['Recency'] = recency_days
rfm_df

We now have a DataFrame for RFM analysis consisting of the necessary fields to carry out the customer segmentation.

Let us now analyse the `rfm_df` DataFrame obtained after RFM analysis and prepare it for K-Means clustering.



---

#### Activity 1: Data Analysis

For clustering, the `CustomerID` field is not required hence it can be dropped from the `rfm_df` DataFrame.

In [None]:
# S1.1: Dropping the 'CustomerID' column
rfm_df.drop(columns = 'CustomerID', inplace = True)
rfm_df

Let's create histogram and boxplots to understand the distribution of `Monetary`, `Frequency`, and `Recency` columns.

Use `subplots()` function of `matplotlib.pyplot` module to display all the three histograms in the first row and the boxplots in the second row.

Follow the steps given below to create this subplot:
1. Call the `subplots()` function on an object of `matplotlib.pyplot` and unpack the figure and axis objects in two different variables, say `fig` and `axis`. Inside the `subplots()` function, pass:

  - `nrows = 2` and `ncols = 3` parameters to create a figure having 2 rows and 3 columns.

  - `figsize = (15, 5)` parameter to create the figure of 15 units wide and 5 units high.

  - `dpi = 100` parameter to further enlarge the figure based on their pixel density.

2. Construct a histogram to visualise the distribution of `Monetary` column using first row, first column subplot's axes i.e `axis[0, 0]`.

3. Construct a boxplot to visualise the distribution of `Monetary` column using second row, first column subplot's axes i.e `axis[1, 0]`.

4. Also call the `set_title()` function using the `axis[0, 0]` object to set the `title` for histogram and boxplot.

5. Similarly, construct histograms and boxplots for `Frequency` and `Recency` columns using the respective subplots's axes.

6. Call the `show()` function on the `matplotlib.pyplot` object.


In [None]:
# S1.2: Obtain the histogram and boxplots
fig, axis = plt.subplots(nrows = 2, ncols = 3, figsize = (15, 5), dpi = 100)

# Construct Histogram and Boxplot for 'Monetary'
axis[0, 0].hist(rfm_df['Monetary'], bins = 'sturges', facecolor = 'red', edgecolor = 'black')
sns.boxplot(x = 'Monetary', data = rfm_df, ax = axis[1, 0], color = 'red')
axis[0, 0].set_title("Histogram and Boxplot for Monetary")

# Construct Histogram and Boxplot for 'Frequency'
axis[0, 1].hist(rfm_df['Frequency'], bins = 'sturges', facecolor = 'green', edgecolor = 'black')
sns.boxplot(x = 'Frequency', data = rfm_df, ax = axis[1, 1], color = 'green')
axis[0, 1].set_title("Histogram and Boxplot for Frequency")

# Construct Histogram and Boxplot for 'Recency'
axis[0, 2].hist(rfm_df['Recency'], bins = 'sturges', facecolor = 'purple', edgecolor = 'black')
sns.boxplot(x = 'Recency', data = rfm_df, ax = axis[1, 2], color = 'purple')
axis[0, 2].set_title("Histogram and Boxplot for Recency")

plt.show()

From the above plot it is clear there are lot of outliers in `Monetary` and `Frequency` fields.

These outliers will affect the model as the K-Means clustering is based on the distance of data points from the cluster centroids. These outliers will shift the cluster centroids away from their intended positions thereby generating inaccurate clusters. To compensate for this we need to remove the outliers.


---

#### Activity 2: Removing Outliers

We had already learned how boxplots are useful in identifying outliers in column data in one of the previous lessons (*Lesson: Meteorite Landings - Box Plots*). Let us recall that.

**What are outliers?**
- Outlier is a value in a data series that is either very small or very large.
- Outliers are abnormal values that can affect the overall observation due to its very high or very low extreme values.
- Hence they should be removed from the actual data.

The best way to detect outliers is to create a boxplot. It plots the minimum, first quartile, second quartile, third quartile, and maximum values in the form of a box. Any value beyond minimum and maximum limit is considered as an outlier.


<center>
<img src= "https://s3-whjr-v2-prod-bucket.whjr.online/fc916def-1fd4-4a16-8a7f-caadecafdecc.jpg" height = 350 /></center>


- **Median or Second quartile ($Q2$):** The middle value of the dataset. Also known as $50^\text{th}$ percentile.

- **First quartile ($Q1$):** The middle value between the smallest value (not the "minimum") and the median of the dataset. Also known as $25^\text{th}$ percentile which means that $25\%$ of the data lies between smallest value and $Q1$.

- **Third quartile ($Q3$):** The middle value between the median and the highest value (not the "maximum") of the dataset. Also known as  $75^\text{th}$ percentile which means 75% of the data lies between smallest value and $Q3$.

- **InterQuartile Range ($IQR$):**  $25^\text{th}$ to the  $75^\text{th}$ percentile. $IQR$ tells how spread the middle values are. It is defined as:

\begin{align}
IQR = Q3 - Q1
\end{align}

- **Minimum or Lower Bound:** $Q1 -1.5 \times IQR$

- **Maximum or Upper Bound:** $Q3 + 1.5 \times IQR$

- **Outliers:** These are the points that lies beyond the "Minimum" and "Maximum" value. So any value more than the upper bound or lesser than the lower bound will be considered as outliers.

Let's define a `remove_outliers()` function which removes outlier from the column data and returns an outlier free DataFrame. This function takes two parameters as input:
 - `df`: The DataFrame which consists of columns containing outliers.
 - `col`: The column of DataFrame `df` from which the outliers needs to be flushed out.

Inside this function,

1. Calculate $Q1$ or $25^\text{th}$ quartile for `col` column using `quantile()` function of `pandas` module and store it in a `q1` variable. Pass `0.25` as input to `quantile()` function.

  **Syntax of `quantile()` function:** `DataFrame.quantile(q)` where, `q` is the quantile to be computed. By default, `q = 0.5` ($50\%$ quantile)

2. Calculate $Q3$ or $75^\text{th}$ quartile for `col` column using `quantile()` function and store it in a `q3` variable. Pass `0.75` as input to `quantile()` function.

3. Calculate $IQR$ by subtracting `q3` from `q1` and store it in a `iqr` variable.

4. Calculate lower bound and upper bound using the following formula and store it in `lower_bound` and `upper_bound` variables respectively.

$$\text{Lower Bound}=Q1 - 1.5 \times IQR$$
$$\text{Upper Bound}=Q3 + 1.5 \times IQR$$

5. Obtain only those values from `df` DataFrame which matches the following condition:

    `(df[col] >= lower_bound) & (df[col] <= upper_bound)`

  This condition will return those values of the `col` column which are between lower bound and upper bound.

6. Return the filtered DataFrame.

**Note:** Here, the terms **quartile** and **quantile** are being used interchangeably. However, quantile is something that divides the dataset into equal parts. A quantile that divides the dataset into 4 equal parts i.e. at 0.25, 0.5 , 0.75, 1.00 is called a quartile. Thus, quartile is a type of quantile.


In [None]:
# S2.1: Create a function for removing the outliers.
def remove_outliers(df, col):
  q1 = df[col].quantile(0.25)  # Q1
  q3 = df[col].quantile(0.75)  # Q3
  iqr = q3 - q1                # IQR = Q3 - Q1
  lower_bound =  q1 - 1.5 * iqr  # lower_bound = Q1 − 1.5 * IQR
  upper_bound = q3 + 1.5 * iqr  # upper_bound = Q3 + 1.5 * IQR
  new_df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]

  return new_df

Now that we have created a function for removing outliers, we can easily remove outliers from `rfm_df` DataFrame.

To remove the outliers from the `Monetary` field:

1. Call the `remove_outliers()` function and pass `rfm_df` and  `'Monetary'` as input to this function. Save the returned DataFrame in a `m_clean_df` variable.

2. Reset the index of `m_clean_df` DataFrame using `reset_index(drop = True)` function. This function deletes the old index and resets the index in the new DataFrame.

In [None]:
# S2.2: Removing outliers from 'Monetary' field
m_clean_df = remove_outliers(rfm_df, 'Monetary')
m_clean_df = m_clean_df.reset_index(drop = True)
m_clean_df

The `rfm_df` had 4339 rows and after removal of outliers `m_clean_df` has 3912 rows which means there were $4339 - 3912 = 427$ outliers in the `Monetary` field.

Let us again create boxplots for the `Recency`, `Monetary`, and `Frequency` field and observe whether there is any improvement in data distribution. Use subplots to create these multiple plots.

In [None]:
# S2.3: Obtain the boxplots
fig, axis = plt.subplots(nrows = 1, ncols = 3, figsize = (15, 5), dpi = 100)

# Construct Boxplot for 'Monetary'
sns.boxplot(x = 'Monetary', data = m_clean_df, ax = axis[0], color = 'red')
axis[0].set_title("Boxplot for Monetary")

# Construct Boxplot for 'Frequency'
sns.boxplot(x = 'Frequency', data = m_clean_df, ax = axis[1], color = 'green')
axis[1].set_title("Boxplot for Frequency")

# Construct Boxplot for 'Recency'
sns.boxplot(x = 'Recency', data = m_clean_df, ax = axis[2], color = 'purple')
axis[2].set_title("Boxplot for Recency")

plt.show()


We can observe that lot of outliers has been removed from the `Monetary` and `Frequency` columns.

Next, let us standardise the  DataFrame, so that all the columns have mean equals to `0` and the standard deviation equals to `1`. For this,
1. Create an object of `StandardScaler()` class of `sklearn.preprocessing` module.
2. Apply `fit_transform()` function on `mf_cleaned_df` DataFrame and store the scaled values in a new DataFrame `scaled_df`.


In [None]:
# S2.4: Normalise the RFM parameters
# Import StandardScaler Module from sklearn

# Make an object 'StandardScaler()'


# Perform fit and transform operation using 'fit_transform()'


# Make a new DataFrame


Let us again create histograms for `Monetary`, `Frequency`, and `Recency` columns to check whether all of them have similar mean and variance after standardisation.

In [None]:
# S2.5: Obtain the histograms.
fig, axis = plt.subplots(nrows = 1, ncols = 3, figsize = (15, 5), dpi = 100)

# Construct Histogram for 'Monetary'
axis[0].hist(scaled_df['Monetary'], bins = 'sturges', facecolor = 'red', edgecolor = 'black')
axis[0].set_title("Histogram for Monetary")

# Construct Histogram for 'Frequency'
axis[1].hist(scaled_df['Frequency'], bins = 'sturges', facecolor = 'green', edgecolor = 'black')
axis[1].set_title("Histogram for Frequency")

# Construct Histogram for 'Recency'
axis[2].hist(scaled_df['Recency'], bins = 'sturges', facecolor = 'purple', edgecolor = 'black')
axis[2].set_title("Histogram for Recency")

plt.show()

You may note that all the columns now have same mean and variance. Now our DataFrame is ready for K-Means clustering.

---

#### Activity 3: Applying K-Means Clustering

We start by finding the optimal number of clusters for the K-Means algorithm. We will use the Elbow method.

Recall the steps for Elbow method:
1. Compute K-Means clustering for different values of `K` by varying `K` from `1` to `10` clusters.
2. For each K, calculate the total within-cluster sum of square (WCSS) using `inertia_` attribute of `KMeans` object.
3. Plot the curve of WCSS vs the number of clusters `K`.


In [None]:
# S3.1: Determine 'K' using Elbow method.

from sklearn.cluster import KMeans
wcss = []

clusters = range(1, 11)
# Initiate a for loop that ranges from 1 to 10.
for k in clusters:
    # Inside for loop, perform K-means clustering for current value of K. Use 'fit()' to train the model.
    kmeans = KMeans(n_clusters = k, random_state = 10)
    kmeans.fit(scaled_df)
    # Find wcss for current K value using 'inertia_' attribute and append it to the empty list.
    wcss.append(kmeans.inertia_)

# Plot WCSS vs number of clusters.




From the above plot, it looks like decrease starts to slow down between 3 and 5. So you can choose any number of clusters from 3 to 5. Let us use 4 clusters to perform K-Means clustering.

Now, perform K-Means clustering with `n_clusters = 4` parameter and determine the cluster labels. Also, count the number of customers in each cluster.

In [None]:
# S3.2: Clustering the dataset for K = 4


# Perform K-Means clustering with n_clusters = 4 and random_state = 10


# Fit the model to the scaled_df


# Make a series using predictions by K-Means


As you can see here, the data is divided into $4$ clusters labelled from `0` to `3`. For cluster visualisation we will not plot normalised DataFrame for scatter graph as the plot's axis won't convey any meaningful information.

Let's make a new DataFrame `km_df` by concatenating `mf_clean_df` and `cluster_labels` using `concat()` function of `pandas` module.

In [None]:
# S3.4: Create a DataFrame with cluster labels for cluster visualisation


The cluster labels for all the data points are now obtained, let's display those clusters using `scatter_3d()` function from `plotly.express` module.

In [None]:
# S3.5: Visualising the clusters for customer segmentation
import plotly.express as px
plotly_fig = px.scatter_3d(km_df, x = 'Monetary', y = 'Frequency', z = 'Recency', color = 'Cluster Label')
plotly_fig.show()

**Summarising clusters:**

Let us calculate the mean recency, frequency, and monetary values of all the clusters by applying `agg()` function on `km_df` DataFrame.

In [None]:
# S3.6: Understanding the Cluster Distribution
mean_df = km_df.groupby(['Cluster Label']).agg({'Recency':['mean'],
                                              'Frequency':['mean'],
                                              'Monetary':['mean','count']}).round(0)
mean_df

The above dataframe gives an optimal interpretation of clusters. Let us understand what each cluster represent.



- The <b><font color = blue>first cluster</font></b> (label 3) belongs to the "Promising Customers" segment as:
    - They purchased recently (`R = 52 days`).
    - Average purchase frequency is very less (`F = 39 purchases`).
    - They spend little (`M = 573 GBP`).

- The <b><font color= purple>second cluster</font></b> (label 0) belongs to the "Almost Lost Customers" segment as:
  - Their last purchase  is long ago (`R = 255 days`).
  - Average purchase frequency is very less (`F = 24 purchases`).
  - They spend very little (`M = 380 GBP`).

- The "best customers" are in <b><font color = orange>third cluster</font></b> (label 2) and <b><mark>fourth cluster</mark></b> (label 1). They spent the greatest amount of money, made many purchases and their last purchase was few days before.

Hence, we can see that using K-Means clustering we divided customers into clusters. Customers in each cluster have similar buying behaviours, so we can use them to personalise marketing offers.

But there are certain challenges with K-means. Let us discuss them one by one.

1. **Clustering outliers:** Cluster centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored. Hence outliers have to be removed as K-Means clustering is highly sensitive to outliers.

2. **`K`  has to be chosen manually:** If number of clusters are unknown you have to use the "WCSS vs Clusters" plot to find the optimal value of `K`.

3. K-Means algorithm is good in capturing the structure of the data if clusters have a spherical-like shape. If the clusters have  complicated geometric shapes, K-Means does a poor job in clustering the data.

To handle above limitations of K-Means we can use **Agglomerative/Hierarchical Clustering** or **PCA - Principle Component Analysis** that we will explore in the upcoming lessons.



---

### **Project**
You can now attempt the **Applied Tech Project 111 - KMeans Clustering V** on your own.

**Applied Tech Project 111 - KMeans Clustering V** : https://colab.research.google.com/drive/1BHChmfN2JTvB-Zh_2RfEgo8_npdqD66c

---