## Overview
<a href="https://archive.ics.uci.edu/ml/datasets/online+retail">Online retail</a> is a transnational dataset which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

## Source
UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/online+retail

## Business Goal
To segment the Customers based on RFM so that the company can target its customers efficiently.

## Methodology

1. [Reading and Understanding the Data](#1) <br>
    a. Creating a Data Dictionary
2. [Data Cleaning](#2)
3. [Data Preparation](#3) <br>
    a. Scaling Variables
4. [Model Building](#4) <br>
    a. K-means Clustering <br>
    b. Finding the Optimal K
5. [Final Analysis](#5)

<a id="1"></a> <br>
### 1 : Data Preprocessing

In [None]:
# import required libraries for dataframe and visualization

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import plotly as py 
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

# import required libraries for clustering
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree

In [None]:
# Reading the data on which analysis needs to be done
retail = pd.read_csv('dataset/OnlineRetail.csv', encoding='ISO-8859-1')
# Display first 10 rows
retail.head(10)

#### Data Dictionary 

First Header  | Definition    |  Description  | Data Type
------------- | ------------- | ------------- | -------------
InvoiceNo  | Invoice number | A 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation. | Nominal
StockCode | Product (item) code | A 5-digit integral number uniquely assigned to each distinct product. | Nominal
Description | Product (item) name | Name of Product | Nominal
Quantity | Quantity | The quantities of each product (item) per transaction | Numeric
InvoiceDate | Invoice Date and time | The day and time when each transaction was generated. | Numeric
UnitPrice | Unit price | Product price per unit in sterling. | Numeric
CustomerID | Customer number | A 5-digit integral number uniquely assigned to each customer. | Nominal
Country | Country name | The name of the country where each customer resides. | Nominal

### Missing Values

In [None]:
missing_values = retail.isnull().sum()

# Filter columns with missing values (optional, can remove if you want all columns)
missing_values = missing_values[missing_values > 0]

# Plot the missing values
plt.figure(figsize=(10, 6))
missing_values.plot(kind='bar', color='skyblue')

# Add labels and title
plt.title('Missing Values per Column', fontsize=16)
plt.xlabel('Columns', fontsize=12)
plt.ylabel('Number of Missing Values', fontsize=12)

# Show the plot
plt.show()

As we can see, the majority of the missing values is in the Customer ID. Since Customer ID is a sensitive column, and we cannot make justifications o assumptions of the number of customers coming in to buy as it may cause some data inconsistencies and be the cause of misrepresentation of sensitive sales data, we drop the missing values.

In [29]:
# Dropping the null customer id rows
retail_cleaned = retail.copy()
retail_cleaned = retail_cleaned.dropna(subset=['CustomerID'])

We try to derive the meaning of the descriptions of the items using Stock Code, and try to fill in missing values 

In [30]:
stock_code_to_description = retail_cleaned.dropna(subset=['Description']).set_index('StockCode')['Description'].to_dict()

# Fill missing descriptions using the dictionary
retail_cleaned['Description'] = retail_cleaned.apply(
    lambda row: stock_code_to_description.get(row['StockCode']) if pd.isnull(row['Description']) else row['Description'],
    axis=1
)

# Drop rows where the description is still missing
df_cleaned = retail_cleaned.dropna(subset=['Description'])

Check if missing values are no more

In [None]:
retail_cleaned.info()

In [None]:
# DF Description
retail.describe()

<a id="2"></a> <br>
### 2 : Data Cleaning

In [None]:
# Calculating the Missing Values % contribution in DF
df_null = round(100*(retail.isnull().sum())/len(retail), 2)
df_null

In [None]:
# Droping rows having missing values
retail = retail.dropna()
retail.shape

In [None]:
# Changing the datatype of Customer Id as per Business understanding
retail['CustomerID'] = retail['CustomerID'].astype(str)
retail.info()

<a id="3"></a> <br>
### 3 : Data Preparation

#### Customers will be analyzed based on 3 factors:
- R (Recency): Number of days since last purchase
- F (Frequency): Number of tracsactions
- M (Monetary): Total amount of transactions (revenue contributed)

In [None]:
retail_cleaned['InvoiceDate'] = pd.to_datetime(retail_cleaned['InvoiceDate'], format="%d-%m-%Y %H:%M", errors='coerce')

# Step 2: Compute the maximum date to determine the last transaction date
last_transaction_date = retail_cleaned['InvoiceDate'].max()
print(f"Last transaction date: {last_transaction_date}")

# Step 3: Compute Recency, Frequency, and Monetary for each customer
rfm = retail_cleaned.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (last_transaction_date - x.max()).days,  # Recency: Days since last purchase
    'InvoiceNo': 'nunique',                                           # Frequency: Number of unique transactions
    'UnitPrice': lambda x: np.sum(x * retail_cleaned.loc[x.index, 'Quantity'])    # Monetary: Total spending per customer
}).reset_index()

# Step 4: Rename the columns to represent RFM
rfm.columns = ['CustomerID', 'Recency', 'Frequency', 'Monetary']

# Step 5: Display the resulting RFM table
print(rfm)

In [None]:
# Merging the two dfs
retail_final = pd.merge(retail_cleaned, rfm, on='CustomerID', how='inner')  # You can change 'how' to 'outer', 'left', or 'right' as needed

# Display the first few rows of the merged DataFrame
print(retail_final.head())


### Exploratory Data Analysis

In [None]:
# Histogram #1: Number of Transactions per Customer

transactions_per_customer = retail.groupby('CustomerID').size()

plt.figure(figsize=(10, 6))
plt.hist(transactions_per_customer, bins=50, edgecolor='black')

plt.title('Number of Transactions per Customer', fontsize=16)
plt.xlabel('Number of Transactions', fontsize=14)
plt.ylabel('Number of Customers', fontsize=14)

plt.show()

The shape of the histogram reveals a right-skewed distribution. This means that majority of the customers have only made few transactions, in this point-of-view.

In [None]:
# Histogram # 2
item_counts = retail['Description'].value_counts().sort_values(ascending=False).iloc[0:15]
plt.figure(figsize=(18,6))
sns.barplot(x=item_counts.index, y=item_counts.values, palette=sns.cubehelix_palette(15))
plt.ylabel("Counts")
plt.title("Which items were bought more often?")
plt.xticks(rotation=90)

The histogram above shows the top 15 items sold. Other than the quite steep frop from the top 3 to top 4, there's no clear pattern or even a reason why there items are popular. There are also no correlation among them as to why they are bought this frequently.

In [None]:
retail['TotalSales'] = retail['Quantity'] * retail['UnitPrice']

# Group by StockCode and calculate the total sales per product
sales_by_stockcode = retail.groupby('StockCode')['TotalSales'].sum()

# Filter out non-positive sales
sales_by_stockcode = sales_by_stockcode[sales_by_stockcode > 0]

# Create the boxplot
plt.figure(figsize=(12, 6))

plt.boxplot(sales_by_stockcode, vert=False, patch_artist=True, boxprops=dict(facecolor='skyblue', color='black'))
plt.title('Boxplot of Total Sales Value by StockCode', fontsize=16)
plt.xlabel('Total Sales Value', fontsize=14)

plt.tight_layout()
plt.show()

The boxplot of "Total Sales Value" reveals a right-skewed distribution. This means there are a few outliers that pull the mean to the right, while the majority of sales values are concentrated on the lower end.

In [None]:
# Correlation heatmap

# Select relevant columns and drop any rows with NaN values
data = retail_final[['Quantity', 'UnitPrice', 'Recency', 'Frequency', 'Monetary']]

# Calculate the correlation matrix
correlation_matrix = data.corr()

# Set the size of the heatmap
plt.figure(figsize=(8, 6))

# Create the heatmap using Seaborn
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, square=True, linewidths=0.5)

# Add titles and labels
plt.title('Correlation Heatmap between Quantity and UnitPrice', fontsize=16)
plt.xticks(rotation=45, fontsize=12)
plt.yticks(rotation=45, fontsize=12)

# Show the heatmap
plt.tight_layout()
plt.show()

From our heatmap, Frequency and Monetary or Amount Spent had the highest strong positive correlation of 0.59, which mean customers who visit frequent likely spend more money. Quantity and Monetary have a correlation of 0.036, which means that customers who buy more in terms of quanitity also spend more. Recency and Frequency surprisingly had -0.24, which means that customers do not have a noticeable pattern on when they visit the store.

#### Rescaling the Attributes

It is extremely important to rescale the variables so that they have a comparable scale.<br>
There are two common ways of rescaling:

1. Min-Max scaling 
2. Standardization (mean-0, sigma-1) 

Here we execute Standard Scaling.

In [17]:
# Rescaling the attributes
from sklearn.preprocessing import MinMaxScaler
features = ['Monetary', 'Frequency', 'Recency']  
X = retail_final[features]


## <span style="color: red;">Execute MinMax Scaling in the next box</span> 

In [18]:
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled)
X_scaled.columns = features

In [None]:
# Scatter plot before scaling
plt.figure(figsize=(12, 6))

# Scatter plot for original data
plt.subplot(1, 2, 1)
plt.scatter(retail_final['Frequency'], retail_final['Monetary'], alpha=0.5)
plt.title('Original RFM Data')
plt.xlabel('Frequency')
plt.ylabel('Amount')
plt.grid(True)

# Scatter plot for scaled data
plt.subplot(1, 2, 2)
plt.scatter(X_scaled['Frequency'], X_scaled['Monetary'], alpha=0.5, color='orange')
plt.title('Scaled RFM Data')
plt.xlabel('Frequency (Scaled)')
plt.ylabel('Amount (Scaled)')
plt.grid(True)

plt.tight_layout()
plt.show()


We scale this using MinMaxScaling to reduce the impact of outliers when we try to implement K-Means Clustering later on. Standardized values allow for comparisons across different features more easily. In terms of K-Means, normalization ensures that all features contribute equally to the distance calculations. This is particularly important in algorithms that use Euclidean distances.

<a id="4"></a> <br>
### 4 : Building the Model

### K-Means Clustering

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.<br>

The algorithm works as follows:

- First we initialize k points, called means, randomly.
- We categorize each item to its closest mean and we update the mean’s coordinates, which are the averages of the items categorized in that mean so far.
- We repeat the process for a given number of iterations and at the end, we have our clusters.

In [None]:
# k-means with some arbitrary k
Kmeans = KMeans(n_clusters=4, max_iter=50)
Kmeans.fit(X_scaled)

## <span style="color: red;">Finding the Optimal Number of Clusters</span> 

#### Elbow Curve to get the right number of Clusters
A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k.

In [None]:
wcss = []

for i in range (1,8):
    kmeans = KMeans(n_clusters=i, init = 'k-means++', max_iter=300, n_init=7, random_state=0)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)

plt.plot(range(1,8), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

3 is the optimal number

In [22]:
#create a K_means function here
def K_Mean(X, n):
    scaler = MinMaxScaler()
    X = scaler.fit_transform(X)
    model = KMeans(n)
    model.fit(X)
    clusters = model.predict(X)
    cent = model.cluster_centers_
    return (clusters, cent)

In [None]:
clusters, cent = K_Mean(X_scaled, 3)
kmeans = pd.DataFrame(clusters)
retail_final.insert((retail_final.shape[1]), 'kmeans', kmeans)
retail_final.head(5)

In [None]:
# Plot your clusters
def Plot3dClustering(n, X, type_c):
    data = []
    clusters = []
    colors = ['rgb(228,26,28)', 'rgb(55,126,184)', 'rgb(77,175,74)']
    
    for i in range(n):
        name = i
        color = colors[i]
        x = X[X[type_c] == i]['Monetary']
        y = X[X[type_c] == i]['Frequency']
        z = X[X[type_c] == i]['Recency']
        
        trace = dict(
            name = name,
            x = x, y = y, z = z,
            type = 'scatter3d',
            mode = 'markers',
            marker = dict(size = 3, color = color, line = dict(width = 0)))
        data.append(trace)
        
        cluster = dict(
            color = color,
            opacity = 0.1,
            type = 'mesh3d',
            alphahull = 7,
            name = "y",
            x = x, y = y, z = z)
        data.append(cluster)
        
    layout = dict(
        width = 800,
        height = 550,
        autosize = False,
        title = '3D Clustering Plot',
        scene = dict(
            xaxis = dict(
                gridcolor = 'rgb(255, 255, 255)',
                zerolinecolor = 'rgb(255, 255, 255)',
                showbackground = True,
                title='Amount',
                backgroundcolor = 'rgb(230, 230, 230)',
                ),
            yaxis = dict(
                gridcolor = 'rgb(255, 255, 255)',
                zerolinecolor = 'rgb(255, 255, 255)',
                showbackground = True,
                title='Frequency',
                backgroundcolor = 'rgb(230, 230, 230)',
                ),
            zaxis = dict(
                gridcolor = 'rgb(255, 255, 255)',
                zerolinecolor = 'rgb(255, 255, 255)',
                showbackground = True,
                title='Recency',
                backgroundcolor = 'rgb(230, 230, 230)',
                ),
            aspectratio = dict(x=1, y=1, z=0.7),
            aspectmode = 'manual'
        ),
    )
        
    fig = dict(data = data, layout = layout)
    iplot(fig, filename='3d-scatter-colorscale', validate=False)

Plot3dClustering(n=3, X=retail_final, type_c='kmeans')

## Silhouette Score and Cluster Evaluation

<a id="5"></a> <br>
## Step 5 : Final Analysis

## <span style="color: red;">Findings</span> 

#### Student Name: