## Overview

<a href="https://archive.ics.uci.edu/ml/datasets/online+retail">Online retail</a> is a transnational dataset which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

## Source

UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/online+retail

## Business Goal

To segment the Customers based on RFM so that the company can target its customers efficiently.


## Methodology

1. [Reading and Understanding the Data](#1) <br>
   a. Creating a Data Dictionary
2. [Data Cleaning](#2)
3. [Data Preparation](#3) <br>
   a. Scaling Variables
4. [Model Building](#4) <br>
   a. K-means Clustering <br>
   b. Finding the Optimal K
5. [Final Analysis](#5)


<a id="1"></a> <br>

### 1 : Data Preprocessing


In [None]:
# import required libraries for dataframe and visualization

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import plotly as py 
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

# import required libraries for clustering
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree

In [None]:
# Reading the data on which analysis needs to be done
retail = pd.read_csv('dataset/OnlineRetail.csv', encoding='utf-8', encoding_errors='ignore')
# Display first 10 rows
retail.head(10)

#### Data Dictionary

| First Header | Definition            | Description                                                                                                                        | Data Type |
| ------------ | --------------------- | ---------------------------------------------------------------------------------------------------------------------------------- | --------- |
| InvoiceNo    | Invoice number        | A 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation. | Nominal   |
| StockCode    | Product (item) code   | A 5-digit integral number uniquely assigned to each distinct product.                                                              | Nominal   |
| Description  | Product (item) name   | Name of Product                                                                                                                    | Nominal   |
| Quantity     | Quantity              | The quantities of each product (item) per transaction                                                                              | Numeric   |
| InvoiceDate  | Invoice Date and time | The day and time when each transaction was generated.                                                                              | Numeric   |
| UnitPrice    | Unit price            | Product price per unit in sterling.                                                                                                | Numeric   |
| CustomerID   | Customer number       | A 5-digit integral number uniquely assigned to each customer.                                                                      | Nominal   |
| Country      | Country name          | The name of the country where each customer resides.                                                                               | Nominal   |


### Missing Values


In [None]:
missing_values = retail.isnull().sum()

# Filter columns with missing values (optional, can remove if you want all columns)
missing_values = missing_values[missing_values > 0]

# Plot the missing values
plt.figure(figsize=(10, 6))
missing_values.plot(kind='bar', color='skyblue')

# Add labels and title
plt.title('Missing Values per Column', fontsize=16)
plt.xlabel('Columns', fontsize=12)
plt.ylabel('Number of Missing Values', fontsize=12)

# Show the plot
plt.show()

As we can see, the majority of the missing values is in the Customer ID. Since Customer ID is a sensitive column, and we cannot make justifications o assumptions of the number of customers coming in to buy as it may cause some data inconsistencies and be the cause of misrepresentation of sensitive sales data, we drop the missing values.


In [91]:
# Dropping the null customer id rows
retail_cleaned = retail.copy()
retail_cleaned = retail_cleaned.dropna(subset=['CustomerID'])

We try to derive the meaning of the descriptions of the items using Stock Code, and try to fill in missing values


In [92]:
stock_code_to_description = retail_cleaned.dropna(subset=['Description']).set_index('StockCode')['Description'].to_dict()

# Fill missing descriptions using the dictionary
retail_cleaned['Description'] = retail_cleaned.apply(
    lambda row: stock_code_to_description.get(row['StockCode']) if pd.isnull(row['Description']) else row['Description'],
    axis=1
)

# Drop rows where the description is still missing
df_cleaned = retail_cleaned.dropna(subset=['Description'])

Check if missing values are no more


In [None]:
retail_cleaned.info()

<a id="2"></a> <br>

### 2 : Data Cleaning


In [None]:
# Calculating the Missing Values % contribution in DF
total_rows = len(retail)

# Calculate the number of missing values per column
missing_values = retail.isnull().sum()

# Calculate the percentage of missing values per column
missing_percentage = (missing_values / total_rows) * 100

# Filter only columns with missing values (optional, can remove if you want all columns)
missing_percentage = missing_percentage[missing_percentage > 0]

# Display the missing percentage for each column
missing_percentage = missing_percentage.sort_values(ascending=False)
print(missing_percentage)


In [95]:
# Changing the datatype of Customer Id as per Business understanding


<a id="3"></a> <br>

### 3 : Data Preparation


#### Customers will be analyzed based on 3 factors:

- R (Recency): Number of days since last purchase
- F (Frequency): Number of tracsactions
- M (Monetary): Total amount of transactions (revenue contributed)


In [None]:
retail_cleaned['InvoiceDate'] = pd.to_datetime(retail_cleaned['InvoiceDate'], format="%d-%m-%Y %H:%M", errors='coerce')

# Step 2: Compute the maximum date to determine the last transaction date
last_transaction_date = retail_cleaned['InvoiceDate'].max()
print(f"Last transaction date: {last_transaction_date}")

# Step 3: Compute Recency, Frequency, and Monetary for each customer
rfm = retail_cleaned.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (last_transaction_date - x.max()).days,  # Recency: Days since last purchase
    'InvoiceNo': 'nunique',                                           # Frequency: Number of unique transactions
    'UnitPrice': lambda x: np.sum(x * retail_cleaned.loc[x.index, 'Quantity'])    # Monetary: Total spending per customer
}).reset_index()

# Step 4: Rename the columns to represent RFM
rfm.columns = ['CustomerID', 'Recency', 'Frequency', 'Monetary']

# Step 5: Display the resulting RFM table
print(rfm)

In [None]:
# Merging the two dfs
retail_final = pd.merge(retail_cleaned, rfm, on='CustomerID', how='inner')  # You can change 'how' to 'outer', 'left', or 'right' as needed

# Display the first few rows of the merged DataFrame
print(retail_final.head())


#### Rescaling the Attributes

It is extremely important to rescale the variables so that they have a comparable scale.<br>
There are two common ways of rescaling:

1. Min-Max scaling
2. Standardization (mean-0, sigma-1)

Here we execute Standard Scaling.


In [98]:
# Rescaling the attributes
from sklearn.preprocessing import MinMaxScaler
features = ['Monetary', 'Frequency', 'Recency']  
X = retail_final[features]


## <span style="color: red;">Execute MinMax Scaling in the next box</span>


In [99]:
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled)
X_scaled.columns = features

<a id="4"></a> <br>

### 4 : Clustering Analysis


### K-Means Clustering


K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.<br>

The algorithm works as follows:

- First we initialize k points, called means, randomly.
- We categorize each item to its closest mean and we update the mean’s coordinates, which are the averages of the items categorized in that mean so far.
- We repeat the process for a given number of iterations and at the end, we have our clusters.


In [None]:
# k-means with some arbitrary k
Kmeans = KMeans(n_clusters=4, max_iter=50)
Kmeans.fit(X_scaled)

### Optimal Number of Clusters

In [None]:
wcss = []

for i in range (1,8):
    kmeans = KMeans(n_clusters=i, init = 'k-means++', max_iter=300, n_init=7, random_state=0)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)

plt.plot(range(1,8), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

In [102]:
#create a K_means function here
def K_Mean(X, n):
    scaler = MinMaxScaler()
    X = scaler.fit_transform(X)
    model = KMeans(n)
    model.fit(X)
    clusters = model.predict(X)
    cent = model.cluster_centers_
    return (clusters, cent)

In [None]:
clusters, cent = K_Mean(X_scaled, 3)
kmeans = pd.DataFrame(clusters)
retail_final.insert((retail_final.shape[1]), 'kmeans', kmeans)
retail_final.head(5)

In [None]:
#plot your clusters
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')

# Use the cluster column to color the points
scatter = ax.scatter(retail_final[features[0]], retail_final[features[1]], retail_final[features[2]], 
                     c=retail_final['kmeans'], cmap='viridis', marker='o', edgecolor='k')

# Set labels for the axes
ax.set_xlabel(features[0])
ax.set_ylabel(features[1])
ax.set_zlabel(features[2])

# Add a color bar to show cluster labels
legend1 = ax.legend(*scatter.legend_elements(), title="Clusters")
ax.add_artist(legend1)

# Add a title
plt.title('3D Scatter Plot of K-means Clusters')

# Show the plot
plt.show()

## <span style="color: red;">Box Plots of Clusters created</span>


<a id="5"></a> <br>

## Step 5 : Final Analysis


## <span style="color: red;">Findings</span>


#### Student Name:
