<a href="https://colab.research.google.com/github/sreeproject/AI-/blob/main/ShopperSample_ML_Submission_Template_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  Shopper Spectrum: Customer Segmentation and Product Recommendations in E-Commerce





**Project Type**    - Clustering/Unusupervised
##### **Contribution**    - Individual


# **Project Summary -**

The global e-commerce industry generates vast amounts of transaction data daily, offering valuable insights into customer purchasing behaviors. Analyzing this data is essential for identifying meaningful customer segments and recommending relevant products to enhance customer experience and drive business growth. This project aims to examine transaction data from an online retail business to uncover patterns in customer purchase behavior, segment customers based on Recency, Frequency, and Monetary (RFM) analysis, and develop a product recommendation system using collaborative filtering techniques.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The global e-commerce industry produces massive transaction data daily, revealing key insights into customer purchasing behavior. This project analyzes online retail data to identify customer segments using Recency, Frequency, and Monetary (RFM) analysis. It also implements a collaborative filtering-based product recommendation system to enhance customer experience and boost business growth.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
data_df=pd.read_csv("online_retail.csv")
data_df.head()


### Dataset First View

In [None]:
data_df

### Dataset Rows & Columns count

In [None]:
data_df.shape

### Dataset Information

In [None]:
data_df.columns

In [None]:
data_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data_df.duplicated().sum()

There are 5,268 duplicate rows in the dataset.


In [None]:
#Drop duplicate values
data_df.drop_duplicates(inplace=True)

In [None]:
data_df.shape

Dataset shape has changed

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data_df.isnull().sum()

Missing values in CustomerID(135037) and description(1454)

In [None]:
# Visualizing the missing values
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


# Count missing values
missing_counts = data_df.isnull().sum()
missing_counts = missing_counts[missing_counts > 0]  # Filter only columns with missing values

# Plot bar chart
plt.figure(figsize=(8, 5))
missing_counts.sort_values().plot(kind='barh', color='salmon')
plt.title('Count of Missing Values by Column')
plt.xlabel('Number of Missing Values')
plt.ylabel('Column')
plt.tight_layout()
plt.show()

In [None]:
#Handle Missing Values

# Drop rows with missing CustomerID
data_df.dropna(subset=['CustomerID'], inplace=True)

# Optional: Fill missing 'Description' with placeholder
data_df['Description'] =data_df['Description'].fillna('Unknown')

Remove missing in CustomerID values

use fillna('Known') in  Description

In [None]:
data_df.shape

#change dataset shape

### What did you know about your dataset?

Dataset contains 8 fields.They are InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,UnitPrice,Country. Five of them are object types, one is int type and 2 of them are float types. After removing missing columns , duplicated entrys	and unusal entyrs dataset shape is more than 3lakshs and 8 field are there

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data_df.columns

In [None]:
data_df.dtypes

In [None]:
# Dataset Describe
data_df.describe()

### Variables Description

InvoiceNo - Transaction number

StockCode - Unique product/item code

Description - Name of the product

Quantity -  Number of products purchased

InvoiceDate - Date and time of transaction (2022–2023)

UnitePrice - Price per product

CustomerID - Unique identifier for each customer

Country - Country where the customer is based


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
data_df.nunique()

# Unusual records.

In [None]:
# Negative quantities or unit prices (possible returns or data errors)
unusal =data_df[(data_df['Quantity'] <= 0) | (data_df['UnitPrice'] <= 0)]
print(unusal.head(10))

In [None]:
# Filter out rows with negative quantities or unit prices
data_df = data_df[(data_df['Quantity'] > 0) & (data_df['UnitPrice'] > 0)]

In [None]:
data_df.shape

In [None]:
data_df.head()

In [None]:
data_df.shape

Exclude cancelled invoices (InvoiceNo starting with 'C')

In [None]:
data_df['InvoiceNo'].astype(str).str.startswith('C').sum()

There is no invoiceNo starting with 'C'

In [None]:
# Remove cancelled invoices (those with InvoiceNo starting with 'C')
#data_df = data_df[~data_df['InvoiceNo'].astype(str).str.startswith('C')]

In [None]:
data_df.shape

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

Data Wrangling containing data loading, data cleaning ,handle missing values, handle duplicate values

In [None]:
data_df['InvoiceDay']= pd.to_datetime(data_df['InvoiceDate']).dt.date
data_df['InvoiceMonth'] = pd.to_datetime(data_df['InvoiceDate']).dt.month
data_df['InvoiceYear'] = pd.to_datetime(data_df['InvoiceDate']).dt.year

In [None]:
data_df['InvoiceHour'] = pd.to_datetime(data_df['InvoiceDate']).dt.hour

In [None]:
# Create a new column for total price
data_df['TotalPrice'] = data_df['Quantity'] * data_df['UnitPrice']


In [None]:
data_df

##EDA

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code


#Analyze transaction volume by country


In [None]:
# Analyze transaction volume by country
transaction_volume = data_df.groupby('Country')['InvoiceNo'].nunique().sort_values(ascending=False)

# Plot the top 20 countries by transaction volume
plt.figure(figsize=(12, 6))
transaction_volume.head(20).plot(kind='bar', color='skyblue')
plt.title('Top 20 Countries by Transaction Volume')
plt.xlabel('Country')
plt.ylabel('Number of Unique Invoices')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Show the top 10 countries in the console
print(transaction_volume.head(20))

 What is/are the insight(s) found from the chart?

The chart highlights the top 20 countries by their number of unique invoice transactions. The United Kingdom leads with an exceptionally high volume, indicating a dominant market presence. Germany, France, and Ireland follow, while countries like Japan and Australia show modest engagement. Overall, the data suggests opportunities for expansion in underrepresented regions.

#### Chart - 2

#Identify top-selling products

In [None]:
# Group by product description and sum total products
top_products = data_df.groupby('Description')['TotalPrice'].sum().sort_values(ascending=False) #TotalPrice = Quantity * UnitPrice

# Plot the top 15 best-selling products by revenue
plt.figure(figsize=(12, 6))
top_products.head(15).plot(kind='bar', color='orange')
plt.title('Top 15 Best-Selling Products by Revenue')
plt.xlabel('Product Description')
plt.ylabel('Total Sales (£)')
plt.xticks(rotation=75)
plt.tight_layout()
plt.show()

# Display top 15 products in the console
print(top_products.head(15))

 What is/are the insight(s) found from the chart?

The graph shows the top 15 products ranked by revenue, based on total sales in British pounds. "PAPER CRAFT , LITTLE BIRDIE" generates the highest revenue, followed by the "REGENCY CAKESTAND 3 TIER" and "WHITE HANGING HEART T-LIGHT HOLDER." Each product's sales are illustrated with orange bars, where taller bars indicate higher revenue. The chart helps identify the most financially successful items in the product lineup.

Chart 3

#Identify Top 10 Customers

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

data_df['TotalPrice'] = data_df['Quantity'] * data_df['UnitPrice']
top_customers = data_df.groupby('CustomerID')['TotalPrice'].sum().sort_values(ascending=False).head(10)

plt.figure(figsize=(10,6))
sns.barplot(x=top_customers.values, y=top_customers.index, palette='viridis')
plt.title("Top 10 Most Buying Customers (by Spend)")
plt.xlabel("Total Spend")
plt.ylabel("CustomerID")
plt.tight_layout()
plt.show()

top_customers.head(10)

The graph ranks the top 10 customers by total spend in GBP, with each bar representing a different CustomerID. The leading customer spent around £280,000, significantly more than the others. Spending decreases steadily down the list, with the tenth customer spending just above £77,000. This visualization helps identify key revenue-driving clients who may benefit from personalized engagement or loyalty initiatives.

#### Chart - 4

#Visualize purchase trends over time


In [None]:
# Chart - 3 visualization code

# Convert InvoiceDate to datetime
data_df['InvoiceDate'] = pd.to_datetime(data_df['InvoiceDate'])

# Extract only the date part (ignore time)
data_df['InvoiceDay'] = data_df['InvoiceDate'].dt.date

# Group by day and sum total price
daily_sales = data_df.groupby('InvoiceDay')['TotalPrice'].sum()

# Plot purchase trend over time
plt.figure(figsize=(14, 6))
daily_sales.plot(color='green')
plt.title('Daily Purchase Trend Over Time')
plt.xlabel('Date')
plt.ylabel('Total Sales (£)')
plt.grid(True)
plt.tight_layout()
plt.show()



The graph shows how much money people spent each day in 2023. Sometimes sales went up and down, but overall, they kept growing over time. Toward the end of the year, people spent a lot more—over £175,000 in a day. This likely happened because of sales or holidays when people buy more.

#### Chart - 5

#Inspect monetary distribution per transaction and customer

In [None]:
#Total sales per transaction (InvoiceNo)
invoice_totals = data_df.groupby('InvoiceNo')['TotalPrice'].sum()

plt.figure(figsize=(12, 5))
sns.histplot(invoice_totals, bins=100, kde=True, color='blue')
plt.title('Monetary Distribution per Transaction')
plt.xlabel('Total Value per Invoice (£)')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

This graph, a histogram with a kernel density estimate (KDE), shows the distribution of monetary values for each transaction. The vast majority of transactions have a very low total value, as indicated by the tall bar and high peak on the far left. As the transaction value increases, the frequency of those transactions drops off sharply. The long tail to the right shows that while very high-value transactions are rare, they do occur.

#### Chart - 6

#which day of the week had the most product sales (by quantity)

In [None]:
# Extract weekday (0=Monday, 6=Sunday)
data_df['Weekday'] = data_df['InvoiceDate'].dt.day_name()

# Total quantity sold per day
sales_by_day = data_df.groupby('Weekday')['Quantity'].sum().sort_values(ascending=False)

# Order days properly
ordered_days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
sales_by_day = sales_by_day.reindex(ordered_days)

plt.figure(figsize=(10,6))
sns.barplot(x=sales_by_day.index, y=sales_by_day.values, palette='mako')
plt.title('Total Product Quantity Sold by Day of the Week')
plt.ylabel('Total Quantity Sold')
plt.xlabel('Day of the Week')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

sales_by_day.head(7)

This chart shows which days of the week had the most product sales. Friday is the busiest day, with over a million items sold. Monday is the slowest, with only about half a million sold. Most other days like Wednesday and Thursday also had high sales, while Sunday isn’t shown at all.

#### Chart - 7

#RFM distributions

In [None]:
# Chart - 6 visualization code

# Set reference date for Recency (usually one day after the last transaction)
reference_date = data_df['InvoiceDate'].max() + pd.Timedelta(days=1)

# Group by Customer and compute RFM
rfm = data_df.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (reference_date - x.max()).days,  # Recency
    'InvoiceNo': 'nunique',                                    # Frequency
    'TotalPrice': 'sum'                                        # Monetary
}).reset_index()

# Rename columns
rfm.columns = ['CustomerID', 'Recency', 'Frequency', 'Monetary']

# ---- Visualize RFM Distributions ----
plt.figure(figsize=(16, 4))

# Recency
plt.subplot(1, 3, 1)
sns.histplot(rfm['Recency'], bins=50, kde=True, color='green')
plt.title('Recency Distribution')
plt.xlabel('Days Since Last Purchase')

# Frequency
plt.subplot(1, 3, 2)
sns.histplot(rfm['Frequency'], bins=50, kde=True, color='blue')
plt.title('Frequency Distribution')
plt.xlabel('Number of Transactions')

# Monetary
plt.subplot(1, 3, 3)
sns.histplot(rfm['Monetary'], bins=50, kde=True, color='purple')
plt.title('Monetary Distribution')
plt.xlabel('Total Spend (£)')

plt.tight_layout()
plt.show()


Recency Distribution:-
This shows how recently customers made their last purchase. Most customers bought something within the past 50 days, and fewer as you move closer to 400 days. It helps identify how engaged your customer base is—more recent purchases often mean stronger loyalty.

Frequency Distribution:-This tracks how often customers make purchases. Most shoppers made fewer than 10 transactions, while very few made over 100. It highlights that your business has many occasional buyers, with a small group of repeat customers.

Monetary Distribution:-This displays how much money each customer spent in total. Most people spent relatively little, and only a few spent tens of thousands of pounds.
It suggests the presence of high-value customers alongside many low-spend ones.

RFM analysis is a marketing technique that evaluates customers based on:

#Recency – How recently a customer purchased.

#Frequency – How often they purchased.

#Monetary value – How much money they spent.

RFM analysis is used to segment customers, understand their value, and build targeted campaigns to improve retention, loyalty, and revenue.

Recency

Definition: Number of days since the customer's last transaction.

Logic: More recent customers are more likely to buy again.

Formula:

#Recency = Reference Date − Last Purchase Date




Frequency

Definition: Number of distinct purchases the customer made.

Logic: Frequent customers are more loyal.

Formula:

#Frequency = Number of Unique Orders (Invoices)


Monetary

Definition: Total money the customer has spent.

Logic: High-spending customers bring more value.

Formula:

#Monetary = ∑ (Quantity × Unit Price)



 # Handling Outliers

In [None]:
def remove_outliers_iqr(df, columns):
    for col in columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
    return df

In [None]:
rfm_cleaned = remove_outliers_iqr(rfm, ['Recency', 'Frequency', 'Monetary'])


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(15, 4))
for i, col in enumerate(['Recency', 'Frequency', 'Monetary']):
    plt.subplot(1, 3, i + 1)
    sns.boxplot(y=rfm[col])
    plt.title(f'Boxplot of {col}')
plt.tight_layout()
plt.show()

#Use Log Transformation to handle these outliers. Because rfm values are highly  right skewed

In [None]:
#Log Transformation

# Select the columns to apply log transformation
rfm_cols= ['Recency', 'Frequency', 'Monetary']

# Create a copy to avoid modifying the original rfm DataFrame directly if needed later
rfm_log = rfm[rfm_cols].copy()

# Apply log transformation using np.log1p (log(1+x) to handle zero values)
rfm_log['Recency'] = np.log1p(rfm_log['Recency'])
rfm_log['Frequency'] = np.log1p(rfm_log['Frequency'])
rfm_log['Monetary'] = np.log1p(rfm_log['Monetary'])

# Display the first few rows of the transformed data to check
print(rfm_log.head())

#Standardize/Normalize the RFM values code

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
rfm_scaled1 = scaler.fit_transform(rfm_log[['Recency', 'Frequency', 'Monetary']])

# Convert back to DataFrame
rfm_scaled_df = pd.DataFrame(rfm_scaled1, columns=['Recency', 'Frequency', 'Monetary'], index=rfm_log.index)

#### Chart - 8

#Elbow curve for cluster selection

In [None]:
# Chart - 7 visualization code
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Assuming you already have the RFM dataframe ready
# Columns: Recency, Frequency, Monetary

# Step 1: Normalize RFM values
#scaler = StandardScaler()
#rfm_scaled = scaler.fit_transform(rfm[['Recency', 'Frequency', 'Monetary']])

# Step 2: Calculate WCSS (within-cluster sum of squares) for different k
wcss = []
K = range(1, 11)

for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(rfm_scaled1)
    wcss.append(kmeans.inertia_)

# Step 3: Plot the Elbow Curve
plt.figure(figsize=(8, 5))
plt.plot(K, wcss, marker='o')
plt.title('Elbow Curve for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('WCSS')
plt.xticks(K)
plt.grid(True)
plt.show()


The Elbow Curve helps determine the optimal number of clusters (k) for segmentation. As k increases, the Within-Cluster Sum of Squares (WCSS) decreases sharply at first, indicating better fit. Around k=3, the curve begins to flatten, forming the “elbow”—this suggests adding more clusters yields diminishing returns. So, k=3 is a likely sweet spot where you balance granularity with simplicity.

#### Chart - 9

#Customer cluster profiles

In [None]:
# Chart - 8 visualization code
#Apply KMeans with chosen number of clusters (example: 4)
kmeans = KMeans(n_clusters=4, init='k-means++', random_state=42)
rfm['Cluster'] = kmeans.fit_predict(rfm_scaled1)
rfm['Cluster']

In [None]:
#Profile Each Cluster
# Calculate average RFM values per cluster
cluster_profile = rfm.groupby('Cluster').agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'Monetary': 'mean',
    'CustomerID': 'count'
}).rename(columns={'CustomerID': 'Num_Customers'}).round(1).reset_index()

print(cluster_profile)

In [None]:
 #2D plot of Frequency vs Recency
plt.figure(figsize=(8, 6))
sns.scatterplot(data=rfm, x='Recency', y='Frequency', hue='Cluster', palette='Set2')
plt.title('Customer Segments: Recency vs Frequency')
plt.grid(True)
plt.show()

This scatter plot segments customers by their recent activity and shopping frequency. Cluster 1 (orange) highlights loyal buyers who shop frequently and have purchased recently. Clusters 2 and 3 (blue and pink) represent moderate or occasional customers with varying engagement. Cluster 0 (green) includes mostly inactive users who haven’t purchased recently and do so infrequently.

#### Chart - 10

#Product recommendation heatmap / similarity matrix


In [None]:
# Chart - 9 visualization code
#Create a Customer-Product Matrix

from sklearn.metrics.pairwise import cosine_similarity

# Create a pivot table of customers and products with quantity as values
product_matrix = data_df.pivot_table(index='CustomerID', columns='Description', values='Quantity', aggfunc='sum', fill_value=0)

In [None]:
#Compute Product Similarity

# Transpose: we want product-by-product similarity
product_similarity = cosine_similarity(product_matrix.T)

# Create a DataFrame for better readability
product_similarity_df = pd.DataFrame(product_similarity,
                                     index=product_matrix.columns,
                                     columns=product_matrix.columns)

In [None]:
#Plot Heatmap

import seaborn as sns
import matplotlib.pyplot as plt

# Select a subset of top products for visualization
top_products = product_matrix.sum().sort_values(ascending=False).head(10).index
subset_sim = product_similarity_df.loc[top_products, top_products]

plt.figure(figsize=(10, 8))
sns.heatmap(subset_sim, cmap='YlGnBu', annot=True, fmt=".2f")
plt.title("Product Similarity Heatmap (Top 10 Products)")
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

This heatmap illustrates the similarity scores between the top 10 products based on customer behavior or product attributes. Darker blue cells represent high similarity, suggesting those items are often bought together or share key features. Lighter yellow regions indicate low similarity, meaning those products rarely relate or co-occur. Businesses can use this data to recommend items, optimize product placement, or design bundled offerings.

####  Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
features = ['Quantity', 'UnitPrice', 'TotalPrice', 'InvoiceNo' ,'CustomerID' ]
correlation_matrix = data_df[features].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.show()

This correlation heatmap shows relationships between variables like Quantity, UnitPrice, TotalPrice, InvoiceNo, and CustomerID. Strong red areas indicate high positive correlations—most notably between Quantity and TotalPrice (0.91), meaning buying more items tends to raise the total price. Blue regions suggest weak or negative correlations, such as InvoiceNo or CustomerID having little relevance to pricing. These insights help pinpoint which features strongly influence revenue and which are mostly identifiers.

####  Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(rfm, vars=['Recency', 'Frequency', 'Monetary'], hue='Cluster', palette='Set2')
plt.suptitle("Pair Plot of RFM Features by Cluster", y=1.02)
plt.show()


This pair plot shows how customers are distributed across clusters based on RFM—Recency, Frequency, and Monetary values. Diagonal panels display the feature distributions per cluster, revealing patterns like Cluster 1 (blue) having low recency and high frequency/spending, indicating loyal customers. Off-diagonal scatter plots show relationships between features, such as how higher frequency often aligns with higher monetary value. The color-coded clusters highlight distinct customer types, making it easier to tailor engagement strategies for each group.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Handled missing values by removing 135,037 rows with missing CustomerID using .dropna(), and filled 1,454 missing entries in the Description column with 'unknown' using .fillna('unknown').

In [None]:
rfm_scaled_df.head()

### 4. Feature Manipulation & Selection

To support customer segmentation and product recommendation, we engineered several new features. First, we extracted detailed time-based features from the InvoiceDate column, such as year, month, day, and hour.

These features help in analyzing customer behavior trends over different time periods. Next, we created a new feature called TotalPrice by multiplying Quantity and UnitPrice, giving the total transaction value for each order line. This allows us to quantify how much customers spend in each transaction.

This new feature helps in understanding customer spending behavior and plays a crucial role in RFM (Recency, Frequency, Monetary) analysis and revenue-based segmentation.

 # ML Model Implementation

 # Clustering Algorithms

#1.KMeans + Silhouette Score

In [None]:
#KMeans + Silhouette Score

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Assume rfm_scaled is your standardized/normalized RFM data
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(rfm_scaled1)
rfm['Kmean_Cluster+Silhouette Score'] =labels

# Calculate silhouette score
score = silhouette_score(rfm_scaled1, labels)

print(f"Silhouette Score for k=4: {score:.4f}")

In [None]:
#To Find Optimal k Using Silhouette Score

scores = []
k_values = range(2, 11)

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(rfm_scaled1)
    score = silhouette_score(rfm_scaled1, labels)
    scores.append(score)
    print(f"k={k}, Silhouette Score={score:.4f}")

In [None]:
#Plot it

plt.plot(k_values, scores, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score for different k')
plt.grid(True)
plt.show()

This graph displays the Silhouette Score for different cluster counts (k), helping evaluate how well data points are grouped.
k=3 is initial choice for clustering. In that case, the graph shows that k=2 offers better cohesion with a higher Silhouette Score, suggesting tighter and more distinct clusters. However, sticking with k=3 might still be justified if it aligns better with domain-specific needs, like separating subtle customer segments.This helps choose the best cluster number by balancing simplicity and segmentation accuracy.

#2.KMeans+Elbow Method

In [None]:
#Elbow Method Code

# Inertia values list
inertia = []
K = range(1, 11)  # Trying k from 1 to 10

for k in K:
    kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42)
    kmeans.fit(rfm_scaled1)
    inertia.append(kmeans.inertia_) # Add this line back to append inertia to the list

    print(f"k={k}, inertia = {inertia[-1]:.4f}")

In [None]:
rfm.head()

In [None]:
#Plot Elbow Curve
plt.figure(figsize=(8, 4))
plt.plot(K, inertia, marker='o') # Changed 'k' to 'K' to plot the list of k values
plt.title('Elbow Method - Optimal k')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia (SSE)')
plt.grid(True)
plt.show()

This graph represents the Elbow Method for selecting the optimal number of clusters (k) in a dataset using k-means. As the number of clusters increases, the inertia (sum of squared errors) decreases, indicating tighter grouping. However, after a certain point—around k=3 or k=4—the rate of improvement drops, forming an “elbow” in the curve. This elbow marks the balance point where adding more clusters yields minimal benefit in reducing inertia. Choosing k at this elbow ensures efficient clustering without unnecessarily increasing model complexity.

 # DBScan

In [None]:
from sklearn.cluster import DBSCAN

# eps and min_samples are hyperparameters you can tune
dbscan = DBSCAN(eps=0.8, min_samples=5)
rfm['DBSCAN_Cluster'] = dbscan.fit_predict(rfm_scaled1)

print(rfm['DBSCAN_Cluster'].value_counts())

In [None]:
#Visualize DBSCAN Clusters

import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(
    x=rfm_scaled1[:, 0], y=rfm_scaled1[:, 1],
    hue=rfm['DBSCAN_Cluster'],
    palette='Set1'
)
plt.title("DBSCAN Clusters")
plt.xlabel("Recency (scaled)")
plt.ylabel("Frequency (scaled)")
plt.show()

This scatter plot displays the DBSCAN clustering results based on scaled Recency and Frequency values. The blue points (Cluster 0) represent the main group of customers with varying purchase frequencies and recent activity. The red points (Cluster -1) are classified as outliers, indicating unusual purchasing patterns compared to the main cluster. This visualization helps identify customer segments as well as exceptions that may require special marketing or investigation.

# Hierarchial

In [None]:
#Compute Linkage and Plot Dendrogram

import scipy.cluster.hierarchy as sch

plt.figure(figsize=(10, 6))
dendrogram = sch.dendrogram(sch.linkage(rfm_scaled1, method='ward'))
plt.title("Customer Dendrogram")
plt.xlabel("Customer Index")
plt.ylabel("Euclidean Distance")
plt.show()

This graph is a dendrogram created from hierarchical clustering of customer data, often used for segmentation analysis.

It visually shows how individual customers are grouped based on similarity, using Euclidean distance as a measure.

As you move up the vertical axis, the lines joining data points represent clusters being merged—shorter merges indicate more similar customers.

The “branches” and colored clusters reveal how customers naturally form distinct groups, which can help in designing targeted marketing strategies.

By choosing a horizontal cut (e.g., where the biggest vertical gaps occur), you can decide how many final clusters to create, such as 3 or 4 meaningful groups.

In [None]:
#Apply Agglomerative Clustering

from sklearn.cluster import AgglomerativeClustering

hc = AgglomerativeClustering(n_clusters=4, metric='euclidean', linkage='ward') #metric='euclidean' is used for distance calculation.
                                                                               #linkage='ward' only works with Euclidean distance, so make sure metric='euclidean'
rfm['Hierarchical_Cluster'] = hc.fit_predict(rfm_scaled1)

print(rfm['Hierarchical_Cluster'].value_counts())


# Label the clusters by interpreting their RFM averages:

Recency – How recently a customer purchased.

Frequency – How often they purchased.

Monetary value – How much money they spent.

Label the clusters by interpreting their RFM averages:



Cluster  --   Characteristics     --        Segment Label

High R,
High F,
High M  --  Regular, frequent, recent,
                and big spenders  --        High-Value


Medium F,
Medium M --  Steady purchasers but not
             premium                    --     Regular

Low F,
Low M,
older R --    Rare, occasional purchases --   Occasional
High R,

Low F,
Low M    --  Haven’t purchased in a long time -- At-Risk



In [None]:
#Group by Cluster and Calculate Mean RFM Values

rfm_cluster_avg = rfm.groupby('Cluster')[['Recency', 'Frequency', 'Monetary']].mean().round(1)
print(rfm_cluster_avg)

In [None]:
# Original segment labels
segment_labels = {
    0: 'Regular',
    1: 'Occasional',
    2: 'High-Value',
    3: 'At-Risk',
}


# First map the known cluster numbers
rfm['Segment'] = rfm['Cluster'].map(segment_labels)


#KMeans – Most commonly used for RFM segmentation

In [None]:
print(rfm[['CustomerID', 'Cluster', 'Segment']].head(10))

In [None]:
print(rfm['Segment'].value_counts())

In [None]:
rfm.head(10)

Low recency,high frequency and high monetary

In [None]:
#3D Plot Compari
from mpl_toolkits.mplot3d import Axes3D
# Elbow 3D
fig = plt.figure(figsize=(10, 5))
ax = fig.add_subplot(121, projection='3d')
sc = ax.scatter(rfm['Recency'], rfm['Frequency'], rfm['Monetary'],
                c=rfm['Cluster'], cmap='Set2', s=50)
ax.set_title("Elbow Clusters")
ax.set_xlabel("Recency")
ax.set_ylabel("Frequency")
ax.set_zlabel("Monetary")

# Silhouette 3D
ax = fig.add_subplot(122, projection='3d')
sc = ax.scatter(rfm['Recency'], rfm['Frequency'], rfm['Monetary'],
                c=rfm['Kmean_Cluster+Silhouette Score'], cmap='Set1', s=50)
ax.set_title("Silhouette Clusters")
ax.set_xlabel("Recency")
ax.set_ylabel("Frequency")
ax.set_zlabel("Monetary")

plt.tight_layout()
plt.show()

These 3D scatter plots compare customer segmentation results using the Elbow and Silhouette methods. Each method groups customers based on Recency, Frequency, and Monetary values, with different colors representing distinct clusters. This visual comparison helps evaluate how well-separated and meaningful the clusters are for targeted analysis.

In [None]:
rfm.groupby('Cluster')[['Recency', 'Frequency', 'Monetary']].mean()

In [None]:
rfm.groupby(rfm['Kmean_Cluster+Silhouette Score'])[['Recency', 'Frequency', 'Monetary']].mean()

In [None]:
#Plot 3D Clusters

fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')

# Assign x, y, z from the original RFM features (unscaled or scaled)
x = rfm_scaled1[:, 0]  # Recency
y = rfm_scaled1[:, 1]  # Frequency
z = rfm_scaled1[:, 2]  # Monetary

# Plot
scatter = ax.scatter(x, y, z, c=rfm['DBSCAN_Cluster'], cmap='Set1', s=60)

# Axis labels
ax.set_xlabel('Recency')
ax.set_ylabel('Frequency')
ax.set_zlabel('Monetary')

# Legend
legend1 = ax.legend(*scatter.legend_elements(), title="Clusters")
ax.add_artist(legend1)

plt.title("3D DBSCAN Clusters on RFM")
plt.show()


This 3D plot shows customer segmentation results using DBSCAN applied to RFM (Recency, Frequency, Monetary) metrics. Cluster 0 (gray) includes customers with similar purchasing behavior, forming a dense core group. Cluster -1 (red) represents outliers—customers whose patterns differ significantly from the main segments. DBSCAN is particularly useful here because it detects clusters of varying shapes and isolates noise without requiring the number of clusters beforehand.

In [None]:
rfm.groupby(rfm['DBSCAN_Cluster'])[['Recency', 'Frequency', 'Monetary']].mean()

In [None]:
# Extract scaled RFM features
x = rfm_scaled1[:, 0]  # Recency
y = rfm_scaled1[:, 1]  # Frequency
z = rfm_scaled1[:, 2]  # Monetary

# Plot
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(x, y, z, c=rfm['Hierarchical_Cluster'], cmap='tab10', s=60)

ax.set_xlabel('Recency')
ax.set_ylabel('Frequency')
ax.set_zlabel('Monetary')
ax.set_title('Hierarchical Clustering - 3D View')

# Legend
legend1 = ax.legend(*scatter.legend_elements(), title="Clusters")
ax.add_artist(legend1)

plt.show()

This 3D plot displays customer segmentation using hierarchical clustering based on Recency, Frequency, and Monetary metrics. Each color represents a distinct group with similar purchasing behavior and engagement patterns. The spatial separation highlights differences, such as high spenders clustered away from occasional buyers. This segmentation helps businesses personalize marketing and improve retention strategies.

In [None]:
rfm.groupby(rfm['Hierarchical_Cluster'])[['Recency', 'Frequency', 'Monetary']].mean()

 # Which ML model did you choose from the above created models as your final prediction model and why?

KMeans Clustering as the final model for customer segmentation. It is efficient, scalable, and suitable for handling large transactional datasets. The algorithm works effectively with RFM (Recency, Frequency, Monetary) features after normalization. KMeans provides clear and interpretable customer clusters that support business decision-making. The optimal number of clusters was determined using the Elbow Method to balance accuracy and simplicity. This model successfully grouped customers into meaningful segments like High-Value, Regular, Occasional, and At-Risk.