# **Project Name**    -



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual
##### **Name**            - Sejal


# **Project Summary -**


The goal of this project was to perform **customer segmentation** using **unsupervised machine learning techniques** to better understand purchasing behaviors and improve business strategies. We utilized **RFM analysis (Recency, Frequency, Monetary)** combined with **K-Means clustering** to identify customer segments from the given dataset.  

---

### **1. Data Preprocessing**
The dataset contained customer transaction data with fields like *InvoiceNo, InvoiceDate, CustomerID, Quantity, UnitPrice,* and *Country*. We performed the following preprocessing steps:
- Removed missing values (e.g., customers without a CustomerID).
- Removed duplicate rows and canceled transactions (InvoiceNo starting with 'C').
- Created a **TotalPrice** feature = *Quantity × UnitPrice* to represent the total monetary value of each transaction.
- Converted the **InvoiceDate** column to datetime format for time-based calculations.

---

### **2. RFM Analysis**
We derived **RFM metrics** for each customer:
- **Recency:** Days since the last purchase (Latest date in dataset − last purchase date).
- **Frequency:** Total number of unique transactions (invoices).
- **Monetary:** Total amount spent by the customer.  

We grouped data by `CustomerID` to calculate RFM values. These metrics provided a foundation to differentiate customers based on their purchase patterns.

---

### **3. Feature Scaling**
As RFM values vary greatly in scale, we standardized them using **StandardScaler**:
- Mean = 0, Standard Deviation = 1 for each feature.
- This ensured fair comparison during clustering.

Normalization was avoided as it could distort differences in frequency and monetary values, which are important for segmentation.

---

### **4. Clustering**
We used the **K-Means clustering algorithm** because it is simple, efficient, and works well with numerical data like RFM scores.

#### **Choosing the Optimal Number of Clusters**
- **Elbow Method (WCSS):** Showed a clear "elbow" at **k = 4**, suggesting that four clusters best balance compactness and separation.
- **Silhouette Score:** Confirmed that k = 4 provided a good balance between intra-cluster similarity and inter-cluster difference.

---

### **5. Cluster Insights**
We identified **4 distinct customer segments**:
1. **High-value frequent customers:** Low Recency (recent purchases), high Frequency, high Monetary value – loyal and profitable customers.
2. **Moderate customers:** Average across all three metrics – potential for upselling and engagement.
3. **Occasional/one-time customers:** High Recency (not purchased recently), low Frequency – need reactivation strategies.
4. **Low-value or dormant customers:** High Recency, low Frequency, low Monetary – at risk of churn.

---

### **6. Data Visualization & Insights**
We created visualizations for deeper understanding:
- **Transaction volume by country:** Revealed that the majority of sales occurred in the United Kingdom, providing insights for market focus.
- **Top 10 best-selling products:** Helped identify high-demand items to maintain adequate stock.
- **Purchase trends over time:** Showed seasonal and time-based purchasing patterns.
- **Monetary value distribution:** Indicated that most transactions were of low to medium value.

These insights can help businesses improve product planning, marketing strategies, and revenue generation.

---

### **7. Business Impact**
- **Customer Retention:** By identifying loyal customers, businesses can reward them with exclusive offers.
- **Churn Reduction:** Dormant customers can be targeted with personalized reactivation campaigns.
- **Optimized Marketing Spend:** Resources can be allocated more efficiently by focusing on high-value segments.

---



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The global e-commerce industry generates vast amounts of transaction data daily, offering valuable insights into customer purchasing behaviors. Analyzing this data is essential for identifying meaningful customer segments and recommending relevant products to enhance customer experience and drive business growth. This project aims to examine transaction data from an online retail business to uncover patterns in customer purchase behavior, segment customers based on Recency, Frequency, and Monetary (RFM) analysis, and develop a product recommendation system using collaborative filtering techniques.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D



### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')
df=pd.read_csv('/content/drive/MyDrive/E-commerce/online_retail.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Total Rows:", df.shape[0])
print("Total Columns:", df.shape[1])


### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print("Number of duplicate rows:", duplicate_count)


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isnull().sum()

print("Missing/Null Values Count per Column:")
print(missing_values)


In [None]:
sns.set(style="whitegrid")

# Bar plot for missing values count
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0]

plt.figure(figsize=(8,4))
sns.barplot(x=missing_values.index, y=missing_values.values, palette="mako")
plt.title("Missing Values Count per Column")
plt.ylabel("Count of Missing Values")
plt.xlabel("Columns")
plt.xticks(rotation=30)
plt.show()



### What did you know about your dataset?



1. The dataset contains **541,909 rows** and **8 columns**.
2. It includes details of online retail transactions such as:
   - Invoice number  
   - Product code  
   - Product description  
   - Quantity purchased  
   - Invoice date  
   - Price per item  
   - Customer ID  
   - Country
3. Two columns have missing values:
   - **CustomerID**: 135,080 missing values (~25% of the dataset)
   - **Description**: 1,454 missing values (<1% of the dataset)
4. The dataset may contain **duplicate rows** that need to be cleaned.
5. **Quantity** can have negative values, usually indicating product returns.
6. **UnitPrice** may have zero or invalid values, which need to be checked and handled.
7. This dataset can be used for several business insights:
   - **Customer Segmentation** using RFM (Recency, Frequency, Monetary) analysis  
   - **Product Recommendations** through collaborative filtering  
   - Identifying top-selling products, return patterns, and sales by country  
   - Supporting targeted marketing campaigns and inventory management


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.shape[1]

In [None]:
# Dataset Describe
df.describe()

### Variables Description


| **Column Name** | **Description** |
|-----------------|-----------------|
| **InvoiceNo**   | Transaction number for each purchase (unique for each invoice) |
| **StockCode**   | Unique product/item code |
| **Description** | Name/description of the product |
| **Quantity**    | Number of products purchased (can be negative for returns) |
| **InvoiceDate** | Date and time of the transaction (2022–2023) |
| **UnitPrice**   | Price per product/item |
| **CustomerID**  | Unique identifier for each customer |
| **Country**     | Country where the customer is based |


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in df.columns:
    print(f"{col} → {df[col].nunique()} unique values")


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# Group data by country and count transactions
country_txn = df.groupby('Country')['InvoiceNo'].nunique().sort_values(ascending=False)

# Plot transaction volume by country
plt.figure(figsize=(15,6))
sns.barplot(x=country_txn.index, y=country_txn.values, palette="viridis")

plt.title("Transaction Volume by Country", fontsize=14)
plt.ylabel("Number of Transactions")
plt.xlabel("Country")
plt.xticks(rotation=75)
plt.show()


##### 1. Why did you pick the specific chart?

I used a **bar chart** because it is the most effective way to compare the transaction volumes across different countries. Each country is clearly represented, and the relative transaction numbers can be easily compared at a glance.


##### 2. What is/are the insight(s) found from the chart?

- **United Kingdom** has an overwhelmingly higher number of transactions compared to all other countries.  
- Countries like **Germany, France, and EIRE** have moderate transaction volumes, while most other countries have very few transactions.  
- The dataset is highly **skewed towards the UK market**.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 **Yes**, the insights can positively impact business strategy:
  - Since the UK contributes the highest volume, focusing marketing campaigns and loyalty programs in the UK can yield higher returns.  
  - Countries with low transactions could be targeted with awareness campaigns, discounts, or partnerships to increase sales.
---
- The skewness in transactions suggests the business is **overly dependent on one market (UK)**.  
- Any disruption in the UK market could negatively impact overall revenue.  
- It is important to **diversify sales across multiple countries** to reduce dependency and risk.


#### Chart - 2

In [None]:
# Chart - 2 visualization code
top_products = (
    df.groupby('Description')['Quantity']
    .sum()
    .sort_values(ascending=False)
    .head(10)
)

# Plot chart
plt.figure(figsize=(12,6))
sns.barplot(x=top_products.values, y=top_products.index, palette="mako")
plt.title("Top 10 Best-Selling Products", fontsize=14)
plt.xlabel("Total Quantity Sold")
plt.ylabel("Product Description")
plt.show()


##### 1. Why did you pick the specific chart?

I used a **horizontal bar chart** because it clearly shows the total quantity sold for each product.  
This format makes it easy to rank products from the highest-selling to the lowest-selling.



##### 2. What is/are the insight(s) found from the chart?

- **"WORLD WAR 2 GLIDERS ASSTD DESIGNS"** is the highest-selling product by a significant margin.  
- Products like **"JUMBO BAG RED RETROSPOT"**, **"ASSORTED COLOUR BIRD ORNAMENT"**, and **"POPCORN HOLDER"** also have very high sales volumes.  
- The sales distribution shows that a few products dominate overall sales, while others sell in smaller volumes.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- **Yes**, these insights help the business focus on:
  - Ensuring sufficient inventory for top-selling products to avoid stock-outs.  
  - Identifying potential products to feature in promotions or bundle offers.  
  - Understanding product preferences for better forecasting and procurement.

---
- There is a risk of **over-dependence on a small number of products**.  
- If demand for the top products falls or if supply chain issues arise, sales could be heavily impacted.  
- Diversifying marketing strategies to boost the sales of mid-tier products can reduce risk.


#### Chart - 3

In [None]:
# Chart - 3 visualization code
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Create a new column for date (ignoring time)
df['Date'] = df['InvoiceDate'].dt.date

# Group by date and count total transactions
daily_trend = df.groupby('Date')['InvoiceNo'].nunique()

# Plot purchase trends over time
plt.figure(figsize=(14,6))
sns.lineplot(x=daily_trend.index, y=daily_trend.values, color='blue')
plt.title("Purchase Trends Over Time", fontsize=14)
plt.xlabel("Date")
plt.ylabel("Number of Transactions")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I selected a **line chart** because it effectively shows how the number of transactions changes over time.  
This type of chart makes it easy to observe fluctuations, patterns, and seasonality in purchases.


##### 2. What is/are the insight(s) found from the chart?

- Transaction volumes show a **lot of fluctuations**, with some sharp spikes on certain dates.  
- There is an overall **upward trend** in the latter part of the year (around September to November).  
- Peaks may be associated with **festive seasons or special promotions**, while the dips could indicate low shopping periods.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- **Yes**, these insights can help in:
  - Planning inventory and supply chain around peak seasons to avoid stock-outs.  
  - Scheduling marketing campaigns during low-transaction periods to boost sales.  
  - Identifying high-revenue days and replicating the strategies that worked.

---
- The **frequent dips** in transactions could indicate missed sales opportunities or periods of low customer engagement.  
- Businesses should investigate the reasons behind these dips (e.g., stock issues, reduced marketing) and take corrective action.


#### Chart - 4

In [None]:
# Chart - 4 visualization code

# Create a new column for total monetary value of each line item
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

#  Monetary distribution per transaction
transaction_value = df.groupby('InvoiceNo')['TotalPrice'].sum()

plt.figure(figsize=(15,5))
sns.histplot(transaction_value, bins=50, kde=True, color="purple")
plt.title(" Monetary Distribution per Transaction", fontsize=14)
plt.xlabel("Transaction Value")
plt.ylabel("Frequency")
plt.xlim(0, transaction_value.quantile(0.95))  # Remove extreme outliers
plt.show()


##### 1. Why did you pick the specific chart?

I selected an **area plot** because it shows the distribution of transaction values clearly.  
It helps visualize how frequently certain monetary values occur across transactions.


##### 2. What is/are the insight(s) found from the chart?

- Most transactions fall in the **mid-range of transaction values** (around 800 units).  
- There are fewer transactions with very low or very high transaction values.  
- This indicates a **balanced spending pattern**, where customers typically purchase moderate amounts.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- **Yes**, these insights help businesses understand the average spending behavior and set targeted strategies:
  - Designing bundles or offers around the mid-value spending range can increase revenue.  
  - Identify high-value customers (outliers) and create **loyalty programs** for them.

---
- If very few customers make high-value purchases, the business may be missing out on opportunities to **upsell or cross-sell**.  
- Offering incentives for customers to increase their basket size could mitigate this gap.
Do you also want me to write Colab-

#### Chart - 5

In [None]:
# Chart - 5 visualization code

# Monetary distribution per customer
customer_value = df.groupby('CustomerID')['TotalPrice'].sum()

plt.figure(figsize=(12,5))
sns.histplot(customer_value, bins=50, kde=True, color="green")
plt.title(" Monetary Distribution per Customer", fontsize=14)
plt.xlabel("Customer Lifetime Value")
plt.ylabel("Frequency")
plt.xlim(0, customer_value.quantile(0.95))  # Remove extreme outliers
plt.show()


##### 1. Why did you pick the specific chart?

This histogram with a line plot clearly shows the **distribution of Customer Lifetime Value (CLV)**.  
It helps identify where most customers fall and makes it easy to spot trends or outliers.  



##### 2. What is/are the insight(s) found from the chart?

 Most customers have a **low CLV (0–1500)**.  
- Customer count drops sharply as CLV increases.  
- Few customers contribute very high CLV, forming a long-tail distribution.  



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. The insights help in:
- Identifying trends and customer preferences  
- Improving operational efficiency  
- Targeting the right audience effectively  
- Optimizing resources for better profitability  

---
Some insights may indicate challenges such as:
- Drop in customer retention  
- Underperforming products or services  

Although these reflect current weaknesses, they **highlight opportunities for improvement**, allowing businesses to correct strategies and prevent further losses.


#### Chart - 6

In [None]:
# Chart - 6 visualization code
#  Boxplots for Quantity and UnitPrice
plt.figure(figsize=(14,5))

plt.subplot(1,2,1)
sns.boxplot(x=df['Quantity'], color="skyblue")
plt.title("Boxplot - Quantity")

plt.subplot(1,2,2)
sns.boxplot(x=df['UnitPrice'], color="lightgreen")
plt.title("Boxplot - UnitPrice")

plt.show()

#  Calculate IQR and find outliers for Quantity
Q1 = df['Quantity'].quantile(0.25)
Q3 = df['Quantity'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR

outliers_quantity = df[(df['Quantity'] < lower_bound) | (df['Quantity'] > upper_bound)]
print(f"Outliers in Quantity: {outliers_quantity.shape[0]} rows")

# Calculate IQR and find outliers for UnitPrice
Q1 = df['UnitPrice'].quantile(0.25)
Q3 = df['UnitPrice'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR

outliers_price = df[(df['UnitPrice'] < lower_bound) | (df['UnitPrice'] > upper_bound)]
print(f"Outliers in UnitPrice: {outliers_price.shape[0]} rows")


##### 1. Why did you pick the specific chart?

- A **boxplot** is a powerful tool for detecting outliers in continuous variables.
- We selected boxplots for **Quantity** and **UnitPrice** because these features often contain extreme values that can distort statistical calculations and machine learning models.
- The boxplot helps visualize the spread of data (median, quartiles, minimum, maximum) and highlights values that fall far outside the typical range.



##### 2. What is/are the insight(s) found from the chart?

- **Quantity:**
  - There are significant outliers on both the positive and negative ends.
  - Negative values likely indicate **order cancellations or data errors**.
  - Positive outliers suggest unusually large orders.
- **UnitPrice:**
  - Multiple extreme outliers exist, with some prices reaching very high values.
  - Negative values are also present, which are unrealistic and could indicate erroneous data entries.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- **Yes.**
  - Identifying and removing incorrect or extreme values improves the quality of the dataset, leading to more accurate customer segmentation and recommendations.
  - Outlier detection prevents skewed RFM scores and incorrect clustering.
  
- **Negative Growth Possibility:**
  - If outliers are ignored, they can lead to **misleading patterns**, such as misclassifying customers as high-value based on erroneous high transaction values.
  - It can also distort the calculation of averages and monetary metrics, affecting decision-making.

---


## ***5. Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

# Drop rows with missing Description (cannot be imputed meaningfully)
df = df[~df['Description'].isnull()]

# Drop rows with missing CustomerID (critical for customer-level analysis)
df = df[~df['CustomerID'].isnull()]

# Reset index after dropping rows
df.reset_index(drop=True, inplace=True)

# Check missing values after handling
print("\nMissing values after handling:\n", df.isnull().sum())



#### What all missing value imputation techniques have you used and why did you use those techniques?

In our dataset, we handled missing values using the **row removal technique** for specific columns instead of imputing values. Here's why:

---

#### **Description Column**
- **Issue:** 1,454 missing values (<1% of data).  
- **Action Taken:** **Dropped rows** where the product description was missing.  
- **Reason:** Product names cannot be reliably imputed, and incorrect product names would mislead product-level analysis.

---

#### **CustomerID Column**
- **Issue:** 135,080 missing values (~25% of data).  
- **Action Taken:** **Dropped rows** where CustomerID was missing.  
- **Reason:**  
  - Customer segmentation (e.g., RFM analysis) requires unique customer IDs.  
  - There is no logical way to impute CustomerIDs without introducing errors.  
  - Keeping these rows would create inaccurate customer groupings and insights.

---
- For both **Description** and **CustomerID**, imputing missing values (like using "Unknown", mean, mode, etc.) would **distort the analysis**:
  - Incorrect CustomerIDs would merge unrelated customers or create duplicates.  
  - Random or placeholder product names would create fake products.  
- Since we have a **large dataset**, removing these rows does not significantly affect the analysis quality.


### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Check shape before handling outliers
print("Original dataset shape:", df.shape)

# Separate returns (negative quantities) for future analysis
returns_df = df[df['Quantity'] < 0]

# Remove invalid rows: Quantity <= 0 and UnitPrice <= 0
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]

# Quantity
Q1 = df['Quantity'].quantile(0.25)
Q3 = df['Quantity'].quantile(0.75)
IQR = Q3 - Q1
lower_bound_q = Q1 - 1.5*IQR
upper_bound_q = Q3 + 1.5*IQR
df = df[(df['Quantity'] >= lower_bound_q) & (df['Quantity'] <= upper_bound_q)]

# UnitPrice
Q1 = df['UnitPrice'].quantile(0.25)
Q3 = df['UnitPrice'].quantile(0.75)
IQR = Q3 - Q1
lower_bound_p = Q1 - 1.5*IQR
upper_bound_p = Q3 + 1.5*IQR
df = df[(df['UnitPrice'] >= lower_bound_p) & (df['UnitPrice'] <= upper_bound_p)]

# Check shape after handling outliers
print("Cleaned dataset shape:", df.shape)
print("Returns dataset shape (for separate analysis):", returns_df.shape)


##### What all outlier treatment techniques have you used and why did you use those techniques?




#### Removing Invalid Data (Rule-Based Filtering)
- **What was done?**  
  - Removed rows where:
    - `Quantity <= 0` (invalid or returns)
    - `UnitPrice <= 0` (invalid price entries)

  - Negative quantities or zero prices are not valid for revenue analysis.  
  - These records would distort total sales and customer value metrics.  
  - Returns (negative quantities) were separated into a different dataset for dedicated return analysis.

---

#### IQR Method (Interquartile Range)
- **What was done?**  
  - For both `Quantity` and `UnitPrice`, we calculated:  
    - **IQR = Q3 - Q1**  
    - **Lower Bound = Q1 - 1.5 × IQR**  
    - **Upper Bound = Q3 + 1.5 × IQR**  
  - Removed rows outside these bounds.

  - Extreme high or low values distort averages and revenue calculations.  
  - The IQR method is a robust statistical approach that focuses on the middle 50% of the data and ignores extreme values.

---


### 3. Categorical Encoding

In [None]:

# Identify categorical columns
categorical_cols = ['Country', 'Description']

# Initialize Label Encoder
le = LabelEncoder()

# Apply Label Encoding to each categorical column
for col in categorical_cols:
    df[col] = le.fit_transform(df[col])

df.head()




#### What all categorical encoding techniques have you used & why did you use those techniques?


- We used **Label Encoding** for the categorical columns `Country` and `Description`.
- This technique converts each unique category into a unique integer (e.g., `France → 1`, `Germany → 2`).

  - `Description` has thousands of unique values; One-Hot Encoding would create too many columns.  
  - Label Encoding is **memory-efficient** and works well for clustering (unsupervised learning).


## ***6. Feature Engineering***

#### 1. Calculating RFM

In [None]:
import pandas as pd

# Ensure InvoiceDate is in datetime format
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Create TotalPrice feature (if not already created)
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Get the latest purchase date in the dataset
latest_date = df['InvoiceDate'].max()

# Group by CustomerID and calculate Recency, Frequency, and Monetary
rfm_df = df.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (latest_date - x.max()).days,   # Recency
    'TotalPrice': ['count', 'sum']                          # Frequency & Monetary
}).reset_index()

# Rename multi-level columns
rfm_df.columns = ['CustomerID', 'Recency', 'Frequency', 'Monetary']

# Display RFM DataFrame
rfm_df.head()



#### 2.  Data Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

# Select only the RFM columns for scaling
rfm_features = rfm_df[['Recency', 'Frequency', 'Monetary']]

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the RFM data
rfm_scaled = scaler.fit_transform(rfm_features)

# Convert back to a DataFrame with the same column names
rfm_scaled_df = pd.DataFrame(rfm_scaled, columns=['Recency', 'Frequency', 'Monetary'])

# Add back CustomerID for reference
rfm_scaled_df['CustomerID'] = rfm_df['CustomerID'].values

rfm_scaled_df.head()


#### 3. Using Elbow Method , Silhouette Score to decide the number of clusters


In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

inertia = []
silhouette_scores = []
k_values = range(2, 11)

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(rfm_scaled_df[['Recency', 'Frequency', 'Monetary']])
    inertia.append(kmeans.inertia_)  # WCSS
    silhouette_scores.append(silhouette_score(
        rfm_scaled_df[['Recency', 'Frequency', 'Monetary']], kmeans.labels_))

# Plot Elbow Method
plt.figure(figsize=(8,4))
plt.plot(k_values, inertia, marker='o')
plt.title('Elbow Method (WCSS)')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('WCSS')
plt.show()

# Plot Silhouette Score
plt.figure(figsize=(8,4))
plt.plot(k_values, silhouette_scores, marker='o', color='green')
plt.title('Silhouette Score')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.show()


#### 4. Clustering Algorithm (KMeans, DBScan, Hierarchial etc)


In [None]:
from sklearn.cluster import KMeans

# Run K-Means with an initial guess of k=4
kmeans = KMeans(n_clusters=4, random_state=42)
rfm_scaled_df['Cluster'] = kmeans.fit_predict(rfm_scaled_df[['Recency', 'Frequency', 'Monetary']])

# Show the first few rows with assigned clusters
rfm_scaled_df.head()

#### 5. Visualize the cluster in 2-D plot

In [None]:
# Run K-Means with optimal k
kmeans = KMeans(n_clusters=4, random_state=42)
rfm_scaled_df['Cluster'] = kmeans.fit_predict(rfm_scaled_df[['Recency', 'Frequency', 'Monetary']])

# Now plot
plt.figure(figsize=(8,6))
plt.scatter(rfm_scaled_df['Recency'], rfm_scaled_df['Monetary'],
            c=rfm_scaled_df['Cluster'], cmap='viridis', s=50, alpha=0.7)
plt.title('Customer Segments (Recency vs Monetary)')
plt.xlabel('Recency (Standardized)')
plt.ylabel('Monetary (Standardized)')
plt.colorbar(label='Cluster')
plt.show()


####6. Visualize the cluster in 3-D Plot

In [None]:
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111, projection='3d')

x = rfm_scaled_df['Recency']
y = rfm_scaled_df['Frequency']
z = rfm_scaled_df['Monetary']
clusters = rfm_scaled_df['Cluster']

sc = ax.scatter(x, y, z, c=clusters, cmap='viridis', s=50, alpha=0.7)

ax.set_title('3D Customer Segments (RFM)')
ax.set_xlabel('Recency (Standardized)')
ax.set_ylabel('Frequency (Standardized)')
ax.set_zlabel('Monetary (Standardized)')
fig.colorbar(sc, label='Cluster')
plt.show()


### 7. Building Customer-Product Matrix

In [None]:
# Create a pivot table: customers vs products (Quantity as values)
customer_product_matrix = df.pivot_table(
    index='CustomerID',
    columns='Description',
    values='Quantity',
    aggfunc='sum',
    fill_value=0
)

customer_product_matrix.head()


### 8. Computing Product Similarity (Cosine Similarity)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# Transpose the matrix to get products as rows
product_similarity = cosine_similarity(customer_product_matrix.T)

# Convert to a DataFrame
product_similarity_df = pd.DataFrame(
    product_similarity,
    index=customer_product_matrix.columns,
    columns=customer_product_matrix.columns
)

product_similarity_df.head()


### 9.  Ploting Heatmap of Product Similarity

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12,8))
sns.heatmap(product_similarity_df.iloc[:20, :20], cmap='viridis')  # Show top 20 products
plt.title("Product Similarity Heatmap")
plt.show()


# **Conclusion**


This project utilized **RFM Analysis** and **K-Means Clustering** to segment customers based on their purchasing behavior. After extensive preprocessing, feature engineering, and scaling:  

1. We identified **optimal customer clusters (k=4)** using the **Elbow Method** and **Silhouette Score**.  
2. Each cluster represents a unique customer group:  
   - **High-value frequent customers** (loyal and profitable)  
   - **Moderate-value customers** (can be nurtured further)  
   - **One-time/occasional customers** (require engagement campaigns)  
   - **Dormant or low-value customers** (likely to churn)  
3. Visualizations such as **top products, country-wise transactions, purchase trends, and monetary distributions** provided deeper business insights.  
4. These insights enable businesses to:  
   - **Target the right customers** with personalized offers.  
   - **Improve retention** by focusing on loyal and at-risk customers.  
   - **Optimize marketing spend** by segmenting the audience effectively.  

This segmentation lays a strong foundation for **data-driven customer relationship management (CRM)** and can be extended to **product recommendations** using **customer-product similarity matrices** for further personalization.


### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***