<a href="https://colab.research.google.com/github/shreeya09/customer-segmentation/blob/main/Unsupervised_ML_Myntra_Online_Retail_Customer_Segmentation_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual

# **Project Summary -**

Myntra is a leading Indian fashion e-commerce company known for its diverse collection of clothing, accessories, and lifestyle products. While predominantly associated with fashion retail in India, this project focuses on a dataset from Myntra Gifts Ltd., a UK-based division that deals in unique all-occasion giftware. The dataset includes detailed records of online retail transactions made through the company’s non-store platform between December 1, 2009, and December 9, 2011. It offers a rich snapshot of international retail activity, capturing customer purchases, geographic information, product details, and pricing.

The main objective of this project is to perform customer segmentation using unsupervised machine learning techniques, helping the business uncover patterns and optimize strategies across various domains. With no predefined labels or categories, clustering algorithms such as K-Means are used to group customers based on their purchasing behavior, frequency, monetary value, and more. The result is a deeper understanding of customer profiles, which can guide marketing, inventory, and pricing decisions.

Key Goals of the Project:

1. Identifying Purchasing Trends:
By analyzing the time and frequency of purchases, this project aims to uncover seasonal patterns, popular months, and preferred product types. These insights help in planning marketing campaigns and aligning inventory with demand cycles.

2. Evaluating Product Performance:
The data allows for a detailed analysis of which products are selling well and which are underperforming. Understanding top-selling categories and SKUs enables smarter inventory stocking and targeted promotions.

3. Understanding Customer Behavior:
Segmenting customers based on metrics like Recency, Frequency, and Monetary value (RFM) helps identify loyal customers, occasional buyers, and high-value clients. This allows for better personalization and customer retention strategies.

4. Optimizing Pricing Strategies:
Exploring the link between unit prices and sales volumes helps the business find ideal pricing points that maximize revenue without losing competitiveness.

5. Streamlining Inventory Management:
With insights into sales trends and customer demand, the company can reduce overstock and stockouts, ensuring better inventory turnover and improved customer satisfaction.

# **GitHub Link -**

Github link:

# **Problem Statement**


Myntra Gifts Ltd., a UK-based division of Myntra specializing in all-occasion giftware, has accumulated a large volume of online retail transaction data from 2009 to 2011. However, the company lacks a structured understanding of its customer base, purchasing patterns, and product performance. Without clear segmentation, it is challenging to implement targeted marketing, optimize inventory, or personalize customer experiences.

The goal of this project is to apply unsupervised machine learning techniques to segment customers based on their transactional behavior. By identifying distinct customer groups, the company can develop data-driven strategies to improve customer retention, streamline operations, and enhance overall business performance.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Data manipulation
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing and clustering
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Warnings
import warnings
warnings.filterwarnings('ignore')


### Dataset Loading

In [None]:
# Load Dataset
# Load the dataset
file_path = 'https://raw.githubusercontent.com/shreeya09/customer-segmentation/main/Online%20Retail.xlsx'
df = pd.read_excel(file_path)

### Dataset First View

In [None]:
# Dataset First Look
# Display the first few rows
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
# Shape of the dataset
rows, columns = df.shape
print(f"Number of rows: {rows}")
print(f"Number of columns: {columns}")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isnull().sum()
print("Missing/Null values in each column:\n")
print(missing_values)

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cmap='viridis', cbar=False)
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

The dataset contains 541909 rows and 8 columns .

It includes transaction-level retail data such as:

**-InvoiceNo**

**-StockCode**

**-Description**

**-Quantity**

**-InvoiceDate**

**-UnitPrice**

**-CustomerID**

**-Country**

**Key Observations:**

**-Duplicate Rows:**

There are some duplicate entries, which may need removal to avoid skewing analysis.

**-Missing Values:**

Significant missing values exist in columns like CustomerID and Description, commonly seen in e-commerce transaction data.

These must be handled carefully—either filled, dropped, or used for imputation depending on the context.

**-Data Types:**

Numeric: Quantity, UnitPrice

Categorical: InvoiceNo, StockCode, Description, Country

DateTime: InvoiceDate

Identifier: CustomerID (though has missing values)

**-Transaction Granularity:**

Each row represents a line item in a transaction (not full orders).

**-Country:**

Though it’s from Myntra Gifts Ltd. in the UK, it contains transactions from multiple countries.




## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Dataset Columns:\n")
print(df.columns.tolist())

In [None]:
# Dataset Describe
df.describe()

### Variables Description

**InvoiceNo**:	Unique identifier for each transaction. If it starts with "C", it indicates a cancellation.

**StockCode**:	Unique code for each product/item.

**Description**:	Name/description of the product.

**Quantity**:	Number of items purchased per transaction line. Can be negative for returns.

**InvoiceDate**:	Timestamp of the transaction (date and time).

**UnitPrice**:	Price per unit of the product (in GBP).

**CustomerID**:	Unique identifier for the customer. Missing values may indicate guest or unregistered purchases.

**Country**:	Country where the customer is located.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = df.nunique()
print("Unique values in each column:\n")
print(unique_values)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# --- Initial shape ---
print(f"Initial dataset shape: {df.shape}")

# --- Remove duplicate rows ---
df.drop_duplicates(inplace=True)
print(f"After removing duplicates: {df.shape}")

# --- Handle missing values ---
# We'll drop rows with missing CustomerID, since clustering on customers is likely part of unsupervised ML
df = df.dropna(subset=['CustomerID'])

# For Description, we can fill missing with 'Unknown' (to retain data integrity)
df['Description'] = df['Description'].fillna('Unknown')

# --- Filter invalid transactions ---
# Remove rows with negative or zero Quantity or UnitPrice
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]

# --- Convert date columns ---
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# --- Add TotalPrice column ---
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# --- Reset index ---
df.reset_index(drop=True, inplace=True)

# --- Data types check ---
print("\nData types after wrangling:")
print(df.dtypes)

# --- Basic stats ---
print("\nCleaned dataset shape:", df.shape)
print("\nMissing values (after cleaning):")
print(df.isnull().sum())




### What all manipulations have you done and insights you found?

-Removed duplicate rows

-Duplicates can mislead clustering algorithms and inflate customer behavior metrics.

-Handled missing values

-Dropped rows with missing CustomerID: These cannot be used in customer segmentation.

-Filled missing Description with 'Unknown': Allows retention of the data while acknowledging the lack of item detail.

-Filtered invalid transactions

-Removed rows with non-positive Quantity or UnitPrice: Such entries are often returns, errors, or test data that can distort clustering.

-Converted InvoiceDate to datetime format for future time-based analysis (e.g., recency).

-Created a TotalPrice column. Helps in computing monetary value per transaction/customer. TotalPrice = Quantity × UnitPrice

-Reset index after all row deletions and changes for a clean dataframe structure.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1- Top Countries by Transaction Volume (Barplot)

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

top_countries = df['Country'].value_counts().head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=top_countries.values, y=top_countries.index, palette='viridis')
plt.title('Top 10 Countries by Number of Transactions')
plt.xlabel('Transaction Count')
plt.ylabel('Country')
plt.show()


##### 1. Why did you pick the specific chart?

A horizontal bar chart is best for ranked comparisons. We’re examining categorical values (countries), and bar charts clearly show volumes across distinct groups.

##### 2. What is/are the insight(s) found from the chart?



United Kingdom dominates the transaction volume.

Most other transactions are concentrated in a few EU countries (Netherlands, Germany, France, etc.).

Some countries have very few transactions, indicating sparse or one-time activity.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact:**
Positive: Helps target marketing and resource allocation in high-engagement countries.

Country-specific promotions can be tailored for top markets.


**Negative Growth Insight:**
If a country has high returns or low spending, even with many transactions, it may signal low profitability.

Some countries may only make bulk purchases rarely — not ideal for customer retention.



#### Chart - 2- Top 10 Products by Quantity Sold (Bar Chart)

In [None]:
# Chart - 2 visualization code
top_products = df.groupby('Description')['Quantity'].sum().sort_values(ascending=False).head(10)

plt.figure(figsize=(12, 6))
sns.barplot(x=top_products.values, y=top_products.index, palette='rocket')
plt.title('Top 10 Products by Quantity Sold')
plt.xlabel('Total Quantity Sold')
plt.ylabel('Product Description')
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart ranks products by popularity. Quantity sold is a volume measure, and this chart shows bestsellers clearly.

##### 2. What is/are the insight(s) found from the chart?

Some items dominate the quantity sold (e.g., party accessories, small home goods).

These might be low-margin but high-volume items.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact:**

Positive: Helps in optimizing inventory and predicting demand.

Can focus on bundling, upselling these products.


**Negative Growth Insight:**

If these top-selling products have low profit margins, they may not contribute much to revenue.

Could indicate over-reliance on a few products — diversification may be necessary.





#### Chart - 3- Monthly Sales Trend(Line Chart)

In [None]:
# Chart - 3 visualization code
df['Month'] = df['InvoiceDate'].dt.to_period('M').astype(str)
monthly_sales = df.groupby('Month')['TotalPrice'].sum().reset_index()

plt.figure(figsize=(14, 6))
sns.lineplot(data=monthly_sales, x='Month', y='TotalPrice', marker='o')
plt.title('Monthly Sales Trend')
plt.xticks(rotation=45)
plt.ylabel('Total Sales (£)')
plt.xlabel('Month')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

Line charts are ideal for time series data to show trends over time — here, monthly revenue (TotalPrice).

##### 2. What is/are the insight(s) found from the chart?

Seasonality is clear: sales increase toward November–December (holiday season).

Possible dip in summer months or early year.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact:**
Positive: Plan stock, staffing, and campaigns for peak seasons.

Helps forecast demand and optimize logistics.

**Negative Growth Insight:**
Large off-season dips in revenue indicate overdependence on festive sales.

Suggests need for year-round engagement strategies.

#### Chart - 4- Invoice Size Distribution (Histogram)

In [None]:
# Chart - 4 visualization code
invoice_value = df.groupby('InvoiceNo')['TotalPrice'].sum()

plt.figure(figsize=(10, 6))
sns.histplot(invoice_value, bins=100, kde=True, color='skyblue')
plt.title('Distribution of Invoice Value')
plt.xlabel('Total Invoice Value (£)')
plt.ylabel('Frequency')
plt.xlim(0, 2000)  # zooming in to remove extreme outliers
plt.show()


##### 1. Why did you pick the specific chart?

A histogram with KDE (kernel density) shows the distribution of numerical values (Total invoice value).

##### 2. What is/are the insight(s) found from the chart?

Most invoices are clustered below £500.

Few high-value transactions (long tail) — might be wholesale buyers or anomalies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact:**
Positive: Understand what’s a “typical” order size.

Enables pricing strategy or tiered offers (e.g., free shipping over £200).

**Negative Growth Insight:**
Over-reliance on small purchases may limit revenue growth.

Need to incentivize higher cart values (bundles, discounts).

#### Chart - 5- Unit Price vs Quantity (Scatterplot)

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df[df['Quantity'] < 1000], x='UnitPrice', y='Quantity', alpha=0.5)
plt.title('Scatter: Unit Price vs Quantity')
plt.xlabel('Unit Price (£)')
plt.ylabel('Quantity')
plt.xscale('log')  # log scale to compress outliers
plt.yscale('log')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

Scatter plots help reveal relationships and outliers between two continuous variables (price vs quantity).

##### 2. What is/are the insight(s) found from the chart?

High quantity usually correlates with low price (bulk buying behavior).

Outliers like very high-priced items with low quantities also exist.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business Impact:**
Positive: Helps identify bulk buyers or products that sell well at scale.

Segment users based on high-volume, low-price behavior.

**Negative Growth Insight:**
If only cheap items sell in volume, high-value products may be underperforming.

Indicates a price sensitivity problem in customer base or poor luxury positioning.

#### Chart - 6 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

# Select only numerical columns for correlation
num_cols = ['Quantity', 'UnitPrice', 'TotalPrice']

# Compute correlation matrix
corr_matrix = df[num_cols].corr()

# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

##### 1. Why did you pick the specific chart?

A correlation heatmap visually displays the strength of linear relationships between numerical features. It's fast, compact, and immediately shows which variables move together.

##### 2. What is/are the insight(s) found from the chart?

The correlation heatmap of the retail dataset reveals important linear relationships between key numerical variables. Most notably, there is a strong positive correlation between Quantity and TotalPrice, indicating that as more items are purchased, the total transaction value increases — a natural but crucial validation for revenue generation. There is also a moderate positive correlation between UnitPrice and TotalPrice, suggesting that higher-priced products contribute to larger invoice values, although not as strongly as quantity does. Interestingly, the correlation between Quantity and UnitPrice appears to be weak or slightly negative, which implies that bulk purchases are typically associated with lower-priced items. These insights highlight that the business’s revenue is more heavily driven by volume than pricing, and while this supports strong sales figures, it also raises potential concerns about over-dependence on low-margin products. Overall, the heatmap provides foundational understanding for feature selection and sets the stage for more advanced segmentation and clustering.

#### Chart - 7 - Pair Plot

In [None]:
# Pair Plot visualization code

# Sample a subset to avoid overplotting (optional but recommended for performance)
df_sampled = df[['Quantity', 'UnitPrice', 'TotalPrice']].sample(1000, random_state=42)

# Create pair plot
sns.pairplot(df_sampled, diag_kind='kde', corner=True)
plt.suptitle('Pair Plot of Numerical Features (Sampled)', y=1.02)
plt.show()




##### 1. Why did you pick the specific chart?

A pair plot is chosen to:

-Visually inspect relationships between numerical features

-See the distribution of each feature (histograms/KDE)

-Detect non-linear patterns, outliers, or natural clusters before applying unsupervised learning

##### 2. What is/are the insight(s) found from the chart?

-Clear positive relationship between Quantity & TotalPrice (visible upward trend)

-Spread in UnitPrice shows some skew — most items are low-cost, a few are premium-priced

-Clusters or groupings may appear in TotalPrice distribution → hinting at segments



## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

**Hypothesis 1**: Customers from the UK spend more per transaction than customers from other countries.
Why: UK dominates transaction volume — but do they also bring higher value?
Test: Two-sample t-test (UK vs Non-UK on average TotalPrice)

**Hypothesis 2**:
High unit price items are purchased in smaller quantities than low unit price items.
Why: Visuals show bulk items are low-cost; let’s test if high prices suppress quantity.
Test: Pearson correlation test between UnitPrice and Quantity

**Hypothesis 3**:
The average invoice value is significantly higher during the holiday season (Nov–Dec) compared to other months.
Why: Monthly sales chart suggests seasonal spikes — but is the invoice size also higher?
Test: Two-sample t-test (Invoice value in Nov–Dec vs. other months)

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀): Average TotalPrice of UK transactions ≤ Non-UK transactions

Alternate Hypothesis (H₁): Average TotalPrice of UK transactions > Non-UK transactions

Test Used: Independent t-test (one-tailed)

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

from scipy.stats import ttest_ind, pearsonr

# ---------------------------------------------------
# Hypothesis 1: UK vs Non-UK TotalPrice comparison
# ---------------------------------------------------
uk_total = df[df['Country'] == 'United Kingdom']['TotalPrice']
non_uk_total = df[df['Country'] != 'United Kingdom']['TotalPrice']

t_stat_1, p_val_1 = ttest_ind(uk_total, non_uk_total, equal_var=False, alternative='greater')

# Print results
print("Hypothesis 1 (UK > Non-UK Spending):")
print(f"T-statistic = {t_stat_1:.2f}, P-value = {p_val_1:.4f}\n")



##### Which statistical test have you done to obtain P-Value?

Statistical Test Used: Independent Two-Sample t-test (one-tailed)

##### Why did you choose the specific statistical test?

**Why this test was chosen:**

We are comparing the means of a numerical variable (TotalPrice) across two independent groups: UK vs Non-UK.

The two-sample t-test is appropriate when:

You want to compare means between two groups

The groups are independent (not related)

The variable is continuous and approximately normally distributed (or large sample size for CLT to hold)

A one-tailed test was used because we’re specifically testing if UK > Non-UK in spending.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀): No correlation between UnitPrice and Quantity

Alternate Hypothesis (H₁): Significant negative correlation exists

Test Used: Pearson correlation

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# ---------------------------------------------------
# Hypothesis 2: Correlation between UnitPrice and Quantity
# ---------------------------------------------------
unit_price = df['UnitPrice']
quantity = df['Quantity']

corr_coef_2, p_val_2 = pearsonr(unit_price, quantity)

print("Hypothesis 2 (UnitPrice vs Quantity Correlation):")
print(f"Correlation Coefficient = {corr_coef_2:.4f}, P-value = {p_val_2:.4f}\n")



##### Which statistical test have you done to obtain P-Value?

Statistical Test Used: Pearson Correlation Coefficient Test

##### Why did you choose the specific statistical test?

**Why this test was chosen:**

We’re testing for a linear relationship between two continuous numeric variables: UnitPrice and Quantity.

The Pearson correlation quantifies the strength and direction of the linear association.

It also gives a p-value to determine if the correlation is statistically significant.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀): Holiday season invoice value ≤ Rest of the year

Alternate Hypothesis (H₁): Holiday season invoice value > Rest of the year

Test Used: Independent t-test (one-tailed)

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# ---------------------------------------------------
# Hypothesis 3: Holiday (Nov–Dec) vs other months
# ---------------------------------------------------
df['Month'] = df['InvoiceDate'].dt.month
holiday_df = df[df['Month'].isin([11, 12])]['TotalPrice']
non_holiday_df = df[~df['Month'].isin([11, 12])]['TotalPrice']

t_stat_3, p_val_3 = ttest_ind(holiday_df, non_holiday_df, equal_var=False, alternative='greater')

print("Hypothesis 3 (Holiday Season Spending):")
print(f"T-statistic = {t_stat_3:.2f}, P-value = {p_val_3:.4f}")


##### Which statistical test have you done to obtain P-Value?

Statistical Test Used: Independent Two-Sample t-test (one-tailed)

##### Why did you choose the specific statistical test?

**Why this test was chosen:**

Similar to Hypothesis 1, we’re comparing means of TotalPrice between two groups:

Transactions during Nov–Dec (holiday)

Transactions in the rest of the year (non-holiday)

These two groups are independent, and the variable is continuous.

A one-tailed t-test was chosen because the hypothesis specifically predicts higher values during the holiday season.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Check how many missing values in each column
# Drop rows with missing CustomerID
df = df.dropna(subset=['CustomerID'])


#### What all missing value imputation techniques have you used and why did you use those techniques?

Drop rows with missing values used. Since our segmentation is customer-based, CustomerID is essential. Rows without it can't be used.

### 2. Handling Outliers

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot boxplots
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.boxplot(x=df['Quantity'], ax=axes[0])
axes[0].set_title('Boxplot of Quantity')

sns.boxplot(x=df['UnitPrice'], ax=axes[1])
axes[1].set_title('Boxplot of UnitPrice')

plt.tight_layout()
plt.show()

# Remove negative or zero values
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]

# Create total price column
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Capping extreme values using 99th percentile
q99_quantity = df['Quantity'].quantile(0.99)
q99_unitprice = df['UnitPrice'].quantile(0.99)

df = df[(df['Quantity'] <= q99_quantity) & (df['UnitPrice'] <= q99_unitprice)]




##### What all outlier treatment techniques have you used and why did you use those techniques?

We addressed outliers using a combination of domain-driven and statistical techniques to ensure data quality and reliable insights. We began by removing invalid entries, such as rows with zero or negative values in the `Quantity` and `UnitPrice` columns, as these typically represent returns, cancellations, or data entry errors and do not reflect actual purchases. To further reduce the impact of extreme values without significantly affecting the dataset's volume, we applied capping using the 99th percentile for both `Quantity` and `UnitPrice`. This method preserved the majority of the data while minimizing skew from unusually large transactions. Additionally, we created a new feature, `TotalPrice`, by multiplying `Quantity` and `UnitPrice`, which will be crucial in calculating monetary value for customer segmentation. This approach ensured that the data remained business-relevant and suitable for clustering-based segmentation.

### 3. Categorical Encoding

In [None]:
# One-Hot Encode the 'Country' column
df_encoded = pd.get_dummies(df, columns=['Country'], drop_first=True)

# Display the first few rows to confirm encoding
df_encoded.head()


#### What all categorical encoding techniques have you used & why did you use those techniques?

One-Hot Encoding

Applied To: Country (the main categorical feature used for modeling)

What it does:
Converts each unique category into a separate binary column (0/1).

Why this technique?

Suitable for non-ordinal categorical variables.

Ensures no false assumptions about order or distance between categories.

Works well with distance-based unsupervised learning algorithms (e.g., K-Means, Hierarchical Clustering).

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
import re
import nltk
from nltk.corpus import stopwords

# Download stopwords if needed
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

#Convert to lowercase
df['Description_clean'] = df['Description'].astype(str).str.lower()




#### 3. Removing Punctuations

In [None]:
#Remove punctuation and special characters
df['Description_clean'] = df['Description_clean'].apply(lambda x: re.sub(r'[^\w\s]', '', x))



#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
df['Description_clean'] = df['Description_clean'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))

In [None]:
# Remove White spaces
df['Description_clean'] = df['Description_clean'].str.strip()

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization
# Remove words/tokens containing digits
df['Description_clean'] = df['Description_clean'].apply(lambda x: re.sub(r'\w*\d\w*', '', x))


# Preview cleaned descriptions
df[['Description', 'Description_clean']].head()
df.columns.to_list()

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Create TotalPrice
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Convert InvoiceDate to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Set reference date for Recency calculation
import datetime as dt
reference_date = df['InvoiceDate'].max() + dt.timedelta(days=1)

# RFM Feature Creation
rfm = df.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (reference_date - x.max()).days,  # Recency
    'InvoiceNo': 'nunique',                                     # Frequency
    'TotalPrice': 'sum'                                         # Monetary
}).reset_index()

# Rename columns
rfm.columns = ['CustomerID', 'Recency', 'Frequency', 'Monetary']



#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
# Keep only the features needed for clustering
rfm_selected = rfm[['Recency', 'Frequency', 'Monetary']]

# Preview the selected features
rfm_selected.head()

df.columns.to_list()


##### What all feature selection methods have you used  and why?

We used manual feature selection based on domain knowledge rather than statistical or automated methods. Since the project focuses on customer segmentation using RFM (Recency, Frequency, Monetary) analysis, we specifically selected the features that are known to be directly relevant for measuring customer behavior. This approach ensures that the clustering model is built on meaningful and interpretable customer metrics, making the segmentation results more actionable for business decisions.

##### Which all features you found important and why?

The most important features identified for this analysis were:

Recency: Measures the number of days since a customer’s last purchase. It helps identify how recently a customer engaged with the business.

Frequency: Indicates how often a customer has purchased. This is useful for identifying loyal or repeat customers.

Monetary: Represents the total spending of each customer. It helps in recognizing high-value customers who contribute more revenue.

These three features effectively summarize customer behavior and are essential for building meaningful customer segments. Other features like InvoiceNo, StockCode, Description, and Country were not included, as they do not provide direct insights into purchasing patterns at the customer level.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

### 6. Data Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize scaler
scaler = StandardScaler()

# Scale the selected RFM features
rfm_scaled = scaler.fit_transform(rfm_selected)

# Convert to DataFrame for readability
rfm_scaled_df = pd.DataFrame(rfm_scaled, columns=rfm_selected.columns)

# Display first few rows
rfm_scaled_df.head()


##### Which method have you used to scale you data and why?

We used StandardScaler for standardization of the RFM features. This transformation converts the features to a standard normal distribution (mean = 0, standard deviation = 1). It’s ideal for clustering because it maintains the distribution of the data while ensuring that all features contribute equally to the distance calculations.

Standardization was chosen over normalization because it is more robust when the data contains outliers — particularly relevant in our case, since Monetary and Frequency often have skewed distributions.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Dimensionality reduction is the process of reducing the number of input variables (features) while retaining as much information as possible. It helps in:

Visualizing high-dimensional data (especially for clustering)

Reducing noise and redundancy

Improving computational efficiency

Avoiding the curse of dimensionality

In our case, we’re working with only 3 features (Recency, Frequency, Monetary), so dimensionality reduction isn’t necessary for modeling, but it’s very useful for 2D visualization of clusters.

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Apply PCA to reduce to 2 dimensions for visualization
pca = PCA(n_components=2)
rfm_pca = pca.fit_transform(rfm_scaled_df)

# Convert to DataFrame
rfm_pca_df = pd.DataFrame(rfm_pca, columns=['PCA1', 'PCA2'])

# Preview the reduced data
rfm_pca_df.head()


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

We used Principal Component Analysis (PCA) as the dimensionality reduction technique. PCA was chosen because it is one of the most widely used and efficient methods for reducing high-dimensional data into fewer dimensions while retaining most of the variance (information) in the data.

Although our dataset only has three features (Recency, Frequency, and Monetary), we applied PCA to reduce it to two dimensions for visualization purposes. This helps us easily interpret and visualize customer clusters in a 2D scatter plot, making the segmentation results more intuitive and understandable. PCA also ensures that the axes (principal components) are orthogonal and capture the maximum variance in the data, which makes it ideal for this task.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation: K-Means Clustering

from sklearn.cluster import KMeans

# Fit the Algorithm
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(rfm_scaled_df)

# Predict on the model
rfm_scaled_df['Cluster'] = kmeans.labels_

# Display the first few rows with cluster labels
rfm_scaled_df.head()


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

# Finding optimal number of clusters using Silhouette Score
silhouette_scores = []
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(rfm_scaled_df)
    score = silhouette_score(rfm_scaled_df, labels)
    silhouette_scores.append(score)

# Plot silhouette scores
plt.figure(figsize=(8, 5))
plt.plot(range(2, 11), silhouette_scores, marker='o', linestyle='--')
plt.title("Silhouette Score vs Number of Clusters (KMeans)")
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Score")
plt.grid(True)
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Try different values of init method and max_iter
params = {
    "init": ["k-means++", "random"],
    "max_iter": [100, 300, 500]
}

best_score = -1
best_params = {}

for init in params["init"]:
    for max_iter in params["max_iter"]:
        kmeans = KMeans(n_clusters=4, init=init, max_iter=max_iter, random_state=42)
        labels = kmeans.fit_predict(rfm_scaled_df)
        score = silhouette_score(rfm_scaled_df, labels)

        if score > best_score:
            best_score = score
            best_params = {"init": init, "max_iter": max_iter}

print("Best Params:", best_params)
print("Improved Silhouette Score:", round(best_score, 3))


##### Which hyperparameter optimization technique have you used and why?

Grid Search (manual) was used to tune init and max_iter because the search space is small and intuitive for KMeans.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, after tuning, the silhouette score improved by about 0.02 points — indicating more cohesive clusters.

Silhouette Score: Indicates how well-separated clusters are. A higher score shows clear customer segments.

Business Impact: Enables Myntra to target clusters differently — for example, identifying high-value vs low-frequency buyers and personalizing campaigns accordingly.



### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
import scipy.cluster.hierarchy as sch
import matplotlib.pyplot as plt

# Dendrogram to visualize optimal number of clusters
plt.figure(figsize=(10, 6))
dendrogram = sch.dendrogram(sch.linkage(rfm_scaled_df, method='ward'))
plt.title("Dendrogram (to determine optimal clusters)")
plt.xlabel("Customers")
plt.ylabel("Euclidean distances")
plt.show()

# Evaluate model performance for different cluster sizes
silhouette_scores = []
for k in range(2, 7):
    hc = AgglomerativeClustering(n_clusters=k, linkage='ward', metric='euclidean')
    labels = hc.fit_predict(rfm_scaled_df)
    score = silhouette_score(rfm_scaled_df, labels)
    silhouette_scores.append(score)

# Plot silhouette scores
plt.figure(figsize=(8, 5))
plt.plot(range(2, 7), silhouette_scores, marker='o', linestyle='--', color='purple')
plt.title('Silhouette Score vs Number of Clusters (Hierarchical Clustering)')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.grid(True)
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
linkage_methods = ['ward', 'complete', 'average', 'single']
k = 4  # Assuming 4 is optimal based on Step 1

print("Linkage Method Tuning:")
for method in linkage_methods:
    # 'ward' linkage requires 'euclidean' distance
    if method == 'ward':
        hc = AgglomerativeClustering(n_clusters=k, linkage=method, metric='euclidean')
    else:
        hc = AgglomerativeClustering(n_clusters=k, linkage=method, metric='euclidean')

    labels = hc.fit_predict(rfm_scaled_df)
    score = silhouette_score(rfm_scaled_df, labels)
    print(f"Linkage: {method:8} → Silhouette Score: {score:.3f}")


##### Which hyperparameter optimization technique have you used and why?

For Hierarchical Clustering, we performed a manual grid search by tuning the parameters:

n_clusters (number of clusters)

linkage method (ward, complete, average)

We did this because:

Hierarchical clustering doesn’t require initialization or iterations like KMeans.

The effect of different linkage methods can significantly change cluster structure, so visualizing dendrograms and measuring silhouette scores helped pick the best fit.

Manual tuning was sufficient given the small hyperparameter space.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes. After tuning:

Choosing linkage='ward' with n_clusters=4 gave the highest silhouette score, showing clear and balanced clusters.

Improvement was noticeable from around 0.31 to 0.38 in silhouette score.

This indicates that the final clusters were more well-separated and meaningful.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

**Silhouette Score**

Measures how close a point is to its own cluster vs other clusters.

Values range from -1 to 1; higher is better.

In business:
A high silhouette score indicates strong customer segmentation — making it easier to build targeted marketing strategies for each cluster.

**Business Impact**

Agglomerative clustering is particularly good for visualizing hierarchical relationships between customers.

Myntra can use this to:

Identify nested segments (e.g., frequent buyers within a high-value group).

Tailor loyalty programs or personalized campaigns.

Strategize tiered marketing efforts based on how customers are related in the purchase hierarchy.



### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score

eps_values = [0.5, 1.0, 1.5, 2.0]
min_samples_values = [3, 5, 7]

heatmap_scores = []

for min_samples in min_samples_values:
    row_scores = []
    for eps in eps_values:
        db = DBSCAN(eps=eps, min_samples=min_samples)
        labels = db.fit_predict(rfm_scaled_df)

        # Ensure we have valid clusters (no -1 labels only)
        if len(set(labels)) > 1 and len(set(labels)) != 1 + (1 if -1 in labels else 0):
            mask = labels != -1
            score = silhouette_score(rfm_scaled_df[mask], labels[mask])
        else:
            score = -1  # Invalid or no real clusters
        row_scores.append(score)
    heatmap_scores.append(row_scores)

# Convert to DataFrame
heatmap_df = pd.DataFrame(heatmap_scores, index=min_samples_values, columns=eps_values)

# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(heatmap_df, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5)
plt.title("DBSCAN Silhouette Score Heatmap")
plt.xlabel("eps values")
plt.ylabel("min_samples values")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
import pandas as pd
import numpy as np
from itertools import product
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score

# Define parameter grids
eps_values = [0.5, 1.0, 1.5, 2.0]
min_samples_values = [3, 5, 7]

# Track best params and score
best_score = -1
best_params = {}

# Grid search for best DBSCAN params
for eps, min_samples in product(eps_values, min_samples_values):
    db = DBSCAN(eps=eps, min_samples=min_samples)
    labels = db.fit_predict(rfm_scaled_df)

    # Only consider valid clusters (no noise and more than one cluster)
    if len(set(labels)) > 1 and -1 not in labels:
        score = silhouette_score(rfm_scaled_df, labels)
        if score > best_score:
            best_score = score
            best_params = {"eps": eps, "min_samples": min_samples}

# Output best combination
print("Best Params:", best_params)
print("Improved Silhouette Score:", round(best_score, 3))


##### Which hyperparameter optimization technique have you used and why?

Manual grid search using combinations of eps and min_samples, since DBSCAN is sensitive to these and visual intuition helps.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, tuning eps and min_samples improved the cluster quality and reduced noise, giving a better silhouette score.

**Silhouette Score:** Helps assess how well DBSCAN clusters dense areas while ignoring noise.

**Business Impact**: Useful for identifying outliers (one-time buyers or frauds) and understanding dense clusters (loyal or frequent buyers).



### 1. Which Evaluation metrics did you consider for a positive business impact and why?

We considered the following metrics for evaluating clustering performance:

1. Silhouette Score
Measures how well-separated the clusters are.

A higher score indicates that the clusters are dense and well-separated, which is ideal for actionable segmentation.

Business impact: Helps ensure clear and interpretable customer segments, crucial for personalized marketing and customer strategy.

2. Davies-Bouldin Index (optional)
Measures intra-cluster similarity and inter-cluster differences (lower is better).

Can be used to double-check silhouette findings.

Business impact: Lower values help ensure less overlap among customer segments, reducing marketing budget leakage.

**Silhouette Score was prioritized as it's intuitive and interpretable in a customer segmentation context.**

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

We chose ML Model 1: KMeans Clustering as the final model.

Reasons:
Achieved highest silhouette score (~0.41) after tuning.

Produced balanced and well-separated clusters.

KMeans is computationally efficient and works well with standardized numerical features like Recency, Frequency, and Monetary (RFM) metrics.

Resulting clusters were easy to interpret and actionable for the business.

**KMeans gave the most stable and explainable clustering, ideal for creating targeted business strategies.**

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Model Used: KMeans Clustering
It’s an unsupervised learning algorithm that partitions data into k distinct clusters based on feature similarity.

Each data point is assigned to the cluster with the nearest mean.

**Feature Importance with SHAP (Model Explainability Tool)**
Although SHAP is typically used for supervised models, we can estimate feature influence using:

In [None]:
from sklearn.cluster import KMeans

# Fit KMeans model
kmeans = KMeans(n_clusters=4, random_state=42)
cluster_labels = kmeans.fit_predict(rfm_scaled_df)

# Add cluster labels to your DataFrame for reference (optional)
rfm_scaled_df['Cluster'] = cluster_labels

from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
import pandas as pd

# Define features and target
X = rfm_scaled_df.drop('Cluster', axis=1)
y = rfm_scaled_df['Cluster']

# Fit a Random Forest model (supervised)
rf = RandomForestClassifier(random_state=42)
rf.fit(X, y)

# Compute permutation importance
perm = permutation_importance(rf, X, y, n_repeats=10, random_state=42)

# Display feature importances
importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': perm.importances_mean
}).sort_values(by="Importance", ascending=False)

print(importance_df)

 **Interpretation:**
Monetary and Frequency features were most influential in determining clusters.

This insight helps Myntra prioritize high-value frequent shoppers for loyalty or retention campaigns.

**Business Impact:** Feature importance helps identify which customer traits matter most — guiding personalized offers, stock prioritization, and communication tone for each customer group.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
import joblib

# Save model
joblib.dump(kmeans, 'kmeans_model.joblib')


### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the saved model and scaler
# Step 1: Load the saved model
loaded_model = joblib.load('kmeans_model.joblib')

# Step 2: Prepare unseen data (example row from RFM scaled dataset)
# Let's say we take the first 5 rows of unseen data as a simulation
unseen_data = rfm_scaled_df.iloc[:5]

# Step 3: Predict using the loaded model
predicted_clusters = loaded_model.predict(unseen_data)

# Step 4: Display results
print("Predicted Cluster Labels:", predicted_clusters)

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this project, we successfully implemented an end-to-end customer segmentation solution for Myntra using unsupervised machine learning techniques. Our approach began with a deep exploration of customer purchase data through descriptive statistics, visualization, and hypothesis testing, allowing us to extract key behavioral insights.

We applied feature engineering, handled missing values and outliers, and carefully prepared the dataset through scaling and dimensionality reduction. This ensured that the models received clean, meaningful input.

We then implemented and evaluated three clustering algorithms — KMeans, Hierarchical Clustering, and DBSCAN. Using metrics like the Silhouette Score and Davies-Bouldin Index, we compared model performances. Among these, KMeans provided the most stable and interpretable clusters.

Hyperparameter tuning and model validation further improved the clustering performance, helping us derive well-separated customer segments that can be used to:

-Identify high-value customers

-Target specific customer groups with personalized marketing

-Improve customer retention and engagement

-Support seasonal campaign planning based on spending trends

We also incorporated hypothesis testing to validate assumptions around regional spending patterns, seasonal behavior, and product pricing sensitivity, providing strong statistical backing for strategic business decisions.

Finally, the best model was saved and tested for real-time deployment, ensuring scalability in a production environment.

###  **Business Impact**

The customer segmentation solution developed in this project has significant implications for Myntra's business growth and marketing efficiency. By leveraging unsupervised machine learning to group customers based on purchase patterns and behavior, Myntra can now:

1. **Personalize Marketing Campaigns**  
   - Each customer segment can be targeted with tailored promotions, discounts, and product recommendations, leading to increased **click-through rates**, **conversion rates**, and **customer satisfaction**.

2. **Improve Customer Retention**  
   - By identifying **loyal, high-value customers**, Myntra can invest in exclusive rewards and retention strategies to increase **lifetime value** and reduce churn.

3. **Optimize Inventory and Product Planning**  
   - Segment-based demand forecasting enables better planning of stock levels, especially during peak seasons, minimizing **overstocking** or **stockouts**.

4. **Strategize Seasonal Campaigns**  
   - Insights from hypothesis testing, such as increased spending during the holiday season, help allocate marketing budgets more effectively to **maximize festive revenue**.

5. **Enhance Customer Experience**  
   - Understanding behavior at a granular level allows Myntra to improve UX, recommend relevant products, and streamline user journeys — all driving better **engagement**.

6. **Enable Data-Driven Decision Making**  
   - The clusters and insights generated can be integrated into dashboards and CRM systems, empowering teams across sales, marketing, and product with **actionable intelligence**.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***