๐ฏ An unsupervised machine learning project that segments mall customers into meaningful groups using K-Means and DBSCAN clustering โ helping businesses identify high-value customers and design targeted marketing strategies.
Businesses struggle to understand who their customers really are. Without customer segmentation:
- โ Marketing campaigns target everyone = wasted budget
- โ No distinction between premium vs budget customers
- โ Missed opportunity to retain high-value customers
Solution: Use clustering algorithms to automatically group customers by income, spending behavior, age, and gender โ no labels needed.
| Feature | Description |
|---|---|
CustomerID |
Unique customer identifier |
Gender |
Male / Female |
Age |
Customer age |
Annual Income (k$) |
Yearly income in thousands |
Spending Score (1-100) |
Mall-assigned score based on behavior |
Source: Mall Customers Dataset โ Kaggle
Raw Data (Mall_Customers.csv)
โ
Data Preprocessing
โโโ Handle missing values
โโโ Encode Gender (Label Encoding)
โโโ Feature Scaling (StandardScaler)
โ
Finding Optimal K
โโโ Elbow Method (WCSS)
โโโ Silhouette Score
โ
Clustering
โโโ K-Means โ spherical clusters
โโโ DBSCAN โ density-based + outlier detection
โ
Dimensionality Reduction
โโโ PCA (2D projection for visualization)
โ
Results
โโโ Cluster labels saved โ mall_customers_with_clusters.csv
โโโ Business insights per segment
| Cluster | Profile | Strategy |
|---|---|---|
| ๐ High Income, High Spending | Premium customers | VIP loyalty programs |
| ๐ Low Income, High Spending | Impulsive spenders | EMI offers, deals |
| ๐ High Income, Low Spending | Untapped potential | Targeted campaigns |
| ๐ผ Middle Income, Average | Regular customers | Retention discounts |
| ๐ด Older, Conservative | Low engagement | Senior programs |
from sklearn.cluster import KMeans
# Elbow method to find optimal K
wcss = []
for k in range(1, 11):
km = KMeans(n_clusters=k, random_state=42)
km.fit(X_scaled)
wcss.append(km.inertia_)
# Final model
kmeans = KMeans(n_clusters=5, random_state=42)
labels = kmeans.fit_predict(X_scaled)from sklearn.cluster import DBSCAN
# Density-based โ detects non-spherical clusters + outliers
db = DBSCAN(eps=0.5, min_samples=5)
labels = db.fit_predict(X_scaled)
# label = -1 means noise/outlierfrom sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Plot 2D clusters from multi-dimensional data- Optimal K = 5 clusters (Elbow + Silhouette Score)
- K-Means works better for spherical, well-separated clusters
- DBSCAN identifies outlier customers automatically
- PCA confirms clear cluster separation in 2D space
- Results saved to
mall_customers_with_clusters.csv
| Layer | Technology |
|---|---|
| Language | Python 3.x |
| Data Processing | Pandas, NumPy |
| ML Algorithms | Scikit-learn (K-Means, DBSCAN, PCA) |
| Visualization | Matplotlib, Seaborn |
| Notebook | Jupyter Notebook |
# Clone the repo
git clone https://github.com/tashfeen786/CustomerSegmentation.git
cd CustomerSegmentation
# Install dependencies
pip install pandas numpy matplotlib seaborn scikit-learn jupyter
# Run notebook
jupyter notebook Task_02_Mall_Customers_Clustering_Project.ipynbCustomerSegmentation/
โ
โโโ Task_02_Mall_Customers_Clustering_Project.ipynb # Main notebook
โโโ Task_02_Mall_Customers_Clustering_Project.pdf # PDF export
โโโ Mall_Customers.csv # Raw dataset
โโโ mall_customers_with_clusters.csv # Clustered output
โโโ README.md
- Hierarchical Clustering โ dendrogram visualization
- Plotly โ interactive 3D cluster plots
- Streamlit dashboard โ interactive segmentation tool
- RFM Analysis โ Recency, Frequency, Monetary segmentation
- Real e-commerce dataset โ more complex features
Tashfeen Aziz โ AI/ML Engineer & Python Developer
โญ If you found this project helpful, please give it a star!