# Capstone Project 3: Customer Segmentation (Clustering)

---

## Learning Objectives

By completing this project you will be able to:

- Frame a business problem as an unsupervised learning task
- Apply feature scaling as a prerequisite for distance-based algorithms
- Use PCA for dimensionality reduction and visualization
- Determine the optimal number of clusters using the elbow method and silhouette scores
- Implement and compare KMeans, Hierarchical Clustering, and DBSCAN
- Profile and interpret clusters for actionable business recommendations

## Prerequisites

- Python 3.8+
- Libraries: numpy, pandas, matplotlib, seaborn, scikit-learn, scipy
- Familiarity with clustering concepts and distance metrics

## Table of Contents

1. [Problem Statement & Business Context](#1)
2. [Data Generation](#2)
3. [Exploratory Data Analysis](#3)
4. [Preprocessing](#4)
5. [Dimensionality Reduction (PCA)](#5)
6. [KMeans Clustering](#6)
7. [Cluster Visualization](#7)
8. [Cluster Profiling](#8)
9. [Hierarchical Clustering](#9)
10. [DBSCAN](#10)
11. [Business Interpretation](#11)
12. [Conclusions and Marketing Recommendations](#12)

<a id="1"></a>
## 1. Problem Statement & Business Context

**Scenario:** An e-commerce company wants to move away from one-size-fits-all marketing. The marketing team believes different customer segments exist and wants data-driven groups to target with tailored campaigns, personalized offers, and differentiated communication strategies.

**Goal:** Identify natural customer segments based on purchasing behavior and demographics. For each segment, provide a profile that the marketing team can use to design targeted campaigns.

**Why Clustering?** There are no predefined labels. We do not know how many groups exist or what they look like. This is a classic unsupervised learning problem where the algorithm must discover structure in the data.

<a id="2"></a>
## 2. Data Generation

We create a synthetic dataset of 500 customers with 7 behavioral and demographic features. The data is designed to have natural cluster structure.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)

np.random.seed(42)

# We create 4 natural clusters by mixing distributions
# Cluster 1: Young, high income, high spending ("Affluent Millennials")
c1 = 130
# Cluster 2: Middle-aged, moderate income, moderate spending ("Steady Middle Class")
c2 = 150
# Cluster 3: Older, high income, low frequency ("Wealthy Infrequent")
c3 = 100
# Cluster 4: Young, low income, budget shoppers ("Budget Conscious")
c4 = 120

def make_cluster(n, age_mu, age_std, income_mu, income_std, spend_mu, spend_std,
                 freq_mu, freq_std, recency_mu, recency_std, products_mu, products_std,
                 clv_mu, clv_std):
    return pd.DataFrame({
        "age": np.random.normal(age_mu, age_std, n).clip(18, 80).astype(int),
        "annual_income": np.random.normal(income_mu, income_std, n).clip(15000, 250000).round(0),
        "spending_score": np.random.normal(spend_mu, spend_std, n).clip(1, 100).round(0),
        "avg_purchase_frequency": np.random.normal(freq_mu, freq_std, n).clip(0.5, 30).round(1),
        "days_since_last_purchase": np.random.normal(recency_mu, recency_std, n).clip(1, 365).astype(int),
        "num_products_bought": np.random.normal(products_mu, products_std, n).clip(1, 100).astype(int),
        "customer_lifetime_value": np.random.normal(clv_mu, clv_std, n).clip(100, 50000).round(0),
    })

#                          age     income     spend     freq     recency   products   CLV
df1 = make_cluster(c1,    28, 5,  85000, 15000,  80, 10,   12, 3,   15, 10,   40, 12,   8000, 2000)
df2 = make_cluster(c2,    42, 8,  55000, 12000,  50, 12,    6, 2,   30, 15,   20, 8,    4000, 1500)
df3 = make_cluster(c3,    58, 7,  95000, 20000,  30, 10,    2, 1,  100, 40,   10, 5,    6000, 2500)
df4 = make_cluster(c4,    25, 5,  30000, 8000,   65, 12,    8, 3,   20, 10,   25, 8,    1500, 600)

df = pd.concat([df1, df2, df3, df4], ignore_index=True)
# Shuffle so clusters are not in order
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

print(f"Dataset shape: {df.shape}")
df.head(10)

In [None]:
df.describe().round(1)

In [None]:
df.info()

<a id="3"></a>
## 3. Exploratory Data Analysis

In [None]:
# Feature distributions
fig, axes = plt.subplots(2, 4, figsize=(20, 8))
features = df.columns.tolist()

for i, (ax, col) in enumerate(zip(axes.ravel(), features)):
    ax.hist(df[col], bins=30, edgecolor="black", alpha=0.7, color="teal")
    ax.set_title(col, fontsize=11)

# Hide the extra subplot
axes[1, 3].set_visible(False)

plt.suptitle("Feature Distributions", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap
plt.figure(figsize=(9, 7))
sns.heatmap(df.corr(), annot=True, fmt=".2f", cmap="coolwarm", center=0, square=True)
plt.title("Feature Correlation Matrix")
plt.tight_layout()
plt.show()

In [None]:
# Pairplot (subsample for speed)
key_features = ["annual_income", "spending_score", "avg_purchase_frequency", "customer_lifetime_value"]
sns.pairplot(df[key_features], diag_kind="hist", plot_kws={"alpha": 0.4, "s": 15})
plt.suptitle("Pairplot of Key Features", y=1.02)
plt.show()

<a id="4"></a>
## 4. Preprocessing

Clustering algorithms based on distance (KMeans, Hierarchical, DBSCAN) are sensitive to feature scales. We standardize all features to zero mean and unit variance.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

print(f"Scaled data shape: {X_scaled.shape}")
print(f"Mean of each feature (should be ~0): {X_scaled.mean(axis=0).round(4)}")
print(f"Std of each feature (should be ~1):  {X_scaled.std(axis=0).round(4)}")

<a id="5"></a>
## 5. Dimensionality Reduction (PCA)

We reduce the 7 features to 2 principal components for visualization. PCA captures the directions of maximum variance.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)

print(f"Explained variance ratio: {pca.explained_variance_ratio_.round(4)}")
print(f"Total explained variance: {pca.explained_variance_ratio_.sum():.2%}")

# Full PCA for cumulative variance
pca_full = PCA(random_state=42).fit(X_scaled)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scree plot
axes[0].bar(range(1, len(pca_full.explained_variance_ratio_) + 1),
            pca_full.explained_variance_ratio_, color="steelblue", edgecolor="black")
axes[0].set_xlabel("Principal Component")
axes[0].set_ylabel("Explained Variance Ratio")
axes[0].set_title("Scree Plot")

# Cumulative variance
cumvar = np.cumsum(pca_full.explained_variance_ratio_)
axes[1].plot(range(1, len(cumvar) + 1), cumvar, "o-", color="steelblue")
axes[1].axhline(y=0.90, color="red", linestyle="--", label="90% threshold")
axes[1].set_xlabel("Number of Components")
axes[1].set_ylabel("Cumulative Explained Variance")
axes[1].set_title("Cumulative Explained Variance")
axes[1].legend()

plt.tight_layout()
plt.show()

In [None]:
# 2D scatter plot before clustering
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.5, s=20, color="gray")
plt.xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)")
plt.ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)")
plt.title("2D PCA Projection (No Clusters Yet)")
plt.tight_layout()
plt.show()

<a id="6"></a>
## 6. KMeans Clustering

We use the elbow method and silhouette scores to determine the optimal number of clusters.

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

k_range = range(2, 11)
inertias = []
silhouette_scores = []

for k in k_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_scaled)
    inertias.append(km.inertia_)
    silhouette_scores.append(silhouette_score(X_scaled, labels))

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Elbow plot
axes[0].plot(k_range, inertias, "o-", color="steelblue", linewidth=2)
axes[0].set_xlabel("Number of Clusters (k)")
axes[0].set_ylabel("Inertia (Within-Cluster Sum of Squares)")
axes[0].set_title("Elbow Method")
axes[0].set_xticks(list(k_range))

# Silhouette plot
axes[1].plot(k_range, silhouette_scores, "o-", color="coral", linewidth=2)
axes[1].set_xlabel("Number of Clusters (k)")
axes[1].set_ylabel("Silhouette Score")
axes[1].set_title("Silhouette Score vs k")
axes[1].set_xticks(list(k_range))

plt.tight_layout()
plt.show()

best_k = k_range[np.argmax(silhouette_scores)]
print(f"Best k by silhouette score: {best_k} (score = {max(silhouette_scores):.4f})")

In [None]:
# Fit final KMeans with optimal k
kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X_scaled)

df["kmeans_cluster"] = kmeans_labels

print(f"KMeans cluster sizes (k={best_k}):")
print(df["kmeans_cluster"].value_counts().sort_index())

<a id="7"></a>
## 7. Cluster Visualization

In [None]:
# 2D PCA scatter colored by KMeans cluster
fig, ax = plt.subplots(figsize=(10, 7))

scatter = ax.scatter(X_pca[:, 0], X_pca[:, 1], c=kmeans_labels,
                     cmap="viridis", alpha=0.6, s=30, edgecolors="k", linewidths=0.3)

# Plot cluster centers projected onto PCA space
centers_pca = pca.transform(kmeans.cluster_centers_)
ax.scatter(centers_pca[:, 0], centers_pca[:, 1], c="red", marker="X",
           s=200, edgecolors="black", linewidths=2, label="Centroids")

ax.set_xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)")
ax.set_ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)")
ax.set_title(f"KMeans Clusters (k={best_k}) in PCA Space")
ax.legend()
plt.colorbar(scatter, label="Cluster")
plt.tight_layout()
plt.show()

In [None]:
# Feature-space scatter matrices colored by cluster
plot_features = ["annual_income", "spending_score", "avg_purchase_frequency", "customer_lifetime_value"]

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
pairs = [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]

for ax, (i, j) in zip(axes.ravel(), pairs):
    for cluster_id in range(best_k):
        mask = kmeans_labels == cluster_id
        ax.scatter(df.loc[mask, plot_features[i]], df.loc[mask, plot_features[j]],
                   alpha=0.5, s=20, label=f"Cluster {cluster_id}")
    ax.set_xlabel(plot_features[i])
    ax.set_ylabel(plot_features[j])
    ax.legend(fontsize=7)

plt.suptitle("Cluster Assignments in Feature Space", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

<a id="8"></a>
## 8. Cluster Profiling

We compute the mean of each feature per cluster to understand what makes each segment unique.

In [None]:
feature_cols = ["age", "annual_income", "spending_score", "avg_purchase_frequency",
                "days_since_last_purchase", "num_products_bought", "customer_lifetime_value"]

cluster_profile = df.groupby("kmeans_cluster")[feature_cols].mean().round(1)
cluster_profile["count"] = df.groupby("kmeans_cluster").size()
cluster_profile

In [None]:
# Radar chart / heatmap of normalized cluster profiles
profile_normalized = cluster_profile[feature_cols].copy()
for col in feature_cols:
    col_min = profile_normalized[col].min()
    col_max = profile_normalized[col].max()
    if col_max > col_min:
        profile_normalized[col] = (profile_normalized[col] - col_min) / (col_max - col_min)
    else:
        profile_normalized[col] = 0.5

plt.figure(figsize=(12, 5))
sns.heatmap(profile_normalized, annot=cluster_profile[feature_cols].values, fmt=".1f",
            cmap="YlOrRd", xticklabels=feature_cols,
            yticklabels=[f"Cluster {i}" for i in range(best_k)])
plt.title("Cluster Profiles (color = normalized, numbers = actual means)")
plt.tight_layout()
plt.show()

In [None]:
# Box plots by cluster for key features
fig, axes = plt.subplots(2, 4, figsize=(20, 8))
for ax, col in zip(axes.ravel(), feature_cols):
    df.boxplot(column=col, by="kmeans_cluster", ax=ax)
    ax.set_title(col)
    ax.set_xlabel("Cluster")

axes[1, 3].set_visible(False)
plt.suptitle("Feature Distributions by Cluster", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

<a id="9"></a>
## 9. Hierarchical Clustering

We compare KMeans with Agglomerative (Hierarchical) Clustering. A dendrogram helps visualize how clusters merge.

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering

# Compute linkage matrix on a subsample for a cleaner dendrogram
subsample_idx = np.random.choice(len(X_scaled), size=100, replace=False)
linkage_matrix = linkage(X_scaled[subsample_idx], method="ward")

plt.figure(figsize=(16, 6))
dendrogram(linkage_matrix, truncate_mode="level", p=5, leaf_font_size=8, color_threshold=10)
plt.title("Dendrogram (Ward Linkage, 100-sample subset)")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
plt.tight_layout()
plt.show()

In [None]:
# Fit Agglomerative Clustering with same k as KMeans
agg = AgglomerativeClustering(n_clusters=best_k, linkage="ward")
agg_labels = agg.fit_predict(X_scaled)

df["hierarchical_cluster"] = agg_labels

# Compare with KMeans
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score

ari = adjusted_rand_score(kmeans_labels, agg_labels)
nmi = normalized_mutual_info_score(kmeans_labels, agg_labels)
sil_agg = silhouette_score(X_scaled, agg_labels)

print(f"Hierarchical Clustering (k={best_k}):")
print(f"  Silhouette Score: {sil_agg:.4f}")
print(f"\nAgreement with KMeans:")
print(f"  Adjusted Rand Index: {ari:.4f}")
print(f"  Normalized Mutual Information: {nmi:.4f}")

In [None]:
# Side-by-side PCA visualizations
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

for ax, (labels, title) in zip(axes, [(kmeans_labels, "KMeans"), (agg_labels, "Hierarchical")]):
    scatter = ax.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap="viridis",
                         alpha=0.6, s=25, edgecolors="k", linewidths=0.3)
    ax.set_xlabel("PC1")
    ax.set_ylabel("PC2")
    ax.set_title(f"{title} Clusters")
    plt.colorbar(scatter, ax=ax, label="Cluster")

plt.tight_layout()
plt.show()

<a id="10"></a>
## 10. DBSCAN

DBSCAN is a density-based algorithm that can find arbitrarily shaped clusters and automatically identifies outliers (label = -1). It does not require specifying k, but needs `eps` (neighborhood radius) and `min_samples` parameters.

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors

# Use k-distance plot to estimate eps
k_neighbors = 10
nn = NearestNeighbors(n_neighbors=k_neighbors)
nn.fit(X_scaled)
distances, _ = nn.kneighbors(X_scaled)
k_distances = np.sort(distances[:, -1])

plt.figure(figsize=(10, 5))
plt.plot(k_distances, color="steelblue")
plt.xlabel("Points (sorted by distance)")
plt.ylabel(f"{k_neighbors}-th Nearest Neighbor Distance")
plt.title(f"k-Distance Plot (k={k_neighbors}) for eps Estimation")
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
# Fit DBSCAN with estimated eps
# Choosing eps from the elbow region of the k-distance plot
eps_value = 1.8
min_samples_value = 10

dbscan = DBSCAN(eps=eps_value, min_samples=min_samples_value)
dbscan_labels = dbscan.fit_predict(X_scaled)

df["dbscan_cluster"] = dbscan_labels

n_clusters_db = len(set(dbscan_labels) - {-1})
n_noise = (dbscan_labels == -1).sum()

print(f"DBSCAN Results (eps={eps_value}, min_samples={min_samples_value}):")
print(f"  Clusters found: {n_clusters_db}")
print(f"  Noise points:   {n_noise} ({n_noise/len(df):.1%})")
print(f"\nCluster sizes:")
print(df["dbscan_cluster"].value_counts().sort_index())

if n_clusters_db >= 2:
    non_noise_mask = dbscan_labels != -1
    sil_db = silhouette_score(X_scaled[non_noise_mask], dbscan_labels[non_noise_mask])
    print(f"\n  Silhouette Score (excl. noise): {sil_db:.4f}")

In [None]:
# Visualize DBSCAN results
plt.figure(figsize=(10, 7))

unique_labels = set(dbscan_labels)
colors = plt.cm.viridis(np.linspace(0, 1, len(unique_labels)))

for label, color in zip(sorted(unique_labels), colors):
    mask = dbscan_labels == label
    if label == -1:
        plt.scatter(X_pca[mask, 0], X_pca[mask, 1], c="gray", alpha=0.3,
                    s=15, marker="x", label="Noise")
    else:
        plt.scatter(X_pca[mask, 0], X_pca[mask, 1], c=[color], alpha=0.6,
                    s=25, edgecolors="k", linewidths=0.3, label=f"Cluster {label}")

plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title(f"DBSCAN Clusters (eps={eps_value}, min_samples={min_samples_value})")
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Comparison summary
sil_kmeans = silhouette_score(X_scaled, kmeans_labels)
sil_hier = silhouette_score(X_scaled, agg_labels)

comparison = pd.DataFrame({
    "Algorithm": ["KMeans", "Hierarchical", "DBSCAN"],
    "Clusters Found": [best_k, best_k, n_clusters_db],
    "Noise Points": [0, 0, n_noise],
    "Silhouette Score": [sil_kmeans, sil_hier,
                         sil_db if n_clusters_db >= 2 else float("nan")],
})
comparison

<a id="11"></a>
## 11. Business Interpretation

Based on the KMeans cluster profiles (our primary segmentation), we assign descriptive segment names.

In [None]:
# Review cluster profiles and assign names
print("Cluster Profiles for Naming:")
print("=" * 80)
for cluster_id in range(best_k):
    profile = cluster_profile.loc[cluster_id]
    print(f"\nCluster {cluster_id} (n={int(profile['count'])}):")
    print(f"  Age:              {profile['age']:.0f}")
    print(f"  Annual Income:    ${profile['annual_income']:,.0f}")
    print(f"  Spending Score:   {profile['spending_score']:.0f}")
    print(f"  Purchase Freq:    {profile['avg_purchase_frequency']:.1f}/month")
    print(f"  Days Since Last:  {profile['days_since_last_purchase']:.0f}")
    print(f"  Products Bought:  {profile['num_products_bought']:.0f}")
    print(f"  Lifetime Value:   ${profile['customer_lifetime_value']:,.0f}")

In [None]:
# Name segments based on profiles
# We rank clusters by CLV and spending to assign names
segment_profiles = cluster_profile[feature_cols].copy()

# Build a rule-based naming scheme
segment_names = {}
name_candidates = [
    "High-Value Loyalists",
    "Affluent Trendsetters",
    "Steady Regulars",
    "Budget-Conscious Shoppers",
    "Dormant High-Earners",
    "New Explorers",
]

for cluster_id in range(best_k):
    p = segment_profiles.loc[cluster_id]
    if p["spending_score"] >= 70 and p["annual_income"] >= 70000:
        segment_names[cluster_id] = "High-Value Loyalists"
    elif p["spending_score"] >= 55 and p["annual_income"] < 40000:
        segment_names[cluster_id] = "Budget-Conscious Shoppers"
    elif p["days_since_last_purchase"] >= 60 and p["annual_income"] >= 70000:
        segment_names[cluster_id] = "Dormant High-Earners"
    elif p["avg_purchase_frequency"] >= 5 and p["spending_score"] >= 40:
        segment_names[cluster_id] = "Steady Regulars"
    else:
        segment_names[cluster_id] = f"Segment {cluster_id}"

df["segment_name"] = df["kmeans_cluster"].map(segment_names)

print("Segment Names:")
for cid, name in sorted(segment_names.items()):
    count = (df["kmeans_cluster"] == cid).sum()
    print(f"  Cluster {cid}: {name} (n={count})")

In [None]:
# Final visualization with segment names
plt.figure(figsize=(10, 7))
for cluster_id in range(best_k):
    mask = kmeans_labels == cluster_id
    plt.scatter(X_pca[mask, 0], X_pca[mask, 1], alpha=0.6, s=30,
                edgecolors="k", linewidths=0.3, label=segment_names[cluster_id])

plt.xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)")
plt.ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)")
plt.title("Customer Segments in PCA Space")
plt.legend(fontsize=10, loc="best")
plt.tight_layout()
plt.show()

<a id="12"></a>
## 12. Conclusions and Marketing Recommendations

### Summary of Segments

| Segment | Profile | Recommended Strategy |
|---------|---------|---------------------|
| **High-Value Loyalists** | High income, high spending, frequent purchases | VIP programs, exclusive early access, loyalty rewards |
| **Steady Regulars** | Middle income, moderate spending, consistent activity | Cross-sell and upsell, bundle offers, referral incentives |
| **Dormant High-Earners** | High income but infrequent, high recency | Win-back campaigns, personalized re-engagement emails, premium product showcases |
| **Budget-Conscious Shoppers** | Lower income, decent spending score, price-sensitive | Discount campaigns, value bundles, flash sales, free shipping offers |

### Key Findings

1. **KMeans** was the most effective algorithm for this dataset, producing clean, well-separated clusters with the highest silhouette score.
2. **Hierarchical Clustering** produced very similar results, validating the KMeans solution.
3. **DBSCAN** identified some outlier customers who may warrant individual attention.
4. The optimal number of segments is 4, each with distinct behavioral patterns.

### Next Steps

1. **A/B test** segment-specific marketing campaigns to measure uplift.
2. **Build a scoring pipeline** to assign new customers to segments in real time.
3. **Monitor segment drift** over time and re-cluster quarterly.
4. **Enrich with additional data** (website behavior, email engagement, social media) for finer segmentation.
5. **Combine with predictive models** (e.g., churn prediction per segment) for prioritized outreach.