# Week 8 â€” Unsupervised Learning: Clustering

**Course:** Applied ML Foundations for SaaS Analytics  
**Week Focus:** Discover natural customer segments without predefined labels.

---

## ðŸŽ¯ Learning Objectives

- Supervised vs Unsupervised Learning
- K-Means clustering and optimal K selection
- Cluster profiling and business interpretation
- Compare K-Means, DBSCAN, Hierarchical approaches

In [None]:
import pandas as pd, numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score

# Load data
subs = pd.read_csv('../data/subscriptions.csv', parse_dates=['signup_date', 'churn_date'])
feature_usage = pd.read_csv('../data/feature_usage.csv')
user_events = pd.read_csv('../data/user_events.csv')

print(f"Customers: {len(subs):,} | Features: {len(feature_usage):,} | Events: {len(user_events):,}")

## Part 1: Feature Engineering for Clustering

In [None]:
# Aggregate engagement metrics
engagement = feature_usage.groupby('user_id').agg({
    'usage_count': 'sum',
    'feature_name': 'nunique'
}).rename(columns={'usage_count': 'total_usage', 'feature_name': 'features_adopted'}).reset_index()

events = user_events.groupby('user_id').size().reset_index(name='total_events')

# Merge features
df = subs[['user_id', 'tenure_days', 'mrr', 'plan_tier']].merge(engagement, on='user_id', how='left')
df = df.merge(events, on='user_id', how='left')
df = df.fillna(0)

# Create derived features
df['usage_per_day'] = df['total_usage'] / (df['tenure_days'] + 1)
df['events_per_day'] = df['total_events'] / (df['tenure_days'] + 1)

clustering_features = ['tenure_days', 'mrr', 'total_usage', 'features_adopted', 'total_events', 'usage_per_day', 'events_per_day']
print(f"Features: {clustering_features}")
print(f"\nFeature summary:\n{df[clustering_features].describe()}")

## Part 2: Feature Scaling & K-Means Clustering

**ðŸ’¡ Depth Note:** Feature scaling is CRITICAL for distance-based algorithms. Without scaling, features with larger ranges dominate. StandardScaler makes all features have mean=0, std=1.

**Explore further:** Try MinMaxScaler vs StandardScaler â€” does output change?

In [None]:
# Scale features
X = df[clustering_features].copy()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("âœ… Features scaled to mean=0, std=1")

# Find optimal K using Elbow Method + Silhouette Score
print("\n" + "="*60)
print("FINDING OPTIMAL K")
print("="*60)

results = []
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    silhouette = silhouette_score(X_scaled, kmeans.labels_)
    results.append({'K': k, 'Inertia': kmeans.inertia_, 'Silhouette': silhouette})
    print(f"K={k} | Inertia: {kmeans.inertia_:>10,.0f} | Silhouette: {silhouette:.4f}")

results_df = pd.DataFrame(results)
best_k = results_df.loc[results_df['Silhouette'].idxmax(), 'K']
print(f"\nâœ… Recommended K: {best_k} (silhouette: {results_df['Silhouette'].max():.4f})")

## Part 3: Train Final K-Means & Profile Clusters

In [None]:
# Final K-Means model
kmeans_final = KMeans(n_clusters=best_k, random_state=42, n_init=10)
df['cluster'] = kmeans_final.fit_predict(X_scaled)

print(f"Cluster distribution:")
print(df['cluster'].value_counts().sort_index())

# Profile each cluster
print("\n" + "="*60)
print("CLUSTER PROFILES")
print("="*60)

profile = df.groupby('cluster')[clustering_features + ['plan_tier']].agg({
    'tenure_days': 'mean',
    'mrr': 'mean',
    'total_usage': 'mean',
    'features_adopted': 'mean',
    'total_events': 'mean',
    'usage_per_day': 'mean',
    'events_per_day': 'mean',
    'plan_tier': lambda x: x.mode()[0] if len(x.mode()) > 0 else 'Unknown'
}).round(2)

print(profile)

# Assign cluster names based on characteristics
cluster_names = {}
for cluster_id in range(best_k):
    c_profile = df[df['cluster'] == cluster_id]
    avg_mrr = c_profile['mrr'].mean()
    avg_usage = c_profile['usage_per_day'].mean()
    
    if avg_mrr > df['mrr'].quantile(0.75) and avg_usage > df['usage_per_day'].quantile(0.75):
        cluster_names[cluster_id] = "Power Users"
    elif avg_mrr > df['mrr'].quantile(0.75):
        cluster_names[cluster_id] = "High Payers"
    elif avg_usage > df['usage_per_day'].quantile(0.75):
        cluster_names[cluster_id] = "Active Users"
    elif c_profile['tenure_days'].mean() < df['tenure_days'].quantile(0.25):
        cluster_names[cluster_id] = "New Customers"
    elif avg_mrr < df['mrr'].quantile(0.25):
        cluster_names[cluster_id] = "Low Value"
    else:
        cluster_names[cluster_id] = "Steady Users"

df['segment'] = df['cluster'].map(cluster_names)

print("\n" + "="*60)
print("SEGMENT NAMES & BUSINESS INTERPRETATION")
print("="*60)

for cluster_id in sorted(df['cluster'].unique()):
    segment = cluster_names[cluster_id]
    segment_size = (df['cluster'] == cluster_id).sum()
    avg_mrr = df[df['cluster'] == cluster_id]['mrr'].mean()
    print(f"\n{segment} (n={segment_size:,}): ${avg_mrr:.0f} avg MRR")

## Part 4: Business Actions by Segment

**ðŸ’¡ Depth Note:** This is where clustering becomes valuable. Design segment-specific strategies for:
- Marketing campaigns
- Pricing/upsell opportunities
- Churn prevention
- Product roadmap prioritization

**Explore further:** Build a churn model with 'segment' as a feature â€” does it improve predictions?

In [None]:
# Segment-specific metrics
print("=" * 60)
print("SEGMENT CHARACTERISTICS & RECOMMENDED ACTIONS")
print("=" * 60)

for segment in df['segment'].unique():
    segment_df = df[df['segment'] == segment]
    churned = segment_df['churn_date'].notna().sum()
    churn_rate = 100 * churned / len(segment_df)
    
    print(f"\n{segment}:")
    print(f"  Size: {len(segment_df):,} ({100*len(segment_df)/len(df):.1f}%)")
    print(f"  Avg MRR: ${segment_df['mrr'].mean():.0f}")
    print(f"  Avg Tenure: {segment_df['tenure_days'].mean():.0f} days")
    print(f"  Churn Rate: {churn_rate:.1f}%")
    print(f"  Avg Features Adopted: {segment_df['features_adopted'].mean():.1f}")
    
    if 'Power User' in segment:
        print(f"  Action: Assign account managers, prevent churn")
    elif 'High Payer' in segment:
        print(f"  Action: Low engagement risk, increase onboarding")
    elif 'Active User' in segment:
        print(f"  Action: Upsell opportunity, feature adoption")
    elif 'New' in segment:
        print(f"  Action: Optimize onboarding, reduce time-to-value")
    elif 'Low Value' in segment:
        print(f"  Action: Re-engagement or downgrade to free tier")
    else:
        print(f"  Action: Standard nurture campaigns")

## Part 5: Alternative Clustering Algorithms

**ðŸ’¡ Depth Note:** Compare different algorithms on same data:
- **DBSCAN**: Finds arbitrary-shaped clusters, identifies outliers
- **Hierarchical**: Creates dendrogram, shows cluster relationships

**Explore further:** 
- Plot dendrograms to visualize cluster hierarchy
- Analyze DBSCAN outliers (noise points) â€” are they actual anomalies?
- Compare runtime on larger datasets

In [None]:
# DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=10)
df['cluster_dbscan'] = dbscan.fit_predict(X_scaled)

n_clusters_dbscan = len(set(df['cluster_dbscan'])) - (1 if -1 in df['cluster_dbscan'].values else 0)
n_outliers = (df['cluster_dbscan'] == -1).sum()

print(f"DBSCAN Results:")
print(f"  Clusters: {n_clusters_dbscan}")
print(f"  Outliers (noise): {n_outliers:,} ({100*n_outliers/len(df):.1f}%)")

# Hierarchical
hierarchical = AgglomerativeClustering(n_clusters=best_k, linkage='ward')
df['cluster_hierarchical'] = hierarchical.fit_predict(X_scaled)

print(f"\nHierarchical Results:")
print(f"  Clusters: {best_k}")
print(f"  Silhouette: {silhouette_score(X_scaled, df['cluster_hierarchical']):.4f}")

# Comparison
comparison = pd.DataFrame({
    'Algorithm': ['K-Means', 'DBSCAN', 'Hierarchical'],
    'Silhouette': [
        silhouette_score(X_scaled, df['cluster']),
        silhouette_score(X_scaled[df['cluster_dbscan'] != -1], df.loc[df['cluster_dbscan'] != -1, 'cluster_dbscan']) if n_clusters_dbscan > 1 else 0,
        silhouette_score(X_scaled, df['cluster_hierarchical'])
    ]
})

print("\n" + "="*60)
print("ALGORITHM COMPARISON")
print("="*60)
print(comparison.to_string(index=False))

## Hands-On Exercises

### Exercise 1: Feature Selection Impact
Try clustering with different feature sets:
1. Engagement only: `total_usage`, `features_adopted`, `total_events`
2. Financial only: `mrr`, `tenure_days`
3. All features (current)

Compare silhouette scores and cluster interpretability.

In [None]:
# TODO: Test different feature sets
# engagement_features = ['total_usage', 'features_adopted', 'total_events']
# financial_features = ['mrr', 'tenure_days']
# TODO: Train K-Means on each, compare results

### Exercise 2: Churn Analysis by Segment
1. Calculate churn rate for each cluster
2. Which segments have highest churn?
3. Use segment as feature in churn prediction model

In [None]:
# TODO: Analyze churn by segment
# TODO: Build churn model with and without segment feature
# TODO: Compare AUC scores

### Exercise 3: Cohort Analysis
Cluster customers separately by signup cohort (monthly). Do segment characteristics change over time? What does this tell us about our product?

In [None]:
# TODO: Extract signup month from subs['signup_date']
# TODO: Cluster within each cohort
# TODO: Compare profiles across time

## Assignment: Customer Persona Development

**Deliverables:**
1. Cluster analysis report (3-5 segments)
2. Persona card for each segment (name, characteristics, pain points, use cases)
3. Strategic recommendations (prioritization, product, marketing)

**Bonus:** Build segment predictor for new customers on signup

## Key Takeaways

âœ… Supervised vs Unsupervised Learning distinction  
âœ… Feature scaling importance for distance-based algorithms  
âœ… Elbow method & silhouette analysis for optimal K  
âœ… Cluster profiling & business interpretation  
âœ… Algorithm comparison (K-Means, DBSCAN, Hierarchical)  
âœ… Translating clusters into actionable business strategies  

## ðŸ”œ Next Week: Dimensionality Reduction (PCA)

Reduce hundreds of features â†’ 2-3 dimensions for visualization and noise reduction