# Unsupervised Learning & Clustering - Hands-On Lab

üìö **Objective**: Learn unsupervised learning through 3 practical examples

By completing this notebook, you'll understand how clustering algorithms work in practice and when to use each one.

## Introduction

Unsupervised learning discovers hidden patterns in unlabeled data. This notebook demonstrates **three core clustering approaches**:

1. **K-means** - Fast partitional clustering for customer segmentation
2. **DBSCAN** - Density-based clustering for anomaly detection
3. **Comparison** - Evaluating clustering quality with metrics

Each example is self-contained and produces visual output to demonstrate the concepts.

---
## Example 1: K-means for Customer Segmentation

**Goal**: Group customers by purchasing behavior (Recency, Frequency, Monetary)

**Key Concepts**:
- Feature scaling before clustering
- Elbow method to find optimal K
- Silhouette score for evaluation

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Generate synthetic customer data
np.random.seed(42)
n_customers = 300

data = pd.DataFrame({
    'recency': np.random.randint(1, 365, n_customers),      # Days since last purchase
    'frequency': np.random.randint(1, 50, n_customers),     # Number of purchases  
    'monetary': np.random.randint(10, 1000, n_customers)    # Average order value
})

print("üìä Customer Data Sample:")
print(data.head())
print(f"\nShape: {data.shape}")
print(f"\nStatistics:\n{data.describe()}")

### Step 1: Feature Scaling

‚ö†Ô∏è **Critical**: K-means uses Euclidean distance, so features must be on the same scale!

In [None]:
# Standardize features (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(data)

print("‚úÖ Scaling verification:")
print(f"Mean: {X_scaled.mean(axis=0).round(2)}")  # Should be ~[0, 0, 0]
print(f"Std:  {X_scaled.std(axis=0).round(2)}")   # Should be ~[1, 1, 1]

### Step 2: Find Optimal K (Elbow Method)

We'll try K from 2 to 10 and plot **inertia** (within-cluster sum of squares) and **silhouette scores**.

In [None]:
# Try different K values
K_range = range(2, 11)
inertias = []
silhouette_scores = []

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_scaled, kmeans.labels_))

# Plot results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
ax1.set_xlabel('Number of Clusters (K)', fontsize=12)
ax1.set_ylabel('Inertia', fontsize=12)
ax1.set_title('Elbow Method', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)

ax2.plot(K_range, silhouette_scores, 'ro-', linewidth=2, markersize=8)
ax2.set_xlabel('Number of Clusters (K)', fontsize=12)
ax2.set_ylabel('Silhouette Score', fontsize=12)
ax2.set_title('Silhouette Analysis', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nüìà Best K by silhouette: {K_range[np.argmax(silhouette_scores)]}")

### Step 3: Apply K-means with Optimal K

Based on the elbow plot, let's choose **K=4** (look for the "elbow" where inertia starts decreasing slowly).

In [None]:
# Fit final model
optimal_k = 4
kmeans_final = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
clusters = kmeans_final.fit_predict(X_scaled)

# Add clusters to original data
data['cluster'] = clusters

# Evaluate
final_silhouette = silhouette_score(X_scaled, clusters)
print(f"‚úÖ Silhouette Score (K={optimal_k}): {final_silhouette:.3f}")
print(f"   Range: [-1, 1], Higher is better\n")

# Analyze cluster profiles
cluster_summary = data.groupby('cluster').mean()
print("üìä Cluster Profiles (Original Scale):")
print(cluster_summary.round(0))
print("\nüí° Interpretation:")
print("  - Look for patterns in recency, frequency, and monetary values")
print("  - High recency = Recent buyers")
print("  - High frequency = Loyal customers")
print("  - High monetary = Big spenders")

### Step 4: Visualize Customer Segments

In [None]:
# 2D visualization (Frequency vs Monetary)
plt.figure(figsize=(10, 6))
scatter = plt.scatter(data['frequency'], data['monetary'], 
                     c=data['cluster'], cmap='viridis', 
                     s=80, alpha=0.6, edgecolors='black', linewidth=0.5)

plt.xlabel('Purchase Frequency', fontsize=12)
plt.ylabel('Average Order Value ($)', fontsize=12)
plt.title(f'Customer Segments (K-means, K={optimal_k})', fontsize=14, fontweight='bold')
plt.colorbar(scatter, label='Cluster ID')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\n‚úÖ Customers successfully segmented! Marketing can now target each group differently.")

---
## Example 2: DBSCAN for Anomaly Detection

**Goal**: Detect unusual network traffic patterns (anomalies)

**Key Concepts**:
- Density-based clustering
- Œµ (epsilon) and MinPts parameters
- Automatic outlier detection

In [None]:
from sklearn.cluster import DBSCAN

# Generate data: normal traffic + anomalies
np.random.seed(42)

# Two normal traffic clusters
normal1 = np.random.randn(100, 2) * 0.5 + [2, 2]
normal2 = np.random.randn(100, 2) * 0.5 + [8, 8]
normal = np.vstack([normal1, normal2])

# Scattered anomalies
anomalies = np.random.uniform(0, 10, (20, 2))

# Combine
X = np.vstack([normal, anomalies])

print(f"üì° Network Traffic Data:")
print(f"  Total points: {len(X)}")
print(f"  Normal patterns: 200")
print(f"  True anomalies: 20")

### Apply DBSCAN

**Parameters**:
- `eps` (Œµ): Maximum distance between neighbors
- `min_samples`: Minimum points to form a dense region

In [None]:
# Apply DBSCAN
epsilon = 0.8
min_samples = 4

dbscan = DBSCAN(eps=epsilon, min_samples=min_samples)
labels = dbscan.fit_predict(X)

# Analyze results
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)

print(f"üîç DBSCAN Results:")
print(f"  Œµ (epsilon): {epsilon}")
print(f"  MinPts: {min_samples}")
print(f"  Clusters found: {n_clusters}")
print(f"  Noise points (anomalies): {n_noise}")
print(f"  Detection rate: {n_noise/20*100:.0f}% of true anomalies")

### Visualize Clusters and Anomalies

Noise points (label=-1) are marked with **red X** symbols.

In [None]:
plt.figure(figsize=(10, 7))

# Plot normal clusters and anomalies
for label in set(labels):
    if label == -1:
        # Anomalies (noise)
        mask = (labels == label)
        plt.scatter(X[mask, 0], X[mask, 1], c='red', marker='x', 
                   s=150, linewidths=3, label='Anomaly', zorder=3)
    else:
        # Normal clusters
        mask = (labels == label)
        plt.scatter(X[mask, 0], X[mask, 1], s=60, alpha=0.7, 
                   label=f'Cluster {label}', edgecolors='black', linewidth=0.5)

plt.xlabel('Feature 1 (Packet Size)', fontsize=12)
plt.ylabel('Feature 2 (Duration)', fontsize=12)
plt.title(f'DBSCAN: Anomaly Detection (Œµ={epsilon}, MinPts={min_samples})', 
         fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\n‚úÖ Anomaly detection complete! Red X marks indicate suspicious traffic.")

---
## Example 3: Comparing Clustering Algorithms

**Goal**: Compare K-means, DBSCAN, and Hierarchical clustering on the same dataset

**Metrics**:
- Silhouette Score (higher = better)
- Execution time

In [None]:
from sklearn.cluster import AgglomerativeClustering
import time

# Use the scaled customer data
X_comparison = X_scaled

# Store results
results = {}

# --- K-means ---
start = time.time()
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
labels_kmeans = kmeans.fit_predict(X_comparison)
time_kmeans = time.time() - start

results['K-means'] = {
    'Silhouette': silhouette_score(X_comparison, labels_kmeans),
    'Time (s)': time_kmeans,
    'Clusters': 4
}

# --- DBSCAN ---
start = time.time()
dbscan_comp = DBSCAN(eps=0.5, min_samples=5)
labels_dbscan_comp = dbscan_comp.fit_predict(X_comparison)
time_dbscan = time.time() - start

n_clusters_dbscan = len(set(labels_dbscan_comp)) - (1 if -1 in labels_dbscan_comp else 0)
if n_clusters_dbscan > 1:
    sil_db = silhouette_score(X_comparison, labels_dbscan_comp)
else:
    sil_db = 0.0

results['DBSCAN'] = {
    'Silhouette': sil_db,
    'Time (s)': time_dbscan,
    'Clusters': n_clusters_dbscan
}

# --- Hierarchical ---
start = time.time()
hierarchical = AgglomerativeClustering(n_clusters=4, linkage='ward')
labels_hierarchical = hierarchical.fit_predict(X_comparison)
time_hierarchical = time.time() - start

results['Hierarchical'] = {
    'Silhouette': silhouette_score(X_comparison, labels_hierarchical),
    'Time (s)': time_hierarchical,
    'Clusters': 4
}

# Display comparison
comparison_df = pd.DataFrame(results).T
print("üìä Algorithm Comparison:")
print(comparison_df.round(4))
print("\nüí° Key Insights:")
print("  - Higher Silhouette = Better clustering quality")
print("  - K-means: Fastest, good for spherical clusters")
print("  - DBSCAN: Finds arbitrary shapes, identifies outliers")
print("  - Hierarchical: No K needed upfront, slower for large datasets")

---
## üìö Summary & Key Takeaways

### What You Learned:

1. **K-means Clustering**
   - Requires K (number of clusters) upfront
   - Use elbow method & silhouette score to find optimal K
   - **Always scale features** before applying!
   - Best for: Spherical, similarly-sized clusters

2. **DBSCAN Clustering**
   - No K required - discovers clusters automatically
   - Identifies **outliers** as noise points
   - Parameters: Œµ (neighborhood radius) and MinPts
   - Best for: Arbitrary shapes, anomaly detection

3. **Evaluation Metrics**
   - **Silhouette Score**: [-1, 1], higher is better
   - Compare multiple algorithms on same data
   - No single "correct" clustering - depends on goal!

### üöÄ Next Steps:
- Try Gaussian Mixture Models (GMM) for soft clustering
- Explore hierarchical clustering dendrograms
- Apply to your own datasets
- Learn dimensionality reduction (PCA, t-SNE) for high-dimensional data

### ‚ö†Ô∏è Remember:
- Clustering is **exploratory** - validate results with domain knowledge
- Different algorithms reveal different patterns  
- Preprocessing (scaling, handling outliers) is critical!