# 027: Hierarchical Clustering - Tree-Based Cluster Discovery

## 🎯 Learning Objectives

By the end of this notebook, you will:

1. **Understand hierarchical clustering theory**: Agglomerative vs divisive, linkage methods, dendrogram interpretation
2. **Master linkage criteria**: Single, complete, average, Ward's method - when to use each
3. **Implement from scratch**: Build agglomerative clustering algorithm with distance matrix updates
4. **Visualize dendrograms**: Interpret tree structures, choose optimal cut height, identify natural groupings
5. **Apply to real problems**: Test hierarchy discovery, failure mode taxonomy, wafer similarity trees
6. **Compare with K-Means**: Understand trade-offs (no K required vs O(n²) complexity)

---

## 📊 Hierarchical Clustering Workflow

```mermaid
graph TD
    A[📥 Data Points N samples] --> B{Clustering Strategy?}
    B -->|Bottom-Up| C[🔼 Agglomerative: Start with N clusters]
    B -->|Top-Down| D[🔽 Divisive: Start with 1 cluster]
    
    C --> E[📏 Compute Distance Matrix NxN]
    E --> F[🔗 Merge Closest Pair using Linkage]
    F --> G{All merged into 1 cluster?}
    G -->|No| H[♻️ Update Distance Matrix]
    H --> F
    G -->|Yes| I[🌳 Dendrogram Tree]
    
    D --> J[📊 Split Largest Cluster]
    J --> K{All N individual clusters?}
    K -->|No| J
    K -->|Yes| I
    
    I --> L[✂️ Cut Dendrogram at Height h]
    L --> M[🎯 Final K Clusters]
    
    M --> N[📈 Evaluate: Cophenetic correlation]
    N --> O[✅ Cluster Analysis]
    
    style C fill:#e1f5e1
    style D fill:#ffe1e1
    style I fill:#fff4e1
    style M fill:#e1f0ff
```

---

## 🔍 Hierarchical vs K-Means vs DBSCAN

| **Criterion** | **Hierarchical** | **K-Means** | **DBSCAN** |
|--------------|-----------------|------------|-----------|
| **Requires K upfront** | ❌ No (cut dendrogram after) | ✅ Yes | ❌ No (density-based) |
| **Cluster shape** | Any (linkage-dependent) | Spherical only | Arbitrary |
| **Scalability** | Poor O(n² log n) | Excellent O(nkt) | Medium O(n log n) |
| **Deterministic** | ✅ Yes (same distance matrix) | ❌ No (random init) | ✅ Yes (given eps/min_samples) |
| **Dendrogram visualization** | ✅ Yes (tree structure) | ❌ No | ❌ No |
| **Handles outliers** | Poor (forces assignment) | Poor | Excellent (noise=-1) |
| **Interpretability** | Excellent (hierarchical taxonomy) | Good (centroids) | Medium (density threshold) |
| **Best for** | Small data (<5K), taxonomy discovery | Large data (100K+), well-separated clusters | Geospatial, outlier-heavy data |

---

## 🏭 Real-World Applications

### Post-Silicon Validation
- **Test Hierarchy Discovery**: Automatically group 500 parametric tests into functional categories (power, speed, leakage)
- **Failure Mode Taxonomy**: Build tree of failure signatures (e.g., voltage failures → Vdd_high vs Vdd_low → specific test modes)
- **Die Similarity Analysis**: Cluster wafer die into groups by parametric profiles, visualize relationships
- **Multi-Site Correlation**: Discover hierarchical relationships between test sites (which sites test similar device characteristics?)

### General AI/ML
- **Document Taxonomy**: Organize 10K documents into hierarchical categories (no predefined K needed)
- **Product Categorization**: Build multi-level product hierarchy from feature similarity
- **Gene Expression Clustering**: Discover hierarchical relationships in biological data
- **Social Network Communities**: Identify nested community structures in network graphs

---

## 📚 Mathematical Foundation

### Agglomerative Hierarchical Clustering Algorithm

**Input:** Data matrix $X \in \mathbb{R}^{n \times d}$, linkage criterion $L$

**Output:** Dendrogram tree structure, mergings at each step

**Steps:**
1. **Initialize:** Assign each of $n$ points to its own cluster: $C_1, C_2, \ldots, C_n$
2. **Compute Distance Matrix:** Calculate pairwise distances $D_{ij}$ for all cluster pairs
3. **Repeat until one cluster remains:**
   - Find closest pair of clusters: $(i^*, j^*) = \arg\min_{i<j} D_{ij}$
   - Merge $C_{i^*}$ and $C_{j^*}$ into new cluster $C_{new}$
   - Update distance matrix: compute distances from $C_{new}$ to all other clusters using linkage criterion
   - Record merge in dendrogram (height = distance at merge)
4. **Output:** Dendrogram tree showing all mergings

### Linkage Criteria

Linkage determines how to compute distance between two clusters $C_i$ and $C_j$:

#### 1. Single Linkage (Minimum)
$$
d_{\text{single}}(C_i, C_j) = \min_{x \in C_i, y \in C_j} \|x - y\|
$$
- **Interpretation:** Distance between **closest** points in two clusters
- **Behavior:** Tends to form long, elongated "chain" clusters
- **Use case:** Finding clusters connected by bridges, detecting outliers (single points form separate clusters)

#### 2. Complete Linkage (Maximum)
$$
d_{\text{complete}}(C_i, C_j) = \max_{x \in C_i, y \in C_j} \|x - y\|
$$
- **Interpretation:** Distance between **farthest** points in two clusters
- **Behavior:** Produces compact, spherical clusters (similar to K-Means)
- **Use case:** When you want tight, well-separated clusters

#### 3. Average Linkage (UPGMA)
$$
d_{\text{average}}(C_i, C_j) = \frac{1}{|C_i| |C_j|} \sum_{x \in C_i} \sum_{y \in C_j} \|x - y\|
$$
- **Interpretation:** Average distance between **all pairs** of points
- **Behavior:** Balanced approach between single and complete linkage
- **Use case:** General-purpose linkage, robust to noise

#### 4. Ward's Method (Minimum Variance)
$$
d_{\text{ward}}(C_i, C_j) = \frac{|C_i| |C_j|}{|C_i| + |C_j|} \|\mu_i - \mu_j\|^2
$$
where $\mu_i, \mu_j$ are cluster centroids.

- **Interpretation:** Increase in within-cluster variance after merging
- **Behavior:** Minimizes inertia (sum of squared distances to centroid), produces balanced cluster sizes
- **Use case:** When you want K-Means-like compact clusters but without specifying K upfront
- **Note:** Most popular in practice, often best default choice

### Lance-Williams Update Formula

Efficient distance matrix update without recomputing all pairwise distances:

$$
d(C_{new}, C_k) = \alpha_i d(C_i, C_k) + \alpha_j d(C_j, C_k) + \beta d(C_i, C_j) + \gamma |d(C_i, C_k) - d(C_j, C_k)|
$$

where $C_{new} = C_i \cup C_j$. Parameters $\alpha_i, \alpha_j, \beta, \gamma$ depend on linkage method:

| **Linkage** | $\alpha_i$ | $\alpha_j$ | $\beta$ | $\gamma$ |
|------------|-----------|-----------|---------|---------|
| Single | 0.5 | 0.5 | 0 | -0.5 |
| Complete | 0.5 | 0.5 | 0 | 0.5 |
| Average | $\frac{|C_i|}{|C_i|+|C_j|}$ | $\frac{|C_j|}{|C_i|+|C_j|}$ | 0 | 0 |
| Ward | $\frac{|C_i|+|C_k|}{|C_i|+|C_j|+|C_k|}$ | $\frac{|C_j|+|C_k|}{|C_i|+|C_j|+|C_k|}$ | $\frac{-|C_k|}{|C_i|+|C_j|+|C_k|}$ | 0 |

### Dendrogram Interpretation

**Y-axis (Height):** Distance at which clusters merge (linkage-dependent)
- **Low height:** Points/clusters are very similar (merge early)
- **High height:** Points/clusters are dissimilar (merge late)

**Cutting the Dendrogram:**
- Horizontal cut at height $h$ produces clusters
- Number of clusters = number of vertical lines intersected
- **Optimal cut:** Look for large "gap" in merge heights (long vertical segments) → natural cluster boundary

**Cophenetic Distance:**
- Distance at which two points are first merged in same cluster
- **Cophenetic Correlation:** Correlation between original distances and cophenetic distances
- High correlation (>0.8) means dendrogram faithfully represents data structure

---

## 📦 Required Libraries

### 📝 What's Happening: Import Dependencies

**Purpose:** Load libraries for hierarchical clustering, dendrogram visualization, and distance computations.

**Key Points:**
- **scipy.cluster.hierarchy**: Core library for hierarchical clustering (linkage, dendrogram, fcluster, cophenetic)
- **scipy.spatial.distance**: Distance metrics (euclidean, cityblock, cosine) and pdist for pairwise distances
- **sklearn.cluster.AgglomerativeClustering**: Scikit-learn wrapper with fit_predict API consistency
- **matplotlib/seaborn**: Dendrogram visualization with customizable aesthetics
- **NumPy**: Distance matrix computations and array operations

**Why This Matters:** scipy.cluster.hierarchy is the gold standard for hierarchical clustering in Python, offering optimized C implementations of linkage algorithms and rich dendrogram visualization. For post-silicon applications, visualizing test hierarchies or failure taxonomies requires clear dendrogram plots.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster, cophenet
from scipy.spatial.distance import pdist, squareform
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import adjusted_rand_score, silhouette_score

# Set random seed for reproducibility
np.random.seed(42)

# Plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("✅ Libraries imported successfully!")
print("\nKey Modules:")
print("  • scipy.cluster.hierarchy: linkage(), dendrogram(), fcluster()")
print("  • scipy.spatial.distance: pdist(), squareform()")
print("  • sklearn.cluster.AgglomerativeClustering: Production API")
print("  • Linkage methods: single, complete, average, ward")

---

## 🔨 Implementation From Scratch: Agglomerative Clustering

### 📝 What's Happening: Building Agglomerative Clustering Algorithm

**Purpose:** Implement agglomerative hierarchical clustering from scratch to understand merge logic and distance matrix updates.

**Key Points:**
- **Distance Matrix**: Compute all pairwise distances once, then update after each merge
- **Merge Selection**: Find closest cluster pair using np.argmin on flattened distance matrix
- **Lance-Williams Formula**: Efficiently update distances without recomputing all pairs
- **Dendrogram Tracking**: Record merge history (cluster1, cluster2, distance, size) for visualization
- **Linkage Method**: Implement average linkage (can extend to single/complete/ward)

**Why This Matters:** Understanding merge mechanics reveals why hierarchical clustering is O(n³) naive but O(n² log n) with priority queues. In post-silicon validation, knowing algorithm internals helps explain why 10K test results take minutes to cluster vs seconds for K-Means.

**Post-Silicon Context:** For 500 parametric tests, hierarchical clustering discovers natural test groupings (power tests, speed tests, leakage tests) without predefined categories. From-scratch implementation clarifies why this approach works better than K-Means when test relationships are hierarchical (e.g., Vdd tests → Vdd_nominal → Vdd_low_power).

### 📝 Implementation

**Purpose:** Core implementation with detailed code

**Key implementation details below.**

In [None]:
class AgglomerativeClusteringFromScratch:
    """
    Agglomerative hierarchical clustering implementation from scratch.
    
    Uses average linkage and Lance-Williams formula for distance updates.
    """
    
    def __init__(self, linkage='average'):
        """
        Parameters:
        -----------
        linkage : str
            Linkage criterion ('single', 'complete', 'average')
        """
        self.linkage = linkage
        self.merge_history_ = []
        self.labels_ = None
        self.n_clusters_ = None
    
    def fit(self, X, n_clusters=3):
        """
        Perform agglomerative clustering.
        
        Parameters:
        -----------
        X : ndarray of shape (n_samples, n_features)
            Training data
        n_clusters : int
            Number of final clusters (where to cut dendrogram)
        """
        n_samples = X.shape[0]
        
        # Initialize: each point is its own cluster
        clusters = {i: [i] for i in range(n_samples)}  # cluster_id -> list of point indices
        cluster_ids = list(range(n_samples))
        
        # Compute initial distance matrix
        dist_matrix = squareform(pdist(X, metric='euclidean'))
        
        # Make diagonal infinite (can't merge cluster with itself)
        np.fill_diagonal(dist_matrix, np.inf)
        
        # Track merge history for dendrogram
        self.merge_history_ = []
        next_cluster_id = n_samples  # New clusters get IDs starting from n_samples
        
        # Merge until we have desired number of clusters
        while len(clusters) > n_clusters:
            # Find closest pair of clusters
            min_idx = np.argmin(dist_matrix)
            i, j = np.unravel_index(min_idx, dist_matrix.shape)
            
            # Ensure i < j for consistency
            if i > j:
                i, j = j, i
            
            # Record merge (cluster_i, cluster_j, distance, new_cluster_size)
            merge_distance = dist_matrix[i, j]
            new_cluster_size = len(clusters[cluster_ids[i]]) + len(clusters[cluster_ids[j]])
            self.merge_history_.append((cluster_ids[i], cluster_ids[j], merge_distance, new_cluster_size))
            
            # Merge clusters i and j
            new_cluster = clusters[cluster_ids[i]] + clusters[cluster_ids[j]]
            clusters[next_cluster_id] = new_cluster
            
            # Update distance matrix using Lance-Williams formula (average linkage)
            new_distances = self._update_distances(dist_matrix, i, j, 
                                                   len(clusters[cluster_ids[i]]), 
                                                   len(clusters[cluster_ids[j]]))
            


### 📝 Implementation Part 2

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
            # Remove rows/cols for merged clusters, add row/col for new cluster
            # Delete rows i and j (delete j first since j > i)
            dist_matrix = np.delete(dist_matrix, [i, j], axis=0)
            dist_matrix = np.delete(dist_matrix, [i, j], axis=1)
            
            # Add new row and column for merged cluster
            dist_matrix = np.vstack([dist_matrix, new_distances])
            new_distances_col = np.append(new_distances, np.inf)  # Distance to itself = inf
            dist_matrix = np.column_stack([dist_matrix, new_distances_col])
            
            # Update cluster_ids list
            del clusters[cluster_ids[i]]
            del clusters[cluster_ids[j]]
            cluster_ids.pop(j)  # Remove j first (higher index)
            cluster_ids.pop(i)
            cluster_ids.append(next_cluster_id)
            
            next_cluster_id += 1
        
        # Assign final labels
        self.labels_ = np.zeros(n_samples, dtype=int)
        for cluster_label, cluster_id in enumerate(cluster_ids):
            for point_idx in clusters[cluster_id]:
                self.labels_[point_idx] = cluster_label
        
        self.n_clusters_ = n_clusters
        
        return self
    
    def _update_distances(self, dist_matrix, i, j, size_i, size_j):
        """
        Update distances using Lance-Williams formula (average linkage).
        
        Parameters:
        -----------
        dist_matrix : ndarray
            Current distance matrix
        i, j : int
            Indices of clusters being merged
        size_i, size_j : int
            Sizes of clusters i and j
        
        Returns:
        --------
        new_distances : ndarray
            Distances from new merged cluster to all other clusters
        """
        n = dist_matrix.shape[0]
        new_distances = np.zeros(n - 2)  # Exclude i and j
        
        # Lance-Williams parameters for average linkage
        alpha_i = size_i / (size_i + size_j)
        alpha_j = size_j / (size_i + size_j)
        beta = 0
        gamma = 0
        
        # Compute distance from new cluster to each remaining cluster k
        k_idx = 0
        for k in range(n):
            if k == i or k == j:
                continue
            
            # Average linkage formula
            if self.linkage == 'average':
                new_distances[k_idx] = alpha_i * dist_matrix[i, k] + alpha_j * dist_matrix[j, k]
            elif self.linkage == 'single':
                new_distances[k_idx] = min(dist_matrix[i, k], dist_matrix[j, k])
            elif self.linkage == 'complete':
                new_distances[k_idx] = max(dist_matrix[i, k], dist_matrix[j, k])
            


### 📝 Implementation Part 3

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
            k_idx += 1
        
        return new_distances
print("✅ Agglomerative Clustering implemented from scratch!")
print("\nKey Methods:")
print("  • fit(X, n_clusters) - Perform clustering, merge until n_clusters remain")
print("  • _update_distances() - Lance-Williams formula for distance matrix update")
print("\nAlgorithm Flow:")
print("  1. Initialize: Each point = separate cluster")
print("  2. Compute distance matrix (all pairwise distances)")
print("  3. Repeat: Find closest pair, merge, update distances")
print("  4. Stop when desired number of clusters reached")
print("\nComplexity:")
print("  • Naive: O(n³) - recompute all distances after each merge")
print("  • Lance-Williams: O(n² log n) - efficient distance updates")


### 📝 What's Happening: Testing From-Scratch on Synthetic Data

**Purpose:** Validate from-scratch hierarchical clustering on known-structure data and visualize merge process.

**Key Points:**
- **Synthetic Data**: Generate 3 well-separated blobs (100 points) to verify algorithm correctness
- **Ground Truth Comparison**: Use Adjusted Rand Index to measure clustering quality vs true labels
- **Merge Tracking**: Print merge history showing which clusters combine at each step
- **Visualization**: Scatter plot with color-coded final clusters
- **Post-Silicon Context**: Similar to discovering 3 natural test groups (power, speed, I/O) from 100 parametric tests

**Why This Matters:** Testing on synthetic data (known structure) validates implementation before applying to real unlabeled data. For semiconductor test hierarchy, verifying algorithm on simple cases ensures it will correctly group related tests when applied to 500+ real parameters.

In [None]:
# Generate synthetic data with 3 clear clusters
X_blobs, y_true = make_blobs(n_samples=100, centers=3, n_features=2, 
                              cluster_std=0.5, random_state=42)

print("📊 Synthetic Data Generated:")
print(f"  • Shape: {X_blobs.shape}")
print(f"  • True clusters: {np.unique(y_true)}")
print(f"  • Feature ranges: [{X_blobs.min():.2f}, {X_blobs.max():.2f}]")

# Train from-scratch hierarchical clustering
hc_scratch = AgglomerativeClusteringFromScratch(linkage='average')
hc_scratch.fit(X_blobs, n_clusters=3)

print(f"\n✅ Hierarchical Clustering Complete!")
print(f"  • Final clusters: {hc_scratch.n_clusters_}")
print(f"  • Total merges performed: {len(hc_scratch.merge_history_)}")
print(f"  • Label distribution: {np.bincount(hc_scratch.labels_)}")

# Show last 5 merges (most important)
print(f"\n🔍 Last 5 Merge Steps:")
print(f"{'Step':<6} {'Cluster1':<10} {'Cluster2':<10} {'Distance':<12} {'New Size'}")
print("-" * 60)
for step, (c1, c2, dist, size) in enumerate(hc_scratch.merge_history_[-5:], 
                                             start=len(hc_scratch.merge_history_)-4):
    print(f"{step:<6} {c1:<10} {c2:<10} {dist:<12.3f} {size}")

# Evaluate clustering quality
ari = adjusted_rand_score(y_true, hc_scratch.labels_)
silhouette = silhouette_score(X_blobs, hc_scratch.labels_)

print(f"\n📈 Clustering Quality:")
print(f"  • Adjusted Rand Index: {ari:.4f} (1.0 = perfect match with true labels)")
print(f"  • Silhouette Score: {silhouette:.4f} (higher = better separation)")

# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Ground truth
axes[0].scatter(X_blobs[:, 0], X_blobs[:, 1], c=y_true, cmap='viridis', 
                alpha=0.6, edgecolors='k', s=60)
axes[0].set_title("Ground Truth Clusters", fontsize=14, fontweight='bold')
axes[0].set_xlabel("Feature 1")
axes[0].set_ylabel("Feature 2")

# Predicted clusters
axes[1].scatter(X_blobs[:, 0], X_blobs[:, 1], c=hc_scratch.labels_, cmap='viridis',
                alpha=0.6, edgecolors='k', s=60)
axes[1].set_title(f"Hierarchical Clustering (ARI={ari:.3f})", fontsize=14, fontweight='bold')
axes[1].set_xlabel("Feature 1")
axes[1].set_ylabel("Feature 2")

plt.tight_layout()
plt.show()

print("\n🔍 Interpretation:")
print("  • High ARI (~1.0): Algorithm correctly recovers true cluster structure")
print("  • Average linkage produces balanced, well-separated clusters")
print("  • Merge distances increase gradually, then sharply at final merges (natural cluster boundary)")
print("\n💡 Post-Silicon Analogy:")
print("  • 100 points = 100 parametric tests")
print("  • 3 clusters = 3 test categories (power, speed, I/O)")
print("  • Merge history reveals hierarchical structure (e.g., Vdd tests merge before joining speed tests)")

---

## 🌳 Dendrogram Visualization with scipy

### 📝 What's Happening: Building and Interpreting Dendrograms

**Purpose:** Use scipy.cluster.hierarchy to create dendrograms showing complete merge hierarchy and identify optimal cut height.

**Key Points:**
- **scipy.hierarchy.linkage()**: Computes linkage matrix (merge sequence) from data or distance matrix
- **Dendrogram Interpretation**: Y-axis = merge distance, horizontal lines = clusters, vertical lines = individual points/sub-clusters
- **Optimal Cut Selection**: Look for large vertical gaps (long segments) indicating natural cluster boundaries
- **Linkage Comparison**: Visualize single vs complete vs average vs ward to see behavior differences
- **Cophenetic Correlation**: Measure how well dendrogram preserves original distances (>0.8 good)

**Why This Matters:** Dendrograms answer "How many clusters?" without trying K=1,2,3,... In semiconductor test hierarchy, dendrogram reveals natural test groupings at multiple granularity levels (e.g., 3 top-level categories split into 10 subcategories).

**Post-Silicon Context:** For 500 parametric tests, dendrogram shows hierarchical relationships: top level separates power/speed/I/O, second level splits power into {Vdd, Idd, leakage}, third level splits Vdd into {nominal, low_power, stress}. Engineers can cut at any level based on analysis needs.

In [None]:
# Compute linkage matrix using scipy (ward method)
linkage_matrix_ward = linkage(X_blobs, method='ward')

print("📊 Linkage Matrix (Ward's Method):")
print("  • Shape:", linkage_matrix_ward.shape, "(n_samples-1) x 4")
print("  • Columns: [cluster1_id, cluster2_id, merge_distance, new_cluster_size]")
print("\nLast 5 Merges (final steps before reaching 1 cluster):")
print(linkage_matrix_ward[-5:])

# Compute cophenetic correlation
coph_corr, coph_dist = cophenet(linkage_matrix_ward, pdist(X_blobs))
print(f"\n📏 Cophenetic Correlation: {coph_corr:.4f}")
print("  • >0.8: Excellent (dendrogram faithfully represents data structure)")
print("  • 0.6-0.8: Good")
print("  • <0.6: Poor (dendrogram distorts relationships)")

# Plot dendrogram
plt.figure(figsize=(12, 6))
dendrogram(linkage_matrix_ward, 
           truncate_mode='lastp',  # Show only last p merges
           p=30,  # Show last 30 merges
           leaf_rotation=90,
           leaf_font_size=10)
plt.title("Dendrogram (Ward's Method) - Last 30 Merges", fontsize=14, fontweight='bold')
plt.xlabel("Sample Index or Cluster Size", fontsize=12)
plt.ylabel("Merge Distance (Ward Criterion)", fontsize=12)
plt.axhline(y=10, color='red', linestyle='--', linewidth=2, alpha=0.7, label='Cut at h=10 → 3 clusters')
plt.legend()
plt.tight_layout()
plt.show()

print("\n🔍 Dendrogram Interpretation:")
print("  • Vertical axis: Distance at which clusters merge")
print("  • Horizontal axis: Individual samples or merged clusters")
print("  • Long vertical lines: Large distance gaps → natural cluster boundaries")
print("  • Red dashed line: Proposed cut height (h=10) produces 3 clusters")
print("\n💡 How to Choose Cut Height:")
print("  1. Look for large gaps in Y-axis (long vertical segments)")
print("  2. Count number of vertical lines crossed by horizontal cut")
print("  3. That count = number of clusters produced")
print("  4. For h=10: Crosses 3 vertical lines → 3 clusters")

# Extract clusters from dendrogram cut
cut_height = 10
cluster_labels_cut = fcluster(linkage_matrix_ward, t=cut_height, criterion='distance')

print(f"\n✂️ Cutting Dendrogram at h={cut_height}:")
print(f"  • Number of clusters: {len(np.unique(cluster_labels_cut))}")
print(f"  • Cluster sizes: {np.bincount(cluster_labels_cut)}")

# Validate cut against ground truth
ari_cut = adjusted_rand_score(y_true, cluster_labels_cut)
print(f"  • ARI vs ground truth: {ari_cut:.4f} (validates 3-cluster choice)")

### 📝 What's Happening: Comparing Linkage Methods

**Purpose:** Visualize how different linkage criteria (single, complete, average, ward) produce different dendrograms and cluster shapes.

**Key Points:**
- **Single Linkage**: Tends to create "chain" clusters (connects via closest points) → elongated shapes
- **Complete Linkage**: Produces compact, spherical clusters (connects via farthest points) → tight groups
- **Average Linkage**: Balanced approach, robust to noise
- **Ward's Method**: Minimizes variance (like K-Means) → balanced cluster sizes, compact shapes
- **Visual Comparison**: 4 dendrograms side-by-side show merge height differences

**Why This Matters:** Linkage choice dramatically affects results. For post-silicon test hierarchy, ward/average work best for balanced test groups; single linkage useful for outlier detection (isolated tests form long chains).

**Post-Silicon Context:** When grouping 500 parametric tests:
- **Ward**: Balanced test categories (each ~50-100 tests)
- **Single**: Detects outlier tests (e.g., specialized debug tests that don't fit main categories)
- **Complete**: Ensures tight test groups (all tests in group highly correlated)
- **Average**: Robust choice when test relationships have noise

In [None]:
# Compute linkage matrices for all 4 methods
linkage_methods = ['single', 'complete', 'average', 'ward']
linkage_matrices = {}
cophenetic_corrs = {}

for method in linkage_methods:
    linkage_matrices[method] = linkage(X_blobs, method=method)
    coph_corr, _ = cophenet(linkage_matrices[method], pdist(X_blobs))
    cophenetic_corrs[method] = coph_corr

print("📊 Linkage Method Comparison:")
print(f"{'Method':<15} {'Cophenetic Corr':<20} {'Interpretation'}")
print("-" * 70)
for method in linkage_methods:
    corr = cophenetic_corrs[method]
    interpretation = "Excellent" if corr > 0.8 else "Good" if corr > 0.6 else "Fair"
    print(f"{method.capitalize():<15} {corr:<20.4f} {interpretation}")

# Visualize all 4 dendrograms
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.flatten()

for idx, method in enumerate(linkage_methods):
    plt.sca(axes[idx])
    dendrogram(linkage_matrices[method], 
               truncate_mode='lastp',
               p=25,
               leaf_rotation=90,
               leaf_font_size=8,
               ax=axes[idx])
    axes[idx].set_title(f"{method.capitalize()} Linkage (Cophenetic={cophenetic_corrs[method]:.3f})", 
                        fontsize=12, fontweight='bold')
    axes[idx].set_xlabel("Sample Index or Cluster Size")
    axes[idx].set_ylabel("Merge Distance")

plt.tight_layout()
plt.show()

print("\n🔍 Linkage Method Behaviors:")
print("  • Single: Lowest merge heights early (connects closest points) → chain-like clusters")
print("  • Complete: Highest merge heights (connects farthest points) → compact, tight clusters")
print("  • Average: Moderate merge heights → balanced, robust clustering")
print("  • Ward: Stepwise increases (minimizes variance) → K-Means-like balanced clusters")

print("\n💡 Post-Silicon Linkage Selection:")
print("  • Ward/Average: General-purpose test hierarchy (balanced categories)")
print("  • Single: Outlier test detection (isolates specialized/debug tests)")
print("  • Complete: Tight test groups (all tests highly correlated)")

# Cut all dendrograms at same height and compare cluster assignments
cut_height = 8
cluster_comparison = {}

for method in linkage_methods:
    labels = fcluster(linkage_matrices[method], t=cut_height, criterion='distance')
    cluster_comparison[method] = labels
    ari = adjusted_rand_score(y_true, labels)
    n_clusters = len(np.unique(labels))
    print(f"\n{method.capitalize()} at h={cut_height}:")
    print(f"  • Clusters: {n_clusters}, ARI: {ari:.4f}, Sizes: {np.bincount(labels)}")

---

## 🏭 Production Implementation: sklearn AgglomerativeClustering

### 📝 What's Happening: sklearn.cluster.AgglomerativeClustering

**Purpose:** Use production-grade sklearn API with connectivity constraints and memory-efficient implementations.

**Key Points:**
- **sklearn API**: Consistent `.fit_predict()` interface like K-Means
- **Linkage Options**: ward, complete, average, single (same as scipy)
- **Connectivity Constraints**: Force certain points to cluster together (spatial constraints for wafer maps)
- **Distance Threshold**: Automatically determine number of clusters based on max distance
- **Memory Efficiency**: Handles 10K+ points better than scipy for large datasets

**Why This Matters:** sklearn provides production-ready hierarchical clustering with advanced features like connectivity constraints (useful for spatial data) and distance_threshold mode (auto K selection). For semiconductor applications, connectivity constraints ensure spatially adjacent die cluster together even if parameters differ slightly.

**Post-Silicon Context:** When clustering wafer die, connectivity constraints ensure die from same wafer region stay together, reflecting spatial correlation from process tools. Distance threshold mode automatically determines natural test groupings without manual K selection.

In [None]:
# sklearn AgglomerativeClustering with fixed n_clusters
hc_sklearn = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels_sklearn = hc_sklearn.fit_predict(X_blobs)

print("✅ sklearn Hierarchical Clustering Complete!")
print(f"  • Number of clusters: {hc_sklearn.n_clusters_}")
print(f"  • Number of leaves: {hc_sklearn.n_leaves_}")
print(f"  • Number of connected components: {hc_sklearn.n_connected_components_}")
print(f"  • Cluster sizes: {np.bincount(labels_sklearn)}")

# Evaluate sklearn clustering
ari_sklearn = adjusted_rand_score(y_true, labels_sklearn)
silhouette_sklearn = silhouette_score(X_blobs, labels_sklearn)

print(f"\n📈 sklearn Clustering Quality:")
print(f"  • Adjusted Rand Index: {ari_sklearn:.4f}")
print(f"  • Silhouette Score: {silhouette_sklearn:.4f}")

# Distance threshold mode: automatically determine number of clusters
hc_auto = AgglomerativeClustering(n_clusters=None, distance_threshold=8.0, linkage='ward')
labels_auto = hc_auto.fit_predict(X_blobs)

print(f"\n🎯 Auto-Clustering (distance_threshold=8.0):")
print(f"  • Automatically determined clusters: {hc_auto.n_clusters_}")
print(f"  • Cluster sizes: {np.bincount(labels_auto)}")
print(f"  • ARI vs ground truth: {adjusted_rand_score(y_true, labels_auto):.4f}")

# Visualize sklearn results
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Ground truth
axes[0].scatter(X_blobs[:, 0], X_blobs[:, 1], c=y_true, cmap='viridis',
                alpha=0.6, edgecolors='k', s=60)
axes[0].set_title("Ground Truth", fontsize=14, fontweight='bold')
axes[0].set_xlabel("Feature 1")
axes[0].set_ylabel("Feature 2")

# sklearn fixed K=3
axes[1].scatter(X_blobs[:, 0], X_blobs[:, 1], c=labels_sklearn, cmap='viridis',
                alpha=0.6, edgecolors='k', s=60)
axes[1].set_title(f"sklearn (K=3, ARI={ari_sklearn:.3f})", fontsize=14, fontweight='bold')
axes[1].set_xlabel("Feature 1")
axes[1].set_ylabel("Feature 2")

# Auto-determined clusters
axes[2].scatter(X_blobs[:, 0], X_blobs[:, 1], c=labels_auto, cmap='viridis',
                alpha=0.6, edgecolors='k', s=60)
axes[2].set_title(f"Auto (K={hc_auto.n_clusters_}, threshold=8.0)", fontsize=14, fontweight='bold')
axes[2].set_xlabel("Feature 1")
axes[2].set_ylabel("Feature 2")



### 📝 Code Continuation (2/2)

Continuing implementation...


In [None]:
plt.tight_layout()
plt.show()

print("\n✅ sklearn vs scipy Comparison:")
print("  • Both produce identical results (same underlying algorithm)")
print("  • sklearn: Better API, connectivity constraints, auto K via distance_threshold")
print("  • scipy: Better dendrogram visualization, more linkage options")
print("\n💡 Production Recommendation:")
print("  • Use scipy for exploratory analysis (dendrograms, linkage comparison)")
print("  • Use sklearn for production pipelines (consistent API, scalability)")

# Demonstrate connectivity constraints (spatial adjacency)
print("\n🔗 Connectivity Constraints Example:")
print("  • Force spatially adjacent points to cluster together")
print("  • Useful for wafer maps (die on same wafer quadrant should cluster)")
from sklearn.neighbors import kneighbors_graph

# Create connectivity matrix (each point connected to 5 nearest neighbors)
connectivity = kneighbors_graph(X_blobs, n_neighbors=5, include_self=False)
hc_constrained = AgglomerativeClustering(n_clusters=3, linkage='ward', connectivity=connectivity)
labels_constrained = hc_constrained.fit_predict(X_blobs)

ari_constrained = adjusted_rand_score(y_true, labels_constrained)
print(f"  • Constrained clustering ARI: {ari_constrained:.4f}")
print(f"  • Use case: Ensure spatially adjacent wafer die cluster together")

---

## 🏭 Real-World Application: Parametric Test Hierarchy Discovery

### Post-Silicon Validation Use Case

**Business Problem:** Semiconductor test programs contain 200-500 parametric tests measuring voltage, current, frequency, power, timing, etc. Engineers need to:
1. Organize tests into logical hierarchies (power→Vdd→nominal/low_power)
2. Identify redundant tests (high correlation) for test time optimization
3. Build failure mode taxonomies from test signatures
4. Understand multi-level test relationships without manual categorization

**Hierarchical Clustering Solution:** Cluster test results across 10K+ devices to discover natural test groupings and hierarchical relationships automatically.

### 📝 What's Happening: Test Hierarchy Discovery

**Purpose:** Apply hierarchical clustering to realistic semiconductor parametric test data (50 tests × 1000 devices) to discover test categories.

**Key Points:**
- **Feature Matrix**: 50 parametric tests (Vdd, Idd, freq, leakage, timing) measured on 1000 devices
- **Hierarchical Structure**: Discover 3 top-level categories (power, speed, leakage) and 10 subcategories
- **Dendrogram Interpretation**: Visualize test relationships, identify highly correlated test pairs (candidates for elimination)
- **Business Value**: Removing 15-20% redundant tests saves 10-15 seconds per device × 1M devices/month = $500K-1M/year
- **Failure Taxonomy**: Group tests by failure modes to accelerate root cause analysis

**Why This Matters:** Manual test categorization takes weeks and is subjective; hierarchical clustering provides objective, data-driven taxonomy in minutes. For 500+ test parameters, automated hierarchy discovery is essential for maintaining test program organization as product evolves.

In [None]:
# Generate realistic parametric test data
np.random.seed(42)
n_devices = 1000
n_tests = 50

# Simulate 3 test categories with hierarchical structure
# Category 1: Power tests (20 tests) - high correlation within group
power_tests = np.random.multivariate_normal(
    mean=np.zeros(20),
    cov=np.eye(20) * 0.3 + 0.7,  # High correlation (0.7)
    size=n_devices
)

# Category 2: Speed tests (15 tests) - moderate correlation
speed_tests = np.random.multivariate_normal(
    mean=np.ones(15),
    cov=np.eye(15) * 0.5 + 0.5,  # Moderate correlation (0.5)
    size=n_devices
)

# Category 3: Leakage tests (15 tests) - lower correlation
leakage_tests = np.random.multivariate_normal(
    mean=np.ones(15) * 2,
    cov=np.eye(15) * 0.7 + 0.3,  # Lower correlation (0.3)
    size=n_devices
)

# Combine into full test matrix (transpose to get test x device)
X_tests = np.column_stack([power_tests, speed_tests, leakage_tests])
X_tests_normalized = (X_tests - X_tests.mean(axis=0)) / X_tests.std(axis=0)

# Cluster tests (not devices) - compute test similarity
# Use test vectors as data points (each test = 1000-dimensional vector of device results)
test_vectors = X_tests_normalized.T  # Shape: (50 tests, 1000 devices)

print("📊 Parametric Test Data Generated:")
print(f"  • Number of devices: {n_devices}")
print(f"  • Number of tests: {n_tests}")
print(f"  • Test categories: Power (20), Speed (15), Leakage (15)")
print(f"  • Test vector shape: {test_vectors.shape}")

# Compute test similarity using correlation distance
from scipy.spatial.distance import pdist
test_distances = pdist(test_vectors, metric='correlation')  # 1 - correlation

# Hierarchical clustering of tests
linkage_matrix_tests = linkage(test_distances, method='ward')

# Compute cophenetic correlation
coph_corr_tests, _ = cophenet(linkage_matrix_tests, test_distances)
print(f"\n📏 Test Hierarchy Cophenetic Correlation: {coph_corr_tests:.4f}")

# Plot dendrogram
plt.figure(figsize=(16, 8))
dendrogram(linkage_matrix_tests,
           labels=[f"T{i+1:02d}" for i in range(n_tests)],  # Test labels T01-T50
           leaf_rotation=90,
           leaf_font_size=10,
           color_threshold=15)  # Color different at height 15
plt.title("Test Hierarchy Dendrogram (Ward's Linkage, Correlation Distance)", 
          fontsize=14, fontweight='bold')
plt.xlabel("Test ID", fontsize=12)
plt.ylabel("Merge Distance (Ward Criterion)", fontsize=12)
plt.axhline(y=15, color='red', linestyle='--', linewidth=2, alpha=0.7, label='Cut at h=15 → 3 categories')
plt.axhline(y=10, color='orange', linestyle='--', linewidth=2, alpha=0.7, label='Cut at h=10 → 10 subcategories')
plt.legend()
plt.tight_layout()
plt.show()

# Cut dendrogram at different heights for multi-level hierarchy


### 📝 Code Continuation (2/2)

Continuing implementation...


In [None]:
test_categories_3 = fcluster(linkage_matrix_tests, t=15, criterion='distance')
test_categories_10 = fcluster(linkage_matrix_tests, t=10, criterion='distance')

print(f"\n🌳 Multi-Level Test Hierarchy:")
print(f"  • Top level (h=15): {len(np.unique(test_categories_3))} categories")
print(f"    - Category sizes: {np.bincount(test_categories_3)}")
print(f"  • Second level (h=10): {len(np.unique(test_categories_10))} subcategories")
print(f"    - Subcategory sizes: {np.bincount(test_categories_10)}")

# Identify highly correlated test pairs (candidates for redundancy reduction)
correlation_matrix = np.corrcoef(test_vectors)
high_corr_pairs = []

for i in range(n_tests):
    for j in range(i+1, n_tests):
        if correlation_matrix[i, j] > 0.95:  # Very high correlation
            high_corr_pairs.append((i, j, correlation_matrix[i, j]))

print(f"\n🔍 Redundant Test Detection:")
print(f"  • Test pairs with correlation >0.95: {len(high_corr_pairs)}")
if high_corr_pairs:
    print(f"  • Example: Test {high_corr_pairs[0][0]+1} ↔ Test {high_corr_pairs[0][1]+1} (r={high_corr_pairs[0][2]:.3f})")
    print(f"  • Recommendation: Consider removing one test from each highly correlated pair")
    print(f"  • Potential test time reduction: {len(high_corr_pairs)} tests × 0.3s = {len(high_corr_pairs)*0.3:.1f}s per device")

print(f"\n💰 Business Impact:")
print(f"  • Removing {len(high_corr_pairs)} redundant tests:")
print(f"  • Test time saved: {len(high_corr_pairs)*0.3:.1f}s per device")
print(f"  • Annual devices: 1M")
print(f"  • Total time saved: {len(high_corr_pairs)*0.3*1e6/3600:.0f} hours/year")
print(f"  • Cost savings at $100/tester-hour: ${len(high_corr_pairs)*0.3*1e6/3600*100:,.0f}/year")

print(f"\n📋 Test Hierarchy Summary:")
print(f"  • Level 1 (3 categories): Power, Speed, Leakage")
print(f"  • Level 2 (10 subcategories): Vdd/Idd/Power within Power, Freq/Timing within Speed, etc.")
print(f"  • Engineers can navigate test program at appropriate granularity")
print(f"  • Failure analysis: Match failure signature to leaf category → narrow root cause")

---

## 🎯 Real-World Projects (Not Exercises!)

Each project includes clear objectives, business value, and implementation guidance.

### Post-Silicon Validation Projects

#### 1. 🏭 Automatic Test Program Organization System
**Objective:** Build hierarchical taxonomy of 500+ parametric tests across multiple product generations, automatically maintaining structure as tests evolve.

**Business Value:** $2M+ annual engineering efficiency (reduce test program maintenance from 100 hours/month → 10 hours with auto-categorization).

**Key Features:**
- Input: 500 tests × 100K devices parametric data per product
- Multi-level hierarchy: 3 top categories → 15 mid-level → 60 leaf groups
- Dendrogram visualization: Interactive plot with zoom, hover tooltips showing test names
- Redundancy detection: Flag test pairs with correlation >0.95 for elimination review
- Temporal tracking: Monitor how test hierarchy changes across product generations

**Implementation Hints:**
- Use correlation distance metric (1 - Pearson correlation) for test similarity
- Apply ward linkage for balanced categories
- Store linkage matrix + cut heights in database for reproducibility
- Build web dashboard with Plotly for dendrogram exploration
- Alert system: Email when new test doesn't fit existing categories (novel functionality)

**Success Metrics:** Achieve 90%+ agreement with manual expert categorization, reduce test program review time from 2 weeks → 2 days per new product.

---

#### 2. 📊 Failure Mode Taxonomy Builder
**Objective:** Automatically discover hierarchical failure patterns from 10K+ failed device test signatures to accelerate root cause analysis.

**Business Value:** $5M+ quarterly yield recovery by reducing failure analysis time from 3 days → 4 hours per systematic excursion.

**Key Features:**
- Input: Test signatures from 10K failed devices (50-100 parametric tests)
- Hierarchical failure clusters: Top level = failure domain (power/speed/leakage), mid level = specific parameter group, leaf = test combination
- Signature matching: New failures automatically assigned to closest cluster (nearest neighbor in dendrogram)
- Root cause hints: Each leaf cluster linked to known failure mechanisms from historical data
- Visualization: Dendrogram colored by failure frequency (red = common, green = rare)

**Implementation Hints:**
- Normalize test results to z-scores (failure magnitude independent)
- Use complete linkage (tight failure mode definition)
- Track cluster appearance over time (new clusters = novel failure modes)
- Integrate with JIRA: Auto-create tickets for new failure clusters
- Machine learning enhancement: Train classifier on cluster labels for fast triage

**Success Metrics:** Classify 85%+ of failures into known categories within 30 seconds, reduce unknown failure investigation from 40% → 15% of total FA time.

---

#### 3. 🔗 Multi-Site Test Correlation Engine
**Objective:** Discover hierarchical relationships between 4-6 test sites (wafer probe, final test, system level) to optimize test content and reduce redundancy.

**Business Value:** $3M+ annual savings by eliminating 20-30% redundant tests across multi-site flow without yield impact.

**Key Features:**
- Input: Parametric data from 50K devices tested at 3 sites (wafer test, final test, board test)
- Cross-site hierarchy: Cluster tests across all sites, identify which site tests overlap
- Transfer function: For overlapping tests, model wafer→final correlation (predict final from wafer)
- Optimization: Recommend test moves (e.g., "Move tests 15-20 from final→wafer, saves 5s/device")
- Risk analysis: Quantify yield impact of removing redundant tests (confidence intervals)

**Implementation Hints:**
- Concatenate test vectors from all sites (150-dimensional for 3 sites × 50 tests)
- Use average linkage (robust to cross-site measurement noise)
- Cophenetic correlation threshold: >0.75 indicates strong cross-site redundancy
- Simulation: A/B test on 10K devices before production deployment
- Business case calculator: TCO model (test time savings vs yield risk vs equipment cost)

**Success Metrics:** Identify 15-25 redundant tests with 95%+ confidence, achieve 8-12 second test time reduction per device, maintain <0.1% yield loss.

---

#### 4. 🌳 Die Similarity Tree for Spatial Analysis
**Objective:** Build hierarchical tree of wafer die based on parametric profiles + spatial location to identify systematic spatial patterns.

**Business Value:** $2M+ quarterly by detecting spatial yield patterns 3-5 days faster than manual wafer map inspection.

**Key Features:**
- Input: 300 die per wafer × 50 parametric tests + (x,y) coordinates
- Hybrid distance: Combine parametric similarity (correlation) + spatial proximity (Euclidean)
- Multi-level spatial clusters: Wafer zones → quadrants → local regions → individual die
- Dendrogram coloring: Color by spatial location to visualize spatial coherence
- Anomaly detection: Isolated die in dendrogram = outliers (potential yield loss)

**Implementation Hints:**
- Feature engineering: [parametric_z_scores × 2, x_coordinate, y_coordinate] (weight parametric 2:1)
- Use ward linkage with spatial connectivity constraints (sklearn kneighbors_graph)
- Visualize: Side-by-side dendrogram + wafer map with cluster colors
- Temporal analysis: Track how die similarity tree changes across wafer lots
- Integration: Feed spatial clusters into failure analysis workflow

**Success Metrics:** Detect 90%+ of spatial excursions within 2 hours of wafer test completion, reduce manual wafer map reviews from 200 → 50 per week.

---

### General AI/ML Projects

#### 5. 📄 Document Taxonomy for Knowledge Management
**Objective:** Automatically organize 50K enterprise documents into hierarchical categories for improved search and knowledge discovery.

**Business Value:** $3M+ annual productivity improvement (reduce document search time from 15 min → 2 min avg per search × 100K searches/year).

**Key Features:**
- Input: 50K documents (PDFs, Word, emails) converted to TF-IDF vectors
- Multi-level taxonomy: 10 top categories → 50 mid-level → 200 leaf categories
- Auto-tagging: New documents automatically assigned to leaf categories
- Search enhancement: Hierarchical navigation (drill down from broad → specific)
- Anomaly detection: Documents that don't fit existing taxonomy = potential new topics

**Implementation Hints:**
- Use TF-IDF or sentence embeddings (BERT) for document vectors
- Average linkage for balanced categories
- Interactive dendrogram with document counts at each node
- Store taxonomy in graph database (Neo4j) for fast hierarchical queries
- User feedback loop: Allow manual reclassification to improve hierarchy

**Success Metrics:** Achieve 80%+ precision/recall vs manual categorization, reduce "document not found" rate from 25% → 8%.

---

#### 6. 🧬 Gene Expression Clustering for Bioinformatics
**Objective:** Discover hierarchical relationships among 20K genes across 100 patient samples to identify disease subtypes and biomarker candidates.

**Business Value:** $50M+ drug development acceleration by identifying target gene clusters 6-12 months faster than manual curation.

**Key Features:**
- Input: 20K genes × 100 patients expression matrix (RNA-seq data)
- Two-way clustering: Cluster genes (rows) AND patients (columns)
- Heatmap visualization: Dendrogram-ordered heatmap shows gene modules
- Biomarker discovery: Genes in same leaf cluster = co-regulated → functional pathway
- Patient stratification: Hierarchical patient clusters = disease subtypes

**Implementation Hints:**
- Use correlation distance (genes with similar expression patterns cluster)
- Average linkage for robust gene modules
- Two dendrograms: One for genes (rows), one for patients (columns)
- Statistical validation: Bootstrap resampling to assess cluster stability
- Pathway enrichment: Link gene clusters to known biological pathways (KEGG, GO)

**Success Metrics:** Discover 5-8 reproducible gene modules, identify 3+ novel biomarker candidates, stratify patients into 4-6 disease subtypes with clinical relevance.

---

#### 7. 🛒 Product Catalog Hierarchy for E-Commerce
**Objective:** Automatically build multi-level product taxonomy from 100K products based on attributes, descriptions, and purchase co-occurrence.

**Business Value:** $10M+ annual revenue increase (improve product discovery, reduce search abandonment from 35% → 22%).

**Key Features:**
- Input: 100K products with attributes (brand, price, category, description embeddings) + co-purchase matrix
- 4-level hierarchy: Department → Category → Subcategory → Product clusters
- Dynamic taxonomy: Updates weekly as new products added, trends shift
- Recommendation enhancement: Products in same leaf cluster = "similar items"
- Search relevance: Use hierarchy for query expansion (search "laptop" returns all devices in laptop cluster)

**Implementation Hints:**
- Combine attribute similarity (Euclidean) + co-purchase affinity (Jaccard)
- Ward linkage for balanced categories (each level has 5-20 children)
- Store in hierarchical database (parent-child relationships)
- A/B test: Hierarchical navigation vs flat search (conversion rate metric)
- Seasonality handling: Recompute hierarchy quarterly to capture trend changes

**Success Metrics:** Increase product discovery rate from 60% → 80%, reduce category navigation depth from 5 clicks → 3 clicks avg, improve conversion 8-12%.

---

#### 8. 🎵 Music Genre Taxonomy for Streaming Service
**Objective:** Build hierarchical genre classification from 1M songs using audio features (tempo, key, spectral features) + user listening patterns.

**Business Value:** $20M+ annual retention improvement (reduce churn 2% via better personalization) + $5M playlist curation efficiency.

**Key Features:**
- Input: 1M songs with audio features (Spotify API: tempo, key, energy, valence, etc.) + user co-listen matrix
- Genre hierarchy: Top = broad genres (rock, pop, electronic) → subgenres → micro-genres
- Playlist generation: Auto-create hierarchical playlists ("Electronic → House → Deep House")
- Discovery algorithm: Recommend songs from neighboring leaf clusters (exploration within taste)
- Temporal trends: Track how genre hierarchy evolves (new micro-genres emerge)

**Implementation Hints:**
- Standardize audio features (tempo in BPM, key in 0-11, etc.)
- Average linkage for balanced genre tree
- Validate with human curators: Genre labels should match industry taxonomy 70%+
- Hybrid approach: Combine hierarchical clustering with user tags (collaborative + content-based)
- Interactive visualization: D3.js tree with song samples at leaf nodes

**Success Metrics:** Match human genre labels 75%+ accuracy, generate 1000+ playlists with 8+ satisfaction rating, increase avg listening time 12-15 minutes/user.

---

## 🎓 Key Takeaways & Best Practices

### ✅ When to Use Hierarchical Clustering

1. **K is unknown**: Don't know how many clusters exist → dendrogram reveals natural groupings at multiple levels
2. **Taxonomy needed**: Want hierarchical relationships (not just flat clusters) → e.g., test hierarchy, failure mode tree
3. **Small-medium data (<5K points)**: O(n²) complexity manageable, dendrogram visualization useful
4. **Deterministic results required**: Same distance matrix always produces same dendrogram (vs K-Means random init)
5. **Exploratory analysis**: Understand data structure before committing to specific K
6. **Multiple granularities**: Need clustering at different resolutions (e.g., 3 top categories, 10 subcategories)

**Example Scenarios:**
- ✅ Test hierarchy discovery (500 tests, unknown natural groupings)
- ✅ Failure mode taxonomy (build tree of failure patterns)
- ✅ Document organization (discover multi-level topic structure)
- ✅ Gene expression analysis (identify gene modules, patient subtypes)

### ❌ When NOT to Use Hierarchical Clustering

1. **Large data (>10K points)**: O(n² log n) complexity too slow, K-Means/MiniBatchKMeans 100× faster
2. **Spherical clusters with known K**: K-Means more efficient, produces similar results
3. **Outliers dominate**: Hierarchical forces all points into clusters, DBSCAN better for noise handling
4. **Real-time inference**: Clustering new points requires full recomputation, K-Means predicts instantly
5. **Greedy merge problem**: Once clusters merge, can't un-merge → divisive clustering or K-Means may be better

**Example Scenarios:**
- ❌ Customer segmentation with 1M users (use K-Means or MiniBatchKMeans)
- ❌ Real-time anomaly detection (use Isolation Forest or LOF)
- ❌ Image clustering with 100K images (use K-Means after dimensionality reduction)
- ❌ Geospatial clusters with noise (use DBSCAN)

### 🔍 Hierarchical vs K-Means vs DBSCAN - Decision Framework

| **Use Hierarchical When...** | **Use K-Means When...** | **Use DBSCAN When...** |
|------------------------------|------------------------|----------------------|
| K unknown, need exploration | K known or easily determined | K unknown, outliers present |
| Want hierarchical taxonomy | Want fast, scalable clustering | Want arbitrary cluster shapes |
| <5K points, can afford O(n²) | 10K-1M+ points, need O(nkt) | Geospatial, density-based patterns |
| Deterministic results critical | Random init acceptable | Noise handling critical |
| Dendrogram visualization valuable | Centroids provide interpretability | No predefined distance threshold |

**Post-Silicon Decision Tree:**
```
IF test hierarchy discovery (500 tests) → Hierarchical (ward/average)
ELSE IF wafer clustering (50K die) → K-Means or MiniBatchKMeans
ELSE IF spatial defect detection (outliers) → DBSCAN
ELSE IF failure taxonomy (10K failures) → Hierarchical (complete linkage)
ELSE IF real-time test triage → K-Means (pre-trained centroids)
```

### 🔧 Linkage Method Selection Guide

| **Linkage** | **Best For** | **Cluster Shape** | **Outlier Sensitivity** | **Post-Silicon Use Case** |
|------------|-------------|-------------------|------------------------|--------------------------|
| **Single** | Outlier detection, chain structures | Elongated, irregular | High (creates long chains) | Identify isolated tests that don't fit categories |
| **Complete** | Tight, compact clusters | Spherical, well-separated | Low (compact groups) | Group highly correlated tests (>0.8) |
| **Average** | General-purpose, robust | Balanced, moderate | Medium (robust to noise) | Default for test hierarchy (balanced categories) |
| **Ward** | Balanced sizes, K-Means-like | Spherical, equal-sized | Low (variance minimization) | Organize tests into equal-sized groups for parallel execution |

**Recommendation Matrix:**
- **Test hierarchy**: Ward or Average (balanced categories, robust)
- **Failure taxonomy**: Complete (tight failure mode definition)
- **Die similarity**: Ward with spatial connectivity (balanced spatial clusters)
- **Redundancy detection**: Average (robust to measurement noise)

### 🔧 Implementation Best Practices

1. **Distance Metric Matters**: Choose based on data type
   ```python
   # For continuous features
   linkage(X, method='ward', metric='euclidean')
   
   # For test correlation (test vectors)
   linkage(pdist(X, metric='correlation'), method='average')
   
   # For binary features
   linkage(pdist(X, metric='jaccard'), method='complete')
   ```

2. **Dendrogram Cut Height Selection**:
   - **Visual inspection**: Look for large vertical gaps (natural boundaries)
   - **Elbow method**: Plot number of clusters vs within-cluster variance
   - **Cophenetic distance**: Cut where cophenetic correlation drops sharply
   - **Domain knowledge**: Post-silicon example - 3 top categories (power/speed/leakage)

3. **Cophenetic Correlation Validation**:
   ```python
   coph_corr, coph_dist = cophenet(linkage_matrix, pdist(X))
   # >0.8: Excellent (dendrogram faithfully represents data)
   # 0.6-0.8: Good
   # <0.6: Poor (consider different linkage or distance metric)
   ```

4. **Memory Efficiency for Large Data**:
   ```python
   # For 5K-10K points, use condensed distance matrix
   distances = pdist(X, metric='euclidean')  # Saves memory
   linkage_matrix = linkage(distances, method='average')
   
   # For 10K+ points, use sklearn (more memory-efficient)
   from sklearn.cluster import AgglomerativeClustering
   hc = AgglomerativeClustering(n_clusters=None, distance_threshold=10)
   ```

5. **Multi-Level Hierarchy Extraction**:
   ```python
   # Top level: 3 categories
   labels_L1 = fcluster(linkage_matrix, t=15, criterion='distance')
   
   # Mid level: 10 subcategories
   labels_L2 = fcluster(linkage_matrix, t=10, criterion='distance')
   
   # Leaf level: 30 fine-grained groups
   labels_L3 = fcluster(linkage_matrix, t=5, criterion='distance')
   ```

6. **Connectivity Constraints (Spatial Data)**:
   ```python
   from sklearn.neighbors import kneighbors_graph
   
   # Force spatially adjacent points to cluster together
   connectivity = kneighbors_graph(X, n_neighbors=5, include_self=False)
   hc = AgglomerativeClustering(n_clusters=3, connectivity=connectivity)
   ```

### ⚠️ Common Pitfalls

1. **Ignoring Distance Metric**: Using Euclidean for test correlation → use 'correlation' distance
2. **Single Linkage Chaining**: Single linkage creates long chains for noisy data → use average/ward
3. **No Dendrogram Inspection**: Cutting blindly without visual inspection → miss natural boundaries
4. **Scalability Ignorance**: Applying to 100K+ points → use K-Means or sampling instead
5. **Not Validating Cophenetic Correlation**: Low correlation (<0.6) means dendrogram doesn't represent data well

### 📊 Evaluation Metrics

| **Metric** | **Formula/Method** | **Interpretation** | **Ideal Value** |
|-----------|-------------------|-------------------|----------------|
| **Cophenetic Correlation** | Correlation between original distances and cophenetic distances | Dendrogram faithfulness | >0.8 excellent, 0.6-0.8 good |
| **Silhouette Score** | $\frac{b - a}{\max(a, b)}$ (cohesion vs separation) | Cluster quality at specific cut | 0.5-0.7 good, >0.7 excellent |
| **Adjusted Rand Index** | Agreement with ground truth (if available) | External validation | 0.8-1.0 excellent |
| **Dendrogram Height Gap** | Max vertical distance between merges | Natural cluster boundary | Large gaps = good separation |

### 🚀 Next Steps in Clustering Mastery

1. **DBSCAN** (Notebook 028): Density-based clustering for outlier handling and arbitrary shapes
2. **Gaussian Mixture Models** (Notebook 029): Probabilistic clustering with soft assignments
3. **Dimensionality Reduction** (Notebook 030): PCA, t-SNE, UMAP for visualizing hierarchical clusters
4. **Advanced Topics**: Divisive clustering, BIRCH (scalable hierarchical), HDBSCAN (hierarchical DBSCAN)

### 💡 Final Thoughts

**Hierarchical Clustering Strengths:**
- No K required → dendrogram reveals natural structure
- Multi-level granularity → taxonomy at multiple resolutions
- Deterministic → same distance matrix = same tree
- Interpretable → dendrogram visualizes relationships

**Hierarchical Clustering Limitations:**
- O(n² log n) complexity → slow for large data (>10K points)
- Greedy merges → can't undo poor early merges
- Forces all points into clusters → poor outlier handling
- Linkage-dependent → choice of linkage significantly affects results

**Production Checklist:**
- ✅ Choose appropriate distance metric (correlation for test similarity, Euclidean for spatial)
- ✅ Select linkage method (ward/average for general, complete for tight clusters, single for outlier detection)
- ✅ Visualize dendrogram and validate cophenetic correlation (>0.7)
- ✅ Use domain knowledge to select cut height (look for large gaps)
- ✅ Extract multi-level hierarchy if needed (3 top categories, 10 subcategories, etc.)
- ✅ Consider sklearn for production (connectivity constraints, memory efficiency)
- ✅ For large data (>10K), use K-Means or sampling + hierarchical on samples

**Post-Silicon Context:**
- Hierarchical clustering excels at test hierarchy discovery (500 tests → 3-10 categories)
- Enables multi-level failure taxonomy (domain → parameter → test combination)
- Critical for organizing complex test programs without manual categorization
- Cophenetic correlation >0.75 indicates test relationships are well-captured

---

## 🎉 Congratulations!

You've mastered hierarchical clustering - from agglomerative algorithm mechanics to dendrogram interpretation to production test hierarchy discovery. You can now:
- ✅ Implement agglomerative clustering from scratch with Lance-Williams distance updates
- ✅ Build and interpret dendrograms using scipy.cluster.hierarchy
- ✅ Select optimal cut height and linkage method based on data characteristics
- ✅ Apply hierarchical clustering to test hierarchy discovery and failure taxonomy
- ✅ Choose between hierarchical, K-Means, DBSCAN based on data size, shape, and goals

**Next:** Explore DBSCAN (Notebook 028) for density-based clustering with automatic outlier detection!