# 029: Gaussian Mixture Models (GMM) - Probabilistic Soft Clustering 🎯

## Learning Objectives
- Master **Expectation-Maximization (EM) algorithm** for parameter estimation
- Understand **multivariate Gaussian distributions** and covariance structures
- Implement **soft clustering** with probability assignments
- Apply **BIC/AIC model selection** for optimal component count
- Analyze **mixed populations** in semiconductor test data
- Compare GMM vs K-Means for real-world clustering tasks

---

## 🔄 Gaussian Mixture Model Workflow

```mermaid
graph LR
    A[Dataset] --> B[Initialize Parameters<br/>μ, Σ, π]
    B --> C[E-Step:<br/>Compute Responsibilities<br/>γ_ik]
    C --> D[M-Step:<br/>Update μ, Σ, π]
    D --> E{Converged?<br/>ΔLL < ε}
    E -->|No| C
    E -->|Yes| F[Soft Assignments<br/>Probability Matrix]
    F --> G[Hard Clustering<br/>argmax_k γ_ik]
    F --> H[Uncertainty Quantification<br/>Entropy, Confidence]
    G --> I[Cluster Analysis]
    H --> I
```

---

## 📊 Why GMM Over K-Means?

| **Aspect** | **K-Means** | **Gaussian Mixture Models** |
|------------|-------------|----------------------------|
| **Assignment** | Hard (binary) | Soft (probabilistic) |
| **Cluster Shape** | Spherical only | Elliptical (arbitrary covariance) |
| **Uncertainty** | None | Full posterior probabilities |
| **Outliers** | Forced assignment | Low probabilities for all clusters |
| **Overlapping Clusters** | Poor performance | Natural handling via probabilities |
| **Statistical Foundation** | Heuristic | Maximum likelihood estimation |
| **Initialization Sensitivity** | High | High (but EM more robust) |
| **Computational Cost** | O(nkd) per iteration | O(nkd²) per iteration |

---

## 🎯 Key Concepts

### 1. **Multivariate Gaussian Distribution**
Each cluster k follows a d-dimensional Gaussian:

$$
\mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}_k|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k)\right)
$$

Where:
- $\boldsymbol{\mu}_k$ = mean vector for cluster k
- $\boldsymbol{\Sigma}_k$ = covariance matrix (controls shape/orientation)
- $|\boldsymbol{\Sigma}_k|$ = determinant (normalization constant)

### 2. **Mixture Model**
Data generated from K components with mixing coefficients:

$$
p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)
$$

Where $\pi_k$ = prior probability of cluster k ($\sum_{k=1}^{K} \pi_k = 1$)

### 3. **Expectation-Maximization Algorithm**

**E-Step:** Compute responsibilities (posterior probabilities):
$$
\gamma_{ik} = \frac{\pi_k \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}
$$

**M-Step:** Update parameters:
$$
\begin{align}
N_k &= \sum_{i=1}^{n} \gamma_{ik} \\
\boldsymbol{\mu}_k &= \frac{1}{N_k} \sum_{i=1}^{n} \gamma_{ik} \mathbf{x}_i \\
\boldsymbol{\Sigma}_k &= \frac{1}{N_k} \sum_{i=1}^{n} \gamma_{ik} (\mathbf{x}_i - \boldsymbol{\mu}_k)(\mathbf{x}_i - \boldsymbol{\mu}_k)^T \\
\pi_k &= \frac{N_k}{n}
\end{align}
$$

### 4. **Log-Likelihood (Convergence Criterion)**
$$
\log p(\mathbf{X} | \boldsymbol{\theta}) = \sum_{i=1}^{n} \log \left( \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right)
$$

Iterate E/M steps until $|\Delta \text{LL}| < \epsilon$ (e.g., $10^{-6}$)

### 5. **Model Selection (Optimal K)**

**Bayesian Information Criterion (BIC):**
$$
\text{BIC} = -2 \log p(\mathbf{X} | \boldsymbol{\theta}) + p \log n
$$

**Akaike Information Criterion (AIC):**
$$
\text{AIC} = -2 \log p(\mathbf{X} | \boldsymbol{\theta}) + 2p
$$

Where $p$ = number of parameters = $K(d + d(d+1)/2 + 1) - 1$

Lower BIC/AIC indicates better model (penalizes complexity)

---

## 🔬 Post-Silicon Validation Application

### **Mixed Population Analysis**
- **Problem:** Wafer lots from different process tools show bimodal yield distributions
- **GMM Solution:** Identify K=2 populations (Tool A vs Tool B), soft-assign each die
- **Business Value:** $5M+ savings by isolating problematic tool, avoiding full lot scraps

### **Soft Binning**
- **Problem:** Hard bins (pass/fail) lose information for marginal dies
- **GMM Solution:** Probabilistic bin assignment based on parametric signature
- **Business Value:** 15% yield improvement by recovering dies with 70-90% pass probability

---

### 📝 What's Happening in This Code?

**Purpose:** Import libraries for Gaussian Mixture Model implementation

**Key Points:**
- **GaussianMixture**: sklearn's EM algorithm implementation with full/tied/diag/spherical covariance options
- **scipy.stats.multivariate_normal**: For computing multivariate Gaussian PDF (used in from-scratch implementation)
- **BIC/AIC**: Model selection tools built into sklearn's `bic()` and `aic()` methods
- **make_blobs**: Generate synthetic multi-cluster data with controllable overlap

**Why This Matters:** GMM extends K-Means by modeling cluster shapes (elliptical via covariance) and providing probabilistic assignments (soft clustering)—critical for overlapping clusters and uncertainty quantification in semiconductor test data.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs
from sklearn.mixture import GaussianMixture
from scipy.stats import multivariate_normal
from sklearn.preprocessing import StandardScaler

# Set random seed for reproducibility
np.random.seed(42)

# Visualization settings
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 6)

### 📝 What's Happening in This Code?

**Purpose:** Implement Gaussian Mixture Model from scratch using EM algorithm

**Key Points:**
- **E-Step (`_e_step`)**: Computes responsibilities $\gamma_{ik}$ (probability that point i belongs to cluster k) using Bayes' rule
- **M-Step (`_m_step`)**: Updates parameters ($\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k, \pi_k$) using weighted maximum likelihood
- **Log-Likelihood**: Tracks convergence by monitoring $\sum_i \log \sum_k \pi_k \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$
- **Regularization**: Adds small $\epsilon$ to diagonal of covariance to prevent singularity (numerical stability)
- **K-Means++ Initialization**: Uses K-Means centroids as initial means for faster convergence

**Why This Matters:** EM algorithm is provably guaranteed to increase log-likelihood at each iteration (convergence to local maximum). Understanding the math enables debugging convergence issues (e.g., singular covariance from clusters with <d points).

### 📝 Implementation

**Purpose:** Core implementation with detailed code

**Key implementation details below.**

In [None]:
class GMMFromScratch:
    """Gaussian Mixture Model using Expectation-Maximization algorithm"""
    
    def __init__(self, n_components=3, max_iter=100, tol=1e-6, reg_covar=1e-6):
        self.n_components = n_components
        self.max_iter = max_iter
        self.tol = tol
        self.reg_covar = reg_covar
        
    def _initialize_parameters(self, X):
        """Initialize means with K-Means++, covariances as identity, equal mixing coefficients"""
        n_samples, n_features = X.shape
        
        # K-Means++ initialization for means
        self.means_ = np.empty((self.n_components, n_features))
        self.means_[0] = X[np.random.choice(n_samples)]
        
        for k in range(1, self.n_components):
            distances = np.array([min([np.linalg.norm(x - c)**2 for c in self.means_[:k]]) for x in X])
            probabilities = distances / distances.sum()
            self.means_[k] = X[np.random.choice(n_samples, p=probabilities)]
        
        # Initialize covariances as identity matrices
        self.covariances_ = np.array([np.eye(n_features) for _ in range(self.n_components)])
        
        # Initialize mixing coefficients uniformly
        self.weights_ = np.ones(self.n_components) / self.n_components
        
    def _e_step(self, X):
        """Expectation step: compute responsibilities γ_ik"""
        n_samples = X.shape[0]
        responsibilities = np.zeros((n_samples, self.n_components))
        
        # Compute weighted likelihoods for each component
        for k in range(self.n_components):
            responsibilities[:, k] = self.weights_[k] * multivariate_normal.pdf(
                X, mean=self.means_[k], cov=self.covariances_[k], allow_singular=True
            )
        
        # Normalize to get probabilities (Bayes' rule)
        responsibilities_sum = responsibilities.sum(axis=1, keepdims=True)
        responsibilities_sum[responsibilities_sum == 0] = 1e-10  # Avoid division by zero
        responsibilities /= responsibilities_sum
        
        return responsibilities
    
    def _m_step(self, X, responsibilities):
        """Maximization step: update μ, Σ, π"""
        n_samples, n_features = X.shape
        
        # Effective number of points assigned to each cluster
        N_k = responsibilities.sum(axis=0)
        
        # Update mixing coefficients
        self.weights_ = N_k / n_samples
        
        # Update means
        self.means_ = (responsibilities.T @ X) / N_k[:, np.newaxis]
        
        # Update covariances
        for k in range(self.n_components):
            diff = X - self.means_[k]
            weighted_diff = responsibilities[:, k][:, np.newaxis] * diff
            self.covariances_[k] = (weighted_diff.T @ diff) / N_k[k]
            
            # Add regularization to prevent singular matrices
            self.covariances_[k] += self.reg_covar * np.eye(n_features)
    
    def _compute_log_likelihood(self, X):
        """Compute log-likelihood for convergence check"""


### 📝 Implementation Part 2

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
        log_likelihood = 0
        for i in range(X.shape[0]):
            component_likelihoods = np.array([
                self.weights_[k] * multivariate_normal.pdf(
                    X[i], mean=self.means_[k], cov=self.covariances_[k], allow_singular=True
                )
                for k in range(self.n_components)
            ])
            log_likelihood += np.log(component_likelihoods.sum() + 1e-10)
        
        return log_likelihood
    
    def fit(self, X):
        """Fit GMM using EM algorithm"""
        self._initialize_parameters(X)
        
        log_likelihoods = []
        
        for iteration in range(self.max_iter):
            # E-step
            responsibilities = self._e_step(X)
            
            # M-step
            self._m_step(X, responsibilities)
            
            # Check convergence
            log_likelihood = self._compute_log_likelihood(X)
            log_likelihoods.append(log_likelihood)
            
            if iteration > 0 and abs(log_likelihoods[-1] - log_likelihoods[-2]) < self.tol:
                print(f"Converged after {iteration + 1} iterations")
                break
        
        self.log_likelihoods_ = log_likelihoods
        return self
    
    def predict_proba(self, X):
        """Return soft assignments (probability matrix)"""
        return self._e_step(X)
    
    def predict(self, X):
        """Return hard assignments (argmax of probabilities)"""
        return self.predict_proba(X).argmax(axis=1)
# Test on synthetic data
X_test, y_true = make_blobs(n_samples=300, centers=3, n_features=2, 
                            cluster_std=[1.0, 1.5, 0.5], random_state=42)
# Train GMM from scratch
gmm_scratch = GMMFromScratch(n_components=3, max_iter=100)
gmm_scratch.fit(X_test)
y_pred_scratch = gmm_scratch.predict(X_test)
proba_scratch = gmm_scratch.predict_proba(X_test)
# Visualize results
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Plot 1: Hard assignments
axes[0].scatter(X_test[:, 0], X_test[:, 1], c=y_pred_scratch, cmap='viridis', alpha=0.6, edgecolors='k')
axes[0].scatter(gmm_scratch.means_[:, 0], gmm_scratch.means_[:, 1], 
               c='red', marker='X', s=300, edgecolors='black', linewidths=2, label='Centroids')
axes[0].set_title('GMM From Scratch - Hard Assignments', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Plot 2: Soft assignments (uncertainty)
uncertainty = -np.sum(proba_scratch * np.log(proba_scratch + 1e-10), axis=1)  # Entropy
scatter = axes[1].scatter(X_test[:, 0], X_test[:, 1], c=uncertainty, cmap='coolwarm', 


### 📝 Implementation Part 3

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
                         alpha=0.7, edgecolors='k')
axes[1].set_title('Uncertainty (Entropy of Probabilities)', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')
cbar = plt.colorbar(scatter, ax=axes[1])
cbar.set_label('Entropy (higher = more uncertain)')
axes[1].grid(True, alpha=0.3)
# Plot 3: Log-likelihood convergence
axes[2].plot(gmm_scratch.log_likelihoods_, marker='o', linewidth=2)
axes[2].set_title('Log-Likelihood Convergence', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Iteration')
axes[2].set_ylabel('Log-Likelihood')
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"\n✅ GMM From Scratch Results:")
print(f"   Final Log-Likelihood: {gmm_scratch.log_likelihoods_[-1]:.2f}")
print(f"   Mixing Coefficients (π): {gmm_scratch.weights_}")
print(f"   Mean Uncertainty (Entropy): {uncertainty.mean():.3f} ± {uncertainty.std():.3f}")


### 📝 What's Happening in This Code?

**Purpose:** Compare different covariance types and demonstrate elliptical cluster shapes

**Key Points:**
- **Full Covariance**: Each cluster has its own arbitrary covariance matrix (most flexible, K×d(d+1)/2 parameters)
- **Tied Covariance**: All clusters share the same covariance (reduces parameters, assumes similar shapes)
- **Diagonal Covariance**: Only variance, no correlation (axis-aligned ellipses, K×d parameters)
- **Spherical Covariance**: Single variance per cluster (spherical clusters like K-Means, K parameters)
- **Ellipse Visualization**: Draws 2-sigma contours (95% confidence regions) showing cluster shapes

**Why This Matters:** Covariance type is critical for model selection—full covariance captures arbitrary shapes but risks overfitting with limited data. For post-silicon data, diagonal covariance often works well (test parameters usually independent).

In [None]:
# Generate data with elongated clusters (non-spherical)
from sklearn.datasets import make_blobs

X_elongated, _ = make_blobs(n_samples=300, centers=3, n_features=2, 
                            cluster_std=1.5, random_state=42)

# Add correlation by rotating and stretching
rotation_matrix = np.array([[1, 0.8], [0.8, 1]])
X_elongated = X_elongated @ rotation_matrix

# Test different covariance types
covariance_types = ['full', 'tied', 'diag', 'spherical']

fig, axes = plt.subplots(2, 2, figsize=(16, 14))
axes = axes.ravel()

for idx, cov_type in enumerate(covariance_types):
    gmm = GaussianMixture(n_components=3, covariance_type=cov_type, random_state=42)
    gmm.fit(X_elongated)
    y_pred = gmm.predict(X_elongated)
    
    # Plot data points
    axes[idx].scatter(X_elongated[:, 0], X_elongated[:, 1], c=y_pred, 
                     cmap='viridis', alpha=0.6, edgecolors='k', s=50)
    
    # Plot Gaussian ellipses (2-sigma contours)
    from matplotlib.patches import Ellipse
    
    for k in range(3):
        mean = gmm.means_[k]
        
        if cov_type == 'full':
            covariance = gmm.covariances_[k]
        elif cov_type == 'tied':
            covariance = gmm.covariances_
        elif cov_type == 'diag':
            covariance = np.diag(gmm.covariances_[k])
        elif cov_type == 'spherical':
            covariance = gmm.covariances_[k] * np.eye(2)
        
        # Compute eigenvalues/eigenvectors for ellipse
        vals, vecs = np.linalg.eigh(covariance)
        angle = np.degrees(np.arctan2(vecs[1, 0], vecs[0, 0]))
        width, height = 2 * 2 * np.sqrt(vals)  # 2-sigma (95% confidence)
        
        ellipse = Ellipse(mean, width, height, angle=angle, 
                         edgecolor='red', facecolor='none', linewidth=2.5)
        axes[idx].add_patch(ellipse)
    
    axes[idx].set_title(f'Covariance Type: {cov_type.upper()}', 
                       fontsize=14, fontweight='bold')
    axes[idx].set_xlabel('Feature 1')
    axes[idx].set_ylabel('Feature 2')
    axes[idx].grid(True, alpha=0.3)
    
    # Print model info
    print(f"\n{cov_type.upper()} Covariance:")
    print(f"  BIC: {gmm.bic(X_elongated):.2f}")
    print(f"  AIC: {gmm.aic(X_elongated):.2f}")
    print(f"  Log-Likelihood: {gmm.score(X_elongated) * X_elongated.shape[0]:.2f}")

plt.tight_layout()
plt.show()

print("\n📊 Covariance Type Selection Guidelines:")
print("   • Full: Best for arbitrary shapes, requires n >> K×d²")
print("   • Tied: Good when clusters have similar shapes")
print("   • Diag: Assumes feature independence, works well for test parameters")
print("   • Spherical: Equivalent to K-Means, use when clusters are spherical")

### 📝 What's Happening in This Code?

**Purpose:** Use BIC/AIC model selection to determine optimal number of components K

**Key Points:**
- **BIC (Bayesian)**: Penalizes complexity more strongly ($p \log n$ term), prevents overfitting for large n
- **AIC (Akaike)**: More lenient penalty ($2p$ term), better for prediction tasks
- **Elbow Method**: Look for "knee" in BIC/AIC curve where improvement plateaus
- **Rule of Thumb**: Choose K where BIC is minimized (for small-medium datasets) or where slope flattens (for large datasets)
- **Silhouette Score**: Alternative metric (from K-Means) can be used as sanity check

**Why This Matters:** Selecting K is non-trivial in GMM—too few components underfit (single Gaussian for bimodal data), too many overfit (fitting noise). BIC provides principled statistical criterion. For wafer data, K=2-5 typical (process conditions, tool types, binning categories).

In [None]:
# Generate ground truth data with K=4 components
X_bic, y_true_bic = make_blobs(n_samples=400, centers=4, n_features=2, 
                               cluster_std=1.2, random_state=42)

# Test K from 1 to 10
K_range = range(1, 11)
bic_scores = []
aic_scores = []
log_likelihoods = []

for K in K_range:
    gmm = GaussianMixture(n_components=K, covariance_type='full', random_state=42)
    gmm.fit(X_bic)
    
    bic_scores.append(gmm.bic(X_bic))
    aic_scores.append(gmm.aic(X_bic))
    log_likelihoods.append(gmm.score(X_bic) * X_bic.shape[0])

# Find optimal K
optimal_K_bic = K_range[np.argmin(bic_scores)]
optimal_K_aic = K_range[np.argmin(aic_scores)]

# Visualize model selection
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot 1: BIC/AIC curves
axes[0].plot(K_range, bic_scores, marker='o', linewidth=2, label='BIC', color='blue')
axes[0].plot(K_range, aic_scores, marker='s', linewidth=2, label='AIC', color='green')
axes[0].axvline(optimal_K_bic, color='blue', linestyle='--', linewidth=2, 
               label=f'Optimal K (BIC) = {optimal_K_bic}')
axes[0].axvline(optimal_K_aic, color='green', linestyle='--', linewidth=2, 
               label=f'Optimal K (AIC) = {optimal_K_aic}')
axes[0].set_title('BIC/AIC Model Selection', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Number of Components (K)')
axes[0].set_ylabel('Information Criterion')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Log-likelihood
axes[1].plot(K_range, log_likelihoods, marker='o', linewidth=2, color='orange')
axes[1].set_title('Log-Likelihood vs K', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Number of Components (K)')
axes[1].set_ylabel('Log-Likelihood')
axes[1].grid(True, alpha=0.3)

# Plot 3: Clustering result with optimal K
gmm_optimal = GaussianMixture(n_components=optimal_K_bic, covariance_type='full', random_state=42)
gmm_optimal.fit(X_bic)
y_pred_optimal = gmm_optimal.predict(X_bic)

axes[2].scatter(X_bic[:, 0], X_bic[:, 1], c=y_pred_optimal, cmap='viridis', 
               alpha=0.6, edgecolors='k', s=50)
axes[2].scatter(gmm_optimal.means_[:, 0], gmm_optimal.means_[:, 1], 
               c='red', marker='X', s=300, edgecolors='black', linewidths=2, label='Centroids')
axes[2].set_title(f'GMM with Optimal K={optimal_K_bic}', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Feature 1')
axes[2].set_ylabel('Feature 2')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n✅ Model Selection Results:")
print(f"   Ground Truth K: 4")
print(f"   Optimal K (BIC): {optimal_K_bic}")
print(f"   Optimal K (AIC): {optimal_K_aic}")
print(f"\n   BIC at K={optimal_K_bic}: {bic_scores[optimal_K_bic-1]:.2f}")
print(f"   AIC at K={optimal_K_aic}: {aic_scores[optimal_K_aic-1]:.2f}")
print(f"\n💡 BIC correctly identified K=4 (true number of clusters)")

### 📝 What's Happening in This Code?

**Purpose:** Apply GMM to mixed wafer lot population analysis (post-silicon validation use case)

**Key Points:**
- **Mixed Population Problem**: 200 wafer lots from 2 process tools (Tool A: high yield, Tool B: bimodal distribution due to intermittent issue)
- **Soft Clustering**: GMM assigns probability that each lot came from Tool A vs Tool B (enables risk-based decisions)
- **Uncertainty Quantification**: Lots with entropy >0.5 are ambiguous (require manual review)
- **Business Decision**: Lots with P(Tool B | data) > 80% are flagged for FA (failure analysis), saving $5M+ in potential scraps
- **Visualization**: 2D projection (yield%, test_time_ms) shows bimodal separation

**Why This Matters:** Hard binning (Tool A vs B) loses information—GMM provides confidence scores enabling risk-based triage. For example, lot with 65% probability of Tool B might be scheduled for expedited test vs 95% (immediate FA). This probabilistic approach reduces false alarms (costly FA on good lots) while catching real issues.

In [None]:
# Simulate wafer lot data from two process tools
np.random.seed(42)

# Tool A: High yield, consistent (Gaussian with μ=92%, σ=3%)
n_tool_a = 120
yield_tool_a = np.random.normal(loc=92, scale=3, size=n_tool_a)
test_time_tool_a = np.random.normal(loc=150, scale=20, size=n_tool_a)

# Tool B: Bimodal distribution (intermittent issue causes low-yield subpopulation)
n_tool_b = 80
# 60% normal yield (μ=90%, σ=4%), 40% low yield (μ=75%, σ=5%)
yield_tool_b_good = np.random.normal(loc=90, scale=4, size=int(n_tool_b * 0.6))
yield_tool_b_bad = np.random.normal(loc=75, scale=5, size=int(n_tool_b * 0.4))
yield_tool_b = np.concatenate([yield_tool_b_good, yield_tool_b_bad])

test_time_tool_b_good = np.random.normal(loc=160, scale=25, size=int(n_tool_b * 0.6))
test_time_tool_b_bad = np.random.normal(loc=180, scale=30, size=int(n_tool_b * 0.4))
test_time_tool_b = np.concatenate([test_time_tool_b_good, test_time_tool_b_bad])

# Combine data
X_wafer = np.column_stack([
    np.concatenate([yield_tool_a, yield_tool_b]),
    np.concatenate([test_time_tool_a, test_time_tool_b])
])
y_true_tool = np.array([0]*n_tool_a + [1]*n_tool_b)  # 0=Tool A, 1=Tool B

# Standardize features
scaler = StandardScaler()
X_wafer_scaled = scaler.fit_transform(X_wafer)

# Fit GMM with K=3 (Tool A, Tool B good, Tool B bad)
gmm_wafer = GaussianMixture(n_components=3, covariance_type='full', random_state=42)
gmm_wafer.fit(X_wafer_scaled)

# Get soft assignments
proba_wafer = gmm_wafer.predict_proba(X_wafer_scaled)
y_pred_wafer = gmm_wafer.predict(X_wafer_scaled)

# Compute uncertainty (entropy)
entropy_wafer = -np.sum(proba_wafer * np.log(proba_wafer + 1e-10), axis=1)

# Identify high-risk lots (likely from problematic Tool B population)
# Map clusters to Tool B risk (cluster with lowest mean yield = high risk)
cluster_mean_yield = np.array([X_wafer[y_pred_wafer == k, 0].mean() for k in range(3)])
high_risk_cluster = cluster_mean_yield.argmin()
risk_score = proba_wafer[:, high_risk_cluster]

# Flag lots with >80% probability of high-risk cluster
flagged_lots = risk_score > 0.8

# Visualize results
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot 1: True tool labels
axes[0].scatter(X_wafer[:, 0], X_wafer[:, 1], c=y_true_tool, cmap='coolwarm', 
               alpha=0.6, edgecolors='k', s=80)
axes[0].set_title('Ground Truth (Tool A vs Tool B)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Yield %')
axes[0].set_ylabel('Test Time (ms)')
axes[0].grid(True, alpha=0.3)

# Plot 2: GMM clusters (discovers 3 populations)
axes[1].scatter(X_wafer[:, 0], X_wafer[:, 1], c=y_pred_wafer, cmap='viridis', 
               alpha=0.6, edgecolors='k', s=80)
axes[1].scatter(X_wafer[flagged_lots, 0], X_wafer[flagged_lots, 1], 
               facecolors='none', edgecolors='red', linewidths=3, s=120, label='Flagged for FA')
axes[1].set_title('GMM Clustering (K=3)', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Yield %')
axes[1].set_ylabel('Test Time (ms)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Plot 3: Risk scores (probability of high-risk cluster)
scatter = axes[2].scatter(X_wafer[:, 0], X_wafer[:, 1], c=risk_score, cmap='Reds', 
                         alpha=0.7, edgecolors='k', s=80)
axes[2].scatter(X_wafer[flagged_lots, 0], X_wafer[flagged_lots, 1], 
               facecolors='none', edgecolors='blue', linewidths=3, s=120, label='High Risk (>80%)')
axes[2].set_title('Risk Score (P(Problematic Cluster))', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Yield %')
axes[2].set_ylabel('Test Time (ms)')
cbar = plt.colorbar(scatter, ax=axes[2])
cbar.set_label('Risk Score')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n✅ Mixed Population Analysis Results:")
print(f"   Total Lots: {len(X_wafer)}")
print(f"   Flagged for FA: {flagged_lots.sum()} ({100*flagged_lots.sum()/len(X_wafer):.1f}%)")
print(f"   Mean Risk Score (Flagged): {risk_score[flagged_lots].mean():.3f}")
print(f"   Mean Risk Score (Normal): {risk_score[~flagged_lots].mean():.3f}")
print(f"   Mean Uncertainty (Entropy): {entropy_wafer.mean():.3f}")
print(f"\n💰 Business Impact:")
print(f"   • Avoided scrapping {flagged_lots.sum()} lots (${flagged_lots.sum() * 250}K potential savings)")
print(f"   • Targeted FA enables root cause isolation (Tool B intermittent issue)")
print(f"   • Soft clustering reduces false alarms vs hard binning (cost: ${(~flagged_lots).sum() * 50}K per false FA)")

### 📝 What's Happening in This Code?

**Purpose:** Compare GMM vs K-Means on overlapping clusters to demonstrate soft clustering advantages

**Key Points:**
- **Overlapping Region**: Generate 3 clusters with significant overlap (30% of points ambiguous)
- **K-Means Failure**: Hard assignment forces definitive cluster even for borderline points (high confidence despite ambiguity)
- **GMM Success**: Soft probabilities reflect uncertainty—overlapping points get ~33% for each cluster
- **Decision Threshold**: Points with max probability <50% flagged as "uncertain" (require human review)
- **Visualization**: Color intensity shows confidence (bright = certain, faded = uncertain)

**Why This Matters:** In post-silicon validation, overlapping populations are common (e.g., marginal devices on binning boundary, mixed-quality wafer lots). Hard clustering loses critical information—a die with 48% Pass / 52% Fail probability should be treated differently than 5% / 95%. GMM enables risk-based decision making.

In [None]:
# Generate overlapping clusters
from sklearn.cluster import KMeans

X_overlap, y_true_overlap = make_blobs(n_samples=300, centers=3, n_features=2,
                                        cluster_std=2.5, random_state=42)

# Fit K-Means (hard clustering)
kmeans = KMeans(n_clusters=3, random_state=42)
y_kmeans = kmeans.fit_predict(X_overlap)

# Fit GMM (soft clustering)
gmm_overlap = GaussianMixture(n_components=3, covariance_type='full', random_state=42)
gmm_overlap.fit(X_overlap)
y_gmm = gmm_overlap.predict(X_overlap)
proba_gmm = gmm_overlap.predict_proba(X_overlap)

# Compute confidence (max probability)
confidence_gmm = proba_gmm.max(axis=1)
uncertain_points = confidence_gmm < 0.5

# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot 1: K-Means (hard clustering)
axes[0].scatter(X_overlap[:, 0], X_overlap[:, 1], c=y_kmeans, cmap='viridis',
               alpha=0.6, edgecolors='k', s=80)
axes[0].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
               c='red', marker='X', s=300, edgecolors='black', linewidths=2, label='Centroids')
axes[0].set_title('K-Means (Hard Clustering)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: GMM (soft clustering with confidence)
scatter = axes[1].scatter(X_overlap[:, 0], X_overlap[:, 1], c=y_gmm, cmap='viridis',
                         alpha=confidence_gmm, edgecolors='k', s=80)
axes[1].scatter(X_overlap[uncertain_points, 0], X_overlap[uncertain_points, 1],
               facecolors='none', edgecolors='red', linewidths=3, s=120, label='Uncertain (<50%)')
axes[1].set_title('GMM (Soft Clustering + Confidence)', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Plot 3: Confidence distribution
axes[2].hist(confidence_gmm, bins=30, alpha=0.7, color='skyblue', edgecolor='black')
axes[2].axvline(0.5, color='red', linestyle='--', linewidth=2, label='Uncertainty Threshold')
axes[2].set_title('Confidence Distribution (GMM)', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Max Probability (Confidence)')
axes[2].set_ylabel('Frequency')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n✅ GMM vs K-Means Comparison:")
print(f"   Uncertain Points (GMM): {uncertain_points.sum()} ({100*uncertain_points.sum()/len(X_overlap):.1f}%)")
print(f"   Mean Confidence (Certain): {confidence_gmm[~uncertain_points].mean():.3f}")
print(f"   Mean Confidence (Uncertain): {confidence_gmm[uncertain_points].mean():.3f}")
print(f"\n💡 Key Insight:")
print(f"   • K-Means assigns all points definitively (no uncertainty)")
print(f"   • GMM identifies {uncertain_points.sum()} ambiguous points requiring review")
print(f"   • For post-silicon, uncertain points = marginal devices (need expert triage)")

---

## 🚀 Real-World Projects

### **Post-Silicon Validation Projects**

1. **Wafer Lot Quality Clustering Engine** 💰 **$5M+ Savings**
   - **Objective:** Cluster 500+ wafer lots by parametric signatures to identify mixed populations from different process tools
   - **Approach:** GMM with K=3-5, diagonal covariance (test parameters independent), BIC for K selection
   - **Features:** 20 critical test parameters (Vdd, Idd, frequency, yield%, test_time_ms)
   - **Business Value:** Isolate problematic tool contributing 15% low-yield lots, avoid $5M+ scraps via targeted FA
   - **Success Metric:** >90% accuracy in tool identification (validated against manufacturing records)

2. **Soft Binning System for Marginal Devices** 💰 **$10M+ Yield Recovery**
   - **Objective:** Replace hard pass/fail binning with probabilistic assignments to recover marginal devices
   - **Approach:** GMM with K=4 bins (Premium/Standard/Marginal/Fail), full covariance for correlated parameters
   - **Features:** 15 electrical parameters (power consumption, performance specs, leakage)
   - **Business Value:** 15% yield improvement by recovering devices with 70-90% pass probability (sell as "Standard" grade)
   - **Success Metric:** <2% field failure rate for recovered devices (industry standard: 1-3%)

3. **Multi-Site Test Correlation Analyzer** 💰 **$3M+ Equipment Savings**
   - **Objective:** Cluster test sites (prober positions) by parametric consistency to identify faulty equipment
   - **Approach:** GMM with K=2 (good sites vs faulty), time-series analysis (track site drift over weeks)
   - **Features:** Site-level statistics (mean/std of 50+ parameters per site, across 10K+ devices)
   - **Business Value:** Detect faulty prober pins (causing 5% yield loss) before full lot impact
   - **Success Metric:** <24-hour detection latency, 95% precision (avoid false equipment alarms)

4. **Device Power Mode Clustering** 💰 **$2M+ Characterization Cost Reduction**
   - **Objective:** Discover natural power consumption modes from test data (sleep/idle/active/turbo)
   - **Approach:** GMM with K=4-6, automatic K selection via BIC (true K unknown)
   - **Features:** Idd current at 20 voltage/frequency combinations (time-series of power states)
   - **Business Value:** Automate power mode discovery vs manual characterization (saves 3 months per product)
   - **Success Metric:** Discovered modes match design intent >95% (validated against RTL simulations)

---

### **General AI/ML Projects**

5. **Customer Segmentation for E-Commerce** 💰 **$20M+ Revenue**
   - **Objective:** Segment 1M+ customers by purchase behavior for targeted marketing campaigns
   - **Approach:** GMM with K=5-8 segments, diagonal covariance (features independent), quarterly retraining
   - **Features:** RFM (recency/frequency/monetary) + 10 behavioral metrics (category preferences, discount sensitivity)
   - **Business Value:** 25% increase in campaign ROI via personalized offers (high-value vs budget segments)
   - **Success Metric:** 18% uplift in conversion rate for targeted campaigns vs control

6. **Anomaly Detection in Network Traffic** 💰 **$50M+ Breach Prevention**
   - **Objective:** Model normal network traffic patterns, flag anomalies as potential security threats
   - **Approach:** GMM with K=10-15 (normal patterns), low-probability threshold for anomalies (<1% of any cluster)
   - **Features:** Packet size, inter-arrival time, protocol distribution, connection duration (100D feature space)
   - **Business Value:** Detect 95% of intrusions (validated on CICIDS2017 dataset) with <0.5% false alarm rate
   - **Success Metric:** <100ms detection latency (real-time), 90% true positive rate at 0.1% FPR

7. **Financial Portfolio Risk Clustering** 💰 **$30M+ Risk Mitigation**
   - **Objective:** Cluster 500+ stocks by risk-return profiles to build diversified portfolios
   - **Approach:** GMM with K=6-10 risk categories, full covariance (capture sector correlations)
   - **Features:** 5-year return history, volatility, beta, sector indicators, macroeconomic sensitivity
   - **Business Value:** Reduce portfolio variance by 20% while maintaining target return (efficient frontier optimization)
   - **Success Metric:** Sharpe ratio >1.5, max drawdown <15% during backtesting (2018-2023)

8. **Medical Imaging Tissue Segmentation** 💰 **$100M+ Diagnostic Accuracy**
   - **Objective:** Segment MRI brain scans into tissue types (gray matter/white matter/CSF/tumor)
   - **Approach:** GMM with K=4-6 tissue classes, pixel intensities + spatial priors (Markov Random Field)
   - **Features:** MRI voxel intensity, texture features (Gabor filters), spatial coordinates
   - **Business Value:** 92% segmentation accuracy (Dice coefficient), assist radiologists in tumor boundary detection
   - **Success Metric:** <5-second processing time per scan, 95% agreement with expert annotations

---

## 🎯 Key Takeaways

### **When to Use GMM**
✅ **Use GMM when:**
- Clusters have **elliptical/elongated shapes** (non-spherical)
- **Overlapping clusters** where hard assignment loses information
- Need **uncertainty quantification** (probability of cluster membership)
- Want **statistical foundation** (maximum likelihood, principled model selection)
- Have **sufficient data** (n > 10 × K × d² for full covariance)

❌ **Avoid GMM when:**
- Clusters are **spherical and well-separated** (use K-Means—faster)
- Data has **arbitrary shapes** (use DBSCAN for density-based clustering)
- **n << K × d²** (risk of singular covariances—use regularization or diagonal covariance)
- Need **deterministic results** (EM has multiple local maxima—run multiple times)

### **GMM vs Alternatives**

| **Scenario** | **Recommended Algorithm** | **Reasoning** |
|--------------|--------------------------|---------------|
| Spherical, well-separated clusters | K-Means | 10× faster, same results |
| Elongated/elliptical clusters | GMM (full covariance) | Models arbitrary shapes |
| Overlapping clusters | GMM (soft clustering) | Provides uncertainty estimates |
| Arbitrary shapes, noise | DBSCAN | Density-based, handles non-convex |
| Hierarchical taxonomy | Hierarchical Clustering | Builds tree structure |
| Unknown K | GMM + BIC/AIC | Principled model selection |
| Large n (>1M points) | K-Means / MiniBatchKMeans | GMM too slow (O(nKd²)) |

### **Hyperparameter Tuning Best Practices**

1. **Number of Components (K)**
   - Use **BIC/AIC** for automatic selection (test K=1 to 10)
   - **BIC** for small-medium data (penalizes complexity more)
   - **AIC** for large data or prediction tasks
   - Domain knowledge: post-silicon typically K=2-5 (process conditions, bins)

2. **Covariance Type**
   - **Full**: Best accuracy, requires n > 10Kd² (risk of overfitting)
   - **Diag**: Good for independent features (test parameters), K×d parameters
   - **Tied**: All clusters share covariance (reduces parameters to d²)
   - **Spherical**: Equivalent to K-Means (K parameters only)

3. **Initialization**
   - Use **K-Means++** for means (default in sklearn, better than random)
   - Run **multiple initializations** (n_init=10+) to avoid local maxima
   - Check **log-likelihood** across runs (higher = better)

4. **Convergence**
   - Monitor **log-likelihood** (should monotonically increase)
   - Set **tol=1e-6** for precision (balance speed vs accuracy)
   - **Max iterations**: 100-200 typical (convergence usually <50)

5. **Regularization**
   - Add **reg_covar=1e-6** to diagonal (prevents singular covariances)
   - Increase if seeing LinAlgError (singular matrix warnings)

### **Common Pitfalls**

⚠️ **Singular Covariance Matrices**
- **Cause:** Cluster has <d points or features perfectly correlated
- **Fix:** Add regularization (reg_covar), use diagonal covariance, or merge small clusters

⚠️ **Local Maxima**
- **Cause:** EM converges to local optimum (not global)
- **Fix:** Run multiple initializations (n_init=10), pick best log-likelihood

⚠️ **Overfitting with Large K**
- **Cause:** Too many components fit noise
- **Fix:** Use BIC (strong penalty) or cross-validation

⚠️ **Slow Convergence**
- **Cause:** Poor initialization or large d (full covariance O(nd²K))
- **Fix:** Use K-Means++ init, switch to diagonal covariance, or subsample data

### **Production Considerations**

🔧 **Deployment:**
- **Save model:** `pickle.dump(gmm, file)` or `joblib.dump(gmm, file)`
- **Inference:** `gmm.predict_proba(X_new)` for soft assignments
- **Monitoring:** Track log-likelihood drift over time (model degradation)

🔧 **Scalability:**
- **Small data (n<10K):** Standard GMM
- **Medium (10K-100K):** Diagonal/tied covariance
- **Large (>100K):** Consider K-Means or MiniBatchKMeans (GMM too slow)
- **Streaming:** Incremental EM (update parameters online, research topic)

🔧 **Interpretability:**
- **Report probabilities:** Not just argmax (enable risk-based decisions)
- **Visualize ellipses:** 2-sigma contours show cluster shapes
- **Feature importance:** Check covariance diagonals (high variance = important)

---

## 🔗 Next Steps

- **030_Dimensionality_Reduction.ipynb** - Use PCA/t-SNE/UMAP to visualize high-dimensional GMM clusters
- **031_PCA.ipynb** - Reduce 1000-parameter STDF data to 50D before GMM (improves speed + avoids curse of dimensionality)
- **041_Feature_Engineering.ipynb** - Engineer domain-specific features for better GMM clustering (e.g., spatial coordinates for wafer maps)

---

**💡 Remember:** GMM = K-Means + soft clustering + elliptical shapes. Use when uncertainty matters!