# 107: ML Model Monitoring & Observability

## üéØ Learning Objectives

By the end of this notebook, you will:
- **Understand** drift types: data drift, concept drift, prediction drift, and target drift
- **Implement** statistical tests for detecting distribution changes (KS test, PSI, Jensen-Shannon)
- **Build** real-time monitoring dashboards tracking model health metrics
- **Apply** monitoring to semiconductor yield prediction models in production
- **Evaluate** when to retrain, roll back, or escalate based on drift severity

## üìö What is ML Model Monitoring?

ML model monitoring is the continuous observation of deployed models to detect performance degradation before it impacts business outcomes. Unlike traditional software where behavior is deterministic, ML models degrade silently as the world changes‚Äînew device types appear, test equipment calibrates differently, manufacturing processes evolve. Monitoring catches these shifts early, triggering alerts when predictions become unreliable.

Effective monitoring tracks three layers: **data layer** (input feature distributions), **prediction layer** (output distribution and confidence), and **outcome layer** (ground truth labels when available). Data drift can occur without impacting performance (benign shift), or it can signal concept drift where feature-target relationships change. For example, a yield prediction model trained on 7nm devices may fail catastrophically on 5nm devices without proper monitoring.

In semiconductor manufacturing, model monitoring is mission-critical because errors cost millions. A drifting binning algorithm might misclassify premium devices as scrap ($10K loss per wafer), while going undetected for weeks in manual QA. Real-time monitoring with automatic rollback safeguards ensures production stability while enabling continuous model improvements.

**Why ML Model Monitoring?**
- ‚úÖ **Early Detection**: Catch drift before accuracy drops 20% and costs spike
- ‚úÖ **Root Cause Analysis**: Isolate which features drifted (e.g., new test equipment calibration)
- ‚úÖ **Automated Response**: Auto-retrain, rollback to baseline, or alert engineers
- ‚úÖ **Compliance**: Audit trail for regulated industries (automotive, medical devices)
- ‚úÖ **Continuous Learning**: Data-driven retraining decisions vs arbitrary schedules

## üè≠ Post-Silicon Validation Use Cases

**Use Case 1: Yield Prediction Model Drift**
- **Scenario**: New wafer fab process (7nm ‚Üí 5nm) changes parametric distributions
- **Monitoring**: Track Vdd, Idd, frequency distributions with PSI (Population Stability Index)
- **Alert**: PSI > 0.25 for Vdd ‚Üí model predictions unreliable
- **Action**: Retrain on last 30 days of 5nm data, A/B test before full deployment
- **Impact**: Prevented $2M in false rejects by catching drift 3 days after process change

**Use Case 2: Test Equipment Calibration Drift**
- **Scenario**: Tester recalibration shifts current measurements by +2mA
- **Monitoring**: Kolmogorov-Smirnov test on Idd distribution (p < 0.001 ‚Üí drift detected)
- **Alert**: Data drift in Idd but model accuracy unchanged (benign drift)
- **Action**: Document calibration change, update training data normalization
- **Impact**: Avoided unnecessary model retrain, saved 2 weeks of engineering time

**Use Case 3: Seasonal Pattern Shift**
- **Scenario**: Summer temperature increase affects test chamber conditions
- **Monitoring**: Track prediction confidence (model uncertainty increases)
- **Alert**: Mean prediction confidence drops from 0.92 to 0.78
- **Action**: Add temperature as explicit feature, retrain quarterly model
- **Impact**: Maintained 95% accuracy through seasonal variation

**Use Case 4: Concept Drift in Binning Logic**
- **Scenario**: Customer requirements change (stricter specs for automotive market)
- **Monitoring**: Target distribution shifts (more BIN2, fewer BIN1)
- **Alert**: Label drift detected, feature-target correlation weakens
- **Action**: Collect 10K new labels under new specs, full model retrain
- **Impact**: Prevented shipping out-of-spec devices, avoided customer returns

## üîÑ Monitoring Workflow

```mermaid
graph TB
    A[Production Traffic] --> B[Log Predictions]
    B --> C[Feature Store]
    C --> D[Monitoring Service]
    
    D --> E[Data Drift Detection]
    D --> F[Prediction Drift Detection]
    D --> G[Performance Monitoring]
    
    E --> H{KS Test / PSI}
    F --> I{Confidence Drop?}
    G --> J{Accuracy Drop?}
    
    H -->|p < 0.05| K[Data Drift Alert]
    I -->|Yes| L[Prediction Drift Alert]
    J -->|>5% Drop| M[Performance Alert]
    
    K --> N{Severity}
    L --> N
    M --> N
    
    N -->|Critical| O[Auto Rollback]
    N -->|High| P[Trigger Retrain]
    N -->|Medium| Q[Engineer Review]
    N -->|Low| R[Log & Monitor]
    
    O --> S[Notify Oncall]
    P --> T[Retrain Pipeline]
    Q --> U[Dashboard Alert]
    R --> V[Metrics DB]
    
    style A fill:#e1f5ff
    style O fill:#ffe1e1
    style P fill:#fff5e1
    style V fill:#e1ffe1
```

## üìä Learning Path Context

**Prerequisites:**
- **041**: Model Evaluation - Understanding baseline metrics
- **106**: A/B Testing - Comparing model versions statistically
- **031**: Time Series - Temporal patterns and seasonality

**This Notebook (107):**
- Drift detection algorithms (KS test, PSI, JS divergence)
- Real-time monitoring implementation
- Alert thresholds and escalation policies
- Root cause analysis for drift
- Automated retraining triggers

**Next Steps:**
- **108**: Feature Stores - Centralized feature management for consistency
- **109**: ML Pipelines - Automated retraining and deployment
- **131**: Cloud Deployment - Scalable monitoring infrastructure

---

Let's build production-grade monitoring systems! üîç

## 1. Setup and Imports

In [None]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import ks_2samp, wasserstein_distance
from scipy.spatial.distance import jensenshannon
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 8)

# Random seed
np.random.seed(42)

print("‚úÖ Monitoring environment ready!")

## 2. Generate Training and Production Data

**Purpose:** Simulate model trained on historical data, then deployed to production with distribution shifts.

**Key Points:**
- **Training data**: Historical semiconductor test data (stable period)
- **Production data**: Live data with gradual drift (new process, equipment changes)
- **Drift types**: Covariate shift (X changes), concept drift (X‚Üíy relationship changes)
- **Why this matters**: Real production always differs from training‚Äîmonitoring quantifies how much

In [None]:
# Training data (stable period, 6 months historical)
n_train = 5000
np.random.seed(42)

# Feature distributions (7nm process)
train_vdd = np.random.normal(1.2, 0.08, n_train)
train_idd = np.random.normal(50, 8, n_train)
train_freq = np.random.normal(2000, 150, n_train)
train_temp = np.random.normal(85, 12, n_train)
train_vth = np.random.normal(0.4, 0.03, n_train)

# True relationship (training period)
train_power = train_vdd * train_idd
train_yield = (
    100 - 0.35 * train_power + 12 * train_vth - 0.01 * train_temp * train_freq / 1000
    + np.random.normal(0, 2.5, n_train)
)
train_yield = np.clip(train_yield, 60, 100)

df_train = pd.DataFrame({
    'vdd': train_vdd,
    'idd': train_idd,
    'freq': train_freq,
    'temp': train_temp,
    'vth': train_vth,
    'yield_pct': train_yield,
    'period': 'train'
})

# Production data Week 1-4 (NO DRIFT - baseline performance)
n_prod_stable = 1000
prod_stable_vdd = np.random.normal(1.2, 0.08, n_prod_stable)
prod_stable_idd = np.random.normal(50, 8, n_prod_stable)
prod_stable_freq = np.random.normal(2000, 150, n_prod_stable)
prod_stable_temp = np.random.normal(85, 12, n_prod_stable)
prod_stable_vth = np.random.normal(0.4, 0.03, n_prod_stable)

prod_stable_power = prod_stable_vdd * prod_stable_idd
prod_stable_yield = (
    100 - 0.35 * prod_stable_power + 12 * prod_stable_vth 
    - 0.01 * prod_stable_temp * prod_stable_freq / 1000
    + np.random.normal(0, 2.5, n_prod_stable)
)
prod_stable_yield = np.clip(prod_stable_yield, 60, 100)

df_prod_stable = pd.DataFrame({
    'vdd': prod_stable_vdd,
    'idd': prod_stable_idd,
    'freq': prod_stable_freq,
    'temp': prod_stable_temp,
    'vth': prod_stable_vth,
    'yield_pct': prod_stable_yield,
    'period': 'prod_stable'
})

# Production data Week 5-8 (DATA DRIFT - process change to 5nm)
n_prod_drift = 1000
prod_drift_vdd = np.random.normal(1.15, 0.07, n_prod_drift)  # Lower voltage
prod_drift_idd = np.random.normal(45, 7, n_prod_drift)      # Lower current
prod_drift_freq = np.random.normal(2200, 160, n_prod_drift) # Higher freq
prod_drift_temp = np.random.normal(85, 12, n_prod_drift)
prod_drift_vth = np.random.normal(0.38, 0.025, n_prod_drift) # Lower Vth

# CONCEPT DRIFT: Relationship changes (new process physics)
prod_drift_power = prod_drift_vdd * prod_drift_idd
prod_drift_yield = (
    100 - 0.40 * prod_drift_power + 15 * prod_drift_vth  # Different coefficients!
    - 0.012 * prod_drift_temp * prod_drift_freq / 1000
    + np.random.normal(0, 3.0, n_prod_drift)  # Higher noise
)
prod_drift_yield = np.clip(prod_drift_yield, 60, 100)

df_prod_drift = pd.DataFrame({
    'vdd': prod_drift_vdd,
    'idd': prod_drift_idd,
    'freq': prod_drift_freq,
    'temp': prod_drift_temp,
    'vth': prod_drift_vth,
    'yield_pct': prod_drift_yield,
    'period': 'prod_drift'
})

print(f"Training data: {len(df_train)} samples (6 months historical)")
print(f"Production stable: {len(df_prod_stable)} samples (Week 1-4, no drift)")
print(f"Production drift: {len(df_prod_drift)} samples (Week 5-8, 5nm process)\n")

print("Feature statistics comparison:")
print(f"\nVdd (voltage):")
print(f"  Train:  Œº={df_train['vdd'].mean():.3f}, œÉ={df_train['vdd'].std():.3f}")
print(f"  Stable: Œº={df_prod_stable['vdd'].mean():.3f}, œÉ={df_prod_stable['vdd'].std():.3f}")
print(f"  Drift:  Œº={df_prod_drift['vdd'].mean():.3f}, œÉ={df_prod_drift['vdd'].std():.3f} ‚ö†Ô∏è SHIFTED")

print(f"\nIdd (current):")
print(f"  Train:  Œº={df_train['idd'].mean():.3f}, œÉ={df_train['idd'].std():.3f}")
print(f"  Stable: Œº={df_prod_stable['idd'].mean():.3f}, œÉ={df_prod_stable['idd'].std():.3f}")
print(f"  Drift:  Œº={df_prod_drift['idd'].mean():.3f}, œÉ={df_prod_drift['idd'].std():.3f} ‚ö†Ô∏è SHIFTED")

## 3. Train Baseline Model

**Purpose:** Train yield prediction model on historical data (baseline for drift detection).

**Key Points:**
- **Training period**: 6 months historical STDF data
- **Features**: Vdd, Idd, frequency, temperature, Vth
- **Model**: Random Forest (production baseline)
- **Why this matters**: Need baseline performance metrics to detect degradation

In [None]:
# Train baseline Random Forest model
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

rf_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf_model.fit(X_train_scaled, y_train)

# Baseline predictions
y_pred_baseline = rf_model.predict(X_test_scaled)
baseline_accuracy = accuracy_score(y_test, y_pred_baseline)
baseline_f1 = f1_score(y_test, y_pred_baseline, average='weighted')

print(f"Baseline Model Performance:")
print(f"  Accuracy: {baseline_accuracy:.4f}")
print(f"  F1 Score: {baseline_f1:.4f}")

# Feature importance (for monitoring which features drift most)
feature_importance_df = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print(f"\nTop 5 Most Important Features:")
print(feature_importance_df.head())

## 4. Simulate Production Data Drift

**Purpose:** Generate new production data with distribution shifts to simulate real-world drift scenarios.

**Key Points:**
- **Covariate Shift**: Feature distributions change (e.g., higher Vdd voltages due to process variation)
- **Concept Drift**: Relationship between features and target changes (e.g., new failure modes emerge)
- **Realistic Scenarios**: In post-silicon, drift happens due to process improvements, new test equipment, or product variants

**Why This Matters:** Production data rarely matches training data perfectly. Monitoring drift prevents silent model degradation.

In [None]:
# Simulate production data with drift
np.random.seed(123)

# Production data: higher voltage (process drift), different frequency distribution
production_vdd = np.random.normal(1.25, 0.08, 200)  # Higher mean (was 1.2)
production_idd = np.random.normal(155, 28, 200)  # Higher mean (was 150)
production_freq = np.random.normal(2900, 180, 200)  # Lower mean (was 3000)
production_power = production_vdd * production_idd / 1000
production_temp = np.random.normal(82, 9, 200)  # Higher mean (was 80)

# Concept drift: new failure mode (high temp + high power = fail more often)
production_pass_prob = 0.65 + 0.15 * (production_vdd - 1.1) / 0.3 - 0.25 * (production_temp > 85)
production_pass = (np.random.random(200) < production_pass_prob).astype(int)

production_df = pd.DataFrame({
    'Vdd': production_vdd,
    'Idd': production_idd,
    'Frequency': production_freq,
    'Power': production_power,
    'Temperature': production_temp,
    'Pass': production_pass
})

# Model predictions on production data
X_production = production_df[feature_cols].values
X_production_scaled = scaler.transform(X_production)
y_production_pred = rf_model.predict(X_production_scaled)
production_accuracy = accuracy_score(production_df['Pass'], y_production_pred)

print(f"Production Model Performance:")
print(f"  Accuracy: {production_accuracy:.4f} (Baseline: {baseline_accuracy:.4f})")
print(f"  Degradation: {(baseline_accuracy - production_accuracy):.4f}")
print(f"\n‚ö†Ô∏è Model performance dropped! Investigating drift...")

## 5. Kolmogorov-Smirnov (KS) Test for Feature Drift

**Purpose:** Detect feature distribution shifts using statistical hypothesis testing.

**Key Points:**
- **KS Statistic**: Maximum vertical distance between cumulative distribution functions (CDFs)
- **P-value < 0.05**: Statistically significant drift detected (reject null hypothesis of same distribution)
- **Per-Feature Monitoring**: Track which specific features are drifting
- **Actionable Threshold**: KS statistic > 0.2 often indicates meaningful drift in production

**Why This Matters:** Early drift detection prevents deploying models on out-of-distribution data.

In [None]:
from scipy.stats import ks_2samp

# KS test for each feature
ks_results = []
for col in feature_cols:
    train_values = X_train[col].values
    prod_values = production_df[col].values
    ks_stat, p_value = ks_2samp(train_values, prod_values)
    
    ks_results.append({
        'Feature': col,
        'KS_Statistic': ks_stat,
        'P_Value': p_value,
        'Drift_Detected': 'Yes' if p_value < 0.05 else 'No',
        'Severity': 'High' if ks_stat > 0.2 else ('Medium' if ks_stat > 0.1 else 'Low')
    })

ks_df = pd.DataFrame(ks_results)
print("Feature Drift Detection (Kolmogorov-Smirnov Test):\n")
print(ks_df.to_string(index=False))
print(f"\nüö® Drifted Features: {ks_df[ks_df['Drift_Detected'] == 'Yes']['Feature'].tolist()}")

## 6. Population Stability Index (PSI)

**Purpose:** Quantify distribution shift magnitude for continuous monitoring.

**Key Points:**
- **PSI Formula**: `Œ£[(Actual% - Expected%) √ó ln(Actual% / Expected%)]`
- **Threshold Guidelines**: PSI < 0.1 (stable), 0.1-0.25 (moderate drift), > 0.25 (severe drift)
- **Industry Standard**: Widely used in credit scoring and risk modeling to monitor feature stability
- **Binning Strategy**: Divide feature range into 10 bins based on training data quantiles

**Why This Matters:** PSI provides a single metric to track drift severity over time, enabling automated alerts.

In [None]:
def calculate_psi(expected, actual, bins=10):
    """Calculate Population Stability Index (PSI) for a feature."""
    # Create bins based on expected (training) data quantiles
    breakpoints = np.percentile(expected, np.linspace(0, 100, bins + 1))
    breakpoints[-1] += 0.0001  # Ensure max value is included
    
    # Count observations in each bin
    expected_counts = np.histogram(expected, bins=breakpoints)[0]
    actual_counts = np.histogram(actual, bins=breakpoints)[0]
    
    # Convert to percentages (add small epsilon to avoid log(0))
    expected_pct = (expected_counts + 1e-6) / len(expected)
    actual_pct = (actual_counts + 1e-6) / len(actual)
    
    # PSI formula: Œ£[(Actual% - Expected%) √ó ln(Actual% / Expected%)]
    psi_values = (actual_pct - expected_pct) * np.log(actual_pct / expected_pct)
    psi = np.sum(psi_values)
    
    return psi

# Calculate PSI for each feature
psi_results = []
for col in feature_cols:
    psi = calculate_psi(X_train[col].values, production_df[col].values)
    
    if psi < 0.1:
        status = 'Stable'
    elif psi < 0.25:
        status = 'Moderate Drift'
    else:
        status = 'Severe Drift'
    
    psi_results.append({'Feature': col, 'PSI': psi, 'Status': status})

psi_df = pd.DataFrame(psi_results)
print("Population Stability Index (PSI) Analysis:\n")
print(psi_df.to_string(index=False))
print(f"\n‚ö†Ô∏è Features with Drift: {psi_df[psi_df['PSI'] > 0.1]['Feature'].tolist()}")

## 7. Jensen-Shannon Divergence

**Purpose:** Measure distributional similarity using information theory (symmetric version of KL divergence).

**Key Points:**
- **Range**: 0 (identical distributions) to 1 (completely different)
- **Symmetric**: JS(P||Q) = JS(Q||P), unlike KL divergence
- **Smooth Metric**: Finite even when distributions have non-overlapping support
- **Threshold**: JS > 0.1 indicates noticeable drift, > 0.3 indicates severe drift

**Why This Matters:** JS divergence is more robust than KL divergence and provides intuitive drift severity scoring.

In [None]:
from scipy.spatial.distance import jensenshannon

def calculate_js_divergence(expected, actual, bins=20):
    """Calculate Jensen-Shannon divergence between two distributions."""
    # Create bins
    combined = np.concatenate([expected, actual])
    breakpoints = np.percentile(combined, np.linspace(0, 100, bins + 1))
    
    # Histogram counts
    expected_counts = np.histogram(expected, bins=breakpoints)[0]
    actual_counts = np.histogram(actual, bins=breakpoints)[0]
    
    # Normalize to probabilities (add epsilon to avoid division by zero)
    expected_prob = (expected_counts + 1e-10) / expected_counts.sum()
    actual_prob = (actual_counts + 1e-10) / actual_counts.sum()
    
    # JS divergence (returns value between 0 and 1)
    js_div = jensenshannon(expected_prob, actual_prob)
    
    return js_div

# Calculate JS divergence for each feature
js_results = []
for col in feature_cols:
    js_div = calculate_js_divergence(X_train[col].values, production_df[col].values)
    
    if js_div < 0.1:
        severity = 'Low'
    elif js_div < 0.3:
        severity = 'Medium'
    else:
        severity = 'High'
    
    js_results.append({'Feature': col, 'JS_Divergence': js_div, 'Severity': severity})

js_df = pd.DataFrame(js_results)
print("Jensen-Shannon Divergence Analysis:\n")
print(js_df.to_string(index=False))
print(f"\nüî¥ High Drift Features: {js_df[js_df['Severity'] == 'High']['Feature'].tolist()}")

## 8. Monitoring Dashboard Visualization

**Purpose:** Create comprehensive visual dashboard for drift monitoring and model performance tracking.

**Key Points:**
- **Feature Distribution Comparison**: Overlay training vs production histograms
- **Drift Metrics Timeline**: Track KS, PSI, JS over time (simulated batches here)
- **Model Performance Degradation**: Monitor accuracy/F1 score drops
- **Alert Thresholds**: Visual indicators when metrics exceed safe limits

**Why This Matters:** Executives and engineers need quick visual summaries to decide on retraining schedules.

In [None]:
fig, axes = plt.subplots(3, 2, figsize=(14, 12))
fig.suptitle('ML Model Monitoring Dashboard', fontsize=16, fontweight='bold')

# 1. Feature distribution comparison (Vdd)
axes[0, 0].hist(X_train['Vdd'], bins=20, alpha=0.6, label='Training', color='blue', edgecolor='black')
axes[0, 0].hist(production_df['Vdd'], bins=20, alpha=0.6, label='Production', color='red', edgecolor='black')
axes[0, 0].set_title('Vdd Distribution Shift')
axes[0, 0].set_xlabel('Vdd (V)')
axes[0, 0].legend()

# 2. Feature distribution comparison (Temperature)
axes[0, 1].hist(X_train['Temperature'], bins=20, alpha=0.6, label='Training', color='blue', edgecolor='black')
axes[0, 1].hist(production_df['Temperature'], bins=20, alpha=0.6, label='Production', color='red', edgecolor='black')
axes[0, 1].set_title('Temperature Distribution Shift')
axes[0, 1].set_xlabel('Temperature (¬∞C)')
axes[0, 1].legend()

# 3. Drift metrics comparison (bar chart)
drift_metrics = pd.DataFrame({
    'Feature': feature_cols,
    'KS Statistic': [ks_df[ks_df['Feature'] == col]['KS_Statistic'].values[0] for col in feature_cols],
    'PSI': [psi_df[psi_df['Feature'] == col]['PSI'].values[0] for col in feature_cols]
})
x_pos = np.arange(len(feature_cols))
axes[1, 0].bar(x_pos - 0.2, drift_metrics['KS Statistic'], 0.4, label='KS Statistic', color='orange')
axes[1, 0].bar(x_pos + 0.2, drift_metrics['PSI'], 0.4, label='PSI', color='purple')
axes[1, 0].axhline(y=0.2, color='red', linestyle='--', label='Alert Threshold')
axes[1, 0].set_xticks(x_pos)
axes[1, 0].set_xticklabels(feature_cols, rotation=45)
axes[1, 0].set_title('Drift Metrics by Feature')
axes[1, 0].legend()

# 4. JS Divergence heatmap-style bar chart
js_values = [js_df[js_df['Feature'] == col]['JS_Divergence'].values[0] for col in feature_cols]
colors = ['green' if v < 0.1 else 'orange' if v < 0.3 else 'red' for v in js_values]
axes[1, 1].barh(feature_cols, js_values, color=colors, edgecolor='black')
axes[1, 1].axvline(x=0.1, color='orange', linestyle='--', label='Medium Threshold')
axes[1, 1].axvline(x=0.3, color='red', linestyle='--', label='High Threshold')
axes[1, 1].set_title('Jensen-Shannon Divergence')
axes[1, 1].set_xlabel('JS Divergence')
axes[1, 1].legend()

# 5. Model performance comparison
performance_comparison = pd.DataFrame({
    'Dataset': ['Training (Baseline)', 'Production'],
    'Accuracy': [baseline_accuracy, production_accuracy],
    'F1 Score': [baseline_f1, f1_score(production_df['Pass'], y_production_pred, average='weighted')]
})
x_pos = np.arange(len(performance_comparison))
axes[2, 0].bar(x_pos - 0.2, performance_comparison['Accuracy'], 0.4, label='Accuracy', color='skyblue')
axes[2, 0].bar(x_pos + 0.2, performance_comparison['F1 Score'], 0.4, label='F1 Score', color='salmon')
axes[2, 0].set_xticks(x_pos)
axes[2, 0].set_xticklabels(performance_comparison['Dataset'])
axes[2, 0].set_ylim(0, 1)
axes[2, 0].set_title('Model Performance Degradation')
axes[2, 0].legend()

# 6. Alert summary table (text)
alert_summary = f"""
MONITORING ALERT SUMMARY
========================
Drifted Features (KS): {len(ks_df[ks_df['Drift_Detected'] == 'Yes'])} / {len(feature_cols)}
High PSI Features: {len(psi_df[psi_df['PSI'] > 0.25])}
High JS Features: {len(js_df[js_df['Severity'] == 'High'])}

Performance Drop: {(baseline_accuracy - production_accuracy):.2%}

üö® RECOMMENDATION:
{'RETRAIN MODEL IMMEDIATELY' if production_accuracy < 0.75 else 'Continue monitoring'}
"""
axes[2, 1].text(0.1, 0.5, alert_summary, fontsize=10, family='monospace', 
                verticalalignment='center', bbox=dict(boxstyle='round', facecolor='wheat'))
axes[2, 1].axis('off')

plt.tight_layout()
plt.show()

## üöÄ Real-World Project Templates

Build production ML monitoring systems using these frameworks:

### 1Ô∏è‚É£ **Post-Silicon Test Yield Monitor**
- **Objective**: Track parametric test drift across wafer lots to predict yield degradation  
- **Data**: STDF files with Vdd, Idd, frequency, power per device (10K+ devices/week)  
- **Success Metric**: Detect drift 2 weeks before yield drops below 85%  
- **Features**: Multi-site correlation, spatial drift (wafer maps), process generation comparison  
- **Tech Stack**: Python, Grafana, PostgreSQL timeseries, Evidently AI

### 2Ô∏è‚É£ **E-commerce Recommendation Drift Detector**
- **Objective**: Monitor user behavior shifts to prevent stale recommendations  
- **Data**: Click-through rates, session duration, product views (1M+ events/day)  
- **Success Metric**: Maintain CTR > 3.5% by detecting seasonal/trend shifts  
- **Features**: Real-time feature monitoring, A/B test drift, cold-start detection  
- **Tech Stack**: Spark Streaming, Kafka, MLflow, custom PSI dashboard

### 3Ô∏è‚É£ **Fraud Detection Model Observatory**
- **Objective**: Detect adversarial drift in transaction patterns  
- **Data**: Transaction amounts, merchant categories, user profiles (500K+ txns/day)  
- **Success Metric**: Alert on concept drift within 24 hours (new fraud tactics)  
- **Features**: Adversarial drift detection, label drift monitoring, precision@K tracking  
- **Tech Stack**: AWS SageMaker Model Monitor, CloudWatch, Lambda alerts

### 4Ô∏è‚É£ **Manufacturing Defect Predictor Monitor**
- **Objective**: Track sensor drift in production line IoT devices  
- **Data**: Temperature, pressure, vibration sensors (100Hz sampling, 50 machines)  
- **Success Metric**: Predict machine failures 48 hours in advance  
- **Features**: Sensor calibration drift, multivariate drift (Mahalanobis distance), time-series KS test  
- **Tech Stack**: InfluxDB, Telegraf, custom Python monitor, PagerDuty integration

### 5Ô∏è‚É£ **Credit Risk Model Stability Tracker**
- **Objective**: Regulatory compliance monitoring for credit scoring models  
- **Data**: Applicant features (income, credit history, debt ratio) - 50K applications/month  
- **Success Metric**: PSI < 0.25 for all features (regulatory requirement)  
- **Features**: Segmented PSI (by demographics), Gini coefficient tracking, approval rate monitoring  
- **Tech Stack**: SAS Viya, custom Python PSI calculator, Tableau dashboard

### 6Ô∏è‚É£ **Autonomous Vehicle Perception Drift Monitor**
- **Objective**: Detect camera/LiDAR sensor degradation in self-driving cars  
- **Data**: Object detection confidence scores, lane detection accuracy (10GB/hour/vehicle)  
- **Success Metric**: Alert when detection confidence drops > 5% from baseline  
- **Features**: Per-sensor drift, weather-based segmentation, geographic distribution shift  
- **Tech Stack**: ROS, PyTorch, Weights & Biases, custom real-time JS divergence

### 7Ô∏è‚É£ **Medical Diagnosis Model Observer**
- **Objective**: Monitor imaging model performance across hospital sites  
- **Data**: X-ray/MRI features, patient demographics, diagnosis outcomes (1K+ scans/day)  
- **Success Metric**: Maintain diagnostic accuracy > 92% across all sites  
- **Features**: Site-specific drift, equipment drift (different MRI machines), demographic fairness monitoring  
- **Tech Stack**: DICOM integration, TensorFlow Extended (TFX), Kubeflow, HIPAA-compliant logging

### 8Ô∏è‚É£ **Energy Demand Forecasting Monitor**
- **Objective**: Detect consumption pattern shifts for grid load balancing  
- **Data**: Hourly consumption, weather, holidays, economic indicators (10 years history)  
- **Success Metric**: MAPE < 5% despite seasonal/COVID-19-like disruptions  
- **Features**: Seasonal decomposition drift, exogenous variable monitoring, forecast interval calibration  
- **Tech Stack**: Prophet, ARIMA, custom drift detection, Azure Time Series Insights

## üéØ Key Takeaways

### What is ML Model Monitoring?
Continuous validation of deployed models to detect performance degradation, data drift, and concept drift in production environments.

### Why Monitor Models?
- **Prevent Silent Failures**: Models degrade as real-world data changes
- **Regulatory Compliance**: Banking, healthcare require documented model stability
- **Cost Savings**: Early detection prevents bad predictions affecting business
- **Trust**: Stakeholders need evidence models remain reliable over time

### Core Monitoring Metrics

| **Metric** | **Purpose** | **Threshold** | **When to Use** |
|-----------|-----------|--------------|----------------|
| **KS Test** | Detect feature distribution shifts | p < 0.05 | Continuous features, any sample size |
| **PSI** | Quantify drift magnitude | > 0.25 severe | Credit scoring, risk models |
| **JS Divergence** | Symmetric distribution distance | > 0.3 high | Research, multi-distribution comparison |
| **Model Accuracy** | Direct performance tracking | Domain-specific | Always, as primary metric |
| **Prediction Drift** | Output distribution changes | > 10% shift | Classification models |

### Implementation Checklist
- ‚úÖ **Baseline**: Establish training data statistics (mean, std, quantiles)
- ‚úÖ **Instrumentation**: Log all predictions + features in production
- ‚úÖ **Cadence**: Run drift checks daily (batch) or per 1K predictions (streaming)
- ‚úÖ **Alerting**: PagerDuty/Slack integration when thresholds exceeded
- ‚úÖ **Visualization**: Grafana/Tableau dashboards for stakeholders
- ‚úÖ **Retraining Pipeline**: Automated trigger when drift confirmed

### Common Pitfalls
- ‚ùå **Monitoring Accuracy Only**: Feature drift happens before accuracy drops
- ‚ùå **No Ground Truth Delay**: In post-silicon, test results come hours/days later
- ‚ùå **Threshold Overload**: Too many alerts ‚Üí alert fatigue ‚Üí ignoring real issues
- ‚ùå **Ignoring Segments**: Overall metrics stable but specific segments (e.g., new product variants) degrading

### Post-Silicon Specifics
- **Spatial Drift**: Wafer edge vs center devices behave differently over time
- **Equipment Drift**: Tester calibration changes ‚Üí feature distribution shifts
- **Process Nodes**: Models trained on 7nm don't transfer to 5nm without monitoring
- **Yield Prediction**: Monitor correlation between parametric tests and final yield weekly

### When to Retrain
1. **PSI > 0.25** on critical features (Vdd, frequency)
2. **Accuracy drop > 5%** sustained over 1 week
3. **New failure mode detected** (concept drift confirmed)
4. **Scheduled**: Every 3 months even without drift (best practice)

### Tool Ecosystem
- **Open Source**: Evidently AI, Alibi Detect, NannyML, WhyLogs
- **Cloud**: AWS SageMaker Model Monitor, Azure ML Monitoring, Vertex AI Model Monitoring
- **Observability**: Arize AI, Fiddler AI, Arthur, Aporia

### Next Steps
- **Notebook 108**: Feature Stores (versioning features for drift comparison)
- **Notebook 109**: ML Pipelines (orchestrating monitoring + retraining)
- **Advanced**: Multi-armed bandits for online model selection under drift

---

**Remember**: *The best model is the one that stays relevant. Monitor or perish!* üö®