# 123: Model Monitoring & Drift Detection

## üéØ Learning Objectives

By the end of this notebook, you will:
- **Understand** drift types: Data drift, concept drift, prediction drift
- **Detect** data drift using statistical tests (KS, PSI, Chi-square)
- **Monitor** model performance degradation in production
- **Implement** alerting systems for drift detection
- **Apply** monitoring to post-silicon validation models
- **Build** comprehensive monitoring dashboards

## üìö What is Model Monitoring?

**Model monitoring** is the continuous observation of ML models in production to detect:
- **Performance degradation**: Accuracy drops from 92% to 78%
- **Data drift**: Input distributions change (e.g., voltage range shifts)
- **Concept drift**: Relationship between features and target changes
- **Prediction drift**: Output distribution changes unexpectedly

**Why Models Fail Silently:**
- ‚úÖ **Data changes**: New device types, process node changes, equipment drift
- ‚úÖ **Seasonal patterns**: Holiday effects, temperature variations
- ‚úÖ **Adversarial shifts**: Gaming the system, evolving fraud tactics
- ‚úÖ **Infrastructure issues**: Feature pipeline bugs, data quality problems

**Without monitoring**: Model serves bad predictions for weeks/months before someone notices.

## üè≠ Post-Silicon Validation Use Cases

**Use Case 1: Yield Predictor Drift Detection**
- **Scenario**: Yield prediction model trained on 1.2V ¬± 0.05V, but new lot uses 1.25V ¬± 0.03V
- **Detection**: PSI on Vdd distribution = 0.28 (major drift, threshold 0.2)
- **Impact**: Accuracy drops from 92% to 84% without retraining
- **Alert**: "‚ö†Ô∏è Vdd drift detected (PSI=0.28). Model retrain recommended."
- **Value**: Detect drift within 24 hours instead of discovering after 2 weeks of bad predictions

**Use Case 2: Test Time Optimizer Performance Monitoring**
- **Scenario**: Model predicts which tests to skip (25% time reduction, <0.5% FNR)
- **Monitoring**: Track false negative rate daily, alert if >0.5% for 3 consecutive days
- **Drift**: New test program version changes test sequence ‚Üí concept drift
- **Detection**: FNR spikes to 1.2% on day 1 of new program
- **Response**: Instant rollback to previous model, retrain on new test program data
- **Value**: Prevent $50K in escapes, maintain quality standards

**Use Case 3: Wafer Map Anomaly Detector Monitoring**
- **Scenario**: Spatial anomaly detection for equipment failures
- **Metrics**: Anomaly detection rate, false positive rate, spatial pattern distribution
- **Drift**: New lithography tool introduced ‚Üí different spatial signatures
- **Detection**: Anomaly rate drops from 5% to 1% (model missing new patterns)
- **Alert**: "üö® Anomaly rate anomaly! Investigating equipment changes..."
- **Value**: Identify equipment issues 6 hours earlier, $1.5M avoidance

**Use Case 4: Device Binning Classifier Monitoring**
- **Scenario**: Multi-class binning (Premium/Standard/Economy at 60/30/10 split)
- **Monitoring**: Track bin distribution daily, alert if shifts >5%
- **Drift**: Premium bin % drops to 50% (from 60%)
- **Root cause**: New process step affects performance parameters
- **Response**: Investigate process change, retrain with new data, update bin thresholds
- **Value**: Revenue optimization, prevent $800K in mis-binned devices

## üîÑ Model Monitoring Workflow

```mermaid
graph TB
    A[Production Model] --> B[Collect Predictions]
    A --> C[Collect Input Features]
    A --> D[Collect Ground Truth]
    
    B --> E[Prediction Drift Analysis]
    C --> F[Data Drift Detection]
    D --> G[Performance Monitoring]
    
    F --> H{Drift Detected?}
    G --> I{Performance Drop?}
    E --> J{Distribution Shift?}
    
    H -->|Yes| K[Alert Data Science Team]
    I -->|Yes| K
    J -->|Yes| K
    
    K --> L{Severity?}
    L -->|Critical| M[Immediate Rollback]
    L -->|High| N[Retrain Model]
    L -->|Medium| O[Investigate Root Cause]
    
    M --> P[Deploy Previous Version]
    N --> Q[Retrain with Recent Data]
    O --> R[Monitor Closely]
    
    H -->|No| S[Continue Monitoring]
    I -->|No| S
    J -->|No| S
    
    style A fill:#e1f5ff
    style K fill:#ffe1e1
    style M fill:#ff9999
    style S fill:#e1ffe1
```

## üìä Learning Path Context

**Prerequisites:**
- **121_MLOps_Fundamentals.ipynb** - MLOps lifecycle, deployment
- **122_MLflow_Complete_Guide.ipynb** - Experiment tracking, model registry
- **041_Model_Evaluation_Metrics.ipynb** - Accuracy, F1, AUC metrics

**Next Steps:**
- **124_Feature_Store_Implementation.ipynb** - Centralized feature management
- **125_ML_Pipeline_Orchestration.ipynb** - Automated retraining pipelines
- **131_Docker_Fundamentals.ipynb** - Containerized monitoring services

---

Let's master model monitoring and drift detection! üöÄ

In [None]:
# Install monitoring libraries
# !pip install evidently alibi-detect scipy scikit-learn pandas numpy matplotlib seaborn

import pandas as pd
import numpy as np
from scipy import stats
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

print("Model monitoring libraries loaded")
print("Focus: Data drift, concept drift, model performance degradation")

## 2. Data Drift Detection

**Data drift** occurs when input feature distributions change over time.

**Types:**
- **Covariate shift**: P(X) changes, but P(Y|X) stays same
- **Prior probability shift**: P(Y) changes
- **Concept drift**: P(Y|X) changes (covered in Section 3)

**Why it matters**: Model trained on old distribution performs poorly on new distribution.

**Post-Silicon Example**: Vdd trained range 1.2V¬±0.05, production shifts to 1.25V¬±0.03 ‚Üí model unreliable.

In [None]:
# Generate reference (training) and current (production) data
np.random.seed(42)

# Reference data (what model was trained on)
n_ref = 5000
reference_data = pd.DataFrame({
    'Vdd_V': np.random.normal(1.2, 0.05, n_ref),
    'Idd_mA': np.random.normal(50, 5, n_ref),
    'freq_MHz': np.random.normal(1000, 50, n_ref),
    'temp_C': np.random.normal(25, 5, n_ref)
})

# Current production data (simulating drift)
n_curr = 1000

# Scenario 1: NO DRIFT (distribution same as training)
current_no_drift = pd.DataFrame({
    'Vdd_V': np.random.normal(1.2, 0.05, n_curr),
    'Idd_mA': np.random.normal(50, 5, n_curr),
    'freq_MHz': np.random.normal(1000, 50, n_curr),
    'temp_C': np.random.normal(25, 5, n_curr)
})

# Scenario 2: DRIFT (Vdd shifted, temp increased variance)
current_with_drift = pd.DataFrame({
    'Vdd_V': np.random.normal(1.25, 0.03, n_curr),  # Mean shifted!
    'Idd_mA': np.random.normal(50, 5, n_curr),
    'freq_MHz': np.random.normal(1000, 50, n_curr),
    'temp_C': np.random.normal(27, 8, n_curr)  # Mean + variance changed!
})

print("Reference data (training):")
print(reference_data.describe())
print("\nCurrent data (NO drift):")
print(current_no_drift.describe())
print("\nCurrent data (WITH drift):")
print(current_with_drift.describe())

### A. Kolmogorov-Smirnov Test (Continuous Features)

**KS Test** compares two distributions, tests if they come from same underlying distribution.

**How it works:**
- Compares empirical CDFs (cumulative distribution functions)
- **Statistic**: Maximum distance between CDFs (0 to 1)
- **P-value**: Probability distributions are same
- **Decision**: If p < 0.05, reject null hypothesis ‚Üí **drift detected**

**When to use**: Continuous numerical features (Vdd, Idd, freq, temp)

In [None]:
# Kolmogorov-Smirnov test for each feature
from scipy.stats import ks_2samp

def ks_drift_test(reference, current, feature_name, alpha=0.05):
    """Perform KS test for drift detection"""
    statistic, pvalue = ks_2samp(reference, current)
    
    drift_detected = pvalue < alpha
    
    result = {
        'feature': feature_name,
        'ks_statistic': statistic,
        'p_value': pvalue,
        'drift_detected': drift_detected,
        'severity': 'HIGH' if statistic > 0.2 else 'MEDIUM' if statistic > 0.1 else 'LOW'
    }
    
    return result

# Test on NO DRIFT scenario
print("=== KS Test: NO DRIFT Scenario ===")
for col in reference_data.columns:
    result = ks_drift_test(reference_data[col], current_no_drift[col], col)
    status = "üö® DRIFT" if result['drift_detected'] else "‚úÖ NO DRIFT"
    print(f"{col:12s}: KS={result['ks_statistic']:.4f}, p={result['p_value']:.4f} ‚Üí {status}")

print("\n=== KS Test: WITH DRIFT Scenario ===")
drift_results = []
for col in reference_data.columns:
    result = ks_drift_test(reference_data[col], current_with_drift[col], col)
    drift_results.append(result)
    status = "üö® DRIFT" if result['drift_detected'] else "‚úÖ NO DRIFT"
    print(f"{col:12s}: KS={result['ks_statistic']:.4f}, p={result['p_value']:.4f} ‚Üí {status} ({result['severity']})")

# Summary
drifted_features = [r['feature'] for r in drift_results if r['drift_detected']]
print(f"\nüìä Summary: {len(drifted_features)}/{len(drift_results)} features drifted")
if drifted_features:
    print(f"Drifted features: {', '.join(drifted_features)}")

### B. Population Stability Index (PSI)

**PSI** measures distribution shift using binned percentages.

**Formula:**
$$PSI = \sum_{i=1}^{n} (P_{current,i} - P_{reference,i}) \times \ln\left(\frac{P_{current,i}}{P_{reference,i}}\right)$$

**Interpretation:**
- **PSI < 0.1**: No significant change ‚úÖ
- **PSI 0.1-0.2**: Minor drift, monitor closely ‚ö†Ô∏è
- **PSI > 0.2**: Major drift, retrain recommended üö®

**Advantages**: Industry standard (banking, credit scoring), intuitive thresholds

In [None]:
# Population Stability Index implementation
def calculate_psi(reference, current, bins=10):
    """Calculate PSI for drift detection"""
    # Create bins from reference data
    _, bin_edges = np.histogram(reference, bins=bins)
    
    # Count samples in each bin
    ref_counts, _ = np.histogram(reference, bins=bin_edges)
    curr_counts, _ = np.histogram(current, bins=bin_edges)
    
    # Convert to percentages
    ref_pct = ref_counts / len(reference)
    curr_pct = curr_counts / len(current)
    
    # Avoid division by zero (add small epsilon)
    ref_pct = np.where(ref_pct == 0, 0.0001, ref_pct)
    curr_pct = np.where(curr_pct == 0, 0.0001, curr_pct)
    
    # Calculate PSI
    psi = np.sum((curr_pct - ref_pct) * np.log(curr_pct / ref_pct))
    
    # Interpretation
    if psi < 0.1:
        status = "‚úÖ NO CHANGE"
        severity = "LOW"
    elif psi < 0.2:
        status = "‚ö†Ô∏è MINOR DRIFT"
        severity = "MEDIUM"
    else:
        status = "üö® MAJOR DRIFT"
        severity = "HIGH"
    
    return {
        'psi': psi,
        'status': status,
        'severity': severity,
        'recommendation': 'Continue monitoring' if psi < 0.1 else 'Monitor closely' if psi < 0.2 else 'Retrain model'
    }

# Test PSI on NO DRIFT scenario
print("=== PSI Test: NO DRIFT Scenario ===")
for col in reference_data.columns:
    result = calculate_psi(reference_data[col], current_no_drift[col])
    print(f"{col:12s}: PSI={result['psi']:.4f} ‚Üí {result['status']} ({result['recommendation']})")

# Test PSI on WITH DRIFT scenario
print("\n=== PSI Test: WITH DRIFT Scenario ===")
psi_results = []
for col in reference_data.columns:
    result = calculate_psi(reference_data[col], current_with_drift[col])
    result['feature'] = col
    psi_results.append(result)
    print(f"{col:12s}: PSI={result['psi']:.4f} ‚Üí {result['status']} ({result['recommendation']})")

# Identify critical features
critical_features = [r['feature'] for r in psi_results if r['psi'] > 0.2]
if critical_features:
    print(f"\nüö® ALERT: Critical drift detected in: {', '.join(critical_features)}")
    print(f"Action required: Model retrain recommended")

### C. Visualizing Drift

**Visual drift detection** helps stakeholders understand distribution changes.

In [None]:
# Visualize drift for Vdd and temp
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Vdd - NO DRIFT
axes[0, 0].hist(reference_data['Vdd_V'], bins=30, alpha=0.5, label='Reference (Training)', color='blue', density=True)
axes[0, 0].hist(current_no_drift['Vdd_V'], bins=30, alpha=0.5, label='Current (No Drift)', color='green', density=True)
axes[0, 0].set_title('Vdd Distribution - NO DRIFT')
axes[0, 0].set_xlabel('Vdd (V)')
axes[0, 0].set_ylabel('Density')
axes[0, 0].legend()
axes[0, 0].axvline(reference_data['Vdd_V'].mean(), color='blue', linestyle='--', label='Ref mean')
axes[0, 0].axvline(current_no_drift['Vdd_V'].mean(), color='green', linestyle='--', label='Curr mean')

# Vdd - WITH DRIFT
axes[0, 1].hist(reference_data['Vdd_V'], bins=30, alpha=0.5, label='Reference (Training)', color='blue', density=True)
axes[0, 1].hist(current_with_drift['Vdd_V'], bins=30, alpha=0.5, label='Current (Drifted)', color='red', density=True)
axes[0, 1].set_title('Vdd Distribution - WITH DRIFT (Mean shifted)')
axes[0, 1].set_xlabel('Vdd (V)')
axes[0, 1].set_ylabel('Density')
axes[0, 1].legend()
vdd_psi = calculate_psi(reference_data['Vdd_V'], current_with_drift['Vdd_V'])
axes[0, 1].text(0.05, 0.95, f"PSI={vdd_psi['psi']:.3f}\n{vdd_psi['status']}", 
                transform=axes[0, 1].transAxes, fontsize=12, verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.5))

# Temp - NO DRIFT
axes[1, 0].hist(reference_data['temp_C'], bins=30, alpha=0.5, label='Reference (Training)', color='blue', density=True)
axes[1, 0].hist(current_no_drift['temp_C'], bins=30, alpha=0.5, label='Current (No Drift)', color='green', density=True)
axes[1, 0].set_title('Temperature Distribution - NO DRIFT')
axes[1, 0].set_xlabel('Temperature (¬∞C)')
axes[1, 0].set_ylabel('Density')
axes[1, 0].legend()

# Temp - WITH DRIFT
axes[1, 1].hist(reference_data['temp_C'], bins=30, alpha=0.5, label='Reference (Training)', color='blue', density=True)
axes[1, 1].hist(current_with_drift['temp_C'], bins=30, alpha=0.5, label='Current (Drifted)', color='red', density=True)
axes[1, 1].set_title('Temperature Distribution - WITH DRIFT (Variance increased)')
axes[1, 1].set_xlabel('Temperature (¬∞C)')
axes[1, 1].set_ylabel('Density')
axes[1, 1].legend()
temp_psi = calculate_psi(reference_data['temp_C'], current_with_drift['temp_C'])
axes[1, 1].text(0.05, 0.95, f"PSI={temp_psi['psi']:.3f}\n{temp_psi['status']}", 
                transform=axes[1, 1].transAxes, fontsize=12, verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.5))

plt.tight_layout()
plt.show()

print("Visual inspection confirms drift detection:")
print(f"- Vdd: Mean shift from {reference_data['Vdd_V'].mean():.3f}V to {current_with_drift['Vdd_V'].mean():.3f}V")
print(f"- Temp: Std increased from {reference_data['temp_C'].std():.2f}¬∞C to {current_with_drift['temp_C'].std():.2f}¬∞C")

## 3. Concept Drift Detection

**Concept drift** occurs when the relationship between features (X) and target (Y) changes.

**Example**: 
- Training: Vdd=1.2V + Idd<55mA ‚Üí 95% yield
- Production: Vdd=1.2V + Idd<55mA ‚Üí 85% yield (process changed!)

**Types:**
- **Sudden drift**: Abrupt change (equipment replacement, new process step)
- **Gradual drift**: Slow change over time (equipment degradation)
- **Recurring drift**: Seasonal patterns (temperature effects)

**Detection challenge**: Requires ground truth labels (actual outcomes)

In [None]:
# Simulate concept drift scenario
np.random.seed(42)

# Original relationship (training data)
n_train = 5000
X_train = pd.DataFrame({
    'Vdd_V': np.random.normal(1.2, 0.05, n_train),
    'Idd_mA': np.random.normal(50, 5, n_train),
    'freq_MHz': np.random.normal(1000, 50, n_train),
    'temp_C': np.random.normal(25, 5, n_train)
})

# Original yield relationship
y_train = (
    (X_train['Vdd_V'] >= 1.15) & (X_train['Vdd_V'] <= 1.25) &
    (X_train['Idd_mA'] <= 55) &
    (X_train['freq_MHz'] >= 950) &
    (X_train['temp_C'] <= 30)
).astype(int)

# Train model on original relationship
model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
model.fit(X_train, y_train)

# Production data - simulate concept drift
# Features same distribution, but relationship changed!
n_prod = 2000
X_production = pd.DataFrame({
    'Vdd_V': np.random.normal(1.2, 0.05, n_prod),  # Same distribution
    'Idd_mA': np.random.normal(50, 5, n_prod),
    'freq_MHz': np.random.normal(1000, 50, n_prod),
    'temp_C': np.random.normal(25, 5, n_prod)
})

# NEW yield relationship (process changed, tighter specs!)
y_production_actual = (
    (X_production['Vdd_V'] >= 1.18) & (X_production['Vdd_V'] <= 1.22) &  # Tighter!
    (X_production['Idd_mA'] <= 52) &  # Lower threshold!
    (X_production['freq_MHz'] >= 980) &  # Higher threshold!
    (X_production['temp_C'] <= 28)  # Tighter!
).astype(int)

# Model predictions (using OLD relationship)
y_production_pred = model.predict(X_production)

# Compare performance
train_accuracy = model.score(X_train, y_train)
production_accuracy = accuracy_score(y_production_actual, y_production_pred)
production_f1 = f1_score(y_production_actual, y_production_pred)

print("=== Concept Drift Impact ===")
print(f"Training accuracy: {train_accuracy:.4f}")
print(f"Production accuracy: {production_accuracy:.4f} ‚ö†Ô∏è (dropped {(train_accuracy - production_accuracy)*100:.1f}%)")
print(f"Production F1: {production_f1:.4f}")
print(f"\nProblem: Feature distributions unchanged, but X‚ÜíY relationship changed")
print(f"Root cause: Process specifications tightened")
print(f"Solution: Retrain model with recent production data")

## 4. Performance Monitoring Over Time

**Track model metrics continuously** to detect gradual degradation.

**What to monitor:**
- Accuracy, F1, Precision, Recall (requires ground truth)
- Prediction confidence distribution
- Prediction rate (predictions/day)
- Latency (p50, p95, p99)

**Post-Silicon example**: Track yield predictor F1 score daily, alert if drops below 0.88 for 3 consecutive days.

In [None]:
# Simulate 30 days of production monitoring
np.random.seed(42)

monitoring_log = []
for day in range(1, 31):
    # Simulate gradual concept drift (accuracy degrades over time)
    drift_factor = max(0, 1 - (day / 50))  # Gradual degradation
    
    # Generate daily production data
    n_daily = 500
    X_daily = pd.DataFrame({
        'Vdd_V': np.random.normal(1.2, 0.05, n_daily),
        'Idd_mA': np.random.normal(50, 5, n_daily),
        'freq_MHz': np.random.normal(1000, 50, n_daily),
        'temp_C': np.random.normal(25, 5, n_daily)
    })
    
    # Ground truth (with concept drift)
    y_true = (
        (X_daily['Vdd_V'] >= 1.15 + (1-drift_factor)*0.03) & 
        (X_daily['Vdd_V'] <= 1.25 - (1-drift_factor)*0.03) &
        (X_daily['Idd_mA'] <= 55 - (1-drift_factor)*3) &
        (X_daily['freq_MHz'] >= 950 + (1-drift_factor)*30) &
        (X_daily['temp_C'] <= 30 - (1-drift_factor)*2)
    ).astype(int)
    
    # Model predictions
    y_pred = model.predict(X_daily)
    
    # Calculate metrics
    accuracy = accuracy_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    
    # Add noise to make it realistic
    accuracy += np.random.normal(0, 0.01)
    f1 += np.random.normal(0, 0.01)
    
    monitoring_log.append({
        'day': day,
        'accuracy': accuracy,
        'f1_score': f1,
        'predictions_count': n_daily
    })

monitoring_df = pd.DataFrame(monitoring_log)

# Plot performance over time
fig, axes = plt.subplots(2, 1, figsize=(12, 8))

# Accuracy trend
axes[0].plot(monitoring_df['day'], monitoring_df['accuracy'], marker='o', label='Daily Accuracy')
axes[0].axhline(y=0.88, color='r', linestyle='--', label='Alert Threshold (0.88)')
axes[0].axhline(y=0.90, color='orange', linestyle='--', label='Warning Threshold (0.90)')
axes[0].fill_between(monitoring_df['day'], 0.88, 1.0, alpha=0.1, color='green', label='Safe Zone')
axes[0].fill_between(monitoring_df['day'], 0.0, 0.88, alpha=0.1, color='red', label='Alert Zone')
axes[0].set_xlabel('Day')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Model Accuracy Over Time - Gradual Degradation')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# F1 score trend
axes[1].plot(monitoring_df['day'], monitoring_df['f1_score'], marker='s', color='purple', label='Daily F1 Score')
axes[1].axhline(y=0.85, color='r', linestyle='--', label='Alert Threshold (0.85)')
axes[1].fill_between(monitoring_df['day'], 0.85, 1.0, alpha=0.1, color='green')
axes[1].fill_between(monitoring_df['day'], 0.0, 0.85, alpha=0.1, color='red')
axes[1].set_xlabel('Day')
axes[1].set_ylabel('F1 Score')
axes[1].set_title('Model F1 Score Over Time')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Alert logic
print("=== Performance Monitoring Alerts ===")
for i, row in monitoring_df.iterrows():
    if row['accuracy'] < 0.88:
        print(f"Day {row['day']}: üö® ALERT - Accuracy {row['accuracy']:.4f} < 0.88 threshold")
    elif row['accuracy'] < 0.90:
        print(f"Day {row['day']}: ‚ö†Ô∏è  WARNING - Accuracy {row['accuracy']:.4f} < 0.90 threshold")

# Check for sustained degradation
consecutive_low = 0
for acc in monitoring_df['accuracy']:
    if acc < 0.88:
        consecutive_low += 1
        if consecutive_low >= 3:
            print(f"\nüö® CRITICAL: Accuracy below threshold for {consecutive_low} consecutive days")
            print("Action: Trigger automatic model retrain")
            break
    else:
        consecutive_low = 0

## 5. Complete Monitoring System

**Production-ready monitoring** combines drift detection + performance tracking + automated alerts.

In [None]:
# Complete monitoring class
class ModelMonitor:
    """Production model monitoring system"""
    
    def __init__(self, reference_data, model, alert_thresholds=None):
        self.reference_data = reference_data
        self.model = model
        self.thresholds = alert_thresholds or {
            'psi': 0.2,
            'ks_pvalue': 0.05,
            'accuracy': 0.88,
            'f1_score': 0.85
        }
        self.monitoring_history = []
    
    def check_data_drift(self, current_data):
        """Check for data drift using PSI and KS tests"""
        drift_report = {'features': {}, 'overall_status': 'OK'}
        
        for col in self.reference_data.columns:
            # PSI test
            psi_result = calculate_psi(self.reference_data[col], current_data[col])
            
            # KS test
            ks_stat, ks_pval = ks_2samp(self.reference_data[col], current_data[col])
            
            drift_detected = (psi_result['psi'] > self.thresholds['psi'] or 
                            ks_pval < self.thresholds['ks_pvalue'])
            
            drift_report['features'][col] = {
                'psi': psi_result['psi'],
                'ks_statistic': ks_stat,
                'ks_pvalue': ks_pval,
                'drift_detected': drift_detected,
                'severity': psi_result['severity']
            }
            
            if drift_detected:
                drift_report['overall_status'] = 'DRIFT_DETECTED'
        
        return drift_report
    
    def check_performance(self, X, y_true):
        """Check model performance"""
        y_pred = self.model.predict(X)
        
        accuracy = accuracy_score(y_true, y_pred)
        f1 = f1_score(y_true, y_pred)
        
        performance_alert = (accuracy < self.thresholds['accuracy'] or 
                           f1 < self.thresholds['f1_score'])
        
        return {
            'accuracy': accuracy,
            'f1_score': f1,
            'alert': performance_alert,
            'status': 'DEGRADED' if performance_alert else 'OK'
        }
    
    def monitor(self, current_data, y_true):
        """Run complete monitoring check"""
        # Check drift
        drift_report = self.check_data_drift(current_data)
        
        # Check performance
        performance = self.check_performance(current_data, y_true)
        
        # Combined report
        report = {
            'timestamp': datetime.now().isoformat(),
            'data_drift': drift_report,
            'performance': performance,
            'overall_health': 'CRITICAL' if (drift_report['overall_status'] == 'DRIFT_DETECTED' and 
                                             performance['status'] == 'DEGRADED') else
                             'WARNING' if (drift_report['overall_status'] == 'DRIFT_DETECTED' or 
                                          performance['status'] == 'DEGRADED') else 'HEALTHY'
        }
        
        self.monitoring_history.append(report)
        return report
    
    def generate_alert(self, report):
        """Generate human-readable alert"""
        if report['overall_health'] == 'HEALTHY':
            return "‚úÖ Model healthy. No action required."
        
        alert_msg = []
        
        # Data drift alerts
        if report['data_drift']['overall_status'] == 'DRIFT_DETECTED':
            drifted = [f for f, r in report['data_drift']['features'].items() 
                      if r['drift_detected']]
            alert_msg.append(f"‚ö†Ô∏è DATA DRIFT: {', '.join(drifted)}")
            for feat in drifted:
                psi = report['data_drift']['features'][feat]['psi']
                alert_msg.append(f"  - {feat}: PSI={psi:.3f}")
        
        # Performance alerts
        if report['performance']['status'] == 'DEGRADED':
            acc = report['performance']['accuracy']
            f1 = report['performance']['f1_score']
            alert_msg.append(f"üö® PERFORMANCE DEGRADATION:")
            alert_msg.append(f"  - Accuracy: {acc:.4f} (threshold: {self.thresholds['accuracy']})")
            alert_msg.append(f"  - F1 Score: {f1:.4f} (threshold: {self.thresholds['f1_score']})")
        
        # Recommendation
        if report['overall_health'] == 'CRITICAL':
            alert_msg.append("\nüîß RECOMMENDED ACTION: Immediate model retrain")
        elif report['overall_health'] == 'WARNING':
            alert_msg.append("\nüîç RECOMMENDED ACTION: Investigate root cause, plan retrain")
        
        return "\n".join(alert_msg)

# Test monitoring system
monitor = ModelMonitor(reference_data, model)

# Day 1: No drift
report_day1 = monitor.monitor(current_no_drift, 
                               (current_no_drift['Vdd_V'] <= 1.25).astype(int))
print("=== Day 1 Monitoring Report ===")
print(monitor.generate_alert(report_day1))

# Day 15: With drift
print("\n=== Day 15 Monitoring Report ===")
report_day15 = monitor.monitor(current_with_drift,
                                y_production_actual[:1000])
print(monitor.generate_alert(report_day15))

print(f"\nOverall health: {report_day15['overall_health']}")

## üéØ Real-World Monitoring Projects

### **Post-Silicon Validation Projects**

#### **Project 1: Yield Predictor Continuous Monitoring**
**Objective**: Deploy comprehensive monitoring for production yield prediction model
- **Metrics**: Accuracy, F1, AUC (daily, with 24-hour ground truth delay)
- **Data drift**: PSI on Vdd, Idd, freq, temp (hourly checks)
- **Alerts**: Slack notification if PSI > 0.2 or accuracy < 0.90 for 2 consecutive days
- **Dashboard**: Grafana dashboard showing 30-day trend, drift heatmap, alert history
- **Auto-retrain**: Trigger retrain job if critical drift detected for 3 days
- **Success**: Detect process changes within 6 hours vs 2 weeks manual detection

#### **Project 2: Test Time Optimizer Monitoring with FNR Tracking**
**Objective**: Monitor test time reduction model, ensure false negative rate < 0.5%
- **Primary metric**: False negative rate (escapes to customer)
- **Secondary metrics**: Test time savings %, throughput (devices/hour)
- **Drift detection**: Chi-square test on test_sequence distribution (categorical drift)
- **Alert thresholds**: FNR > 0.5% ‚Üí instant rollback, FNR 0.4-0.5% ‚Üí warning
- **Root cause analysis**: Log test program version changes, equipment IDs
- **Business value**: Prevent $100K in customer escapes, maintain quality standards

#### **Project 3: Wafer Map Anomaly Detector Monitoring**
**Objective**: Monitor spatial anomaly detection for equipment failures
- **Metrics**: Anomaly detection rate, false positive rate, spatial coverage
- **Data drift**: Track wafer coordinate distributions, spatial autocorrelation changes
- **Concept drift**: Monitor equipment IDs, lithography tool changes
- **Visualization**: Daily wafer map gallery (detected anomalies highlighted)
- **Alerts**: Anomaly rate drops >50% (model missing new patterns) OR spikes >200% (false positives)
- **Integration**: Trigger equipment maintenance alerts based on spatial patterns
- **Value**: Identify equipment issues 4 hours earlier, $2M+ avoidance/year

#### **Project 4: Device Binning Drift Dashboard**
**Objective**: Real-time monitoring of binning distribution for revenue optimization
- **Target distribution**: Premium(60%) / Standard(30%) / Economy(10%)
- **Drift detection**: Chi-square test on bin distribution (daily)
- **Performance**: Track binning accuracy, revenue per wafer
- **Alerts**: Premium bin % drops below 55% or above 65% (investigate process changes)
- **Root cause**: Correlate bin shifts with fab events (process changes, equipment)
- **Dashboard**: Streamlit app showing bin trend, revenue impact, alert log
- **Value**: $800K revenue optimization, proactive process optimization

---

### **General AI/ML Projects**

#### **Project 5: Customer Churn Predictor Monitoring**
**Objective**: Monitor churn prediction model for concept drift (customer behavior changes)
- **Metrics**: Precision, recall, F1 (monthly with subscription renewal data)
- **Data drift**: PSI on usage_minutes, support_tickets, payment_history
- **Concept drift**: Monitor churn rate over time (sudden spikes indicate drift)
- **Seasonal patterns**: Track holiday effects, quarterly business cycles
- **Alerts**: Precision < 0.75 (too many false positives, wasted retention offers)
- **Auto-retrain**: Quarterly retrain with last 12 months of data
- **Business value**: Reduce churn by 18%, optimize retention campaign spend

#### **Project 6: Fraud Detection Real-Time Monitoring**
**Objective**: Monitor fraud detection model with sub-minute alerting
- **Metrics**: Precision (false positive rate), recall (fraud catch rate)
- **Data drift**: Track transaction_amount, merchant_category, location distributions
- **Concept drift**: Fraudsters adapt tactics ‚Üí relationship between features and fraud changes
- **Real-time**: Stream monitoring with 1-minute aggregation windows
- **Alerts**: FPR > 2% (customer friction) ‚Üí instant alert, recall < 85% (missing fraud) ‚Üí critical
- **A/B testing**: Shadow mode for new models (compare with production)
- **Value**: Block $4M fraud annually, maintain <1% false positive rate

#### **Project 7: Recommendation System Performance Tracking**
**Objective**: Monitor recommendation CTR and engagement metrics
- **Online metrics**: Click-through rate, conversion rate, time-on-site
- **Offline metrics**: Coverage, diversity, novelty (prevent filter bubble)
- **Data drift**: User preference shifts (genre popularity, seasonal trends)
- **Concept drift**: New content categories, changing user behavior
- **Prediction drift**: Track recommendation distribution (diversity vs popularity)
- **Alerts**: CTR drops >10% from baseline, diversity score < 0.3
- **Success**: 22% CTR improvement, balanced diversity/relevance

#### **Project 8: Demand Forecasting Monitoring Dashboard**
**Objective**: Track forecast accuracy over time with seasonal adjustments
- **Metrics**: MAPE, MAE, RMSE (daily evaluation with next-day actuals)
- **Data drift**: Sales volume distribution, promotion frequency, market trends
- **Concept drift**: COVID effects, supply chain disruptions, competitor actions
- **Seasonal patterns**: Weekly, monthly, quarterly cycles
- **Visualization**: Forecast vs actual plots, error distribution, drift heatmap
- **Alerts**: MAPE > 15% for 7 consecutive days ‚Üí retrain trigger
- **Auto-retrain**: Weekly retrain with expanding window (last 365 days)
- **Business value**: Reduce inventory costs by 30%, improve forecast to MAPE < 9%

## üìö Comprehensive Takeaways

### **üéØ Why Model Monitoring Matters**

**Silent failures**: Production models degrade without anyone noticing until business impact severe.

**Real-world example (post-silicon):**
- Yield predictor trained on 1.2V¬±0.05V process
- New lot uses 1.25V¬±0.03V (process change)
- Model accuracy drops from 92% to 78%
- **Without monitoring**: Discovered after 2 weeks, 5000 mis-predicted devices
- **With monitoring**: Detected in 4 hours, automated retrain triggered, $200K saved

**Types of degradation:**
1. **Data drift**: Input distributions change
2. **Concept drift**: X‚ÜíY relationship changes  
3. **Prediction drift**: Output distribution shifts
4. **Performance degradation**: Metrics decline over time

---

### **üîß Drift Detection Methods**

#### **1. Kolmogorov-Smirnov (KS) Test**

**For**: Continuous numerical features

**How it works:**
- Compares empirical CDFs of reference vs current data
- **Statistic**: Max distance between CDFs (0 to 1)
- **P-value**: Probability both distributions are same
- **Decision**: p < 0.05 ‚Üí reject null ‚Üí **drift detected**

**Code:**
```python
from scipy.stats import ks_2samp
statistic, pvalue = ks_2samp(reference_data, current_data)
drift = pvalue < 0.05
```

**Pros:**
- ‚úÖ Non-parametric (no distribution assumptions)
- ‚úÖ Sensitive to both location and shape changes
- ‚úÖ Statistical rigor (p-value)

**Cons:**
- ‚ùå Requires sufficient sample size (>100 recommended)
- ‚ùå May be overly sensitive with large samples

**When to use**: Continuous features like Vdd, temperature, voltage, current

---

#### **2. Population Stability Index (PSI)**

**For**: Any numerical feature (bins continuous into categories)

**Formula:**
$$PSI = \sum_{i=1}^{n} (P_{current,i} - P_{reference,i}) \times \ln\left(\frac{P_{current,i}}{P_{reference,i}}\right)$$

**Interpretation thresholds:**
- **PSI < 0.1**: No change ‚úÖ
- **PSI 0.1-0.2**: Minor drift, monitor ‚ö†Ô∏è
- **PSI > 0.2**: Major drift, retrain üö®

**Code:**
```python
def calculate_psi(reference, current, bins=10):
    ref_hist, bin_edges = np.histogram(reference, bins=bins)
    curr_hist, _ = np.histogram(current, bins=bin_edges)
    
    ref_pct = ref_hist / len(reference)
    curr_pct = curr_hist / len(current)
    
    # Avoid log(0)
    ref_pct = np.where(ref_pct == 0, 0.0001, ref_pct)
    curr_pct = np.where(curr_pct == 0, 0.0001, curr_pct)
    
    psi = np.sum((curr_pct - ref_pct) * np.log(curr_pct / ref_pct))
    return psi
```

**Pros:**
- ‚úÖ Industry standard (banking, credit scoring, fintech)
- ‚úÖ Intuitive thresholds (0.1, 0.2)
- ‚úÖ Works for any distribution

**Cons:**
- ‚ùå Sensitive to number of bins (default 10)
- ‚ùå Asymmetric (PSI(A,B) ‚â† PSI(B,A))

**When to use**: Credit scores, risk models, any feature with established PSI thresholds

---

#### **3. Chi-Square Test**

**For**: Categorical features (device_type, test_program, bin_category)

**How it works:**
- Compares observed vs expected frequencies in categories
- **Statistic**: Measures deviation from expected
- **P-value**: Probability distributions are same
- **Decision**: p < 0.05 ‚Üí **drift detected**

**Code:**
```python
from scipy.stats import chi2_contingency

# Create contingency table
ref_counts = reference_categorical.value_counts()
curr_counts = current_categorical.value_counts()

# Chi-square test
chi2, pvalue, dof, expected = chi2_contingency([ref_counts, curr_counts])
drift = pvalue < 0.05
```

**When to use**: Device binning distribution, test program mix, categorical features

---

#### **4. Wasserstein Distance (Earth Mover's Distance)**

**For**: Measuring "effort" to transform one distribution into another

**Advantages**: More interpretable than KS for practitioners (units same as data)

**Code:**
```python
from scipy.stats import wasserstein_distance
distance = wasserstein_distance(reference_data, current_data)
```

**When to use**: When you want drift magnitude in original units (e.g., "Vdd shifted by 0.05V")

---

### **üìä Concept Drift Detection**

**Challenge**: Requires ground truth labels (delayed in production)

**Strategies:**

#### **1. Performance Monitoring (Gold Standard)**
```python
# Daily batch evaluation
y_pred = model.predict(X_production)
# Wait for ground truth (24 hours for semiconductor test)
accuracy = accuracy_score(y_true_delayed, y_pred)

if accuracy < threshold:
    trigger_retrain()
```

**Pros**: Direct measurement of model effectiveness  
**Cons**: Requires ground truth (may be delayed days/weeks)

#### **2. Prediction Drift**
```python
# Monitor prediction distribution
ref_pred_mean = reference_predictions.mean()
curr_pred_mean = current_predictions.mean()

# If prediction distribution shifts significantly ‚Üí investigate
if abs(curr_pred_mean - ref_pred_mean) > 0.1:
    alert("Prediction drift detected")
```

**Use case**: Early warning before ground truth available

#### **3. Error Distribution Monitoring**
```python
# Track error patterns
ref_errors = y_true_ref - y_pred_ref
curr_errors = y_true_curr - y_pred_curr

# If error distribution changes ‚Üí concept drift
ks_stat, pval = ks_2samp(ref_errors, curr_errors)
```

**Advantage**: Detects subtle concept drift

---

### **‚öôÔ∏è Production Monitoring Architecture**

#### **Components:**

**1. Data Collection:**
```python
# Log every prediction
prediction_log = {
    'timestamp': datetime.now(),
    'input_features': X.to_dict(),
    'prediction': y_pred,
    'prediction_confidence': model.predict_proba(X)[0][1],
    'model_version': '2.1',
    'latency_ms': 45
}
# Store in database (Postgres, MongoDB, S3)
```

**2. Drift Detection Pipeline:**
```python
# Scheduled job (hourly/daily)
def drift_detection_job():
    # Fetch last 24 hours of data
    current_data = fetch_recent_predictions()
    
    # Load reference data
    reference_data = load_training_data()
    
    # Run drift tests
    drift_report = {}
    for feature in features:
        psi = calculate_psi(reference_data[feature], current_data[feature])
        drift_report[feature] = {'psi': psi, 'drift': psi > 0.2}
    
    # If drift detected ‚Üí alert
    if any(r['drift'] for r in drift_report.values()):
        send_alert(drift_report)
    
    # Log to monitoring dashboard
    log_to_grafana(drift_report)
```

**3. Alerting System:**
```python
# Multi-channel alerts
def send_alert(report):
    if report['severity'] == 'CRITICAL':
        # Immediate alerts
        send_pagerduty(report)  # Wake up on-call engineer
        send_slack('#ml-alerts', report)
        send_email(ml_team, report)
    elif report['severity'] == 'WARNING':
        send_slack('#ml-monitoring', report)
    else:
        log_to_dashboard(report)
```

**4. Auto-Retrain Trigger:**
```python
# Automated response
def evaluate_retrain_need(drift_report, performance):
    score = 0
    
    # Data drift scoring
    high_psi_features = sum(1 for f in drift_report if f['psi'] > 0.2)
    score += high_psi_features * 10
    
    # Performance scoring
    if performance['accuracy'] < 0.88:
        score += 50
    
    # Decision
    if score > 60:
        trigger_retrain_pipeline()
        return "RETRAIN_TRIGGERED"
    elif score > 30:
        return "MONITOR_CLOSELY"
    else:
        return "OK"
```

---

### **üìà Monitoring Dashboards**

#### **Essential Visualizations:**

**1. Drift Heatmap:**
```
            Day 1   Day 2   Day 3   ...
Vdd_V       0.02    0.15    0.28    <- PSI values
Idd_mA      0.05    0.06    0.04
freq_MHz    0.10    0.12    0.25
temp_C      0.03    0.18    0.22

Color coding: Green (<0.1), Yellow (0.1-0.2), Red (>0.2)
```

**2. Performance Trend:**
```
Accuracy over time (30 days)
- Line plot with threshold lines
- Shaded regions (safe/warning/critical)
- Annotations for events (model updates, data changes)
```

**3. Prediction Distribution:**
```
Histogram comparison:
- Blue: Reference predictions (training)
- Green: Current predictions (no drift)
- Red: Current predictions (with drift)
```

**4. Feature Drift Radar Chart:**
```
Spider plot showing PSI for all features
- Each axis = one feature
- Reference line at PSI=0.1, 0.2
- Easy to spot which features drifted
```

#### **Dashboard Tools:**

**Open-source:**
- **Evidently**: Pre-built drift dashboards, interactive reports
- **Grafana**: Time-series metrics, alerts, custom dashboards
- **Streamlit**: Custom Python dashboards, rapid prototyping

**Commercial:**
- **Arize AI**: ML observability platform, automatic drift detection
- **Fiddler AI**: Model monitoring + explainability
- **WhyLabs**: Data/model quality monitoring

---

### **üéì Best Practices**

#### **1. Choose Right Drift Test for Feature Type**

| Feature Type | Recommended Test | Why |
|--------------|------------------|-----|
| Continuous numerical | KS Test or PSI | Sensitive to distribution changes |
| Categorical | Chi-Square | Tests category frequencies |
| Ordinal | PSI or Wasserstein | Preserves order information |
| High-dimensional | PCA + KS | Reduce dimensions first |

#### **2. Set Appropriate Thresholds**

**Don't use defaults blindly:**
- PSI thresholds (0.1, 0.2) are industry standard but may need tuning
- For high-frequency monitoring (hourly), use higher thresholds (reduce false alarms)
- For critical models (fraud, medical), use lower thresholds (catch drift early)

**Calibrate thresholds:**
```python
# Use validation data to find optimal threshold
val_psi_values = [calculate_psi(train, val_fold) for val_fold in val_folds]
threshold = np.percentile(val_psi_values, 95)  # 95th percentile
```

#### **3. Monitor at Multiple Time Scales**

- **Hourly**: Detect sudden changes (equipment failure, data pipeline bug)
- **Daily**: Standard monitoring cadence
- **Weekly**: Trend analysis, seasonal patterns
- **Monthly**: Long-term drift, model retraining schedule

#### **4. Combine Multiple Signals**

**Don't rely on single metric:**
```python
# Ensemble drift detection
signals = {
    'data_drift': psi > 0.2,
    'concept_drift': accuracy < 0.88,
    'prediction_drift': pred_distribution_shift > 0.15,
    'latency_spike': p95_latency > 100ms
}

# Alert if 2+ signals triggered
if sum(signals.values()) >= 2:
    send_alert("Multiple drift signals detected")
```

#### **5. Root Cause Analysis Logging**

**Log context for debugging:**
```python
prediction_log = {
    # Prediction data
    'features': X,
    'prediction': y_pred,
    
    # Context (helps identify root cause)
    'data_source': 'wafer_test_station_3',
    'test_program_version': '2.1.5',
    'equipment_id': 'ATE-007',
    'fab_location': 'FAB12',
    'process_node': '7nm',
    'lot_id': 'LOT-2025-001',
    
    # Model metadata
    'model_version': '2.3',
    'inference_latency_ms': 45
}
```

**When drift detected**, correlate with:
- Equipment changes
- Process changes
- Software updates
- External events (temperature, power)

---

### **‚ö†Ô∏è Common Pitfalls**

#### **1. Over-Monitoring (Alert Fatigue)**
- **Problem**: Too many false alarms ‚Üí team ignores alerts
- **Solution**: Tune thresholds, require multiple consecutive violations

#### **2. Under-Monitoring (Missing Critical Drift)**
- **Problem**: Only monitor once/week ‚Üí miss sudden changes
- **Solution**: Hourly drift checks, daily performance checks

#### **3. No Action Plan**
- **Problem**: Alert triggers, but no one knows what to do
- **Solution**: Runbook with decision tree (rollback vs retrain vs investigate)

#### **4. Ignoring Seasonal Patterns**
- **Problem**: Holiday shopping surge ‚Üí "drift detected" ‚Üí unnecessary retrain
- **Solution**: Use seasonally-adjusted baselines, exclude known patterns

#### **5. Not Testing Monitoring System**
- **Problem**: Monitoring code has bug ‚Üí false confidence
- **Solution**: Inject synthetic drift, verify alerts trigger correctly

---

### **üîÆ Next Steps**

**After mastering monitoring:**
1. **124_Feature_Store_Implementation.ipynb** ‚Üí Centralize feature engineering
2. **125_ML_Pipeline_Orchestration.ipynb** ‚Üí Automate retrain on drift
3. **126_AB_Testing_ML_Models.ipynb** ‚Üí Safe model rollouts
4. **131_Docker_Fundamentals.ipynb** ‚Üí Containerize monitoring services

**Hands-On Practice:**
- Implement PSI monitoring for real dataset
- Build Streamlit dashboard showing drift heatmap
- Set up automated alerting (email/Slack)
- Create auto-retrain trigger (if PSI > 0.2 for 3 days)
- Test monitoring with synthetic drift

---

**You now have complete mastery of model monitoring and drift detection! üöÄ**

**Key skill acquired**: Detect silent model failures early, maintain production model health, prevent business impact from model degradation.