# 151: MLOps Fundamentals

In [None]:
# Setup and Installation

import time
import json
import pickle
import hashlib
from dataclasses import dataclass, field, asdict
from typing import List, Dict, Set, Optional, Any, Tuple
from datetime import datetime, timedelta
from enum import Enum
from collections import defaultdict
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# MLOps simulation (educational implementation)
# In production: pip install mlflow wandb dvc

print("✅ MLOps Development Environment Ready")
print("📦 Core libraries loaded")
print("🎯 Ready to build production ML pipelines")
print("\n💡 Production MLOps Stack:")
print("   pip install mlflow  # Experiment tracking, model registry")
print("   pip install wandb   # Weights & Biases (experiment tracking)")
print("   pip install dvc     # Data Version Control")
print("   pip install kubeflow  # Kubernetes-native ML workflows")
print("   pip install seldon-core  # Model serving on Kubernetes")

# Seed for reproducibility
np.random.seed(42)

## 2. 📊 Experiment Tracking - Log Hyperparameters, Metrics, and Artifacts

### 📝 What's Happening in This Code?

**Purpose:** Build an experiment tracking system to log all ML experiments (hyperparameters, metrics, artifacts) for reproducibility and comparison.

**Key Points:**
- **Experiment:** Single model training run with specific hyperparameters
- **Run Metadata:** Log parameters (n_estimators, max_depth), metrics (RMSE, MAE, R²)
- **Artifacts:** Store model files, feature importance plots, training data snapshots
- **Reproducibility:** Exact parameters logged → can recreate any experiment
- **Comparison:** Compare experiments side-by-side (find best hyperparameters)

**What to Track:**
- **Parameters:** All hyperparameters (learning_rate, batch_size, architecture)
- **Metrics:** Training and validation metrics per epoch
- **Artifacts:** Model binaries, plots, data samples, predictions
- **Environment:** Code version (git SHA), library versions, system info
- **Lineage:** Parent models, data sources, feature engineering code

**Why This Matters for Post-Silicon:**
- **Model Selection:** 20 experiments → find best yield prediction model
- **Audit Trail:** Show regulators exact model used for production decision
- **Debugging:** Model accuracy dropped → compare to previous successful runs
- **Knowledge Sharing:** Team sees what's been tried, avoid duplicate work

In [None]:
# Experiment Tracking System

@dataclass
class Experiment:
    """ML experiment with parameters, metrics, and artifacts"""
    experiment_id: str
    name: str
    parameters: Dict[str, Any]
    metrics: Dict[str, float] = field(default_factory=dict)
    artifacts: Dict[str, Any] = field(default_factory=dict)
    start_time: datetime = field(default_factory=datetime.now)
    end_time: Optional[datetime] = None
    status: str = "running"  # running, completed, failed
    code_version: Optional[str] = None
    
    def log_metric(self, key: str, value: float):
        """Log a metric"""
        self.metrics[key] = value
    
    def log_artifact(self, key: str, value: Any):
        """Log an artifact (model, plot, data)"""
        self.artifacts[key] = value
    
    def complete(self):
        """Mark experiment as completed"""
        self.end_time = datetime.now()
        self.status = "completed"
    
    def get_duration(self) -> float:
        """Get experiment duration in seconds"""
        if self.end_time:
            return (self.end_time - self.start_time).total_seconds()
        return (datetime.now() - self.start_time).total_seconds()

class ExperimentTracker:
    """Experiment tracking system (like MLflow)"""
    
    def __init__(self):
        self.experiments: Dict[str, Experiment] = {}
        self.experiment_counter = 0
    
    def create_experiment(self, name: str, parameters: Dict[str, Any], 
                         code_version: Optional[str] = None) -> Experiment:
        """Create new experiment"""
        self.experiment_counter += 1
        experiment_id = f"exp_{self.experiment_counter:04d}"
        
        experiment = Experiment(
            experiment_id=experiment_id,
            name=name,
            parameters=parameters.copy(),
            code_version=code_version
        )
        
        self.experiments[experiment_id] = experiment
        return experiment
    
    def get_experiment(self, experiment_id: str) -> Optional[Experiment]:
        """Get experiment by ID"""
        return self.experiments.get(experiment_id)
    
    def list_experiments(self, name_filter: Optional[str] = None) -> List[Experiment]:
        """List all experiments"""
        exps = list(self.experiments.values())
        if name_filter:
            exps = [e for e in exps if name_filter in e.name]
        return sorted(exps, key=lambda e: e.start_time, reverse=True)
    
    def compare_experiments(self, experiment_ids: List[str], metric: str) -> Dict:
        """Compare experiments by specific metric"""
        comparison = {}
        for exp_id in experiment_ids:
            exp = self.experiments.get(exp_id)
            if exp and metric in exp.metrics:
                comparison[exp_id] = {
                    'name': exp.name,
                    'value': exp.metrics[metric],
                    'parameters': exp.parameters
                }
        return comparison
    
    def get_best_experiment(self, metric: str, minimize: bool = True) -> Optional[Experiment]:
        """Get best experiment by metric"""
        valid_exps = [e for e in self.experiments.values() if metric in e.metrics]
        if not valid_exps:
            return None
        
        return min(valid_exps, key=lambda e: e.metrics[metric]) if minimize else \
               max(valid_exps, key=lambda e: e.metrics[metric])

# Example: Hyperparameter Tuning with Experiment Tracking

print("=" * 80)
print("Experiment Tracking - Hyperparameter Tuning")
print("=" * 80)

# Setup
tracker = ExperimentTracker()

# Generate synthetic wafer test data
print("\n📊 Generating synthetic wafer test data...")
n_samples = 1000
n_features = 5

# Features: Vdd_mean, Idd_mean, Frequency_mean, Temperature, Dies_tested
X = np.random.randn(n_samples, n_features)
X[:, 0] = X[:, 0] * 0.05 + 1.0  # Vdd around 1.0V
X[:, 1] = X[:, 1] * 0.1 + 0.5   # Idd around 0.5A
X[:, 2] = X[:, 2] * 50 + 1000   # Frequency around 1000 MHz
X[:, 3] = X[:, 3] * 10 + 25     # Temperature around 25°C
X[:, 4] = np.random.randint(50, 200, n_samples)  # Dies tested

# Target: Yield percentage (0-100%)
# Yield influenced by all parameters
y = (85 + 
     -20 * (X[:, 0] - 1.0) +  # Higher voltage reduces yield
     -10 * (X[:, 1] - 0.5) +  # Higher current reduces yield
     0.01 * (X[:, 2] - 1000) +  # Higher frequency slightly increases yield
     -0.5 * (X[:, 3] - 25) +   # Higher temperature reduces yield
     0.05 * X[:, 4] +          # More dies tested, better calibration
     np.random.randn(n_samples) * 3)  # Noise

y = np.clip(y, 0, 100)  # Yield between 0-100%

# Train/test split
split = int(0.8 * n_samples)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

print(f"✅ Data generated: {n_samples} samples, {n_features} features")
print(f"   Training set: {len(X_train)} samples")
print(f"   Test set: {len(X_test)} samples")

# Run experiments with different hyperparameters

print(f"\n\n{'=' * 80}")
print("Running Hyperparameter Tuning Experiments")
print("=" * 80)

hyperparameter_grid = [
    {'n_estimators': 50, 'max_depth': 5},
    {'n_estimators': 100, 'max_depth': 5},
    {'n_estimators': 100, 'max_depth': 10},
    {'n_estimators': 200, 'max_depth': 10},
    {'n_estimators': 200, 'max_depth': None}
]

print(f"\n🔬 Training {len(hyperparameter_grid)} Random Forest models...")
print(f"\n{'Exp ID':<10} {'n_trees':<10} {'depth':<10} {'RMSE':<10} {'MAE':<10} {'R²':<10} {'Time (s)':<10}")
print("-" * 80)

for params in hyperparameter_grid:
    # Create experiment
    exp = tracker.create_experiment(
        name="YieldPrediction_RandomForest",
        parameters=params,
        code_version="abc123"  # Git commit SHA
    )
    
    try:
        # Train model
        model = RandomForestRegressor(
            n_estimators=params['n_estimators'],
            max_depth=params['max_depth'],
            random_state=42,
            n_jobs=-1
        )
        model.fit(X_train, y_train)
        
        # Predictions
        y_pred_train = model.predict(X_train)
        y_pred_test = model.predict(X_test)
        
        # Metrics
        rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
        rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
        mae_test = mean_absolute_error(y_test, y_pred_test)
        r2_test = r2_score(y_test, y_pred_test)
        
        # Log metrics
        exp.log_metric('rmse_train', rmse_train)
        exp.log_metric('rmse_test', rmse_test)
        exp.log_metric('mae_test', mae_test)
        exp.log_metric('r2_test', r2_test)
        
        # Log artifacts
        exp.log_artifact('model', model)
        exp.log_artifact('feature_importance', model.feature_importances_)
        exp.log_artifact('predictions', y_pred_test[:10])  # Sample predictions
        
        exp.complete()
        
        # Print results
        depth_str = str(params['max_depth']) if params['max_depth'] else 'None'
        print(f"{exp.experiment_id:<10} {params['n_estimators']:<10} {depth_str:<10} "
              f"{rmse_test:<10.3f} {mae_test:<10.3f} {r2_test:<10.3f} {exp.get_duration():<10.2f}")
        
    except Exception as e:
        exp.status = "failed"
        print(f"{exp.experiment_id:<10} FAILED: {str(e)}")

# Find best model

print(f"\n\n{'=' * 80}")
print("Experiment Comparison and Best Model Selection")
print("=" * 80)

best_exp = tracker.get_best_experiment('rmse_test', minimize=True)

if best_exp:
    print(f"\n🏆 Best Model Found:")
    print(f"   Experiment ID: {best_exp.experiment_id}")
    print(f"   Parameters: {best_exp.parameters}")
    print(f"   Test RMSE: {best_exp.metrics['rmse_test']:.3f}%")
    print(f"   Test MAE: {best_exp.metrics['mae_test']:.3f}%")
    print(f"   Test R²: {best_exp.metrics['r2_test']:.3f}")
    print(f"   Training time: {best_exp.get_duration():.2f} seconds")
    
    # Feature importance
    feature_names = ['Vdd_mean', 'Idd_mean', 'Frequency_mean', 'Temperature', 'Dies_tested']
    importance = best_exp.artifacts['feature_importance']
    
    print(f"\n📊 Feature Importance (Top 3):")
    importance_sorted = sorted(zip(feature_names, importance), key=lambda x: x[1], reverse=True)
    for i, (name, imp) in enumerate(importance_sorted[:3], 1):
        print(f"   {i}. {name}: {imp:.3f}")

# Compare all experiments

print(f"\n\n{'=' * 80}")
print("All Experiments Summary")
print("=" * 80)

all_exps = tracker.list_experiments()
completed_exps = [e for e in all_exps if e.status == 'completed']

print(f"\n📊 Experiment Statistics:")
print(f"   Total experiments: {len(all_exps)}")
print(f"   Completed: {len(completed_exps)}")
print(f"   Failed: {len([e for e in all_exps if e.status == 'failed'])}")
print(f"   Running: {len([e for e in all_exps if e.status == 'running'])}")

if completed_exps:
    rmse_values = [e.metrics['rmse_test'] for e in completed_exps]
    print(f"\n📊 RMSE Distribution:")
    print(f"   Best: {min(rmse_values):.3f}%")
    print(f"   Worst: {max(rmse_values):.3f}%")
    print(f"   Average: {np.mean(rmse_values):.3f}%")
    print(f"   Improvement: {(max(rmse_values) - min(rmse_values)) / max(rmse_values) * 100:.1f}% (best vs worst)")

# Business value

print(f"\n\n{'=' * 80}")
print("Business Value")
print("=" * 80)

# Time savings
manual_experiment_time = 2 * 3600  # 2 hours per experiment (manual)
automated_experiment_time = best_exp.get_duration()
time_saved_per_experiment = manual_experiment_time - automated_experiment_time
experiments_per_month = 50
monthly_time_savings = time_saved_per_experiment * experiments_per_month / 3600  # hours

engineer_cost_per_hour = 150  # USD (fully loaded)
monthly_cost_savings = monthly_time_savings * engineer_cost_per_hour
annual_savings = monthly_cost_savings * 12

print(f"\n💰 Experiment Tracking Value:")
print(f"   Manual experiment time: {manual_experiment_time / 3600:.1f} hours")
print(f"   Automated experiment time: {automated_experiment_time / 60:.1f} minutes")
print(f"   Time saved per experiment: {time_saved_per_experiment / 3600:.1f} hours")
print(f"   Experiments per month: {experiments_per_month}")
print(f"   Monthly time savings: {monthly_time_savings:.0f} hours")
print(f"   Engineer cost: ${engineer_cost_per_hour}/hour")
print(f"   Annual cost savings: ${annual_savings / 1e6:.1f}M")

print(f"\n✅ Experiment tracking validated!")
print(f"✅ {len(completed_exps)} experiments logged and compared")
print(f"✅ Best model identified (RMSE: {best_exp.metrics['rmse_test']:.3f}%)")
print(f"✅ ${annual_savings / 1e6:.1f}M/year business value")

## 3. 🗂️ Model Registry - Versioning, Stage Promotion, and Rollback

### 📝 What's Happening in This Code?

**Purpose:** Build a model registry to manage model versions, stage promotions (dev → staging → production), and enable instant rollback.

**Key Points:**
- **Model Versioning:** Track every model version (v1, v2, v3) with lineage
- **Stage Promotion:** Models progress through stages (None → Staging → Production → Archived)
- **Metadata Storage:** Link model to experiment, training data, code version
- **Rollback Capability:** Instantly revert to previous production version if issues
- **Access Control:** Only authorized users can promote to production

**Model Lifecycle Stages:**
- **None:** Newly registered, not yet tested
- **Staging:** Deployed to staging environment for validation
- **Production:** Serving live traffic
- **Archived:** Deprecated, kept for audit/compliance

**Why This Matters for Post-Silicon:**
- **Deployment Safety:** Test models in staging before production (catch bugs early)
- **Quick Rollback:** Production model fails → revert to previous version in seconds
- **Audit Trail:** Regulators ask "which model was used on 2025-12-01?" → instant answer
- **Multi-Model Management:** Track 50+ production models (yield, test time, binning)

In [None]:
# Model Registry System

class ModelStage(Enum):
    """Model lifecycle stages"""
    NONE = "None"
    STAGING = "Staging"
    PRODUCTION = "Production"
    ARCHIVED = "Archived"

@dataclass
class ModelVersion:
    """Registered model version"""
    model_name: str
    version: int
    experiment_id: str
    stage: ModelStage = ModelStage.NONE
    created_at: datetime = field(default_factory=datetime.now)
    promoted_at: Optional[datetime] = None
    model_artifact: Optional[Any] = None
    metadata: Dict[str, Any] = field(default_factory=dict)
    
    def promote_to_stage(self, new_stage: ModelStage):
        """Promote model to new stage"""
        self.stage = new_stage
        self.promoted_at = datetime.now()
    
    def get_model_uri(self) -> str:
        """Get model URI"""
        return f"models:/{self.model_name}/{self.version}"

class ModelRegistry:
    """Model registry (like MLflow Model Registry)"""
    
    def __init__(self, experiment_tracker: ExperimentTracker):
        self.experiment_tracker = experiment_tracker
        self.models: Dict[str, List[ModelVersion]] = defaultdict(list)
        self.version_counter: Dict[str, int] = defaultdict(int)
    
    def register_model(self, model_name: str, experiment_id: str, 
                      model_artifact: Any, metadata: Optional[Dict] = None) -> ModelVersion:
        """Register new model version"""
        # Verify experiment exists
        experiment = self.experiment_tracker.get_experiment(experiment_id)
        if not experiment:
            raise ValueError(f"Experiment {experiment_id} not found")
        
        # Increment version
        self.version_counter[model_name] += 1
        version_number = self.version_counter[model_name]
        
        # Create model version
        model_version = ModelVersion(
            model_name=model_name,
            version=version_number,
            experiment_id=experiment_id,
            model_artifact=model_artifact,
            metadata=metadata or {}
        )
        
        # Add experiment metrics to metadata
        model_version.metadata['experiment_metrics'] = experiment.metrics.copy()
        model_version.metadata['experiment_parameters'] = experiment.parameters.copy()
        
        self.models[model_name].append(model_version)
        return model_version
    
    def get_model_version(self, model_name: str, version: int) -> Optional[ModelVersion]:
        """Get specific model version"""
        for mv in self.models.get(model_name, []):
            if mv.version == version:
                return mv
        return None
    
    def get_latest_version(self, model_name: str, stage: Optional[ModelStage] = None) -> Optional[ModelVersion]:
        """Get latest model version (optionally filtered by stage)"""
        versions = self.models.get(model_name, [])
        if stage:
            versions = [v for v in versions if v.stage == stage]
        
        if not versions:
            return None
        
        return max(versions, key=lambda v: v.version)
    
    def promote_model(self, model_name: str, version: int, new_stage: ModelStage):
        """Promote model to new stage"""
        model_version = self.get_model_version(model_name, version)
        if not model_version:
            raise ValueError(f"Model version {model_name}:{version} not found")
        
        # If promoting to production, demote current production model to archived
        if new_stage == ModelStage.PRODUCTION:
            current_prod = self.get_latest_version(model_name, ModelStage.PRODUCTION)
            if current_prod:
                current_prod.promote_to_stage(ModelStage.ARCHIVED)
        
        model_version.promote_to_stage(new_stage)
    
    def list_models(self) -> List[str]:
        """List all registered model names"""
        return list(self.models.keys())
    
    def get_model_history(self, model_name: str) -> List[ModelVersion]:
        """Get all versions of a model"""
        return sorted(self.models.get(model_name, []), key=lambda v: v.version, reverse=True)

# Example: Model Registry Workflow

print("=" * 80)
print("Model Registry - Versioning and Stage Promotion")
print("=" * 80)

# Setup
registry = ModelRegistry(tracker)

# Register best model from experiments

best_exp = tracker.get_best_experiment('rmse_test', minimize=True)
if best_exp:
    print(f"\n📝 Registering best model to registry...")
    print(f"   Experiment ID: {best_exp.experiment_id}")
    print(f"   RMSE: {best_exp.metrics['rmse_test']:.3f}%")
    
    model_v1 = registry.register_model(
        model_name="YieldPredictor",
        experiment_id=best_exp.experiment_id,
        model_artifact=best_exp.artifacts['model'],
        metadata={
            'model_type': 'RandomForest',
            'training_samples': len(X_train),
            'features': ['Vdd_mean', 'Idd_mean', 'Frequency_mean', 'Temperature', 'Dies_tested']
        }
    )
    
    print(f"✅ Model registered: {model_v1.model_name} v{model_v1.version}")
    print(f"   Stage: {model_v1.stage.value}")
    print(f"   URI: {model_v1.get_model_uri()}")

# Register another version (simulate improvement)

print(f"\n\n{'=' * 80}")
print("Registering Improved Model Version")
print("=" * 80)

# Find second-best experiment
all_completed = [e for e in tracker.list_experiments() if e.status == 'completed']
if len(all_completed) >= 2:
    second_best = sorted(all_completed, key=lambda e: e.metrics['rmse_test'])[1]
    
    model_v2 = registry.register_model(
        model_name="YieldPredictor",
        experiment_id=second_best.experiment_id,
        model_artifact=second_best.artifacts['model'],
        metadata={
            'model_type': 'RandomForest',
            'training_samples': len(X_train),
            'improvement': 'Increased max_depth for better accuracy'
        }
    )
    
    print(f"✅ Model registered: {model_v2.model_name} v{model_v2.version}")
    print(f"   RMSE: {second_best.metrics['rmse_test']:.3f}%")
    print(f"   Stage: {model_v2.stage.value}")

# Promote model to staging

print(f"\n\n{'=' * 80}")
print("Stage Promotion Workflow")
print("=" * 80)

print(f"\n1️⃣ Promote v1 to Staging:")
registry.promote_model("YieldPredictor", version=1, new_stage=ModelStage.STAGING)
staging_model = registry.get_latest_version("YieldPredictor", ModelStage.STAGING)
print(f"   ✅ {staging_model.model_name} v{staging_model.version} → Staging")
print(f"   Promoted at: {staging_model.promoted_at.strftime('%Y-%m-%d %H:%M:%S')}")

# Test in staging (simulate)
print(f"\n2️⃣ Testing in Staging Environment:")
print(f"   Running validation tests...")
print(f"   ✅ Accuracy test passed (RMSE < 2%)")
print(f"   ✅ Latency test passed (P95 < 50ms)")
print(f"   ✅ Load test passed (1000 QPS)")

# Promote to production
print(f"\n3️⃣ Promote to Production:")
registry.promote_model("YieldPredictor", version=1, new_stage=ModelStage.PRODUCTION)
prod_model = registry.get_latest_version("YieldPredictor", ModelStage.PRODUCTION)
print(f"   ✅ {prod_model.model_name} v{prod_model.version} → Production")
print(f"   Now serving live traffic")

# Deploy v2 to staging
print(f"\n4️⃣ Deploy v2 to Staging (Candidate for Production):")
registry.promote_model("YieldPredictor", version=2, new_stage=ModelStage.STAGING)
print(f"   ✅ {model_v2.model_name} v{model_v2.version} → Staging")
print(f"   Testing new model candidate...")

# Simulate production issue - rollback scenario

print(f"\n\n{'=' * 80}")
print("Rollback Scenario - Production Model Fails")
print("=" * 80)

print(f"\n🚨 Production Incident Detected:")
print(f"   Model v2 promoted to production")
print(f"   Accuracy dropped from 90% to 75% (data drift)")
print(f"   Alert triggered: RMSE >2%")

# Promote v2 to production (simulate bad deployment)
registry.promote_model("YieldPredictor", version=2, new_stage=ModelStage.PRODUCTION)

print(f"\n⚠️ Current Production Model:")
current_prod = registry.get_latest_version("YieldPredictor", ModelStage.PRODUCTION)
print(f"   {current_prod.model_name} v{current_prod.version}")
print(f"   Status: ❌ FAILING (high RMSE)")

# Rollback to v1
print(f"\n🔄 Initiating Rollback:")
print(f"   Demoting v2 to Archived")
registry.promote_model("YieldPredictor", version=2, new_stage=ModelStage.ARCHIVED)

print(f"   Promoting v1 back to Production")
registry.promote_model("YieldPredictor", version=1, new_stage=ModelStage.PRODUCTION)

restored_prod = registry.get_latest_version("YieldPredictor", ModelStage.PRODUCTION)
print(f"\n✅ Rollback Complete:")
print(f"   {restored_prod.model_name} v{restored_prod.version} → Production")
print(f"   Production restored in <30 seconds")
print(f"   RMSE back to normal (< 2%)")

# Model history

print(f"\n\n{'=' * 80}")
print("Model Version History")
print("=" * 80)

history = registry.get_model_history("YieldPredictor")

print(f"\n📜 {len(history)} versions of YieldPredictor:")
print(f"\n{'Version':<10} {'Stage':<15} {'RMSE':<10} {'Created':<20} {'Experiment':<15}")
print("-" * 80)

for mv in history:
    rmse = mv.metadata['experiment_metrics'].get('rmse_test', 0)
    created = mv.created_at.strftime('%Y-%m-%d %H:%M:%S')
    print(f"v{mv.version:<9} {mv.stage.value:<15} {rmse:<10.3f} {created:<20} {mv.experiment_id:<15}")

# Business value

print(f"\n\n{'=' * 80}")
print("Business Value")
print("=" * 80)

# Rollback time savings
manual_rollback_time = 4 * 3600  # 4 hours (find model, redeploy, test)
automated_rollback_time = 30  # 30 seconds
rollback_time_saved = manual_rollback_time - automated_rollback_time

rollbacks_per_year = 12  # 1 per month
downtime_cost_per_hour = 50000  # USD
downtime_prevented = (rollback_time_saved * rollbacks_per_year) / 3600  # hours
annual_downtime_savings = downtime_prevented * downtime_cost_per_hour

# Deployment confidence
deployment_incidents_prevented = 50  # per year (caught in staging)
incident_cost = 100000  # USD per incident
incident_prevention_value = deployment_incidents_prevented * incident_cost

total_value = annual_downtime_savings + incident_prevention_value

print(f"\n💰 Model Registry Value:")
print(f"   Manual rollback time: {manual_rollback_time / 3600:.1f} hours")
print(f"   Automated rollback time: {automated_rollback_time} seconds")
print(f"   Time saved per rollback: {rollback_time_saved / 3600:.1f} hours")
print(f"   Rollbacks per year: {rollbacks_per_year}")
print(f"   Downtime prevented: {downtime_prevented:.1f} hours/year")
print(f"   Downtime cost savings: ${annual_downtime_savings / 1e6:.1f}M/year")
print(f"\n   Deployment incidents prevented: {deployment_incidents_prevented}/year")
print(f"   Incident prevention value: ${incident_prevention_value / 1e6:.1f}M/year")
print(f"\n   Total annual value: ${total_value / 1e6:.1f}M")

print(f"\n✅ Model registry validated!")
print(f"✅ {len(history)} model versions tracked")
print(f"✅ Rollback capability tested (30 seconds)")
print(f"✅ ${total_value / 1e6:.1f}M/year business value")

## 4. 🔄 CI/CD Pipeline - Automated Training, Testing, and Deployment

**Purpose:** Build CI/CD pipeline for ML that automates training, validation, and deployment when code or data changes.

**Key Points:**
- **Continuous Integration (CI)**: Automated testing when code changes (unit tests, model tests, data validation)
- **Continuous Deployment (CD)**: Automated deployment to staging/production when tests pass
- **Pipeline Triggers**: Code commits, new data, schedule (daily retraining), manual trigger
- **Pipeline Stages**: Data validation → Training → Evaluation → Model registration → Deployment → Smoke tests
- **Automated Rollback**: If smoke tests fail, automatically rollback to previous model version

**Why for Post-Silicon?**
- **Automated Retraining**: When new wafer test data arrives (daily), retrain yield prediction models automatically
- **Fast Iteration**: Data scientists commit model improvements → CI/CD deploys to staging in <10 minutes
- **Safety**: Staging tests catch bad models before production (99% of issues caught pre-prod)
- **Audit Trail**: Every deployment tracked (who, when, what changed, test results) for regulatory compliance

In [None]:
# CI/CD Pipeline System

class PipelineStage(Enum):
    """Pipeline stages"""
    DATA_VALIDATION = "Data Validation"
    TRAINING = "Training"
    EVALUATION = "Evaluation"
    MODEL_REGISTRATION = "Model Registration"
    DEPLOYMENT = "Deployment"
    SMOKE_TEST = "Smoke Test"

@dataclass
class PipelineRun:
    """CI/CD pipeline run"""
    run_id: str
    trigger: str  # "code_commit", "data_update", "schedule", "manual"
    stages: Dict[PipelineStage, Dict[str, Any]] = field(default_factory=dict)
    status: str = "running"  # running, success, failed
    started_at: datetime = field(default_factory=datetime.now)
    completed_at: Optional[datetime] = None
    
    def log_stage(self, stage: PipelineStage, status: str, details: Dict[str, Any]):
        """Log pipeline stage result"""
        self.stages[stage] = {
            'status': status,
            'details': details,
            'timestamp': datetime.now()
        }
    
    def complete(self, status: str):
        """Complete pipeline run"""
        self.status = status
        self.completed_at = datetime.now()
    
    def get_duration(self) -> Optional[float]:
        """Get duration in seconds"""
        if self.completed_at:
            return (self.completed_at - self.started_at).total_seconds()
        return None

class MLPipeline:
    """ML CI/CD Pipeline (like Kubeflow, MLflow Pipelines)"""
    
    def __init__(self, experiment_tracker: ExperimentTracker, 
                 model_registry: ModelRegistry):
        self.experiment_tracker = experiment_tracker
        self.model_registry = model_registry
        self.runs: List[PipelineRun] = []
    
    def run_pipeline(self, trigger: str, model_name: str, 
                    hyperparameters: Dict[str, Any],
                    X_train: np.ndarray, y_train: np.ndarray,
                    X_test: np.ndarray, y_test: np.ndarray,
                    min_r2_score: float = 0.7,
                    max_rmse: float = 2.0) -> PipelineRun:
        """Run complete ML pipeline"""
        
        run_id = f"pipeline_{len(self.runs) + 1}"
        pipeline_run = PipelineRun(run_id=run_id, trigger=trigger)
        self.runs.append(pipeline_run)
        
        print(f"\n{'=' * 80}")
        print(f"🚀 Pipeline Run: {run_id}")
        print(f"{'=' * 80}")
        print(f"Trigger: {trigger}")
        print(f"Model: {model_name}")
        print(f"Started: {pipeline_run.started_at.strftime('%Y-%m-%d %H:%M:%S')}")
        
        try:
            # Stage 1: Data Validation
            print(f"\n1️⃣ {PipelineStage.DATA_VALIDATION.value}...")
            data_checks = {
                'train_samples': len(X_train),
                'test_samples': len(X_test),
                'features': X_train.shape[1],
                'missing_values': np.isnan(X_train).sum() + np.isnan(X_test).sum(),
                'target_range': (y_train.min(), y_train.max())
            }
            
            # Validate data quality
            if data_checks['missing_values'] > 0:
                raise ValueError(f"Found {data_checks['missing_values']} missing values")
            
            if data_checks['train_samples'] < 100:
                raise ValueError(f"Insufficient training samples: {data_checks['train_samples']}")
            
            pipeline_run.log_stage(PipelineStage.DATA_VALIDATION, 'passed', data_checks)
            print(f"   ✅ Data validation passed")
            print(f"      Train samples: {data_checks['train_samples']}")
            print(f"      Test samples: {data_checks['test_samples']}")
            print(f"      Features: {data_checks['features']}")
            
            # Stage 2: Training
            print(f"\n2️⃣ {PipelineStage.TRAINING.value}...")
            experiment = self.experiment_tracker.create_experiment(
                name=f"{model_name}_pipeline_{run_id}",
                parameters=hyperparameters
            )
            
            model = RandomForestRegressor(**hyperparameters, random_state=42)
            model.fit(X_train, y_train)
            
            training_details = {
                'experiment_id': experiment.experiment_id,
                'hyperparameters': hyperparameters,
                'training_time': experiment.get_duration()
            }
            
            pipeline_run.log_stage(PipelineStage.TRAINING, 'passed', training_details)
            print(f"   ✅ Training completed")
            print(f"      Experiment ID: {experiment.experiment_id}")
            
            # Stage 3: Evaluation
            print(f"\n3️⃣ {PipelineStage.EVALUATION.value}...")
            
            # Compute metrics
            y_pred_train = model.predict(X_train)
            y_pred_test = model.predict(X_test)
            
            rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
            rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
            mae_test = mean_absolute_error(y_test, y_pred_test)
            r2_test = r2_score(y_test, y_pred_test)
            
            # Log metrics to experiment
            experiment.log_metric('rmse_train', rmse_train)
            experiment.log_metric('rmse_test', rmse_test)
            experiment.log_metric('mae_test', mae_test)
            experiment.log_metric('r2_test', r2_test)
            experiment.log_artifact('model', model)
            experiment.complete()
            
            evaluation_details = {
                'rmse_test': rmse_test,
                'mae_test': mae_test,
                'r2_test': r2_test,
                'metrics_logged': True
            }
            
            # Check if model meets quality thresholds
            if r2_test < min_r2_score:
                raise ValueError(f"R² score {r2_test:.3f} below threshold {min_r2_score}")
            
            if rmse_test > max_rmse:
                raise ValueError(f"RMSE {rmse_test:.3f} above threshold {max_rmse}")
            
            pipeline_run.log_stage(PipelineStage.EVALUATION, 'passed', evaluation_details)
            print(f"   ✅ Evaluation passed")
            print(f"      RMSE: {rmse_test:.3f}% (threshold: <{max_rmse}%)")
            print(f"      R²: {r2_test:.3f} (threshold: >{min_r2_score})")
            
            # Stage 4: Model Registration
            print(f"\n4️⃣ {PipelineStage.MODEL_REGISTRATION.value}...")
            
            model_version = self.model_registry.register_model(
                model_name=model_name,
                experiment_id=experiment.experiment_id,
                model_artifact=model,
                metadata={
                    'pipeline_run': run_id,
                    'trigger': trigger,
                    'hyperparameters': hyperparameters
                }
            )
            
            registration_details = {
                'model_name': model_name,
                'version': model_version.version,
                'stage': model_version.stage.value
            }
            
            pipeline_run.log_stage(PipelineStage.MODEL_REGISTRATION, 'passed', registration_details)
            print(f"   ✅ Model registered")
            print(f"      {model_name} v{model_version.version}")
            print(f"      URI: {model_version.get_model_uri()}")
            
            # Stage 5: Deployment to Staging
            print(f"\n5️⃣ {PipelineStage.DEPLOYMENT.value} (Staging)...")
            
            self.model_registry.promote_model(
                model_name=model_name,
                version=model_version.version,
                new_stage=ModelStage.STAGING
            )
            
            deployment_details = {
                'environment': 'staging',
                'model_version': model_version.version,
                'deployed_at': datetime.now()
            }
            
            pipeline_run.log_stage(PipelineStage.DEPLOYMENT, 'passed', deployment_details)
            print(f"   ✅ Deployed to staging")
            print(f"      {model_name} v{model_version.version} → Staging")
            
            # Stage 6: Smoke Tests
            print(f"\n6️⃣ {PipelineStage.SMOKE_TEST.value}...")
            
            # Simulate smoke tests
            smoke_tests = {
                'prediction_test': True,  # Can make predictions
                'latency_test': True,     # P95 latency < 50ms
                'accuracy_test': r2_test > min_r2_score,
                'load_test': True         # Handle 100 QPS
            }
            
            all_passed = all(smoke_tests.values())
            
            if not all_passed:
                # Rollback on failure
                print(f"   ❌ Smoke tests failed!")
                print(f"      Initiating rollback...")
                self.model_registry.promote_model(
                    model_name=model_name,
                    version=model_version.version,
                    new_stage=ModelStage.ARCHIVED
                )
                raise ValueError("Smoke tests failed, deployment rolled back")
            
            pipeline_run.log_stage(PipelineStage.SMOKE_TEST, 'passed', smoke_tests)
            print(f"   ✅ Smoke tests passed")
            for test_name, result in smoke_tests.items():
                print(f"      {test_name}: {'✅' if result else '❌'}")
            
            # Ready for production promotion (manual approval in real scenario)
            print(f"\n✅ Pipeline completed successfully!")
            print(f"   Model ready for production promotion")
            
            pipeline_run.complete('success')
            
        except Exception as e:
            print(f"\n❌ Pipeline failed: {str(e)}")
            pipeline_run.complete('failed')
        
        return pipeline_run
    
    def get_pipeline_history(self) -> List[PipelineRun]:
        """Get all pipeline runs"""
        return sorted(self.runs, key=lambda r: r.started_at, reverse=True)

# Example: CI/CD Pipeline Execution

print("=" * 80)
print("CI/CD Pipeline - Automated ML Workflow")
print("=" * 80)

# Initialize pipeline
pipeline = MLPipeline(tracker, registry)

# Scenario 1: Code commit trigger (data scientist improved model)

print(f"\n\n{'=' * 80}")
print("Scenario 1: Code Commit (Improved Hyperparameters)")
print("=" * 80)

pipeline_run_1 = pipeline.run_pipeline(
    trigger="code_commit",
    model_name="YieldPredictor",
    hyperparameters={
        'n_estimators': 300,
        'max_depth': 15,
        'min_samples_split': 5
    },
    X_train=X_train,
    y_train=y_train,
    X_test=X_test,
    y_test=y_test,
    min_r2_score=0.7,
    max_rmse=2.0
)

print(f"\nPipeline Duration: {pipeline_run_1.get_duration():.2f} seconds")
print(f"Status: {pipeline_run_1.status}")

# Scenario 2: Data update trigger (new wafer test data arrived)

print(f"\n\n{'=' * 80}")
print("Scenario 2: Data Update (New Wafer Test Data)")
print("=" * 80)

# Simulate new data (slightly different distribution)
np.random.seed(100)
X_new = np.random.randn(1000, 5)
X_new[:, 0] = X_new[:, 0] * 0.05 + 1.0  # Vdd
X_new[:, 1] = X_new[:, 1] * 0.1 + 0.5   # Idd
X_new[:, 2] = X_new[:, 2] * 50 + 1000   # Frequency
X_new[:, 3] = X_new[:, 3] * 5 + 25      # Temperature
X_new[:, 4] = (np.random.rand(1000) * 150 + 50).astype(int)  # Dies

y_new = 50 + 30 * X_new[:, 0] + 20 * X_new[:, 1] - 0.1 * X_new[:, 2] + \
        2 * X_new[:, 3] + 0.05 * X_new[:, 4] + np.random.randn(1000) * 2

X_train_new = X_new[:800]
y_train_new = y_new[:800]
X_test_new = X_new[800:]
y_test_new = y_new[800:]

pipeline_run_2 = pipeline.run_pipeline(
    trigger="data_update",
    model_name="YieldPredictor",
    hyperparameters={
        'n_estimators': 200,
        'max_depth': 10
    },
    X_train=X_train_new,
    y_train=y_train_new,
    X_test=X_test_new,
    y_test=y_test_new,
    min_r2_score=0.7,
    max_rmse=2.0
)

print(f"\nPipeline Duration: {pipeline_run_2.get_duration():.2f} seconds")
print(f"Status: {pipeline_run_2.status}")

# Pipeline history

print(f"\n\n{'=' * 80}")
print("Pipeline Run History")
print("=" * 80)

history = pipeline.get_pipeline_history()

print(f"\n📜 {len(history)} pipeline runs:")
print(f"\n{'Run ID':<20} {'Trigger':<15} {'Status':<10} {'Duration':<12} {'Stages':<10}")
print("-" * 80)

for run in history:
    duration = f"{run.get_duration():.2f}s" if run.get_duration() else "running"
    stages_passed = sum(1 for s in run.stages.values() if s['status'] == 'passed')
    print(f"{run.run_id:<20} {run.trigger:<15} {run.status:<10} {duration:<12} {stages_passed}/6")

# Business value

print(f"\n\n{'=' * 80}")
print("Business Value")
print("=" * 80)

# Time savings
manual_deployment_time = 4 * 3600  # 4 hours (manual testing, deployment, validation)
automated_deployment_time = pipeline_run_1.get_duration()  # seconds
time_saved_per_deployment = manual_deployment_time - automated_deployment_time

deployments_per_month = 20  # 1 per business day
monthly_time_saved = (time_saved_per_deployment * deployments_per_month) / 3600  # hours
engineer_cost = 150  # USD per hour
monthly_cost_savings = monthly_time_saved * engineer_cost
annual_cost_savings = monthly_cost_savings * 12

# Faster iteration
time_to_production_manual = 7 * 24 * 3600  # 1 week
time_to_production_auto = 10 * 60  # 10 minutes
innovation_acceleration = time_to_production_manual / time_to_production_auto

# Error prevention
deployment_errors_prevented = 50  # per year (caught by automated tests)
error_cost = 25000  # USD per deployment error
error_prevention_value = deployment_errors_prevented * error_cost

total_value = annual_cost_savings + error_prevention_value

print(f"\n💰 CI/CD Pipeline Value:")
print(f"   Manual deployment time: {manual_deployment_time / 3600:.1f} hours")
print(f"   Automated deployment time: {automated_deployment_time / 60:.1f} minutes")
print(f"   Time saved per deployment: {time_saved_per_deployment / 3600:.1f} hours")
print(f"   Deployments per month: {deployments_per_month}")
print(f"   Monthly time savings: {monthly_time_saved:.0f} hours")
print(f"   Annual cost savings: ${annual_cost_savings / 1e6:.1f}M")
print(f"\n   Time to production acceleration: {innovation_acceleration:.0f}x faster")
print(f"   Deployment errors prevented: {deployment_errors_prevented}/year")
print(f"   Error prevention value: ${error_prevention_value / 1e6:.1f}M/year")
print(f"\n   Total annual value: ${total_value / 1e6:.1f}M")

print(f"\n✅ CI/CD pipeline validated!")
print(f"✅ {len(history)} successful pipeline runs")
print(f"✅ 6-stage automated workflow")
print(f"✅ ${total_value / 1e6:.1f}M/year business value")

## 5. 📈 Production Monitoring - Drift Detection, Performance Tracking, and Alerts

**Purpose:** Monitor production models for data drift, concept drift, performance degradation, and trigger retraining when needed.

**Key Points:**
- **Data Drift**: Input feature distribution changes (e.g., temperature sensors recalibrated, test parameters changed)
- **Concept Drift**: Relationship between features and target changes (e.g., new chip design affects yield patterns)
- **Performance Monitoring**: Track prediction accuracy, latency, throughput in production
- **Automated Alerts**: Notify when drift detected, accuracy drops, latency spikes
- **Retraining Triggers**: Automatically trigger CI/CD pipeline when drift exceeds threshold

**Why for Post-Silicon?**
- **Early Detection**: Detect when yield prediction models become stale (temperature drift → 15% accuracy drop detected in 1 hour)
- **Prevent Bad Decisions**: Alert before model makes incorrect predictions (prevent $500K bad wafer decisions)
- **Automated Response**: Auto-retrain when data drift detected (no manual monitoring needed)
- **Compliance**: Track model performance for regulatory audits (FDA, automotive safety standards)

In [None]:
# Production Monitoring System

class DriftType(Enum):
    """Types of drift"""
    DATA_DRIFT = "Data Drift"
    CONCEPT_DRIFT = "Concept Drift"
    PERFORMANCE_DRIFT = "Performance Drift"

@dataclass
class DriftAlert:
    """Drift detection alert"""
    alert_id: str
    drift_type: DriftType
    severity: str  # "low", "medium", "high", "critical"
    feature_name: Optional[str] = None
    metric_name: Optional[str] = None
    baseline_value: float = 0.0
    current_value: float = 0.0
    drift_score: float = 0.0
    threshold: float = 0.0
    detected_at: datetime = field(default_factory=datetime.now)
    recommendation: str = ""

class ProductionMonitor:
    """Production model monitoring (like Evidently AI, Fiddler)"""
    
    def __init__(self, model_name: str, baseline_data: np.ndarray, 
                 baseline_predictions: np.ndarray):
        self.model_name = model_name
        self.baseline_data = baseline_data
        self.baseline_predictions = baseline_predictions
        
        # Compute baseline statistics
        self.baseline_mean = baseline_data.mean(axis=0)
        self.baseline_std = baseline_data.std(axis=0)
        
        self.alerts: List[DriftAlert] = []
        self.performance_history: List[Dict] = []
    
    def detect_data_drift(self, current_data: np.ndarray, 
                         threshold: float = 3.0) -> List[DriftAlert]:
        """Detect data drift using z-score (Kolmogorov-Smirnov test in production)"""
        drift_alerts = []
        
        # Check each feature
        for i in range(current_data.shape[1]):
            current_mean = current_data[:, i].mean()
            baseline_mean = self.baseline_mean[i]
            baseline_std = self.baseline_std[i]
            
            # Z-score: how many standard deviations away from baseline
            if baseline_std > 0:
                z_score = abs(current_mean - baseline_mean) / baseline_std
            else:
                z_score = 0
            
            # Alert if drift detected
            if z_score > threshold:
                severity = "critical" if z_score > 5 else "high" if z_score > 4 else "medium"
                
                alert = DriftAlert(
                    alert_id=f"drift_{len(self.alerts) + 1}",
                    drift_type=DriftType.DATA_DRIFT,
                    severity=severity,
                    feature_name=f"Feature_{i}",
                    baseline_value=baseline_mean,
                    current_value=current_mean,
                    drift_score=z_score,
                    threshold=threshold,
                    recommendation=f"Feature {i} drifted {z_score:.2f} std devs. Review data source or retrain model."
                )
                
                drift_alerts.append(alert)
                self.alerts.append(alert)
        
        return drift_alerts
    
    def detect_prediction_drift(self, current_predictions: np.ndarray,
                               threshold: float = 2.0) -> List[DriftAlert]:
        """Detect drift in prediction distribution"""
        drift_alerts = []
        
        baseline_pred_mean = self.baseline_predictions.mean()
        baseline_pred_std = self.baseline_predictions.std()
        
        current_pred_mean = current_predictions.mean()
        
        # Z-score for prediction mean
        if baseline_pred_std > 0:
            z_score = abs(current_pred_mean - baseline_pred_mean) / baseline_pred_std
        else:
            z_score = 0
        
        if z_score > threshold:
            severity = "critical" if z_score > 4 else "high" if z_score > 3 else "medium"
            
            alert = DriftAlert(
                alert_id=f"drift_{len(self.alerts) + 1}",
                drift_type=DriftType.CONCEPT_DRIFT,
                severity=severity,
                metric_name="prediction_mean",
                baseline_value=baseline_pred_mean,
                current_value=current_pred_mean,
                drift_score=z_score,
                threshold=threshold,
                recommendation=f"Prediction distribution shifted. Possible concept drift. Retrain model with recent data."
            )
            
            drift_alerts.append(alert)
            self.alerts.append(alert)
        
        return drift_alerts
    
    def track_performance(self, y_true: np.ndarray, y_pred: np.ndarray,
                         latency_ms: float, throughput_qps: float):
        """Track model performance metrics"""
        rmse = np.sqrt(mean_squared_error(y_true, y_pred))
        mae = mean_absolute_error(y_true, y_pred)
        r2 = r2_score(y_true, y_pred)
        
        performance = {
            'timestamp': datetime.now(),
            'rmse': rmse,
            'mae': mae,
            'r2': r2,
            'latency_ms': latency_ms,
            'throughput_qps': throughput_qps
        }
        
        self.performance_history.append(performance)
        
        return performance
    
    def check_performance_degradation(self, current_rmse: float, 
                                     baseline_rmse: float,
                                     threshold_pct: float = 20.0) -> Optional[DriftAlert]:
        """Check if performance degraded beyond threshold"""
        degradation_pct = ((current_rmse - baseline_rmse) / baseline_rmse) * 100
        
        if degradation_pct > threshold_pct:
            severity = "critical" if degradation_pct > 50 else "high" if degradation_pct > 30 else "medium"
            
            alert = DriftAlert(
                alert_id=f"drift_{len(self.alerts) + 1}",
                drift_type=DriftType.PERFORMANCE_DRIFT,
                severity=severity,
                metric_name="rmse",
                baseline_value=baseline_rmse,
                current_value=current_rmse,
                drift_score=degradation_pct,
                threshold=threshold_pct,
                recommendation=f"RMSE increased {degradation_pct:.1f}%. Trigger retraining pipeline immediately."
            )
            
            self.alerts.append(alert)
            return alert
        
        return None
    
    def get_alerts(self, severity: Optional[str] = None) -> List[DriftAlert]:
        """Get drift alerts (optionally filtered by severity)"""
        if severity:
            return [a for a in self.alerts if a.severity == severity]
        return self.alerts

# Example: Production Monitoring

print("=" * 80)
print("Production Monitoring - Drift Detection and Alerts")
print("=" * 80)

# Setup: Use original test data as baseline
baseline_data = X_test
baseline_model = best_exp.artifacts['model']
baseline_predictions = baseline_model.predict(baseline_data)

monitor = ProductionMonitor(
    model_name="YieldPredictor",
    baseline_data=baseline_data,
    baseline_predictions=baseline_predictions
)

print(f"\n📊 Baseline Statistics:")
print(f"   Data samples: {len(baseline_data)}")
print(f"   Features: {baseline_data.shape[1]}")
print(f"   Prediction mean: {baseline_predictions.mean():.2f}%")
print(f"   Prediction std: {baseline_predictions.std():.2f}%")

# Scenario 1: Normal production data (no drift)

print(f"\n\n{'=' * 80}")
print("Scenario 1: Normal Production Data (No Drift)")
print("=" * 80)

# Generate data similar to baseline
np.random.seed(50)
normal_data = X_test[:50] + np.random.randn(50, 5) * 0.1  # Small noise
normal_predictions = baseline_model.predict(normal_data)

data_drift_alerts = monitor.detect_data_drift(normal_data, threshold=3.0)
pred_drift_alerts = monitor.detect_prediction_drift(normal_predictions, threshold=2.0)

print(f"\n🔍 Drift Detection Results:")
print(f"   Data drift alerts: {len(data_drift_alerts)}")
print(f"   Prediction drift alerts: {len(pred_drift_alerts)}")

if not data_drift_alerts and not pred_drift_alerts:
    print(f"   ✅ No drift detected - Model performing normally")

# Track performance
perf = monitor.track_performance(
    y_true=y_test[:50],
    y_pred=normal_predictions,
    latency_ms=25.0,
    throughput_qps=100.0
)

print(f"\n📈 Performance Metrics:")
print(f"   RMSE: {perf['rmse']:.3f}%")
print(f"   MAE: {perf['mae']:.3f}%")
print(f"   R²: {perf['r2']:.3f}")
print(f"   Latency: {perf['latency_ms']:.1f}ms")
print(f"   Throughput: {perf['throughput_qps']:.0f} QPS")

# Scenario 2: Data drift detected (temperature sensor recalibrated)

print(f"\n\n{'=' * 80}")
print("Scenario 2: Data Drift (Temperature Sensor Recalibrated)")
print("=" * 80)

# Simulate temperature drift (feature 3 shifted by 10°C)
drifted_data = X_test[:50].copy()
drifted_data[:, 3] += 10.0  # Temperature increased by 10°C
drifted_predictions = baseline_model.predict(drifted_data)

print(f"\n🚨 Temperature sensor recalibration detected:")
print(f"   Baseline temperature mean: {X_test[:50, 3].mean():.2f}°C")
print(f"   Current temperature mean: {drifted_data[:, 3].mean():.2f}°C")
print(f"   Shift: +{drifted_data[:, 3].mean() - X_test[:50, 3].mean():.2f}°C")

data_drift_alerts = monitor.detect_data_drift(drifted_data, threshold=3.0)
pred_drift_alerts = monitor.detect_prediction_drift(drifted_predictions, threshold=2.0)

print(f"\n🔍 Drift Detection Results:")
print(f"   Data drift alerts: {len(data_drift_alerts)} ⚠️")
print(f"   Prediction drift alerts: {len(pred_drift_alerts)}")

for alert in data_drift_alerts:
    print(f"\n   Alert ID: {alert.alert_id}")
    print(f"   Type: {alert.drift_type.value}")
    print(f"   Severity: {alert.severity.upper()}")
    print(f"   Feature: {alert.feature_name}")
    print(f"   Baseline value: {alert.baseline_value:.3f}")
    print(f"   Current value: {alert.current_value:.3f}")
    print(f"   Drift score: {alert.drift_score:.2f} std devs (threshold: {alert.threshold})")
    print(f"   Recommendation: {alert.recommendation}")

# Scenario 3: Performance degradation (concept drift)

print(f"\n\n{'=' * 80}")
print("Scenario 3: Performance Degradation (Concept Drift)")
print("=" * 80)

# Simulate concept drift: relationship between features and target changed
np.random.seed(75)
concept_drift_data = X_test[:50].copy()

# Generate new target with different relationship (new chip design)
concept_drift_true = (60 + 40 * concept_drift_data[:, 0] +  # Vdd importance increased
                     10 * concept_drift_data[:, 1] -         # Idd importance decreased
                     0.05 * concept_drift_data[:, 2] +
                     1 * concept_drift_data[:, 3] +
                     0.03 * concept_drift_data[:, 4] +
                     np.random.randn(50) * 2)

# Predictions with old model (trained on old relationship)
concept_drift_predictions = baseline_model.predict(concept_drift_data)

print(f"\n⚠️ New chip design deployed:")
print(f"   Vdd sensitivity increased (30 → 40)")
print(f"   Idd sensitivity decreased (20 → 10)")
print(f"   Old model trained on previous chip design")

# Track performance
perf_degraded = monitor.track_performance(
    y_true=concept_drift_true,
    y_pred=concept_drift_predictions,
    latency_ms=27.0,
    throughput_qps=95.0
)

baseline_rmse = perf['rmse']
current_rmse = perf_degraded['rmse']

print(f"\n📈 Performance Comparison:")
print(f"   Baseline RMSE: {baseline_rmse:.3f}%")
print(f"   Current RMSE: {current_rmse:.3f}%")
print(f"   Degradation: {((current_rmse - baseline_rmse) / baseline_rmse * 100):.1f}%")

degradation_alert = monitor.check_performance_degradation(
    current_rmse=current_rmse,
    baseline_rmse=baseline_rmse,
    threshold_pct=20.0
)

if degradation_alert:
    print(f"\n🚨 Performance Degradation Alert:")
    print(f"   Alert ID: {degradation_alert.alert_id}")
    print(f"   Type: {degradation_alert.drift_type.value}")
    print(f"   Severity: {degradation_alert.severity.upper()}")
    print(f"   Metric: {degradation_alert.metric_name}")
    print(f"   Baseline RMSE: {degradation_alert.baseline_value:.3f}%")
    print(f"   Current RMSE: {degradation_alert.current_value:.3f}%")
    print(f"   Degradation: {degradation_alert.drift_score:.1f}%")
    print(f"   Recommendation: {degradation_alert.recommendation}")
    print(f"\n   🔄 Action: Triggering CI/CD pipeline for model retraining...")

# Alert summary

print(f"\n\n{'=' * 80}")
print("Alert Summary")
print("=" * 80)

all_alerts = monitor.get_alerts()

print(f"\n📊 Total alerts: {len(all_alerts)}")

# Group by severity
critical_alerts = monitor.get_alerts(severity="critical")
high_alerts = monitor.get_alerts(severity="high")
medium_alerts = monitor.get_alerts(severity="medium")

print(f"\n   Critical: {len(critical_alerts)}")
print(f"   High: {len(high_alerts)}")
print(f"   Medium: {len(medium_alerts)}")

# Alert types
data_drift_count = sum(1 for a in all_alerts if a.drift_type == DriftType.DATA_DRIFT)
concept_drift_count = sum(1 for a in all_alerts if a.drift_type == DriftType.CONCEPT_DRIFT)
perf_drift_count = sum(1 for a in all_alerts if a.drift_type == DriftType.PERFORMANCE_DRIFT)

print(f"\n   Data Drift: {data_drift_count}")
print(f"   Concept Drift: {concept_drift_count}")
print(f"   Performance Drift: {perf_drift_count}")

# Business value

print(f"\n\n{'=' * 80}")
print("Business Value")
print("=" * 80)

# Early detection value
manual_monitoring_hours = 160  # 1 FTE per month
automated_monitoring_cost = 5  # hours per month (maintenance)
monitoring_time_saved = manual_monitoring_hours - automated_monitoring_cost
engineer_cost = 150  # USD per hour
monthly_monitoring_savings = monitoring_time_saved * engineer_cost
annual_monitoring_savings = monthly_monitoring_savings * 12

# Prevented bad decisions
drift_detection_time = 1  # hour (automated)
manual_detection_time = 7 * 24  # 1 week (noticed in production)
bad_decisions_prevented = 12  # per year
cost_per_bad_decision = 500000  # USD (bad wafer scrapped)
early_detection_value = bad_decisions_prevented * cost_per_bad_decision

total_value = annual_monitoring_savings + early_detection_value

print(f"\n💰 Production Monitoring Value:")
print(f"   Manual monitoring: {manual_monitoring_hours} hours/month")
print(f"   Automated monitoring: {automated_monitoring_cost} hours/month")
print(f"   Time saved: {monitoring_time_saved} hours/month")
print(f"   Annual monitoring savings: ${annual_monitoring_savings / 1e6:.1f}M")
print(f"\n   Drift detection time: {drift_detection_time} hour (vs {manual_detection_time / 24:.0f} days manual)")
print(f"   Bad decisions prevented: {bad_decisions_prevented}/year")
print(f"   Early detection value: ${early_detection_value / 1e6:.1f}M/year")
print(f"\n   Total annual value: ${total_value / 1e6:.1f}M")

print(f"\n✅ Production monitoring validated!")
print(f"✅ {len(all_alerts)} drift alerts detected")
print(f"✅ 3 drift types monitored (data, concept, performance)")
print(f"✅ ${total_value / 1e6:.1f}M/year business value")

## 6. 🚀 Real-World MLOps Projects

Each project includes clear objectives, business value, and implementation guidance.

---

### **Post-Silicon Validation Projects** ($26.7M/year total value)

#### **Project 1: Automated Yield Prediction Model Retraining Pipeline** ($8.5M/year)
**Objective:** Build end-to-end MLOps pipeline that automatically retrains yield prediction models when new wafer test data arrives daily.

**Business Value:** Prevent model staleness that causes 15% accuracy degradation over 2 weeks, leading to bad wafer disposition decisions ($8.5M/year in prevented waste).

**Features:**
- Data validation stage (check STDF file integrity, test parameter ranges)
- Automated training trigger (new data detected via S3 event or cron)
- Model evaluation (compare new model RMSE to production baseline)
- Automatic promotion to staging if RMSE improves by >5%
- Rollback capability if staging tests fail

**Tech Stack:** MLflow (tracking), DVC (data versioning), Kubeflow Pipelines (orchestration), Airflow (scheduling), S3 (STDF storage)

**Success Metrics:** 
- Model retrained within 2 hours of new data arrival
- Zero manual intervention (100% automated)
- RMSE consistently <2% (vs 3% with manual retraining)

---

#### **Project 2: Multi-Model Experiment Tracking System** ($6.3M/year)
**Objective:** Build centralized experiment tracking for 20+ data scientists running 500+ experiments/month across yield prediction, test time optimization, and binning models.

**Business Value:** Reduce experimentation time by 60% (2 hours → 48 minutes per experiment), enabling faster model improvements ($6.3M/year in productivity gains).

**Features:**
- Centralized experiment tracker (MLflow or Weights & Biases)
- Automatic logging (hyperparameters, metrics, artifacts, environment)
- Experiment comparison UI (compare 10 experiments side-by-side)
- Best model selection (auto-select by lowest RMSE, highest R²)
- Reproducibility (capture code version, data hash, random seed)

**Tech Stack:** MLflow, PostgreSQL (backend store), S3 (artifact store), Git (code versioning), DVC (data versioning)

**Success Metrics:**
- 500+ experiments tracked per month
- 100% reproducibility (any experiment can be rerun identically)
- <5 seconds experiment query time

---

#### **Project 3: Model Registry with Stage-Based Promotion** ($4.7M/year)
**Objective:** Build model registry that manages 50+ models (yield, test time, binning) with None→Staging→Production→Archived lifecycle and <30 second rollback capability.

**Business Value:** Reduce deployment incidents by 80% (50 → 10 per year) via mandatory staging validation, preventing $4.7M/year in bad model costs.

**Features:**
- Model versioning (semantic versioning: v1.2.3)
- Stage-based promotion workflow (manual approval for prod)
- Metadata storage (experiment ID, performance metrics, owner, deployment history)
- Rollback capability (demote bad model, promote previous version)
- Access control (only ML engineers can promote to prod)

**Tech Stack:** MLflow Model Registry, PostgreSQL, CI/CD integration (GitHub Actions), RBAC (Okta)

**Success Metrics:**
- <30 seconds rollback time (vs 4 hours manual)
- 100% staging validation before prod deployment
- Zero unauthorized model promotions

---

#### **Project 4: Production Model Monitoring and Drift Alerts** ($7.2M/year)
**Objective:** Monitor 20 production models for data drift, concept drift, and performance degradation with automated alerts and retraining triggers.

**Business Value:** Detect model staleness in <1 hour (vs 1 week manual monitoring), preventing $7.2M/year in bad wafer disposition from drifted models.

**Features:**
- Data drift detection (Kolmogorov-Smirnov test on feature distributions)
- Concept drift detection (PSI on prediction distribution)
- Performance monitoring (RMSE, MAE, R², latency, throughput)
- Alert system (Slack/PagerDuty when drift detected)
- Auto-retraining trigger (when RMSE degrades >20%)

**Tech Stack:** Evidently AI, Prometheus (metrics), Grafana (dashboards), Slack API, Airflow (retraining orchestration)

**Success Metrics:**
- <1 hour drift detection time
- Zero false positives (precision >95%)
- Automated retraining triggered within 2 hours of drift

---

### **General AI/ML Projects** ($31.8M/year total value)

#### **Project 5: E-Commerce Product Recommendation MLOps Pipeline** ($9.2M/year)
**Objective:** Build MLOps pipeline for collaborative filtering recommendation model serving 10M+ users, retraining nightly on new user interaction data.

**Business Value:** Increase conversion rate by 18% via fresh recommendations (model retrained on yesterday's clicks/purchases), driving $9.2M/year additional revenue.

**Features:**
- Nightly training pipeline (process 50M+ interaction events)
- A/B testing framework (compare new model vs production)
- Feature store (pre-computed user embeddings, item embeddings)
- Real-time serving (<50ms P95 latency for top-10 recommendations)
- Champion/Challenger deployment (gradual rollout 10% → 50% → 100%)

**Tech Stack:** Kubeflow, Feast (feature store), BentoML (serving), Prometheus, Grafana, Seldon Core

**Success Metrics:**
- <12 hours training time (on 50M events)
- <50ms recommendation latency
- 18% conversion rate improvement

---

#### **Project 6: Fraud Detection Model Monitoring with Real-Time Drift** ($6.8M/year)
**Objective:** Monitor fraud detection model in real-time (streaming transactions) for data drift and concept drift (fraudsters change tactics), with <5 minute retraining trigger.

**Business Value:** Reduce fraud loss by 25% via real-time drift detection and rapid retraining when fraud patterns change ($6.8M/year savings).

**Features:**
- Streaming drift detection (Kafka + Flink for real-time analysis)
- Concept drift detection (fraud pattern shifts detected via prediction distribution change)
- Automated retraining (when drift detected, trigger pipeline within 5 minutes)
- Shadow deployment (new model processes traffic but doesn't affect decisions until validated)
- Real-time dashboards (drift scores, fraud detection rate, false positive rate)

**Tech Stack:** Kafka, Flink, MLflow, Evidently AI, Kubernetes, Seldon Core

**Success Metrics:**
- <5 minute drift detection latency
- <10 minute retraining trigger time
- 25% fraud loss reduction

---

#### **Project 7: Multi-Model Registry for Healthcare Diagnostic Models** ($8.3M/year)
**Objective:** Build HIPAA-compliant model registry managing 100+ diagnostic models (chest X-ray, diabetic retinopathy, etc.) with audit trails and versioning.

**Business Value:** Enable faster regulatory approval (FDA submission) via comprehensive audit trails, accelerating time-to-market by 6 months ($8.3M/year NPV).

**Features:**
- HIPAA-compliant artifact storage (encrypted S3, access logging)
- Audit trail (every model access, prediction, promotion logged)
- Model lineage tracking (dataset → preprocessing → training → deployment)
- Regulatory report generation (performance metrics, validation results, bias analysis)
- Immutable versioning (models cannot be overwritten, only archived)

**Tech Stack:** MLflow, AWS S3 (encrypted), PostgreSQL (encrypted), Vault (secrets), CloudWatch (audit logs)

**Success Metrics:**
- 100% audit trail coverage
- <2 days regulatory report generation (vs 2 weeks manual)
- Zero HIPAA violations

---

#### **Project 8: LLM Fine-Tuning Experiment Tracking and Versioning** ($7.5M/year)
**Objective:** Track 200+ LLM fine-tuning experiments (GPT-4, Llama 2) across customer support, code generation, and summarization tasks with automatic best-model selection.

**Business Value:** Reduce LLM experimentation cost by 40% (avoid redundant experiments) and improve model quality (select best from 200 experiments), driving $7.5M/year efficiency gains.

**Features:**
- Large artifact storage (multi-GB model checkpoints in S3)
- Experiment comparison (compare perplexity, BLEU, human eval scores)
- Hyperparameter tracking (learning rate, batch size, LoRA rank, quantization)
- Cost tracking (GPU hours, API costs per experiment)
- Automatic best-model selection (by task-specific metric)

**Tech Stack:** Weights & Biases, S3, MLflow, Hugging Face Hub, Ray Train (distributed training)

**Success Metrics:**
- 200+ experiments tracked per month
- 40% cost reduction (avoid redundant runs)
- Best model auto-selected (highest BLEU score)

## 7. 🎯 Key Takeaways

### **MLOps vs Traditional Software Development**

**Critical Differences:**
- **Artifacts**: Code + Data + Models (vs just Code)
- **Testing**: Data quality + Model performance + Code correctness (vs just Code tests)
- **Deployment**: Model serving + versioning + monitoring (vs just Blue-green deployments)
- **Maintenance**: Model retraining + drift monitoring + data pipelines (vs just Bug fixes)

---

### **When to Use MLOps**

✅ **Perfect For:**
- **Production ML systems** requiring continuous retraining (yield prediction, fraud detection, recommendations)
- **Multiple models** in production (20+ models, need centralized tracking)
- **Regulatory compliance** (FDA, automotive safety, finance - need audit trails)
- **Data drift** scenarios (input distributions change over time)
- **Team collaboration** (10+ data scientists sharing experiments)

❌ **Not Ideal For:**
- **One-off analysis** (exploratory notebooks, research projects)
- **Static models** (trained once, never updated - e.g., historical analysis)
- **Small teams** (<3 people, overhead outweighs benefits)
- **Prototype phase** (before product-market fit, premature optimization)

---

### **MLOps Maturity Levels**

**Level 0: Manual Process** 
- Manual training, manual deployment, no versioning
- Good for: Prototypes, research
- Risk: Not reproducible, no rollback

**Level 1: ML Pipeline Automation**
- Automated training pipeline (triggered manually)
- Model versioning, experiment tracking
- Good for: Small teams, low model update frequency
- Risk: No continuous training, manual deployment

**Level 2: CI/CD Pipeline Automation**
- Automated training + testing + deployment
- Continuous monitoring, automated rollback
- Good for: Production systems, frequent updates
- Risk: Complexity overhead for simple use cases

**Level 3: Full MLOps (This Notebook)**
- Automated retraining on data drift
- Multi-stage deployment (dev/staging/prod)
- Advanced monitoring (drift detection, performance tracking)
- Good for: Large-scale production ML (Netflix, Uber, Amazon)

---

### **Best Practices**

**Experiment Tracking:**
- ✅ **DO**: Log every experiment (even failures teach you what doesn't work)
- ✅ **DO**: Track environment (Python version, library versions, random seed)
- ✅ **DO**: Version your data (DVC, data hash) for reproducibility
- ❌ **DON'T**: Only log successful experiments (selection bias)
- ❌ **DON'T**: Forget to log hyperparameters (can't reproduce results)

**Model Registry:**
- ✅ **DO**: Use semantic versioning (v1.2.3: major.minor.patch)
- ✅ **DO**: Test in staging before production (catch 99% of issues)
- ✅ **DO**: Implement rollback capability (<1 minute rollback time)
- ❌ **DON'T**: Deploy directly to production (skip staging)
- ❌ **DON'T**: Delete old model versions (keep for rollback)

**CI/CD Pipeline:**
- ✅ **DO**: Automate everything (data validation → deployment)
- ✅ **DO**: Run smoke tests after deployment (basic sanity checks)
- ✅ **DO**: Set quality thresholds (e.g., RMSE <2%, R² >0.7)
- ❌ **DON'T**: Allow pipeline to continue if tests fail (fail fast)
- ❌ **DON'T**: Skip data validation (garbage in = garbage out)

**Production Monitoring:**
- ✅ **DO**: Monitor data drift AND concept drift (both matter)
- ✅ **DO**: Set up automated alerts (Slack, PagerDuty)
- ✅ **DO**: Trigger retraining when drift detected (automation is key)
- ❌ **DON'T**: Only monitor accuracy (latency, throughput also matter)
- ❌ **DON'T**: Ignore prediction distribution shifts (concept drift indicator)

---

### **Common Pitfalls and Solutions**

**Pitfall 1: Model-Code Skew**
- **Problem**: Training code differs from serving code (preprocessing mismatch)
- **Solution**: Use same codebase for training and serving (unify feature engineering)
- **Tools**: Feature stores (Feast, Tecton) ensure consistency

**Pitfall 2: Data Leakage in Pipelines**
- **Problem**: Test data accidentally used in training (overly optimistic metrics)
- **Solution**: Strict train/test split, validate data lineage
- **Tools**: DVC pipelines track data splits

**Pitfall 3: Silent Model Degradation**
- **Problem**: Model accuracy drops over months, undetected
- **Solution**: Continuous monitoring, automated drift alerts
- **Tools**: Evidently AI, Fiddler

**Pitfall 4: Experiment Chaos**
- **Problem**: 500 experiments, can't find the best model 6 months later
- **Solution**: Centralized experiment tracking with metadata
- **Tools**: MLflow, Weights & Biases

**Pitfall 5: Deployment Downtime**
- **Problem**: 2-hour downtime during model redeployment
- **Solution**: Blue-green deployment, canary releases
- **Tools**: Kubernetes, Seldon Core

**Pitfall 6: Unreproducible Results**
- **Problem**: "It worked on my laptop" but fails in production
- **Solution**: Containerize everything (Docker), version all dependencies
- **Tools**: Docker, conda environments, requirements.txt

**Pitfall 7: Data Versioning Nightmares**
- **Problem**: "Which dataset did we use for v1.2.3?"
- **Solution**: Version data with DVC, track data hash in experiment metadata
- **Tools**: DVC, Git LFS

---

### **Production Checklist**

Before deploying ML model to production, verify:

**Data & Features:**
- [ ] Data validation pipeline (check ranges, missing values, schema)
- [ ] Feature engineering code tested (unit tests for transformations)
- [ ] Data versioned (DVC, data hash tracked)
- [ ] Feature store integrated (if using real-time features)

**Model & Training:**
- [ ] Experiment tracked (hyperparameters, metrics, artifacts logged)
- [ ] Model meets quality thresholds (RMSE <X, R² >Y)
- [ ] Model registered in registry (with metadata)
- [ ] Reproducibility verified (can retrain and get same results)

**Deployment:**
- [ ] Staging environment tested (smoke tests passed)
- [ ] Rollback capability verified (<1 minute rollback time)
- [ ] Blue-green or canary deployment (not big-bang)
- [ ] Monitoring enabled (metrics, logs, alerts)

**Monitoring & Maintenance:**
- [ ] Data drift monitoring (feature distribution tracking)
- [ ] Concept drift monitoring (prediction distribution tracking)
- [ ] Performance monitoring (RMSE, latency, throughput)
- [ ] Automated alerts configured (Slack, PagerDuty)
- [ ] Retraining pipeline ready (triggered by drift or schedule)

**Compliance & Governance:**
- [ ] Audit trail enabled (all predictions logged for regulators)
- [ ] Model lineage documented (data → preprocessing → training → deployment)
- [ ] Access control configured (RBAC for model registry)
- [ ] Bias and fairness tested (for regulated industries)

---

### **MLOps Tools & Technologies**

**Experiment Tracking:**
- **MLflow**: Open-source, Python-native, great for tracking experiments and models
- **Weights & Biases (wandb)**: Best-in-class UI, team collaboration features
- **Neptune.ai**: Metadata store, great for large teams
- **TensorBoard**: PyTorch/TensorFlow native, visualization focus

**Model Registry:**
- **MLflow Model Registry**: Open-source, stage-based promotion (None/Staging/Production/Archived)
- **SageMaker Model Registry**: AWS-native, integrates with SageMaker Pipelines
- **Vertex AI Model Registry**: GCP-native, managed service

**Data Versioning:**
- **DVC (Data Version Control)**: Git for data, integrates with Git workflows
- **Delta Lake**: Databricks, time travel for data tables
- **LakeFS**: Git-like versioning for data lakes

**Pipeline Orchestration:**
- **Kubeflow Pipelines**: Kubernetes-native, containerized workflows
- **Apache Airflow**: Python DAGs, great for scheduling
- **Prefect**: Modern alternative to Airflow, easier error handling
- **MLflow Projects**: Lightweight, good for simple pipelines

**Model Serving:**
- **Seldon Core**: Kubernetes-native, supports A/B testing, canary deployments
- **BentoML**: Python-first, easy to containerize models
- **TorchServe**: PyTorch native
- **TensorFlow Serving**: TensorFlow native
- **Ray Serve**: Distributed serving, great for LLMs

**Monitoring:**
- **Evidently AI**: Drift detection, open-source
- **Fiddler**: Enterprise monitoring, root cause analysis
- **Arize AI**: ML observability platform
- **Prometheus + Grafana**: Metrics monitoring (can track model metrics)

**Feature Stores:**
- **Feast**: Open-source, lightweight
- **Tecton**: Enterprise feature platform
- **Hopsworks**: Open-source, supports streaming features

---

### **Next Steps**

**Deepen Your MLOps Knowledge:**
1. **Notebook 152**: Advanced Model Serving (A/B Testing, Canary Deployments, Multi-Armed Bandits)
2. **Notebook 153**: Feature Stores and Real-Time ML (Feast, streaming features, low-latency serving)
3. **Notebook 154**: ML Model Explainability and Debugging (SHAP, LIME, model debugging techniques)
4. **Notebook 155**: Distributed Training and Hyperparameter Tuning (Ray, Optuna, multi-GPU training)

**Build a Portfolio Project:**
- Start with Project 2 (Multi-Model Experiment Tracking) - easy to build, high impact
- Then Project 4 (Production Monitoring) - learn drift detection
- Finally Project 1 (Automated Retraining Pipeline) - tie everything together

**Learn by Doing:**
- Deploy a real model to production (even if it's a personal project)
- Set up MLflow on localhost, track 10 experiments
- Simulate data drift, trigger automated retraining
- Build a monitoring dashboard with Prometheus + Grafana

---

### **Summary**

**MLOps is essential for:**
- 🚀 **Faster iteration** (automated pipelines vs manual deployment)
- ✅ **Higher reliability** (staging tests, rollback capability)
- 📊 **Better models** (experiment tracking enables systematic improvement)
- 🔍 **Early problem detection** (drift monitoring catches staleness in <1 hour)
- 💰 **Cost savings** ($26.7M/year in this notebook's use cases)

**Start simple:**
- Level 1: Just add experiment tracking (MLflow)
- Level 2: Add model registry and basic CI/CD
- Level 3: Add drift monitoring and automated retraining

**Remember:**
- MLOps is a journey, not a destination (start small, iterate)
- Automation pays off (4 hours → 10 minutes deployment time)
- Monitoring prevents disasters (catch drift before model fails)

---

🎉 **Congratulations!** You've built a complete MLOps system with experiment tracking, model registry, CI/CD pipelines, and production monitoring. You're ready to deploy and maintain production ML systems at scale!

## 🎯 Key Takeaways

### When to Use MLOps
- **Multiple models in production**: >3 models requiring consistent deployment, monitoring, retraining
- **Frequent model updates**: Weekly/monthly retraining cycles (demand forecasting, fraud detection)
- **Team collaboration**: Data scientists, ML engineers, DevOps working on shared models
- **Compliance requirements**: Model versioning, audit trails, reproducibility (financial, healthcare)
- **Business-critical predictions**: High-cost errors requiring reliability, observability (yield prediction, pricing)

### Limitations
- **Complexity overhead**: MLOps tooling (MLflow, Kubeflow, Airflow) has learning curve (2-3 months ramp-up)
- **Infrastructure costs**: Dedicated MLOps platform (compute, storage, tools) = $50K-$500K/year
- **Overkill for simple projects**: Single Jupyter notebook model doesn't need full MLOps pipeline
- **Tool fragmentation**: 50+ MLOps tools, no single standard (vendor lock-in risk)

### Alternatives
- **Manual deployment**: Data scientist manually deploys model (works for 1-2 models, doesn't scale)
- **Jupyter notebooks in production**: Run notebooks on schedule (fragile, hard to maintain)
- **Generic CI/CD**: Use Jenkins/GitHub Actions without ML-specific features (no experiment tracking, model registry)
- **Serverless ML**: Cloud AutoML, AWS SageMaker autopilot (less control, simpler)

### Best Practices
- **Experiment tracking**: Log hyperparameters, metrics, artifacts for every run (MLflow, Weights & Biases)
- **Model registry**: Centralized storage with versioning, staging (dev/staging/prod), lineage
- **Automated pipelines**: CI/CD for training, testing, deployment (Kubeflow Pipelines, Airflow)
- **Monitoring**: Data drift, model performance, system metrics (latency, throughput)
- **Reproducibility**: Pin dependencies (requirements.txt), containerize environments (Docker)
- **Feature stores**: Centralized feature management for training-serving consistency

## 📊 Diagnostic Checks Summary

### Implementation Checklist
✅ **Experiment Tracking (MLflow)**
- Logging: Hyperparameters, metrics (accuracy, loss), artifacts (models, plots)
- Tagging: Environment (dev/staging/prod), dataset version, git commit SHA
- Comparison: Compare runs side-by-side, identify best hyperparameters
- Reproducibility: Log random seeds, library versions, data snapshots

✅ **Model Registry**
- Versioning: Semantic versioning (1.0.0, 1.1.0, 2.0.0) for model releases
- Staging: dev → staging → prod promotion workflow with approvals
- Lineage: Track training data, code version, hyperparameters used
- Metadata: Performance metrics, deployment timestamp, owner

✅ **CI/CD Pipelines (Kubeflow/Airflow)**
- Training pipeline: Data validation → feature engineering → model training → evaluation
- Deployment pipeline: Model packaging → container build → staging deployment → prod deployment
- Testing: Unit tests (code), integration tests (pipeline), model validation (accuracy >threshold)
- Rollback: Automated rollback if production accuracy drops >10%

✅ **Monitoring & Retraining**
- Data drift: KS test, KL divergence on feature distributions (retrain if p<0.01)
- Model performance: Track online metrics (accuracy when labels available)
- Retraining triggers: Scheduled (weekly/monthly), performance-based (accuracy <threshold), drift-based
- A/B testing: Champion-challenger comparison before full deployment

### Quality Metrics
- **Experiment reproducibility**: 100% of experiments can be reproduced from logged metadata
- **Model deployment time**: <30min from "promote to prod" to live traffic
- **Monitoring coverage**: 100% of production models have drift + performance monitoring
- **Retraining frequency**: Meets business SLA (weekly for fast-changing domains, monthly for stable)

### Post-Silicon Validation Applications
**1. Yield Prediction MLOps Pipeline**
- Experiment tracking: Log 50+ yield model experiments (XGBoost, LightGBM, Neural Nets)
- Model registry: Prod model = XGBoost v3.2 (MAE 2.1%), staging = Neural Net v1.0 (MAE 2.3%)
- CI/CD: Weekly retraining on latest 90 days of wafer test data
- Monitoring: Alert if predicted yield distribution shifts >15% (possible data quality issue)
- Business value: Systematize yield modeling, reduce deployment time 5 days → 30min

**2. Device Binning Model Lifecycle**
- Experiment tracking: Compare binning algorithms (decision trees, logistic regression, NN)
- Model registry: Track 3 models (premium-grade classifier, automotive-grade, low-power)
- Retraining: Monthly on new device test data (capture performance drift over time)
- A/B testing: Champion (current prod) vs. challenger (new model), 90/10 traffic split
- Business value: $8M-$15M/year revenue optimization (better bin assignments), safe deployments

**3. Test Time Prediction Pipeline**
- CI/CD: Automated pipeline triggered on Git push to `main` branch
  - Data validation: Check STDF files schema, outlier detection
  - Feature engineering: Extract test sequence, device complexity metrics
  - Model training: Train regression model, validate MAE <10% threshold
  - Deployment: If validation passes, deploy to staging → prod after 24hr canary
- Monitoring: Track prediction error, retrain if RMSE increases >20%
- Business value: $4M-$8M/year capacity planning accuracy, automated model updates

### Business ROI Estimation

**Scenario 1: Medium-Volume Semiconductor Fab (100K wafers/year, 5 production models)**
- Experiment tracking: 50% faster model iteration (1 week → 3 days) = **$2.5M/year** time-to-value
- Automated retraining: Weekly updates vs. manual quarterly = **$3M/year** better accuracy
- CI/CD for models: 5 days → 30min deployment = **$1.5M/year** engineering productivity
- **Total ROI: $7M/year** (cost: $150K MLflow + Airflow + $200K team = $6.65M net)

**Scenario 2: High-Volume Automotive Semiconductor (500K wafers/year, 20+ models)**
- Model registry: Centralized management for 20 models = **$8M/year** operational efficiency
- A/B testing: Safe model deployments prevent bad releases = **$15M/year** avoided revenue loss
- Feature stores: Training-serving consistency = **$12M/year** reduced prediction errors
- Compliance: Model versioning + audit trails for ISO 26262 = **$5M/year** audit cost savings
- **Total ROI: $40M/year** (cost: $1M MLOps platform + $800K team = $38.2M net)

**Scenario 3: Advanced Node R&D Fab (<10K wafers/year, experimental models)**
- Experiment tracking: Organize 200+ R&D experiments = **$3M/year** research efficiency
- Reproducibility: Recreate experiments 6 months later for publications = **$1.5M/year** IP value
- Rapid prototyping: Deploy experimental models to test environments in <1hr = **$2.5M/year** faster learning
- **Total ROI: $7M/year** (cost: $200K MLOps tools + $150K setup = $6.65M net)

## 📈 Progress Update

**Notebook 151: MLOps Fundamentals** expanded from 11 → 15 cells ✅

**Session summary: 12 notebooks completed**
- 12-cell (5): 129, 133, 162, 163, 164
- 11-cell (7): 111, 112, 116, 130, 138, 151

**Current completion:** ~73% (128/175 notebooks)  
**Remaining:** 47 partial notebooks

Continuing with more 11-cell notebooks...

---

## 🎓 Mastery Achievement

**You now have production-grade expertise in:**
- ✅ Tracking ML experiments with MLflow (hyperparameters, metrics, artifacts, versioning)
- ✅ Managing model lifecycle with registry (dev/staging/prod promotion, lineage tracking)
- ✅ Building CI/CD pipelines for ML (Kubeflow Pipelines, Airflow for training and deployment)
- ✅ Monitoring production models for data drift and performance degradation
- ✅ Implementing automated retraining and A/B testing for safe deployments

**Next Steps:**
- **Advanced MLOps**: Feature stores (Feast, Tecton), model serving (KServe, Seldon)
- **ML Platform Engineering**: Multi-tenancy, resource quotas, cost optimization
- **Continuous Training**: Event-driven retraining, federated learning across sites