# 121: MLOps Fundamentals

## üéØ Learning Objectives

By the end of this notebook, you will:
- **Understand** MLOps: lifecycle management, CI/CD for ML, production challenges
- **Master** experiment tracking: logging metrics, parameters, artifacts with MLflow
- **Build** ML pipelines: data ‚Üí train ‚Üí evaluate ‚Üí deploy automated workflows
- **Implement** model versioning: registry, staging, production promotion
- **Deploy** models to production: REST APIs, batch inference, edge deployment
- **Monitor** ML systems: performance tracking, drift detection, alerting

## üìö What is MLOps?

**MLOps (Machine Learning Operations)** applies DevOps principles to ML systems - automating the end-to-end ML lifecycle from data preparation to production deployment and monitoring. Unlike traditional software, ML systems require managing data, models, and experiments alongside code.

**Core concepts:**
- **Reproducibility**: Track experiments (code, data, hyperparameters, metrics)
- **Automation**: CI/CD pipelines for training, testing, deployment
- **Monitoring**: Track model performance, data drift, system health
- **Governance**: Model versioning, lineage, compliance, security

**Why MLOps?**
- ‚úÖ **Faster deployment**: Days to production (vs months with manual processes)
- ‚úÖ **Reliability**: Automated testing prevents bad models from reaching production
- ‚úÖ **Scalability**: Deploy 100+ models without linear team growth
- ‚úÖ **Collaboration**: Data scientists, ML engineers, DevOps work seamlessly

## üè≠ Post-Silicon Validation Use Cases

**Yield Prediction Model Pipeline**
- Input: STDF data (wafer test, final test), device parameters (Vdd, Idd, freq), historical yield (100K devices)
- Pipeline: Data validation ‚Üí Feature engineering ‚Üí Model training (Random Forest, XGBoost) ‚Üí Evaluation (F1, AUC) ‚Üí Registry ‚Üí REST API deployment
- Output: Yield prediction API (95% accuracy), real-time inference (<50ms), daily retraining (automated)
- Value: Identify failing devices 2 days earlier, reduce scrap 12%, improve yield 3%

**Test Time Optimization Model**
- Input: Test execution data (1M devices √ó 100 tests), test correlations, coverage matrix
- Pipeline: Correlation analysis ‚Üí ML ranking (Gradient Boosting) ‚Üí Test selection ‚Üí A/B testing ‚Üí Production rollout
- Output: Optimized test suite (25% time reduction, <1% coverage loss), confidence intervals, ROI dashboard
- Value: Save $500K/year per product, automated monthly retraining, shadow mode validation

**Anomaly Detection System**
- Input: Parametric test results (real-time stream, 1000 devices/hour), control limits, historical distributions
- Pipeline: Feature extraction ‚Üí Isolation Forest training ‚Üí Model registry ‚Üí Edge deployment (test floor) ‚Üí Alert system
- Output: Real-time anomaly detection (<1s latency), email/SMS alerts, 24/7 monitoring
- Value: Detect excursions within 5 min (vs 2 hrs manual), reduce false positives 40%

**Device Binning Classifier**
- Input: Parametric measurements (Vdd, Idd, freq, temp), spec limits, bin definitions (PASS, FAIL_VDD, etc.)
- Pipeline: Multi-class classification (SVM, Neural Network) ‚Üí SHAP explanations ‚Üí Model validation ‚Üí Production API
- Output: Intelligent binning with confidence scores, feature importance, bin prediction accuracy 98%
- Value: Reduce test escapes 60%, improve bin accuracy, explain predictions to stakeholders

## üîÑ MLOps Workflow

```mermaid
graph TB
    A[Data Collection] --> B[Data Validation]
    B --> C[Feature Engineering]
    C --> D[Model Training]
    D --> E[Experiment Tracking]
    E --> F{Performance OK?}
    F -->|No| D
    F -->|Yes| G[Model Registry]
    G --> H[Staging Environment]
    H --> I[A/B Testing]
    I --> J{Pass Tests?}
    J -->|No| D
    J -->|Yes| K[Production Deployment]
    K --> L[Monitoring & Logging]
    L --> M{Drift Detected?}
    M -->|Yes| A
    M -->|No| L
    
    N[CI/CD Pipeline] -.-> D
    N -.-> K
    O[Model Governance] -.-> G
    
    style A fill:#e1f5ff
    style K fill:#e1ffe1
    style M fill:#fffacd
```

## üìä Learning Path Context

**Prerequisites:**
- 010: Linear Regression (ML model basics)
- 041: Model Evaluation (metrics, validation)
- 051: Deep Learning Basics (neural networks)

**Next Steps:**
- 122: MLflow Complete Guide (experiment tracking platform)
- 123: Model Monitoring & Drift Detection (production health)
- 131: Docker Fundamentals (containerization for deployment)

---

Let's build production ML systems! üöÄ

## 1. Setup & Installation

**Note**: MLOps tools integrate with existing ML workflows. We'll install MLflow for experiment tracking and model registry.

In [None]:
# Install MLOps packages
import subprocess
import sys

packages = [
    'mlflow',          # Experiment tracking and model registry
    'scikit-learn',    # ML models
    'pandas',          # Data processing
    'numpy',           # Numerical operations
    'matplotlib',      # Visualization
    'seaborn',         # Statistical plots
    'requests',        # API calls
    'flask',           # REST API server
]

for package in packages:
    try:
        __import__(package.replace('-', '_'))
        print(f"‚úì {package} already installed")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', package, '-q'])

# Imports
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print("\n‚úÖ All packages ready!")
print(f"MLflow version: {mlflow.__version__}")
print("\nTo start MLflow UI:")
print("  mlflow ui --port 5000")
print("  Open browser: http://127.0.0.1:5000")

In [None]:
# Experiment tracking example: Yield prediction model
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
import pandas as pd
import numpy as np

# Generate synthetic STDF data
np.random.seed(42)
n_devices = 5000

df = pd.DataFrame({
    'Vdd_V': np.random.normal(1.2, 0.05, n_devices),
    'Idd_mA': np.random.normal(50, 5, n_devices),
    'freq_MHz': np.random.normal(1000, 50, n_devices),
    'temp_C': np.random.normal(25, 5, n_devices),
})

# Create yield target (devices fail if params out of spec)
df['yield'] = ((df['Vdd_V'] > 1.15) & (df['Vdd_V'] < 1.25) &
               (df['Idd_mA'] < 60) & (df['freq_MHz'] > 950)).astype(int)

# Split data
X = df.drop('yield', axis=1)
y = df['yield']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set MLflow experiment
mlflow.set_experiment(\"yield_prediction\")\n\n# Train with experiment tracking
with mlflow.start_run(run_name=\"rf_baseline\"):\n    
    # Log parameters
    n_estimators = 100
    max_depth = 10
    mlflow.log_param(\"n_estimators\", n_estimators)
    mlflow.log_param(\"max_depth\", max_depth)
    mlflow.log_param(\"model_type\", \"RandomForest\")
    mlflow.log_param(\"data_size\", len(df))
    
    # Train model
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        random_state=42
    )
    model.fit(X_train, y_train)
    
    # Evaluate
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_pred_proba)
    
    # Log metrics
    mlflow.log_metric(\"accuracy\", accuracy)
    mlflow.log_metric(\"f1_score\", f1)
    mlflow.log_metric(\"auc\", auc)
    
    # Log model
    mlflow.sklearn.log_model(model, \"model\")
    
    # Log feature importance plot
    import matplotlib.pyplot as plt
    fig, ax = plt.subplots(figsize=(8, 6))
    feature_importance = pd.DataFrame({
        'feature': X.columns,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    ax.barh(feature_importance['feature'], feature_importance['importance'])
    ax.set_xlabel('Importance')
    ax.set_title('Feature Importance')
    plt.tight_layout()
    mlflow.log_figure(fig, \"feature_importance.png\")
    plt.close()
    
    # Log confusion matrix
    from sklearn.metrics import confusion_matrix
    import seaborn as sns
    
    cm = confusion_matrix(y_test, y_pred)
    fig, ax = plt.subplots(figsize=(6, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax)
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')
    ax.set_title('Confusion Matrix')
    mlflow.log_figure(fig, \"confusion_matrix.png\")
    plt.close()
    
    print(f\"‚úÖ Experiment logged successfully!\")
    print(f\"Accuracy: {accuracy:.3f}, F1: {f1:.3f}, AUC: {auc:.3f}\")
    print(f\"\\nView results: mlflow ui --port 5000\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Hyperparameter Tuning with Tracking\n",
    "\n",
    "### üìù Systematic Experimentation\n",
    "\n",
    "MLflow enables comparing 100+ hyperparameter combinations:\n",
    "- Grid search / Random search / Bayesian optimization\n",
    "- Log all trials automatically\n",
    "- Sort by metric to find best configuration\n",
    "- Visualize parameter effects in UI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Hyperparameter tuning with MLflow
    "from sklearn.model_selection import ParameterGrid\n",
    "\n",
    "# Define parameter grid\n",
    "param_grid = {\n",
    "    'n_estimators': [50, 100, 200],\n",
    "    'max_depth': [5, 10, 20, None],\n",
    "    'min_samples_split': [2, 5, 10]\n",
    "}\n",
    "\n",
    "# Grid search with tracking
n",
    "best_auc = 0\n",
    "best_params = None\n",
    "\n",
    "for params in ParameterGrid(param_grid):\n",
    "    with mlflow.start_run(run_name=f\"rf_{params['n_estimators']}_{params['max_depth']}\"):\n",
    "        \n",
    "        # Log all parameters\n",
    "        mlflow.log_params(params)\n",
    "        \n",
    "        # Train model\n",
    "        model = RandomForestClassifier(**params, random_state=42)\n",
    "        model.fit(X_train, y_train)\n",
    "        \n",
    "        # Evaluate\n",
    "        y_pred_proba = model.predict_proba(X_test)[:, 1]\n",
    "        auc = roc_auc_score(y_test, y_pred_proba)\n",
    "        \n",
    "        # Log metrics\n",
    "        mlflow.log_metric(\"auc\", auc)\n",
    "        \n",
    "        # Track best model\n",
    "        if auc > best_auc:\n",
    "            best_auc = auc\n",
    "            best_params = params\n",
    "            mlflow.sklearn.log_model(model, \"best_model\")\n",
    "\n",
    "print(f\"\\n‚úÖ Tested {len(list(ParameterGrid(param_grid)))} configurations\")
print(f\"Best AUC: {best_auc:.4f}\")
print(f\"Best params: {best_params}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


## 2. Experiment Tracking with MLflow

### üìù Why Experiment Tracking?

**The problem**: Data scientists run 100+ experiments, forget what worked, can't reproduce results

**MLflow tracking** logs:
- **Parameters**: Hyperparameters (learning_rate, n_estimators, etc.)
- **Metrics**: accuracy, F1, AUC, loss over epochs
- **Artifacts**: Models, plots, datasets, code snapshots
- **Metadata**: Environment, Git commit, runtime

**Benefits**:
- Compare 100+ experiments in UI (sort by metric, filter by parameter)
- Reproduce any experiment (exact code, data, hyperparameters)
- Share results with team (URL to experiment, not screenshots)
- Track lineage (which data/model produced which result)

## 3. Model Registry & Versioning

**The Challenge**: Multiple model versions in production, which one is current?

**MLflow Model Registry** provides:
- **Centralized storage** for all models
- **Version control** with lineage tracking
- **Stage transitions** (Staging ‚Üí Production)
- **Model metadata** (who, when, why promoted)

**Post-Silicon Example**: Yield predictor model lifecycle
- v1.0: Initial Random Forest (85% accuracy) ‚Üí Staging
- v1.1: Tuned hyperparameters (88% accuracy) ‚Üí Production
- v2.0: XGBoost with feature engineering (92% accuracy) ‚Üí Production
- v1.1: Archived (replaced by v2.0)

## 4. Model Deployment Patterns

**Three Common Deployment Patterns:**

### **A. REST API (Real-time Inference)**
- **Use Case**: Web app needs yield prediction for single device
- **Latency**: <100ms
- **Tool**: MLflow Models + Flask/FastAPI

### **B. Batch Inference**
- **Use Case**: Daily analysis of 10,000 wafers from test floor
- **Latency**: Minutes to hours acceptable
- **Tool**: MLflow Models + Apache Spark

### **C. Edge Deployment**
- **Use Case**: Real-time inference on ATE (Automated Test Equipment)
- **Latency**: <10ms
- **Tool**: ONNX Runtime or TensorFlow Lite

**Post-Silicon Example**: Test floor yield predictor
- **Input**: Device parameters from STDF (Vdd, Idd, freq, temp)
- **Output**: Yield probability + risk score
- **SLA**: 99.9% uptime, <50ms latency

## 5. Monitoring & Drift Detection

**What to Monitor in Production:**

### **A. Model Performance Metrics**
- Accuracy, F1, AUC (requires ground truth labels)
- Prediction confidence distribution
- Error rate trends

### **B. Data Drift**
- **Feature drift**: Input distributions change (e.g., Vdd range shifts from 1.2V¬±0.05 to 1.25V¬±0.03)
- **Concept drift**: Relationship between features and target changes
- **Detection**: Statistical tests (KS test, PSI - Population Stability Index)

### **C. System Health**
- Latency (p50, p95, p99)
- Throughput (predictions/sec)
- Resource usage (CPU, memory)

**Post-Silicon Alert Example**: 
"‚ö†Ô∏è Vdd distribution shifted by 2 standard deviations. Model retrain recommended. Current accuracy estimate: 82% (down from 92%)."

## 6. CI/CD for ML Pipelines

**Traditional CI/CD** (software engineering):
- Code ‚Üí Build ‚Üí Test ‚Üí Deploy

**ML CI/CD** (additional steps):
- Data validation ‚Üí Feature engineering ‚Üí Model training ‚Üí Model validation ‚Üí A/B testing ‚Üí Gradual rollout

**Key Differences:**
- **Data is code**: Data changes require testing
- **Model testing**: Beyond unit tests (accuracy, fairness, robustness)
- **Gradual rollout**: Canary deployments (5% traffic ‚Üí 50% ‚Üí 100%)

**Post-Silicon CI/CD Pipeline:**
1. **Trigger**: New STDF data arrives (daily at 2 AM)
2. **Data validation**: Check schema, ranges, missing values
3. **Feature engineering**: Calculate derived metrics
4. **Model training**: Train on last 30 days of data
5. **Model validation**: Accuracy > 90% threshold?
6. **Model registry**: Register as new version
7. **A/B test**: Deploy to 10% of test stations
8. **Monitoring**: Compare metrics vs baseline
9. **Promote**: If A/B successful, promote to 100%

## üéØ Real-World MLOps Projects

### **Post-Silicon Validation Projects**

#### **Project 1: Automated Yield Prediction Pipeline**
**Objective**: Build end-to-end MLOps pipeline for wafer yield prediction
- **Data**: STDF files from test floor (daily refresh)
- **Model**: Gradient Boosting (track experiments with MLflow)
- **Deployment**: REST API for real-time predictions (<50ms latency)
- **Monitoring**: Data drift detection (PSI < 0.2), accuracy tracking
- **CI/CD**: Daily retrain at 2 AM, auto-promote if accuracy > 92%
- **Business Value**: Predict yield 24 hours earlier, reduce scrap by 15%

#### **Project 2: Test Time Optimization Model Lifecycle**
**Objective**: Deploy ML model to reduce ATE test time while maintaining quality
- **Features**: Device parameters, test sequence, historical pass/fail
- **Model**: XGBoost (log to MLflow with test time reduction metric)
- **Registry**: Track versions (v1.0: 10% reduction ‚Üí v2.0: 25% reduction)
- **Deployment**: Batch inference on nightly test results
- **Monitoring**: False negative rate < 0.5%, test time savings
- **Success Metric**: $500K annual savings, 0% quality degradation

#### **Project 3: Anomaly Detection System with MLOps**
**Objective**: Real-time anomaly detection on test floor with full MLOps lifecycle
- **Data**: Streaming STDF data (parametric measurements)
- **Model**: Isolation Forest (experiment tracking for contamination parameter)
- **Deployment**: Edge deployment on test controllers (<10ms inference)
- **Monitoring**: Alert when anomaly rate > 5%, drift in normal behavior baseline
- **CI/CD**: Weekly retrain with new normal behavior patterns
- **Business Value**: Detect equipment failures 4 hours earlier, $2M avoidance/year

#### **Project 4: Device Binning Classifier with Explainability**
**Objective**: Automated device binning with SHAP explainability and MLOps tracking
- **Data**: Final test STDF (performance parameters)
- **Model**: Random Forest (track feature importance over time in MLflow)
- **Registry**: Version models with business rules (bin definitions change quarterly)
- **Deployment**: REST API with SHAP explanations ("Device binned as Premium because Vdd stability = 98%")
- **Monitoring**: Bin distribution drift (expected 60/30/10 split)
- **Success Metric**: 98% binning accuracy, full auditability

---

### **General AI/ML Projects**

#### **Project 5: Customer Churn Prediction MLOps Pipeline**
**Objective**: Production-ready churn prediction with complete MLOps lifecycle
- **Data**: Customer usage metrics (refresh weekly)
- **Model**: LightGBM (experiment with feature engineering strategies)
- **Deployment**: Batch predictions, integrate with CRM via API
- **Monitoring**: Concept drift (customer behavior changes), precision/recall tracking
- **CI/CD**: Retrain when drift detected or monthly
- **Business Value**: Reduce churn by 20%, proactive retention campaigns

#### **Project 6: Recommendation System with A/B Testing**
**Objective**: Deploy recommendation model with rigorous A/B testing
- **Data**: User interactions, product catalog
- **Model**: Collaborative filtering (log user engagement metrics to MLflow)
- **Registry**: Track candidate models (v1: content-based, v2: collaborative, v3: hybrid)
- **Deployment**: Canary release (5% ‚Üí 25% ‚Üí 100% traffic)
- **Monitoring**: Click-through rate, conversion rate, user satisfaction
- **Success Metric**: 15% increase in conversion rate

#### **Project 7: Fraud Detection Real-Time Inference**
**Objective**: Low-latency fraud detection with model monitoring
- **Data**: Transaction data (streaming)
- **Model**: Neural network (track precision/recall tradeoff experiments)
- **Deployment**: REST API (<100ms SLA for payment processing)
- **Monitoring**: False positive rate (customer friction), model drift daily
- **CI/CD**: Blue-green deployment (instant rollback if FPR spikes)
- **Business Value**: Block $5M fraud annually, <1% false positive rate

#### **Project 8: Demand Forecasting with MLOps Governance**
**Objective**: Enterprise demand forecasting with model governance
- **Data**: Sales history, seasonality, promotions (daily updates)
- **Model**: Prophet + XGBoost ensemble (track component contributions)
- **Registry**: Maintain model lineage (data version + code version + hyperparameters)
- **Deployment**: Scheduled batch predictions (nightly forecasts for next 30 days)
- **Monitoring**: MAPE tracking, alert when > 15%, automatic retrain trigger
- **Success Metric**: Reduce inventory costs by 25%, improve forecast accuracy to MAPE < 10%

## üìö Comprehensive Takeaways

### **üéØ What is MLOps?**

**MLOps** = Machine Learning + DevOps = Systematic approach to deploying, monitoring, and managing ML models in production

**Core Principles:**
1. **Reproducibility**: Every experiment, model, and prediction must be traceable
2. **Automation**: Manual steps = errors; automate data validation ‚Üí training ‚Üí deployment
3. **Monitoring**: Models degrade over time; continuous monitoring is non-negotiable
4. **Collaboration**: Data scientists, ML engineers, DevOps work from same platform
5. **Governance**: Model lineage, approvals, audit trails for regulated industries

---

### **üîß MLOps Lifecycle Stages**

#### **1. Experiment Tracking**
**Tools**: MLflow, Weights & Biases, Neptune.ai

**What to Track:**
- **Parameters**: Hyperparameters, feature engineering choices, data version
- **Metrics**: Accuracy, F1, AUC, business metrics (revenue impact, latency)
- **Artifacts**: Trained models, plots, feature importance, confusion matrices
- **Metadata**: Git commit hash, dataset hash, training duration

**Best Practices:**
```python
with mlflow.start_run(run_name="descriptive_name"):
    mlflow.log_param("learning_rate", 0.01)  # Log ALL hyperparameters
    mlflow.log_metric("val_accuracy", 0.92)  # Log validation metrics
    mlflow.sklearn.log_model(model, "model")  # Log model artifact
    mlflow.log_artifact("feature_importance.png")  # Log visualizations
```

**Post-Silicon Tip**: Track semiconductor-specific metrics (yield improvement %, test time reduction, false negative rate)

---

#### **2. Model Registry & Versioning**

**Why Versioning Matters:**
- Production model degrades ‚Üí need to rollback to v1.2 (stable version)
- A/B test v2.0 vs v1.5 ‚Üí need both versions deployed
- Regulatory audit ‚Üí "Show me the exact model used on 2024-03-15"

**MLflow Registry Stages:**
- **None**: Experimental models, not production-ready
- **Staging**: Validated models undergoing A/B testing
- **Production**: Live models serving predictions
- **Archived**: Retired models (keep for audit trails)

**Best Practices:**
- **Semantic versioning**: v1.0.0 (major.minor.patch)
- **Model cards**: Document model purpose, training data, limitations, fairness
- **Automated promotion**: If staging model outperforms production by >5%, auto-promote

**Post-Silicon Example:**
```
Yield Predictor Registry:
- v1.0: Random Forest, 85% accuracy ‚Üí Archived
- v1.5: Tuned RF, 88% accuracy ‚Üí Archived
- v2.0: XGBoost, 92% accuracy ‚Üí Production
- v2.1: XGBoost + new features, 93% accuracy ‚Üí Staging (A/B testing)
```

---

#### **3. Model Deployment Patterns**

**A. REST API (Real-Time Inference)**

**When to Use:**
- Low latency required (<100ms)
- Single predictions (web app, mobile app)
- Synchronous workflows

**Tools:** Flask, FastAPI, MLflow Models, TensorFlow Serving

**Code Pattern:**
```python
@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    prediction = model.predict(pd.DataFrame([data]))
    return jsonify({'prediction': prediction[0]})
```

**SLA Considerations:**
- **Latency**: p99 < 100ms
- **Throughput**: 1000 requests/sec
- **Availability**: 99.9% uptime (load balancing, health checks)

**Post-Silicon Use Case**: Test floor yield prediction API for real-time device routing

---

**B. Batch Inference**

**When to Use:**
- Large volumes (millions of predictions)
- Latency not critical (minutes/hours acceptable)
- Daily/weekly prediction jobs

**Tools:** Apache Spark, Dask, Airflow scheduling

**Code Pattern:**
```python
# Load production model
model = mlflow.pyfunc.load_model("models:/yield_predictor/Production")

# Load batch data (e.g., 1M devices from today's test floor)
devices_df = spark.read.parquet("s3://stdf-data/2024-12-13/")

# Batch predict
predictions = model.predict(devices_df.toPandas())

# Save results
predictions_df.write.parquet("s3://predictions/2024-12-13/")
```

**Post-Silicon Use Case**: Nightly batch processing of 10,000 wafers

---

**C. Edge Deployment**

**When to Use:**
- Ultra-low latency (<10ms)
- Offline inference (no internet)
- Resource-constrained devices (ATE, IoT)

**Tools:** ONNX Runtime, TensorFlow Lite, TensorRT

**Optimization Techniques:**
- **Model quantization**: Float32 ‚Üí Int8 (4x smaller, 4x faster)
- **Pruning**: Remove 90% of weights with <1% accuracy loss
- **Knowledge distillation**: Compress large model into small student model

**Post-Silicon Use Case**: Real-time inference on ATE during device testing

---

#### **4. Monitoring & Drift Detection**

**What to Monitor:**

**A. Model Performance**
- **Online metrics**: Prediction latency, throughput, error rate
- **Offline metrics**: Accuracy, F1 (requires ground truth labels)

**B. Data Drift**
- **Feature drift**: Input distributions shift (e.g., temperature range changes)
- **Concept drift**: Feature-target relationships change (e.g., new process node)

**Detection Methods:**

**Kolmogorov-Smirnov Test:**
```python
from scipy.stats import ks_2samp
statistic, pvalue = ks_2samp(training_vdd, production_vdd)
if pvalue < 0.05:
    print("Drift detected! Retrain model.")
```

**Population Stability Index (PSI):**
- PSI < 0.1: No drift
- PSI 0.1-0.2: Minor drift, monitor
- PSI > 0.2: Major drift, retrain immediately

**C. System Health**
- CPU/memory usage
- Prediction latency (p50, p95, p99)
- API error rates

**Alerting Strategy:**
- **Critical**: Model accuracy drops >10% ‚Üí page on-call engineer
- **Warning**: PSI > 0.15 ‚Üí email data science team
- **Info**: New model version deployed ‚Üí Slack notification

**Post-Silicon Example:**
"üö® ALERT: Vdd drift detected (PSI=0.24). Yield predictor accuracy estimated at 82% (down from 92%). Auto-retrain triggered."

---

#### **5. CI/CD for ML**

**Traditional CI/CD vs ML CI/CD:**

| Stage | Software Engineering | Machine Learning |
|-------|---------------------|------------------|
| **Build** | Compile code | Train model |
| **Test** | Unit tests, integration tests | Data validation, model validation, bias tests |
| **Deploy** | Blue-green, canary | A/B testing, shadow mode, gradual rollout |
| **Monitor** | Latency, errors | Drift, accuracy, fairness |

**ML Pipeline Stages:**

1. **Data Validation**
   - Schema check: Expected columns present?
   - Range check: Values within historical bounds?
   - Missing values: <5% threshold?

2. **Feature Engineering**
   - Deterministic transforms (same code = same features)
   - Version feature engineering code

3. **Model Training**
   - Automated hyperparameter tuning
   - Cross-validation for robustness
   - Track all experiments in MLflow

4. **Model Validation**
   - **Accuracy gate**: Accuracy > 90% threshold?
   - **Fairness gate**: No bias across subgroups?
   - **Robustness gate**: Performance stable on OOD data?

5. **Model Registration**
   - Register as new version in model registry
   - Tag with metadata (data version, code commit, accuracy)

6. **Staging Deployment**
   - Deploy to staging environment
   - Run integration tests (API response format correct?)

7. **A/B Testing**
   - Deploy to 5% of traffic
   - Monitor metrics vs baseline (champion vs challenger)
   - Statistical significance test (p < 0.05)

8. **Production Promotion**
   - If challenger wins A/B test, promote to 100%
   - Archive previous production model (keep for rollback)

**Orchestration Tools:**
- **Airflow**: Complex DAGs, scheduling
- **Kubeflow**: Kubernetes-native ML pipelines
- **MLflow Projects**: Reproducible runs with conda/docker
- **GitHub Actions**: Simple CI/CD for small teams

**Post-Silicon Pipeline:**
```
Daily at 2 AM:
1. Fetch yesterday's STDF files ‚Üí validate schema
2. Engineer features ‚Üí calculate derived metrics
3. Train yield predictor ‚Üí log to MLflow
4. Validate accuracy > 90% ‚Üí gate
5. Register model ‚Üí transition to Staging
6. Deploy to 10% of test stations ‚Üí A/B test
7. Monitor for 24 hours ‚Üí compare metrics
8. If accuracy stable, promote to Production ‚Üí all stations
```

---

### **‚öôÔ∏è MLOps Tools Ecosystem**

#### **Experiment Tracking**
- **MLflow** (open-source): Lightweight, Python-friendly, self-hosted
- **Weights & Biases**: Managed service, beautiful dashboards, team collaboration
- **Neptune.ai**: Enterprise features, model registry, integrations

#### **Model Registry**
- **MLflow Registry**: Built into MLflow, stage transitions
- **ModelDB**: Open-source, Spark integration
- **Vertex AI Model Registry**: Google Cloud managed

#### **Deployment**
- **MLflow Models**: Multi-framework support (sklearn, TensorFlow, PyTorch)
- **TensorFlow Serving**: High-performance TensorFlow deployment
- **TorchServe**: PyTorch models as REST API
- **Seldon Core**: Kubernetes-native, advanced deployment patterns

#### **Monitoring**
- **Evidently AI**: Drift detection, model quality reports
- **Fiddler AI**: Enterprise monitoring, explainability
- **Arize AI**: ML observability platform
- **WhyLabs**: Data/model monitoring, anomaly detection

#### **Orchestration**
- **Airflow**: Workflow scheduling, complex DAGs
- **Kubeflow**: End-to-end ML on Kubernetes
- **Metaflow**: Netflix's human-centric ML framework
- **Prefect**: Modern workflow orchestration, Python-first

---

### **üöÄ When to Use MLOps**

**You NEED MLOps if:**
- ‚úÖ Models deployed to production (not just notebooks)
- ‚úÖ Multiple data scientists experimenting (need to compare 100+ runs)
- ‚úÖ Models retrained regularly (weekly/monthly)
- ‚úÖ Regulatory requirements (audit trails, reproducibility)
- ‚úÖ Business-critical predictions (downtime = revenue loss)

**You DON'T need MLOps if:**
- ‚ùå One-off analysis (quick insight, never reused)
- ‚ùå Static models (trained once, never updated)
- ‚ùå Prototype stage (MVP, validating idea)

**Post-Silicon Context:**
- Test floor models (yield prediction, binning) ‚Üí NEED MLOps (retrain weekly, regulatory audits)
- Exploratory analysis (one-time wafer map investigation) ‚Üí DON'T need MLOps

---

### **üéì Best Practices**

#### **1. Start Simple, Scale Gradually**
- **Week 1**: Log experiments to CSV files
- **Week 2**: Adopt MLflow for experiment tracking
- **Month 1**: Set up model registry, manual deployments
- **Month 2**: Automate deployments with CI/CD
- **Month 3**: Add drift monitoring, alerting

#### **2. Automate Everything**
- Manual deployment = 2 hours + human error risk
- Automated pipeline = 10 minutes + reproducible

#### **3. Monitor from Day 1**
- "Model deployed successfully! üéâ" ‚Üí 6 months later ‚Üí "Why is accuracy 60%?"
- Deploy monitoring BEFORE production launch

#### **4. Version EVERYTHING**
- Data version (hash, timestamp, location)
- Code version (Git commit SHA)
- Model version (registry version number)
- Environment version (requirements.txt, Docker image)

#### **5. Build Rollback Mechanisms**
- Production model crashes ‚Üí instant rollback to v1.5 (last stable)
- Blue-green deployment: Keep old version running until new version validated

#### **6. Document Model Decisions**
- Model card: "Why Random Forest? Explainability > 2% accuracy gain from deep learning"
- Experiment notes: "Tried feature X, no improvement, wasted 3 days"

---

### **‚ö†Ô∏è Common Pitfalls**

#### **1. Over-Engineering Too Early**
- **Mistake**: Spend 3 months building Kubernetes MLOps platform before training first model
- **Fix**: Start with MLflow + Flask API, scale when needed

#### **2. Ignoring Data Quality**
- **Mistake**: "Model accuracy dropped from 92% to 60%... oh, data pipeline broke 3 weeks ago"
- **Fix**: Data validation in CI/CD pipeline (check schema, ranges, nulls)

#### **3. No Monitoring = Silent Failures**
- **Mistake**: Model serves predictions for 6 months, accuracy unknown
- **Fix**: Log ground truth labels (delayed), calculate offline metrics weekly

#### **4. Training/Serving Skew**
- **Mistake**: Training uses pandas, production uses Java ‚Üí feature calculations differ ‚Üí accuracy drops
- **Fix**: Same feature engineering code for training AND serving (use FeatureStore or shared library)

#### **5. Forgetting Model Governance**
- **Mistake**: "Which model version was used for this prediction?" ‚Üí No audit trail
- **Fix**: Log model version, input features, prediction, timestamp for every request

---

### **üîÆ Next Steps**

**After mastering this notebook:**
1. **122_MLflow_Complete_Guide.ipynb** ‚Üí Deep dive into MLflow tracking, registry, projects
2. **123_Model_Monitoring_Drift_Detection.ipynb** ‚Üí Advanced drift detection, alerting strategies
3. **124_ML_CI_CD_Pipelines.ipynb** ‚Üí Airflow, GitHub Actions, automated retraining
4. **131_Docker_Fundamentals.ipynb** ‚Üí Containerize ML models for reproducible deployments

**Hands-On Practice:**
- Deploy Notebook 121 experiment tracking example locally
- Set up MLflow UI (port 5000), explore experiments
- Build REST API for yield predictor (Flask + MLflow)
- Simulate drift detection with synthetic STDF data

---

### **üìä MLOps Maturity Model**

**Level 0: Manual Process**
- Notebooks on laptops
- Manual model training
- Email models to deployment team

**Level 1: Experiment Tracking**
- MLflow logging
- Centralized metric comparison
- Manual deployment with scripts

**Level 2: Automated Training**
- CI/CD pipeline trains models
- Automated validation gates
- Model registry with staging

**Level 3: Automated Deployment**
- A/B testing automated
- Gradual rollout (canary)
- Monitoring dashboards

**Level 4: Full MLOps**
- Continuous training (CT)
- Automatic drift detection ‚Üí retrain
- Self-healing pipelines
- Comprehensive governance

**Most post-silicon teams**: Level 1-2  
**Target for production systems**: Level 3-4

---

**You now have the MLOps foundation to deploy, monitor, and manage ML models in production! üöÄ**

In [None]:
# Simple CI/CD pipeline script (conceptual)
import os
import mlflow
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

def ml_pipeline():
    """Automated ML pipeline for daily model retraining"""
    
    # 1. Data validation
    print("Step 1: Validating new STDF data...")
    # Load new data (placeholder)
    # validate_data_schema(new_data)
    # validate_data_quality(new_data)
    
    # 2. Feature engineering
    print("Step 2: Engineering features...")
    # features = engineer_features(new_data)
    
    # 3. Model training
    print("Step 3: Training model...")
    with mlflow.start_run(run_name="automated_retrain"):
        # Train model (using previous synthetic data for demo)
        model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
        model.fit(X_train, y_train)
        
        # 4. Model validation
        print("Step 4: Validating model performance...")
        y_pred = model.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        
        mlflow.log_metric("accuracy", accuracy)
        
        # Accuracy threshold gate
        ACCURACY_THRESHOLD = 0.85
        if accuracy < ACCURACY_THRESHOLD:
            print(f"‚ùå Model failed validation: {accuracy:.4f} < {ACCURACY_THRESHOLD}")
            print("Pipeline aborted. Alert data science team.")
            return False
        
        print(f"‚úÖ Model passed validation: {accuracy:.4f}")
        
        # 5. Register model
        print("Step 5: Registering model...")
        mlflow.sklearn.log_model(model, "model")
        model_uri = f"runs:/{mlflow.active_run().info.run_id}/model"
        mlflow.register_model(model_uri, "yield_predictor")
        
        # 6. Transition to staging
        print("Step 6: Promoting to Staging for A/B test...")
        # client.transition_model_version_stage(...)
        
        print("‚úÖ Pipeline completed successfully")
        return True

# Run pipeline
# In production, this would be triggered by cron job or Airflow DAG
print("Simulating automated ML pipeline...")
# ml_pipeline()
print("Pipeline would run daily at 2 AM via cron: 0 2 * * * python ml_pipeline.py")

In [None]:
# Data drift detection
from scipy.stats import ks_2samp
import numpy as np

# Reference data (training distribution)
reference_vdd = np.random.normal(1.2, 0.05, 1000)

# Current production data (potentially drifted)
current_vdd = np.random.normal(1.25, 0.03, 500)  # Mean shifted!

# Kolmogorov-Smirnov test
statistic, pvalue = ks_2samp(reference_vdd, current_vdd)

print(f"KS Statistic: {statistic:.4f}")
print(f"P-value: {pvalue:.4f}")

if pvalue < 0.05:
    print("‚ö†Ô∏è DRIFT DETECTED: Vdd distribution has significantly changed")
    print("Action: Retrain model with recent data")
else:
    print("‚úÖ NO DRIFT: Distribution stable")

# Population Stability Index (PSI)
def calculate_psi(reference, current, bins=10):
    """Calculate Population Stability Index"""
    ref_hist, bin_edges = np.histogram(reference, bins=bins)
    cur_hist, _ = np.histogram(current, bins=bin_edges)
    
    # Convert to percentages
    ref_pct = ref_hist / len(reference)
    cur_pct = cur_hist / len(current)
    
    # Avoid division by zero
    ref_pct = np.where(ref_pct == 0, 0.0001, ref_pct)
    cur_pct = np.where(cur_pct == 0, 0.0001, cur_pct)
    
    # PSI formula
    psi = np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct))
    
    return psi

psi = calculate_psi(reference_vdd, current_vdd)
print(f"\nPSI: {psi:.4f}")

if psi < 0.1:
    print("‚úÖ PSI < 0.1: No significant change")
elif psi < 0.2:
    print("‚ö†Ô∏è PSI 0.1-0.2: Minor drift detected, monitor closely")
else:
    print("üö® PSI > 0.2: Major drift! Model retrain required")

# Log drift metrics to MLflow
mlflow.log_metric("vdd_ks_statistic", statistic)
mlflow.log_metric("vdd_psi", psi)
mlflow.log_metric("drift_detected", 1 if pvalue < 0.05 else 0)

In [None]:
# REST API deployment with Flask
from flask import Flask, request, jsonify
import mlflow.pyfunc

app = Flask(__name__)

# Load model once at startup
model = mlflow.pyfunc.load_model("models:/yield_predictor/Production")

@app.route('/predict', methods=['POST'])
def predict():
    """
    Predict yield for device parameters
    
    Request body:
    {
        "Vdd_V": 1.2,
        "Idd_mA": 48.5,
        "freq_MHz": 1050,
        "temp_C": 27
    }
    """
    data = request.json
    
    # Convert to DataFrame (model expects this format)
    import pandas as pd
    input_df = pd.DataFrame([data])
    
    # Predict
    prediction = model.predict(input_df)
    probability = prediction[0]
    
    # Return result
    return jsonify({
        'yield_probability': float(probability),
        'risk_level': 'LOW' if probability > 0.9 else 'MEDIUM' if probability > 0.7 else 'HIGH',
        'recommendation': 'PASS' if probability > 0.8 else 'RETEST'
    })

@app.route('/health', methods=['GET'])
def health():
    """Health check endpoint for monitoring"""
    return jsonify({'status': 'healthy', 'model': 'yield_predictor', 'version': 'production'})

# Run server
# app.run(host='0.0.0.0', port=5001)
print("REST API ready. Start with: app.run(host='0.0.0.0', port=5001)")
print("Test with: curl -X POST http://localhost:5001/predict -H 'Content-Type: application/json' -d '{\"Vdd_V\": 1.2, \"Idd_mA\": 50, \"freq_MHz\": 1000, \"temp_C\": 25}'")

In [None]:
# Model registry workflow
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Register the best model
model_name = "yield_predictor"
model_uri = f"runs:/{mlflow.active_run().info.run_id}/best_model"

# Register model
model_version = mlflow.register_model(model_uri, model_name)

print(f"Model {model_name} version {model_version.version} registered")

# Transition to staging
client.transition_model_version_stage(
    name=model_name,
    version=model_version.version,
    stage="Staging",
    archive_existing_versions=False
)

print(f"Model transitioned to Staging")

# After validation, promote to production
client.transition_model_version_stage(
    name=model_name,
    version=model_version.version,
    stage="Production",
    archive_existing_versions=True  # Archive previous production model
)

print(f"Model promoted to Production")

# Load production model for inference
production_model = mlflow.pyfunc.load_model(f"models:/{model_name}/Production")
print(f"Production model loaded and ready for inference")

In [None]:
# Hyperparameter tuning with grid search
from sklearn.model_selection import GridSearchCV

# Parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10]
}

# Grid search with MLflow tracking
with mlflow.start_run(run_name="grid_search_rf"):
    grid_search = GridSearchCV(
        RandomForestClassifier(random_state=42),
        param_grid,
        cv=5,
        scoring='f1',
        n_jobs=-1
    )
    
    grid_search.fit(X_train, y_train)
    
    # Log best parameters
    for param, value in grid_search.best_params_.items():
        mlflow.log_param(f"best_{param}", value)
    
    # Log best score
    mlflow.log_metric("best_cv_f1", grid_search.best_score_)
    
    # Test set evaluation
    y_pred = grid_search.predict(X_test)
    test_f1 = f1_score(y_test, y_pred)
    mlflow.log_metric("test_f1", test_f1)
    
    # Log best model
    mlflow.sklearn.log_model(grid_search.best_estimator_, "best_model")
    
    print(f"Best parameters: {grid_search.best_params_}")
    print(f"Best CV F1: {grid_search.best_score_:.4f}")
    print(f"Test F1: {test_f1:.4f}")