# 122: MLflow Complete Guide

## üéØ Learning Objectives

By the end of this notebook, you will:
- **Master** MLflow Tracking: Log parameters, metrics, artifacts systematically
- **Build** MLflow Projects: Reproducible ML workflows with packaging
- **Deploy** MLflow Models: Serve models via REST API, batch, cloud
- **Manage** MLflow Registry: Version control, stage transitions, model governance
- **Integrate** MLflow with production: CI/CD, monitoring, scaling strategies
- **Apply** advanced patterns: Nested runs, autologging, custom metrics

## üìö What is MLflow?

**MLflow** is an open-source platform for managing the complete machine learning lifecycle, including experimentation, reproducibility, deployment, and central model registry.

**Four Core Components:**
- ‚úÖ **MLflow Tracking**: Record and query experiments (parameters, metrics, artifacts)
- ‚úÖ **MLflow Projects**: Package ML code in reproducible format (conda, Docker)
- ‚úÖ **MLflow Models**: Deploy models to diverse serving environments (REST, batch, Spark)
- ‚úÖ **MLflow Registry**: Central model store with versioning and stage management

**Why MLflow?**
- **Open-source**: No vendor lock-in, active community (20K+ GitHub stars)
- **Framework-agnostic**: Works with sklearn, TensorFlow, PyTorch, XGBoost, LightGBM
- **Production-ready**: Used by Uber, Databricks, Microsoft, Netflix
- **Simple API**: 3 lines to start tracking: `mlflow.start_run()`, `mlflow.log_metric()`, `mlflow.end_run()`

## üè≠ Post-Silicon Validation Use Cases

**Use Case 1: Multi-Algorithm Yield Prediction Comparison**
- **Input**: STDF parametric data (Vdd, Idd, freq, temp) for 50,000 devices
- **Models**: Random Forest, XGBoost, Gradient Boosting, Neural Network
- **Tracking**: Log all hyperparameters, cross-validation scores, training time
- **Output**: Best model (XGBoost, 93.5% accuracy) promoted to production
- **Value**: Systematic comparison eliminates guesswork, 8% accuracy improvement

**Use Case 2: Test Time Optimization Model Registry**
- **Scenario**: 5 model versions for test time reduction (v1.0: 10% ‚Üí v2.5: 28% reduction)
- **Registry**: Track each version with metadata (data range, accuracy, test time savings)
- **Deployment**: Stage v2.5 on 10% of ATE, monitor false negative rate
- **Promotion**: If FNR < 0.5%, promote to Production, archive v2.0
- **Value**: $800K annual savings, full audit trail for quality compliance

**Use Case 3: Wafer Map Anomaly Detection Experiments**
- **Challenge**: Tested 47 different feature engineering strategies over 3 months
- **Tracking**: Log spatial features, PCA components, contamination parameters, F1 scores
- **Best**: Spatial autocorrelation + PCA(20) ‚Üí F1 = 0.89
- **Artifacts**: Save wafer map visualizations, feature importance plots
- **Value**: Without MLflow, would have lost track of experiments, wasted 50+ hours recreating results

**Use Case 4: Device Binning Model Deployment**
- **Models**: 3 binning strategies (performance-based, power-based, hybrid)
- **Projects**: Package each strategy as MLflow Project with dependencies
- **Deployment**: Serve via REST API (<30ms latency) to binning automation system
- **Monitoring**: Track bin distribution drift, alert if Premium bin % drops >5%
- **Value**: 98.5% binning accuracy, automated deployment eliminates manual errors

## üîÑ MLflow Workflow

```mermaid
graph TB
    A[Data Preparation] --> B[Experiment Tracking]
    B --> C[Log Parameters]
    B --> D[Log Metrics]
    B --> E[Log Artifacts]
    C --> F[Compare Experiments]
    D --> F
    E --> F
    F --> G{Best Model?}
    G -->|Yes| H[Register Model]
    G -->|No| B
    H --> I[Staging]
    I --> J[Validation Tests]
    J --> K{Pass?}
    K -->|Yes| L[Production]
    K -->|No| I
    L --> M[Serve Predictions]
    M --> N[Monitor Performance]
    N --> O{Drift Detected?}
    O -->|Yes| A
    O -->|No| M
    
    style A fill:#e1f5ff
    style L fill:#e1ffe1
    style O fill:#fff4e1
```

## üìä Learning Path Context

**Prerequisites:**
- **121_MLOps_Fundamentals.ipynb** - MLOps lifecycle, deployment patterns
- **010_Linear_Regression.ipynb** - ML model basics
- **041_Model_Evaluation_Metrics.ipynb** - Evaluation techniques

**Next Steps:**
- **123_Model_Monitoring_Drift_Detection.ipynb** - Monitor deployed models
- **124_Feature_Store_Implementation.ipynb** - Centralized feature management
- **125_ML_Pipeline_Orchestration.ipynb** - Airflow, Kubeflow integration

---

Let's master MLflow for production ML! üöÄ

In [None]:
# Install MLflow and dependencies
# !pip install mlflow scikit-learn pandas numpy matplotlib seaborn xgboost

import mlflow
import mlflow.sklearn
from mlflow.tracking import MlflowClient
from mlflow.models.signature import infer_signature
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

print(f"MLflow version: {mlflow.__version__}")
print("MLflow tracking URI:", mlflow.get_tracking_uri())
print("Start MLflow UI: mlflow ui --port 5000")

## 2. MLflow Tracking: Complete Guide

**MLflow Tracking** is a logging API for recording:
- **Parameters**: Hyperparameters, config values (immutable)
- **Metrics**: Performance scores, loss values (can update over iterations)
- **Artifacts**: Files (models, plots, data, any file)
- **Tags**: Metadata (model type, data version, environment)

**Key Concepts:**
- **Experiment**: Collection of runs (e.g., "Yield Prediction Experiments")
- **Run**: Single execution of ML code (e.g., "RandomForest_run_42")
- **Tracking URI**: Where data is stored (local file, database, remote server)

**Post-Silicon Example**: Compare 5 algorithms for yield prediction, track all experiments in one place.

In [None]:
# Part 1: Basic tracking example
# Generate synthetic STDF data
np.random.seed(42)
n_devices = 5000

data = pd.DataFrame({
    'Vdd_V': np.random.normal(1.2, 0.05, n_devices),
    'Idd_mA': np.random.normal(50, 5, n_devices),
    'freq_MHz': np.random.normal(1000, 50, n_devices),
    'temp_C': np.random.normal(25, 5, n_devices)
})

# Create yield target
data['yield'] = (
    (data['Vdd_V'] >= 1.15) & (data['Vdd_V'] <= 1.25) &
    (data['Idd_mA'] <= 55) &
    (data['freq_MHz'] >= 950) &
    (data['temp_C'] <= 30)
).astype(int)

X = data.drop('yield', axis=1)
y = data['yield']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Dataset: {len(data)} devices, {y.sum()} passing ({y.mean()*100:.1f}%)")
print(f"Training: {len(X_train)}, Test: {len(X_test)}")

### A. Basic Tracking API

**Core tracking functions:**
- `mlflow.set_experiment(name)` - Create or set active experiment
- `mlflow.start_run(run_name)` - Start new run (returns context manager)
- `mlflow.log_param(key, value)` - Log single parameter
- `mlflow.log_params(dict)` - Log multiple parameters
- `mlflow.log_metric(key, value, step)` - Log metric (can log multiple times with different steps)
- `mlflow.log_artifact(path)` - Log file artifact
- `mlflow.end_run()` - End current run (automatic with context manager)

In [None]:
# Comprehensive tracking example
mlflow.set_experiment("Yield_Prediction_Complete")

with mlflow.start_run(run_name="RandomForest_Comprehensive") as run:
    # 1. Log parameters (hyperparameters, config)
    params = {
        'model_type': 'RandomForest',
        'n_estimators': 100,
        'max_depth': 10,
        'min_samples_split': 5,
        'random_state': 42,
        'data_size': len(X_train)
    }
    mlflow.log_params(params)
    
    # 2. Log tags (metadata)
    mlflow.set_tags({
        'data_source': 'STDF_synthetic',
        'environment': 'development',
        'ml_engineer': 'data_science_team',
        'use_case': 'yield_prediction'
    })
    
    # 3. Train model
    model = RandomForestClassifier(
        n_estimators=params['n_estimators'],
        max_depth=params['max_depth'],
        min_samples_split=params['min_samples_split'],
        random_state=params['random_state']
    )
    model.fit(X_train, y_train)
    
    # 4. Log metrics
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    metrics = {
        'accuracy': accuracy_score(y_test, y_pred),
        'f1_score': f1_score(y_test, y_pred),
        'roc_auc': roc_auc_score(y_test, y_pred_proba)
    }
    mlflow.log_metrics(metrics)
    
    # 5. Log artifacts (plots)
    # Feature importance plot
    fig, ax = plt.subplots(figsize=(8, 5))
    importance = pd.DataFrame({
        'feature': X.columns,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=False)
    sns.barplot(data=importance, x='importance', y='feature', ax=ax)
    ax.set_title('Feature Importance')
    plt.tight_layout()
    plt.savefig('feature_importance.png')
    mlflow.log_artifact('feature_importance.png')
    plt.close()
    
    # Confusion matrix
    fig, ax = plt.subplots(figsize=(6, 5))
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax)
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')
    ax.set_title('Confusion Matrix')
    plt.tight_layout()
    plt.savefig('confusion_matrix.png')
    mlflow.log_artifact('confusion_matrix.png')
    plt.close()
    
    # 6. Log model with signature
    signature = infer_signature(X_train, model.predict(X_train))
    mlflow.sklearn.log_model(model, "model", signature=signature)
    
    print(f"Run ID: {run.info.run_id}")
    print(f"Metrics: Accuracy={metrics['accuracy']:.4f}, F1={metrics['f1_score']:.4f}, AUC={metrics['roc_auc']:.4f}")
    print(f"View in UI: http://127.0.0.1:5000/#/experiments/{run.info.experiment_id}/runs/{run.info.run_id}")

### B. Autologging (Automatic Tracking)

**Autologging** automatically logs parameters, metrics, and models for supported frameworks.

**Supported frameworks:**
- `mlflow.sklearn.autolog()` - scikit-learn
- `mlflow.tensorflow.autolog()` - TensorFlow/Keras
- `mlflow.pytorch.autolog()` - PyTorch
- `mlflow.xgboost.autolog()` - XGBoost
- `mlflow.lightgbm.autolog()` - LightGBM

**What gets logged automatically:**
- All hyperparameters
- Training/validation metrics
- Model artifact
- Model signature
- Training dataset info

**Trade-off**: Less control, but 90% faster setup.

In [None]:
# Autologging example
mlflow.sklearn.autolog()

with mlflow.start_run(run_name="GradientBoosting_Autolog"):
    # Just train the model - MLflow logs everything automatically!
    gb_model = GradientBoostingClassifier(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=5,
        random_state=42
    )
    gb_model.fit(X_train, y_train)
    
    # MLflow automatically logged:
    # - All parameters (n_estimators, learning_rate, etc.)
    # - Training score
    # - Model artifact
    # - Model signature
    
    y_pred = gb_model.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_pred)
    
    # Manual log for test accuracy (not auto-logged)
    mlflow.log_metric("test_accuracy", test_accuracy)
    
    print(f"Test Accuracy: {test_accuracy:.4f}")
    print("Check MLflow UI - all parameters and model logged automatically!")

# Turn off autologging
mlflow.sklearn.autolog(disable=True)

### C. Comparing Multiple Models

**Real-world scenario**: Test 5 algorithms, pick the best one.

**Strategy**: Run all experiments in same MLflow Experiment, compare in UI or programmatically.

In [None]:
# Compare multiple models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
import xgboost as xgb

models = {
    'LogisticRegression': LogisticRegression(max_iter=1000, random_state=42),
    'DecisionTree': DecisionTreeClassifier(max_depth=10, random_state=42),
    'RandomForest': RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42),
    'GradientBoosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'XGBoost': xgb.XGBClassifier(n_estimators=100, max_depth=5, random_state=42, eval_metric='logloss')
}

mlflow.set_experiment("Model_Comparison_Yield")

results = []
for model_name, model in models.items():
    with mlflow.start_run(run_name=f"{model_name}_comparison"):
        # Train
        model.fit(X_train, y_train)
        
        # Predict
        y_pred = model.predict(X_test)
        y_pred_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else y_pred
        
        # Metrics
        acc = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)
        auc = roc_auc_score(y_test, y_pred_proba) if hasattr(model, 'predict_proba') else acc
        
        # Log
        mlflow.log_params({'model_type': model_name, 'data_size': len(X_train)})
        mlflow.log_metrics({'accuracy': acc, 'f1_score': f1, 'auc': auc})
        mlflow.sklearn.log_model(model, "model")
        
        results.append({'Model': model_name, 'Accuracy': acc, 'F1': f1, 'AUC': auc})
        print(f"{model_name}: Acc={acc:.4f}, F1={f1:.4f}, AUC={auc:.4f}")

# Compare results
results_df = pd.DataFrame(results).sort_values('F1', ascending=False)
print("\n=== Model Ranking (by F1 Score) ===")
print(results_df.to_string(index=False))
print(f"\nBest model: {results_df.iloc[0]['Model']} (F1={results_df.iloc[0]['F1']:.4f})")

## 3. MLflow Models: Deployment

**MLflow Models** standardize model packaging for deployment across platforms.

**Key Features:**
- **Multi-framework support**: sklearn, TensorFlow, PyTorch, ONNX, custom Python
- **Model signature**: Input/output schema validation
- **Flavor system**: Multiple representations of same model (sklearn flavor + python_function flavor)
- **Deployment targets**: REST API, batch, Spark UDF, cloud (AWS SageMaker, Azure ML)

**Model URI formats:**
- `runs:/<run_id>/model` - Model from specific run
- `models:/<model_name>/<version>` - Model from registry by version
- `models:/<model_name>/<stage>` - Model from registry by stage (Staging/Production)

In [None]:
# Load model for inference
# Option 1: Load from run
run_id = mlflow.search_runs(
    experiment_names=["Model_Comparison_Yield"],
    order_by=["metrics.f1_score DESC"],
    max_results=1
).iloc[0]['run_id']

model_uri = f"runs:/{run_id}/model"
loaded_model = mlflow.pyfunc.load_model(model_uri)

# Make predictions
sample_device = pd.DataFrame({
    'Vdd_V': [1.21],
    'Idd_mA': [49.5],
    'freq_MHz': [1025],
    'temp_C': [26.5]
})

prediction = loaded_model.predict(sample_device)
print(f"Sample device prediction: {'PASS' if prediction[0] == 1 else 'FAIL'}")
print(f"Device parameters: Vdd={sample_device['Vdd_V'][0]}V, Idd={sample_device['Idd_mA'][0]}mA")

# Batch prediction
batch_devices = X_test.head(10)
batch_predictions = loaded_model.predict(batch_devices)
print(f"\nBatch prediction (10 devices): {batch_predictions}")
print(f"Pass rate: {batch_predictions.mean()*100:.1f}%")

## 4. MLflow Model Registry

**Model Registry** provides centralized model store with:
- **Versioning**: Automatic version numbering (v1, v2, v3...)
- **Stage management**: None ‚Üí Staging ‚Üí Production ‚Üí Archived
- **Model lineage**: Track which run/experiment produced the model
- **Annotations**: Add descriptions, tags, comments to versions
- **Access control**: (Enterprise) Role-based permissions

**Typical workflow:**
1. Train model ‚Üí log to MLflow Tracking
2. Register model ‚Üí creates version in registry
3. Transition to Staging ‚Üí testing/validation
4. Transition to Production ‚Üí live serving
5. New model arrives ‚Üí archive old Production model

In [None]:
# Complete Model Registry workflow
client = MlflowClient()

# 1. Register best model from comparison
model_name = "yield_predictor_production"

# Get best run
best_run = mlflow.search_runs(
    experiment_names=["Model_Comparison_Yield"],
    order_by=["metrics.f1_score DESC"],
    max_results=1
).iloc[0]

model_uri = f"runs:/{best_run['run_id']}/model"

# Register model
try:
    model_version = mlflow.register_model(model_uri, model_name)
    print(f"Registered {model_name} version {model_version.version}")
except Exception as e:
    print(f"Model already exists or error: {e}")
    # Get latest version
    model_version = client.get_latest_versions(model_name)[0]

# 2. Add description and tags
client.update_model_version(
    name=model_name,
    version=model_version.version,
    description=f"Yield predictor trained on {best_run['params.data_size']} devices. "
                f"F1 score: {best_run['metrics.f1_score']:.4f}. "
                f"Model type: {best_run['params.model_type']}. "
                f"Production-ready after validation."
)

client.set_model_version_tag(
    name=model_name,
    version=model_version.version,
    key="validation_status",
    value="passed"
)

client.set_model_version_tag(
    name=model_name,
    version=model_version.version,
    key="deployment_date",
    value="2025-12-13"
)

# 3. Transition to Staging
client.transition_model_version_stage(
    name=model_name,
    version=model_version.version,
    stage="Staging",
    archive_existing_versions=False
)
print(f"Model transitioned to Staging")

# 4. Simulate validation in staging
print("Validating model in Staging environment...")
staging_model = mlflow.pyfunc.load_model(f"models:/{model_name}/Staging")
staging_predictions = staging_model.predict(X_test)
staging_accuracy = accuracy_score(y_test, staging_predictions)
print(f"Staging validation accuracy: {staging_accuracy:.4f}")

# 5. Promote to Production (if validation passes)
if staging_accuracy > 0.85:
    client.transition_model_version_stage(
        name=model_name,
        version=model_version.version,
        stage="Production",
        archive_existing_versions=True  # Archive previous production model
    )
    print(f"‚úÖ Model promoted to Production (accuracy {staging_accuracy:.4f} > threshold 0.85)")
else:
    print(f"‚ùå Model failed validation (accuracy {staging_accuracy:.4f} < threshold 0.85)")

# 6. List all versions
print(f"\n=== All versions of {model_name} ===")
for mv in client.search_model_versions(f"name='{model_name}'"):
    print(f"Version {mv.version}: {mv.current_stage} (Run: {mv.run_id[:8]}...)")

## 5. Advanced MLflow Patterns

### A. Nested Runs (Parent-Child Hierarchy)

**Use case**: Hyperparameter tuning - parent run for grid search, child runs for each configuration.

**Benefits:**
- Organize related experiments hierarchically
- Compare child runs within parent context
- Track overall best result in parent run

In [None]:
# Nested runs for hyperparameter tuning
mlflow.set_experiment("Nested_Runs_Example")

param_grid = {
    'max_depth': [5, 10, 15],
    'n_estimators': [50, 100, 200]
}

with mlflow.start_run(run_name="GridSearch_Parent") as parent_run:
    best_f1 = 0
    best_params = None
    
    for max_depth in param_grid['max_depth']:
        for n_estimators in param_grid['n_estimators']:
            with mlflow.start_run(run_name=f"depth{max_depth}_trees{n_estimators}", nested=True):
                # Train model
                model = RandomForestClassifier(
                    max_depth=max_depth,
                    n_estimators=n_estimators,
                    random_state=42
                )
                model.fit(X_train, y_train)
                
                # Evaluate
                y_pred = model.predict(X_test)
                f1 = f1_score(y_test, y_pred)
                
                # Log to child run
                mlflow.log_params({'max_depth': max_depth, 'n_estimators': n_estimators})
                mlflow.log_metric('f1_score', f1)
                
                # Track best
                if f1 > best_f1:
                    best_f1 = f1
                    best_params = {'max_depth': max_depth, 'n_estimators': n_estimators}
                
                print(f"depth={max_depth}, trees={n_estimators}: F1={f1:.4f}")
    
    # Log best result to parent run
    mlflow.log_params(best_params)
    mlflow.log_metric('best_f1', best_f1)
    print(f"\nBest configuration: {best_params}, F1={best_f1:.4f}")

### B. Custom Metrics (Logging Over Time)

**Use case**: Track training loss per epoch, validation accuracy per iteration.

**Pattern**: Use `step` parameter in `log_metric()` to create time-series data.

In [None]:
# Log metrics over time (simulating training epochs)
with mlflow.start_run(run_name="Training_Progress_Tracking"):
    # Simulate 20 epochs of training
    for epoch in range(1, 21):
        # Simulate improving accuracy
        train_acc = 0.5 + (epoch / 20) * 0.4 + np.random.uniform(-0.02, 0.02)
        val_acc = 0.48 + (epoch / 20) * 0.38 + np.random.uniform(-0.03, 0.03)
        
        # Log at each epoch
        mlflow.log_metric("train_accuracy", train_acc, step=epoch)
        mlflow.log_metric("val_accuracy", val_acc, step=epoch)
        
        if epoch % 5 == 0:
            print(f"Epoch {epoch}: Train={train_acc:.4f}, Val={val_acc:.4f}")
    
    # Log final metrics
    mlflow.log_metric("final_train_accuracy", train_acc)
    mlflow.log_metric("final_val_accuracy", val_acc)
    
    print("Check MLflow UI - metrics plotted over epochs!")

## üéØ Real-World MLflow Projects

### **Post-Silicon Validation Projects**

#### **Project 1: Automated Model Comparison Pipeline**
**Objective**: Systematically compare 10+ algorithms for yield prediction, track all experiments
- **Algorithms**: RF, XGBoost, LightGBM, CatBoost, Neural Network, SVM, Logistic Regression, KNN, Gradient Boosting, AdaBoost
- **Tracking**: Log all hyperparameters, cross-validation scores (5-fold), training time, inference latency
- **Experiments**: Create separate MLflow experiment for each use case (yield, test_time, binning)
- **Comparison**: Use MLflow UI to visualize accuracy vs training time scatter plot
- **Output**: Best model automatically promoted to registry Staging stage
- **Success Metric**: Find optimal model in 2 hours instead of 2 weeks manual testing

#### **Project 2: Test Time Optimization Model Registry**
**Objective**: Manage multiple versions of test time reduction model with full lineage
- **Challenge**: 8 model versions over 6 months (v1.0: 8% reduction ‚Üí v2.3: 32% reduction)
- **Registry**: Track each version with tags (data_range, accuracy, test_time_savings, deployed_stations)
- **Stages**: None ‚Üí Staging (10% ATE) ‚Üí Production (100% ATE) ‚Üí Archived
- **Artifacts**: Save test sequence optimization recommendations as JSON per version
- **Monitoring**: Log false negative rate, test time distribution per version
- **Rollback**: Instant rollback to v2.1 if v2.3 increases FNR above threshold
- **Compliance**: Full audit trail for ISO 9001 quality requirements

#### **Project 3: Wafer Map Classifier with Nested Runs**
**Objective**: Hyperparameter tuning for spatial anomaly detection with organized tracking
- **Parent run**: "SpatialFeature_GridSearch" (tracks overall best configuration)
- **Child runs**: 120 combinations (8 PCA components √ó 5 contamination values √ó 3 spatial kernels)
- **Metrics**: F1 score, precision, recall, inference time per wafer map
- **Artifacts**: Save best/worst wafer map examples, confusion matrices
- **Best config**: PCA(15) + contamination(0.05) + Gaussian kernel ‚Üí F1=0.91
- **Production**: Deploy best model via MLflow Models REST API (<200ms per wafer)

#### **Project 4: Device Binning Model Deployment**
**Objective**: Deploy binning classifier with MLflow Models to production ATE
- **Model**: Multi-class classifier (Premium/Standard/Economy bins)
- **Signature**: Define input schema (15 electrical parameters) and output (bin_label + confidence)
- **Deployment**: Serve via `mlflow models serve` REST API
- **Integration**: ATE calls API with device parameters, receives bin recommendation
- **Monitoring**: Track bin distribution drift (expected 60/30/10 split)
- **Update process**: Retrain monthly, register new version, A/B test on 5% of devices
- **Value**: 98.7% binning accuracy, $1.2M revenue optimization annually

---

### **General AI/ML Projects**

#### **Project 5: Customer Churn Prediction with Autologging**
**Objective**: Rapid experimentation for churn prediction using MLflow autologging
- **Framework**: XGBoost, LightGBM (both support autologging)
- **Data**: Customer usage metrics (100K users, 50 features)
- **Tracking**: Enable autologging ‚Üí train 20 models in 30 minutes with full tracking
- **Comparison**: Compare all runs in MLflow UI (AUC vs training time)
- **Best model**: LightGBM with class weights ‚Üí AUC=0.87, 2min training
- **Deployment**: Register to model registry, serve via Flask API
- **Business impact**: Identify 15% of high-risk customers, reduce churn by 22%

#### **Project 6: Recommendation System Model Registry**
**Objective**: Manage lifecycle of recommendation models with staged rollouts
- **Models**: Content-based (v1.0), Collaborative filtering (v1.5), Hybrid (v2.0)
- **Registry**: Track each approach with metadata (training_data_size, cold_start_performance)
- **A/B testing**: Deploy v2.0 to Staging ‚Üí serve to 10% of users ‚Üí measure CTR improvement
- **Promotion**: If CTR improves by >5%, promote to Production, archive v1.5
- **Artifacts**: Save user embeddings, item embeddings, similarity matrices per version
- **Monitoring**: Track CTR, conversion rate, user engagement metrics
- **Success**: Hybrid model (v2.0) ‚Üí 18% CTR improvement, promoted to 100% traffic

#### **Project 7: Fraud Detection with Nested Experiments**
**Objective**: Optimize fraud detection with hierarchical experiment tracking
- **Parent runs**: Different feature engineering strategies (4 approaches)
- **Child runs**: Model variations within each strategy (5 models √ó 4 strategies = 20 runs)
- **Metrics**: Precision, recall, F1, false positive rate, inference latency
- **Best**: Feature strategy "temporal_patterns" + LightGBM ‚Üí F1=0.89, FPR=0.8%
- **Production**: Deploy via MLflow Models with <50ms latency requirement
- **Monitoring**: Track prediction distribution, alert if fraud rate changes >2%
- **Value**: Block $3.5M fraud annually, maintain customer satisfaction (low FPR)

#### **Project 8: Time Series Forecasting with Metric Tracking**
**Objective**: Track model performance over time for demand forecasting
- **Model**: Prophet + XGBoost ensemble
- **Training**: Retrain weekly with expanding window (last 365 days)
- **Tracking**: Log MAPE, MAE, RMSE at each retrain (52 runs per year)
- **Metrics over time**: Use step parameter to track performance degradation
- **Alerting**: If MAPE increases >15% for 2 consecutive weeks, trigger investigation
- **Seasonality**: Track model performance across seasons (holiday vs normal periods)
- **Registry**: Version models by quarter (Q1_2025, Q2_2025, etc.)
- **Business value**: Reduce inventory costs by 28%, improve forecast accuracy to MAPE < 8%

## üìö Comprehensive Takeaways

### **üéØ MLflow Complete Overview**

**MLflow** is an open-source platform for managing the ML lifecycle from experimentation to production deployment.

**Four Pillars:**
1. **Tracking**: Log experiments (parameters, metrics, artifacts)
2. **Projects**: Package ML code for reproducibility
3. **Models**: Deploy models to multiple targets
4. **Registry**: Version and manage production models

**Why MLflow Wins:**
- ‚úÖ **Open-source**: No vendor lock-in, free forever
- ‚úÖ **Framework-agnostic**: sklearn, TensorFlow, PyTorch, XGBoost, custom
- ‚úÖ **Simple API**: 3 lines to start tracking experiments
- ‚úÖ **Production-ready**: Used by Uber, Databricks, Microsoft, Netflix
- ‚úÖ **Active community**: 20K+ GitHub stars, frequent updates

---

### **üîß MLflow Tracking Deep Dive**

#### **1. Core Tracking API**

**Hierarchy:**
- **Experiment**: Collection of related runs (e.g., "Yield_Prediction_2025")
- **Run**: Single execution (e.g., "RandomForest_v1_20250113")
- **Metrics, Parameters, Artifacts**: Data logged within run

**Essential functions:**
```python
# Setup
mlflow.set_experiment("experiment_name")  # Create/select experiment
mlflow.set_tracking_uri("file:./mlruns")  # Local or "http://server:5000"

# Tracking
with mlflow.start_run(run_name="descriptive_name"):
    mlflow.log_param("learning_rate", 0.01)        # Single parameter
    mlflow.log_params({"n_estimators": 100, ...})  # Multiple parameters
    
    mlflow.log_metric("accuracy", 0.92)            # Single metric
    mlflow.log_metrics({"f1": 0.89, "auc": 0.94})  # Multiple metrics
    mlflow.log_metric("loss", 0.15, step=10)       # Metric at epoch 10
    
    mlflow.log_artifact("plot.png")                # Log file
    mlflow.sklearn.log_model(model, "model")       # Log model
    mlflow.set_tag("environment", "production")    # Add metadata
```

#### **2. What to Log**

**Parameters (immutable):**
- Hyperparameters: `n_estimators`, `learning_rate`, `max_depth`
- Config: `batch_size`, `random_seed`, `optimizer`
- Data info: `train_size`, `test_size`, `data_version`

**Metrics (can update):**
- Performance: `accuracy`, `f1_score`, `auc`, `precision`, `recall`
- Loss: `train_loss`, `val_loss` (logged per epoch with `step`)
- Business metrics: `revenue_impact`, `cost_savings`, `latency_ms`

**Artifacts (files):**
- Models: `model.pkl`, `model.h5`
- Plots: `confusion_matrix.png`, `feature_importance.png`, `roc_curve.png`
- Data: `predictions.csv`, `feature_stats.json`
- Reports: `model_card.md`, `validation_report.pdf`

**Tags (metadata):**
- `environment`: dev/staging/production
- `ml_engineer`: team_member_name
- `data_version`: v2.3
- `use_case`: yield_prediction

#### **3. Autologging**

**Instant setup** for supported frameworks:
```python
# Enable autologging
mlflow.sklearn.autolog()   # scikit-learn
mlflow.xgboost.autolog()   # XGBoost
mlflow.tensorflow.autolog()  # TensorFlow/Keras
mlflow.pytorch.autolog()   # PyTorch

# Train model - MLflow logs everything automatically!
model.fit(X_train, y_train)

# Disable when done
mlflow.sklearn.autolog(disable=True)
```

**What gets auto-logged:**
- All model hyperparameters
- Training/validation metrics
- Model artifact (serialized model)
- Model signature (input/output schema)
- Training dataset metadata

**When to use:**
- ‚úÖ Rapid prototyping (test 10 models in 10 minutes)
- ‚úÖ Standard workflows (no custom metrics needed)
- ‚ùå Complex pipelines (need fine-grained control)
- ‚ùå Custom metrics (autologging won't capture them)

#### **4. Comparing Experiments**

**Programmatic comparison:**
```python
runs = mlflow.search_runs(
    experiment_names=["Yield_Prediction"],
    filter_string="metrics.accuracy > 0.9",
    order_by=["metrics.f1_score DESC"],
    max_results=10
)

print(runs[['run_id', 'params.model_type', 'metrics.accuracy', 'metrics.f1_score']])
```

**UI comparison:**
- MLflow UI: http://127.0.0.1:5000
- Select multiple runs ‚Üí "Compare" button
- Visualize: Parallel coordinates plot, scatter plot (accuracy vs training_time)

---

### **üöÄ MLflow Models**

**Model packaging** for deployment across platforms.

#### **1. Model Flavors**

**Flavor** = representation of model in specific format.

**Every MLflow model has:**
- **python_function flavor** (universal): Works everywhere, slower
- **Native flavor** (framework-specific): sklearn, tensorflow, pytorch - faster, optimized

**Example:**
```python
# Log model with automatic flavors
mlflow.sklearn.log_model(model, "model", signature=signature)

# Logged as:
# - sklearn flavor (for sklearn-native loading)
# - python_function flavor (for framework-agnostic loading)
```

#### **2. Model Signature**

**Signature** defines input/output schema for validation.

```python
from mlflow.models.signature import infer_signature

# Infer from data
signature = infer_signature(X_train, model.predict(X_train))

# Manual definition
from mlflow.types.schema import Schema, ColSpec
input_schema = Schema([
    ColSpec("double", "Vdd_V"),
    ColSpec("double", "Idd_mA"),
    ColSpec("double", "freq_MHz"),
    ColSpec("double", "temp_C")
])
output_schema = Schema([ColSpec("long")])
signature = ModelSignature(inputs=input_schema, outputs=output_schema)

# Log with signature
mlflow.sklearn.log_model(model, "model", signature=signature)
```

**Benefits:**
- Validates input data at prediction time
- Documents expected input format
- Prevents errors in production

#### **3. Loading Models**

**Three URI formats:**
```python
# From specific run
model = mlflow.pyfunc.load_model("runs:/<run_id>/model")

# From registry by version
model = mlflow.pyfunc.load_model("models:/<model_name>/1")

# From registry by stage
model = mlflow.pyfunc.load_model("models:/<model_name>/Production")
```

**Making predictions:**
```python
# Single prediction
prediction = model.predict(pd.DataFrame([{...}]))

# Batch prediction
predictions = model.predict(df)
```

#### **4. Deployment Options**

**A. REST API (local)**
```bash
mlflow models serve -m runs:/<run_id>/model -p 5001

# Test
curl -X POST http://127.0.0.1:5001/invocations \
  -H 'Content-Type: application/json' \
  -d '{"dataframe_records": [{"Vdd_V": 1.2, "Idd_mA": 50, ...}]}'
```

**B. Batch (Spark UDF)**
```python
# Load as Spark UDF
predict_udf = mlflow.pyfunc.spark_udf(spark, "runs:/<run_id>/model")

# Apply to Spark DataFrame
df = df.withColumn("prediction", predict_udf(*df.columns))
```

**C. Cloud deployment**
- **AWS SageMaker**: `mlflow.sagemaker.deploy()`
- **Azure ML**: `mlflow.azureml.deploy()`
- **Google Cloud**: Use `gcloud` CLI with MLflow model artifact

**D. Docker container**
```bash
# Build Docker image
mlflow models build-docker -m runs:/<run_id>/model -n my_model

# Run container
docker run -p 5001:8080 my_model
```

---

### **üì¶ MLflow Model Registry**

**Centralized model store** with versioning, stages, and governance.

#### **1. Registry Workflow**

```python
from mlflow.tracking import MlflowClient
client = MlflowClient()

# Register model
model_version = mlflow.register_model(
    model_uri="runs:/<run_id>/model",
    name="yield_predictor"
)
# Creates version 1, 2, 3... automatically

# Add description
client.update_model_version(
    name="yield_predictor",
    version=1,
    description="Trained on 50K devices, F1=0.92"
)

# Add tags
client.set_model_version_tag(
    name="yield_predictor",
    version=1,
    key="validation_status",
    value="passed"
)

# Transition stages
client.transition_model_version_stage(
    name="yield_predictor",
    version=1,
    stage="Staging"  # None ‚Üí Staging ‚Üí Production ‚Üí Archived
)

# Promote to production
client.transition_model_version_stage(
    name="yield_predictor",
    version=1,
    stage="Production",
    archive_existing_versions=True  # Demote previous production model
)
```

#### **2. Stage Management**

**Four stages:**
- **None**: Newly registered, not yet validated
- **Staging**: Testing/validation in progress
- **Production**: Live model serving predictions
- **Archived**: Retired models (kept for audit/rollback)

**Typical flow:**
1. Register model ‚Üí None
2. Validate accuracy/fairness ‚Üí Transition to Staging
3. A/B test in staging ‚Üí Monitor metrics
4. If successful ‚Üí Transition to Production (archive old)
5. If new model arrives ‚Üí Archive current Production

#### **3. Model Lineage**

**Every registered model tracks:**
- Source run ID (which experiment/run produced it)
- Training parameters (from run)
- Training metrics (from run)
- Artifacts (plots, data from run)
- Stage history (when transitioned, by whom)

**Query lineage:**
```python
# Get model version details
mv = client.get_model_version("yield_predictor", 1)
print(f"Run ID: {mv.run_id}")
print(f"Stage: {mv.current_stage}")
print(f"Description: {mv.description}")

# Get source run details
run = client.get_run(mv.run_id)
print(f"Parameters: {run.data.params}")
print(f"Metrics: {run.data.metrics}")
```

**Audit trail:**
- "Show me which model version was in production on 2024-03-15"
- "Which experiment produced the current production model?"
- "What were the training parameters for version 3?"

---

### **üéì Advanced Patterns**

#### **1. Nested Runs**

**Use case**: Organize hyperparameter tuning experiments.

```python
with mlflow.start_run(run_name="GridSearch_Parent"):
    for param1 in [10, 20, 30]:
        for param2 in [0.01, 0.1]:
            with mlflow.start_run(run_name=f"p1_{param1}_p2_{param2}", nested=True):
                # Train with this config
                # Log metrics to child run
                pass
    
    # Log best config to parent run
    mlflow.log_params(best_params)
    mlflow.log_metric("best_f1", best_f1)
```

**Benefits:**
- Parent run shows overall best result
- Child runs show individual configurations
- Hierarchical organization in UI

#### **2. Metric Tracking Over Time**

**Use case**: Track loss per epoch, validation accuracy over iterations.

```python
with mlflow.start_run():
    for epoch in range(100):
        train_loss = train_one_epoch()
        val_accuracy = validate()
        
        mlflow.log_metric("train_loss", train_loss, step=epoch)
        mlflow.log_metric("val_accuracy", val_accuracy, step=epoch)
```

**Result**: MLflow UI plots metrics as time-series graphs.

#### **3. Custom Python Models**

**Use case**: Deploy complex pipelines (preprocessing + model + postprocessing).

```python
import mlflow.pyfunc

class YieldPredictorPipeline(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        # Load model and preprocessor
        import pickle
        self.preprocessor = pickle.load(open(context.artifacts["preprocessor"], "rb"))
        self.model = pickle.load(open(context.artifacts["model"], "rb"))
    
    def predict(self, context, model_input):
        # Preprocess
        X = self.preprocessor.transform(model_input)
        # Predict
        predictions = self.model.predict(X)
        # Postprocess
        return {"prediction": predictions.tolist(), "confidence": 0.95}

# Log custom model
artifacts = {"preprocessor": "preprocessor.pkl", "model": "model.pkl"}
mlflow.pyfunc.log_model("custom_model", python_model=YieldPredictorPipeline(), artifacts=artifacts)
```

---

### **‚öôÔ∏è MLflow Configuration**

#### **Tracking URI Options**

**Local file system:**
```python
mlflow.set_tracking_uri("file:./mlruns")  # Default
```

**SQLite database:**
```python
mlflow.set_tracking_uri("sqlite:///mlflow.db")
```

**PostgreSQL (production):**
```python
mlflow.set_tracking_uri("postgresql://user:pass@host:5432/mlflow_db")
```

**Remote server:**
```python
mlflow.set_tracking_uri("http://mlflow-server:5000")
```

**Databricks:**
```python
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Users/user@example.com/experiments/yield_prediction")
```

#### **Artifact Storage**

**Local:**
```python
# Stored in ./mlruns/<experiment_id>/<run_id>/artifacts/
```

**S3:**
```python
mlflow.set_tracking_uri("http://mlflow-server:5000")
os.environ["MLFLOW_S3_ENDPOINT_URL"] = "https://s3.amazonaws.com"
# Artifacts stored in s3://bucket/path/
```

**Azure Blob:**
```python
os.environ["AZURE_STORAGE_CONNECTION_STRING"] = "..."
# Artifacts stored in wasbs://container@account/path/
```

---

### **üöÄ Production Best Practices**

#### **1. Experiment Organization**

**Strategy**: One experiment per use case/project.

```
Experiments:
- Yield_Prediction_2025
  - RandomForest_baseline
  - XGBoost_v1
  - XGBoost_v2_optimized
- Test_Time_Optimization
  - Sequential_Model
  - Parallel_Model
- Wafer_Map_Clustering
  - KMeans
  - DBSCAN
  - Hierarchical
```

**Don't**: Create one experiment per run (clutters UI).

#### **2. Naming Conventions**

**Run names**: `<model_type>_<variant>_<date>`
- Example: `RandomForest_tuned_20250113`

**Tags**: Use consistent keys
- `environment`: dev/staging/production
- `data_version`: v1.0, v2.1
- `use_case`: yield/binning/test_time

**Model names** (registry): `<use_case>_<model_type>`
- Example: `yield_predictor_xgboost`

#### **3. Logging Strategy**

**Always log:**
- ‚úÖ All hyperparameters (even defaults)
- ‚úÖ Data size (train/test split sizes)
- ‚úÖ Random seed (for reproducibility)
- ‚úÖ Training time (for cost analysis)
- ‚úÖ Model artifact (for deployment)

**Post-silicon specific:**
- ‚úÖ STDF data date range
- ‚úÖ Test floor location/equipment
- ‚úÖ Device type/process node
- ‚úÖ Pass/fail thresholds

#### **4. Model Registry Governance**

**Policy**: Only models meeting criteria can reach Production.

**Validation gates:**
1. **Accuracy gate**: Accuracy > 0.90
2. **Fairness gate**: No bias across device types
3. **Latency gate**: Inference < 50ms (for real-time use)
4. **A/B test**: Staging model performs ‚â• baseline

**Implementation:**
```python
def promote_to_production(model_name, version, client):
    # Get model metrics
    mv = client.get_model_version(model_name, version)
    run = client.get_run(mv.run_id)
    accuracy = float(run.data.metrics["accuracy"])
    
    # Validation gates
    if accuracy < 0.90:
        print(f"‚ùå Failed accuracy gate: {accuracy:.4f} < 0.90")
        return False
    
    # Promote
    client.transition_model_version_stage(
        name=model_name,
        version=version,
        stage="Production",
        archive_existing_versions=True
    )
    print(f"‚úÖ Promoted to Production")
    return True
```

#### **5. Monitoring Production Models**

**What to track:**
- Prediction volume (predictions/day)
- Latency (p50, p95, p99)
- Prediction distribution (drift detection)
- Error rate (exceptions, timeouts)

**Implementation**:
```python
# Log production metrics daily
with mlflow.start_run(run_name=f"production_monitor_{date}"):
    mlflow.log_metric("daily_predictions", 100000)
    mlflow.log_metric("p95_latency_ms", 45)
    mlflow.log_metric("error_rate", 0.002)
    mlflow.log_artifact("prediction_distribution.png")
```

---

### **‚ö†Ô∏è Common Pitfalls**

#### **1. Logging Too Much**
- **Problem**: Logging every intermediate step ‚Üí 1000 metrics per run ‚Üí slow UI
- **Solution**: Log only essential metrics, aggregate intermediate results

#### **2. Not Logging Enough**
- **Problem**: Forgot to log random seed ‚Üí can't reproduce result
- **Solution**: Log ALL parameters that affect outcome (hyperparameters, seeds, data versions)

#### **3. Inconsistent Naming**
- **Problem**: Runs named "test1", "test2", "final", "final_v2" ‚Üí can't find anything
- **Solution**: Use consistent naming: `<model>_<variant>_<date>`

#### **4. No Model Signatures**
- **Problem**: Production API receives wrong input format ‚Üí crashes
- **Solution**: Always use signatures for validation

#### **5. Cluttered Experiments**
- **Problem**: 500 runs in one experiment "Experiments" ‚Üí impossible to navigate
- **Solution**: One experiment per project/use case

#### **6. No Rollback Plan**
- **Problem**: Production model fails, previous version deleted
- **Solution**: Use `archive_existing_versions=True` (keeps old models), test rollback procedure

---

### **üîÆ Next Steps**

**After mastering MLflow:**
1. **123_Model_Monitoring_Drift_Detection.ipynb** ‚Üí Monitor deployed models for drift
2. **124_Feature_Store_Implementation.ipynb** ‚Üí Centralize feature engineering
3. **125_ML_Pipeline_Orchestration.ipynb** ‚Üí Automate with Airflow/Kubeflow
4. **131_Docker_Fundamentals.ipynb** ‚Üí Containerize MLflow deployments

**Hands-On Practice:**
- Set up MLflow tracking server (shared team server)
- Track 10 experiments for real use case
- Register model, test stage transitions
- Deploy model via REST API, test with curl
- Implement nested runs for hyperparameter tuning
- Create custom Python model for complex pipeline

---

**You now have complete mastery of MLflow for production ML! üöÄ**

**Key skill acquired**: Systematic experiment tracking, model versioning, and deployment - the foundation of professional ML engineering.