# Chapter 11: AutoML in Production - MLOps Integration

This notebook accompanies Chapter 11 of the O'Reilly AutoML book. It demonstrates practical MLOps integration patterns for AutoML systems, including:

1. MLflow experiment tracking with AutoGluon
2. Hierarchical (parent/child) experiment organization
3. Model registry and versioning
4. Kubeflow pipeline components
5. Model monitoring and drift detection
6. Automated retraining workflows

**Prerequisites:**
- Python 3.10+
- AutoGluon
- MLflow
- scikit-learn
- pandas, numpy

---
## Section 11.1: Environment Setup

In [None]:
# Install required packages (uncomment if needed)
# !pip install autogluon mlflow scikit-learn pandas numpy scipy

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import time
from datetime import datetime
import json
import os

# MLflow
import mlflow
from mlflow.tracking import MlflowClient

# AutoGluon
from autogluon.tabular import TabularPredictor

# Scikit-learn utilities
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

print(f"MLflow version: {mlflow.__version__}")
print(f"AutoGluon ready for MLOps integration")

---
## Section 11.2: Load Sample Dataset

We'll use the California Housing dataset to demonstrate MLOps patterns. This is a regression task suitable for demonstrating model tracking, versioning, and monitoring.

In [None]:
# Load California Housing dataset
housing = fetch_california_housing(as_frame=True)
df = housing.frame

# Rename target column
df = df.rename(columns={'MedHouseVal': 'target'})

# Split data: train/validation/test
train_df, temp_df = train_test_split(df, test_size=0.3, random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

print(f"Training samples: {len(train_df)}")
print(f"Validation samples: {len(val_df)}")
print(f"Test samples: {len(test_df)}")
print(f"\nFeatures: {list(df.columns[:-1])}")

---
## Section 11.3: MLflow Experiment Tracking with AutoGluon

### Snippet 11-1: Basic MLflow Integration

This demonstrates the fundamental pattern for tracking AutoGluon experiments with MLflow.

In [None]:
# Snippet 11-1: Basic MLflow Integration with AutoGluon

# Set up MLflow experiment
mlflow.set_experiment("automl-housing-experiment")

with mlflow.start_run(run_name="autogluon-baseline") as run:
    # Log training parameters
    mlflow.log_param("time_limit", 120)
    mlflow.log_param("eval_metric", "root_mean_squared_error")
    mlflow.log_param("presets", "medium_quality")
    
    # Train AutoGluon model
    predictor = TabularPredictor(
        label='target',
        eval_metric='root_mean_squared_error',
        path='./ag_models_basic'
    ).fit(
        train_data=train_df,
        time_limit=120,
        presets='medium_quality'
    )
    
    # Get predictions and calculate metrics
    val_predictions = predictor.predict(val_df.drop(columns=['target']))
    rmse = np.sqrt(mean_squared_error(val_df['target'], val_predictions))
    r2 = r2_score(val_df['target'], val_predictions)
    
    # Log metrics
    mlflow.log_metric("val_rmse", rmse)
    mlflow.log_metric("val_r2", r2)
    
    # Log model leaderboard
    leaderboard = predictor.leaderboard(silent=True)
    leaderboard.to_csv("leaderboard.csv", index=False)
    mlflow.log_artifact("leaderboard.csv")
    
    print(f"Run ID: {run.info.run_id}")
    print(f"Validation RMSE: {rmse:.4f}")
    print(f"Validation RÂ²: {r2:.4f}")

---
## Section 11.4: Hierarchical Experiment Tracking

### Snippet 11-2: Nested MLflow Runs

AutoML generates many models. Using parent/child runs provides clear organization where the parent run captures overall experiment metadata while child runs track individual model details.

In [None]:
# Snippet 11-2: Hierarchical MLflow Tracking for AutoML

def train_with_nested_tracking(train_df, val_df, experiment_name, time_limit=180):
    """Train AutoGluon with hierarchical MLflow tracking."""
    
    mlflow.set_experiment(experiment_name)
    
    # Parent run captures overall experiment
    with mlflow.start_run(run_name="automl-parent") as parent_run:
        # Log parent-level parameters
        mlflow.log_param("framework", "autogluon")
        mlflow.log_param("time_limit", time_limit)
        mlflow.log_param("n_train_samples", len(train_df))
        mlflow.log_param("n_features", len(train_df.columns) - 1)
        
        # Train AutoGluon
        predictor = TabularPredictor(
            label='target',
            eval_metric='root_mean_squared_error',
            path='./ag_models_nested'
        ).fit(
            train_data=train_df,
            time_limit=time_limit,
            presets='best_quality'
        )
        
        # Get leaderboard
        leaderboard = predictor.leaderboard(silent=True)
        
        # Create child runs for each model
        for idx, row in leaderboard.iterrows():
            model_name = row['model']
            
            with mlflow.start_run(run_name=f"model-{model_name}", nested=True):
                mlflow.log_param("model_type", model_name)
                mlflow.log_metric("score_val", row['score_val'])
                mlflow.log_metric("pred_time_val", row['pred_time_val'])
                mlflow.log_metric("fit_time", row['fit_time'])
                
                # Calculate inference latency per sample
                latency_per_sample = row['pred_time_val'] / len(val_df) * 1000
                mlflow.log_metric("latency_ms_per_sample", latency_per_sample)
        
        # Log best model metrics at parent level
        best_model = leaderboard.iloc[0]
        mlflow.log_metric("best_score", best_model['score_val'])
        mlflow.log_param("best_model", best_model['model'])
        
        # Calculate and log validation metrics
        val_predictions = predictor.predict(val_df.drop(columns=['target']))
        val_rmse = np.sqrt(mean_squared_error(val_df['target'], val_predictions))
        mlflow.log_metric("final_val_rmse", val_rmse)
        
        print(f"Parent Run ID: {parent_run.info.run_id}")
        print(f"Best Model: {best_model['model']}")
        print(f"Best Score: {best_model['score_val']:.4f}")
        print(f"Models trained: {len(leaderboard)}")
        
        return predictor, parent_run.info.run_id

# Run hierarchical tracking
predictor_nested, parent_run_id = train_with_nested_tracking(
    train_df, val_df, 
    experiment_name="automl-hierarchical-tracking",
    time_limit=180
)

---
## Section 11.5: Model Registry and Versioning

### Snippet 11-3: Model Registration Strategies

The model registry provides lifecycle management for production models.

In [None]:
# Snippet 11-3: Model Registration Strategy

def register_automl_model(predictor, run_id, model_name, alias="production"):
    """Register AutoGluon model and assign an alias for lifecycle management."""
    
    client = MlflowClient()
    
    # Log model as MLflow artifact
    with mlflow.start_run(run_id=run_id):
        # Save predictor directory to artifacts
        predictor_path = predictor.path
        mlflow.log_artifacts(predictor_path, artifact_path="autogluon_predictor")
        
        # Get artifact URI
        artifact_uri = mlflow.get_artifact_uri("autogluon_predictor")
    
    # Register model
    try:
        # Create registered model if it doesn't exist
        client.create_registered_model(
            name=model_name,
            description="AutoGluon predictor for housing price prediction"
        )
    except mlflow.exceptions.RestException:
        pass  # Model already exists
    
    # Create new version
    model_version = client.create_model_version(
        name=model_name,
        source=artifact_uri,
        run_id=run_id,
        description=f"AutoGluon model registered at {datetime.now()}"
    )
    
    # Assign alias for lifecycle management
    client.set_registered_model_alias(
        name=model_name,
        alias=alias.lower(),
        version=model_version.version
    )
    
    print(f"Registered model: {model_name}")
    print(f"Version: {model_version.version}")
    print(f"Alias: {alias}")
    
    return model_version

# Uncomment below to run (requires MLflow tracking server):
# model_version = register_automl_model(
#     predictor_nested, 
#     parent_run_id, 
#     "housing-price-automl"
# )

---
## Section 11.6: Model Validation Pipeline

### Snippet 11-4: Validation Checks Before Deployment

In [None]:
# Snippet 11-4: Model Validation Pipeline

class ModelValidator:
    """Comprehensive model validation before deployment."""
    
    def __init__(self, predictor, baseline_metrics=None):
        self.predictor = predictor
        self.baseline_metrics = baseline_metrics or {}
        self.validation_results = {}
    
    def validate_performance(self, test_df, target_col='target', threshold=0.95):
        """Check if model meets performance thresholds."""
        predictions = self.predictor.predict(test_df.drop(columns=[target_col]))
        rmse = np.sqrt(mean_squared_error(test_df[target_col], predictions))
        r2 = r2_score(test_df[target_col], predictions)
        
        baseline_r2 = self.baseline_metrics.get('r2', 0.0)
        performance_ratio = r2 / baseline_r2 if baseline_r2 > 0 else 1.0
        
        passed = performance_ratio >= threshold
        
        self.validation_results['performance'] = {
            'passed': passed,
            'rmse': rmse,
            'r2': r2,
            'baseline_r2': baseline_r2,
            'ratio': performance_ratio
        }
        
        return passed
    
    def validate_latency(self, test_df, max_latency_ms=100):
        """Check inference latency requirements."""
        import time
        
        # Warm-up
        _ = self.predictor.predict(test_df.head(10).drop(columns=['target']))
        
        # Measure latency
        start = time.time()
        _ = self.predictor.predict(test_df.drop(columns=['target']))
        elapsed = time.time() - start
        
        latency_per_sample = (elapsed / len(test_df)) * 1000  # ms
        passed = latency_per_sample <= max_latency_ms
        
        self.validation_results['latency'] = {
            'passed': passed,
            'latency_ms': latency_per_sample,
            'max_allowed_ms': max_latency_ms
        }
        
        return passed
    
    def validate_all(self, test_df):
        """Run all validation checks."""
        perf_passed = self.validate_performance(test_df)
        latency_passed = self.validate_latency(test_df)
        
        all_passed = perf_passed and latency_passed
        
        return {
            'all_passed': all_passed,
            'results': self.validation_results
        }

# Run validation
validator = ModelValidator(
    predictor_nested, 
    baseline_metrics={'r2': 0.80}  # Baseline from previous model
)

validation_report = validator.validate_all(test_df)
print(json.dumps(validation_report, indent=2, default=str))

---
## Section 11.7: Model Monitoring and Drift Detection

### Snippet 11-5: Data Drift Detection

In [None]:
# Snippet 11-5: Data Drift Detection

from scipy import stats

class DriftDetector:
    """Detect data and prediction drift in AutoML systems."""
    
    def __init__(self, reference_data, feature_columns):
        self.reference_data = reference_data
        self.feature_columns = feature_columns
        self.reference_stats = self._compute_stats(reference_data)
    
    def _compute_stats(self, data):
        """Compute reference statistics."""
        stats_dict = {}
        for col in self.feature_columns:
            stats_dict[col] = {
                'mean': data[col].mean(),
                'std': data[col].std(),
                'min': data[col].min(),
                'max': data[col].max(),
                'quantiles': data[col].quantile([0.25, 0.5, 0.75]).values
            }
        return stats_dict
    
    def calculate_psi(self, reference, current, bins=10):
        """Calculate Population Stability Index."""
        # Create bins from reference data
        min_val = min(reference.min(), current.min())
        max_val = max(reference.max(), current.max())
        bins_edges = np.linspace(min_val, max_val, bins + 1)
        
        ref_counts, _ = np.histogram(reference, bins=bins_edges)
        curr_counts, _ = np.histogram(current, bins=bins_edges)
        
        # Add small value to avoid division by zero
        ref_pct = (ref_counts + 1) / (len(reference) + bins)
        curr_pct = (curr_counts + 1) / (len(current) + bins)
        
        psi = np.sum((curr_pct - ref_pct) * np.log(curr_pct / ref_pct))
        return psi
    
    def detect_drift(self, current_data, psi_threshold=0.2):
        """Detect drift in current data vs reference."""
        drift_report = {
            'drift_detected': False,
            'features_with_drift': [],
            'feature_psi': {}
        }
        
        for col in self.feature_columns:
            psi = self.calculate_psi(
                self.reference_data[col],
                current_data[col]
            )
            drift_report['feature_psi'][col] = psi
            
            if psi > psi_threshold:
                drift_report['drift_detected'] = True
                drift_report['features_with_drift'].append({
                    'feature': col,
                    'psi': psi,
                    'severity': 'high' if psi > 0.25 else 'medium'
                })
        
        return drift_report

# Initialize drift detector with training data
feature_cols = [c for c in train_df.columns if c != 'target']
drift_detector = DriftDetector(train_df, feature_cols)

# Check for drift in test data (simulating production data)
drift_report = drift_detector.detect_drift(test_df)
print("Drift Detection Report:")
print(json.dumps(drift_report, indent=2, default=str))

---
## Section 11.8: Regression Testing

### Snippet 11-6: Model Regression Testing

In [None]:
# Snippet 11-6: Model Regression Testing

class ModelRegressionTester:
    """Test new models against production baselines."""
    
    def __init__(self, test_cases):
        self.test_cases = test_cases  # DataFrame with inputs and expected outputs
        
    def run_regression_tests(self, new_predictor, old_predictor, 
                            tolerance=0.1, max_regression_pct=0.05):
        """Compare new model against old model on test cases."""
        
        features = self.test_cases.drop(columns=['target'])
        expected = self.test_cases['target']
        
        old_predictions = old_predictor.predict(features)
        new_predictions = new_predictor.predict(features)
        
        # Calculate metrics for both
        old_rmse = np.sqrt(mean_squared_error(expected, old_predictions))
        new_rmse = np.sqrt(mean_squared_error(expected, new_predictions))
        
        # Check for regression
        regression_pct = (new_rmse - old_rmse) / old_rmse if old_rmse > 0 else 0
        
        # Count significant prediction differences
        diff = np.abs(new_predictions - old_predictions)
        significant_changes = (diff > tolerance * np.abs(old_predictions)).sum()
        
        results = {
            'passed': regression_pct <= max_regression_pct,
            'old_rmse': old_rmse,
            'new_rmse': new_rmse,
            'regression_pct': regression_pct,
            'significant_prediction_changes': significant_changes,
            'total_test_cases': len(self.test_cases)
        }
        
        return results

# Create regression test suite using a sample of test data
regression_tester = ModelRegressionTester(test_df.head(100))

# Note: In practice, you'd compare new model vs deployed production model
# Here we compare the same model as a demonstration
regression_results = regression_tester.run_regression_tests(
    predictor_nested, 
    predictor_nested  # Would be different in production
)

print("Regression Test Results:")
print(json.dumps(regression_results, indent=2, default=str))

---
## Section 11.9: Automated Retraining Workflow

### Snippet 11-7: Retraining Trigger Logic

In [None]:
# Snippet 11-7: Automated Retraining Trigger

class RetrainingOrchestrator:
    """Orchestrate automated model retraining based on triggers."""
    
    def __init__(self, config):
        self.config = config
        self.drift_threshold = config.get('drift_psi_threshold', 0.2)
        self.performance_threshold = config.get('performance_threshold', 0.95)
        
    def evaluate_triggers(self, drift_report, performance_metrics):
        """Determine if retraining should be triggered."""
        triggers = []
        
        # Check drift trigger
        if drift_report.get('drift_detected', False):
            triggers.append({
                'type': 'data_drift',
                'reason': f"Drift detected in {len(drift_report.get('features_with_drift', []))} features",
                'severity': 'high'
            })
        
        # Check performance degradation
        current_perf = performance_metrics.get('current_r2', 1.0)
        baseline_perf = performance_metrics.get('baseline_r2', 1.0)
        perf_ratio = current_perf / baseline_perf if baseline_perf > 0 else 1.0
        
        if perf_ratio < self.performance_threshold:
            triggers.append({
                'type': 'performance_degradation',
                'reason': f"Performance dropped to {perf_ratio:.2%} of baseline",
                'severity': 'high' if perf_ratio < 0.9 else 'medium'
            })
        
        return {
            'should_retrain': len(triggers) > 0,
            'triggers': triggers,
            'timestamp': datetime.now().isoformat()
        }
    
    def execute_retraining(self, train_df, retrain_config):
        """Execute retraining pipeline."""
        print(f"Starting retraining at {datetime.now()}")
        
        mlflow.set_experiment("automl-retraining")
        
        with mlflow.start_run(run_name=f"retrain-{datetime.now().strftime('%Y%m%d-%H%M%S')}"):
            mlflow.log_param("retrain_reason", retrain_config.get('reason', 'scheduled'))
            mlflow.log_param("time_limit", retrain_config.get('time_limit', 300))
            
            predictor = TabularPredictor(
                label='target',
                eval_metric='root_mean_squared_error',
                path=f'./ag_models_retrain_{datetime.now().strftime("%Y%m%d_%H%M%S")}'
            ).fit(
                train_data=train_df,
                time_limit=retrain_config.get('time_limit', 300),
                presets=retrain_config.get('presets', 'best_quality')
            )
            
            return predictor

# Initialize orchestrator
orchestrator = RetrainingOrchestrator({
    'drift_psi_threshold': 0.2,
    'performance_threshold': 0.95
})

# Evaluate triggers
trigger_result = orchestrator.evaluate_triggers(
    drift_report=drift_report,
    performance_metrics={'current_r2': 0.85, 'baseline_r2': 0.90}
)

print("Retraining Trigger Evaluation:")
print(json.dumps(trigger_result, indent=2))

# Execute retraining if triggered (commented out to save time)
# if trigger_result['should_retrain']:
#     new_predictor = orchestrator.execute_retraining(
#         train_df,
#         {'time_limit': 180, 'reason': trigger_result['triggers'][0]['type']}
#     )

---
## Section 11.10: Kubeflow Pipeline Component Example

This cell shows how to define a Kubeflow pipeline component for AutoML training. Note: This requires `kfp` package and Kubeflow cluster to execute.

In [None]:
# Kubeflow Pipeline Component Definition (reference implementation)

AUTOML_COMPONENT_YAML = """
name: AutoGluon Training Component
description: Train AutoGluon model with MLflow tracking

inputs:
  - name: train_data_path
    type: String
    description: Path to training data
  - name: target_column
    type: String
    description: Name of target column
  - name: time_limit
    type: Integer
    default: 300
  - name: presets
    type: String
    default: best_quality
  - name: mlflow_tracking_uri
    type: String

outputs:
  - name: model_path
    type: String
  - name: metrics
    type: JsonObject

implementation:
  container:
    image: autogluon-mlops:latest
    command:
      - python
      - train_component.py
    args:
      - --train-data={{inputs.train_data_path}}
      - --target-column={{inputs.target_column}}
      - --time-limit={{inputs.time_limit}}
      - --presets={{inputs.presets}}
      - --mlflow-uri={{inputs.mlflow_tracking_uri}}
      - --output-path={{outputs.model_path}}
"""

print("Kubeflow Component Definition:")
print(AUTOML_COMPONENT_YAML)

---
## Section 11.11: Summary

This notebook demonstrated key MLOps patterns for AutoML systems:

1. **MLflow Integration** - Tracking experiments, parameters, and metrics
2. **Hierarchical Tracking** - Parent/child runs for multi-model experiments
3. **Model Registry** - Version management and staging
4. **Validation Pipeline** - Performance and latency checks
5. **Drift Detection** - PSI-based data drift monitoring
6. **Regression Testing** - Ensuring model quality over time
7. **Automated Retraining** - Trigger-based model updates

These patterns enable reliable, reproducible AutoML deployments in production environments.

In [None]:
# Cleanup temporary files (optional)
import shutil

# Uncomment to clean up:
# for path in ['./ag_models_basic', './ag_models_nested']:
#     if os.path.exists(path):
#         shutil.rmtree(path)
#         print(f"Removed {path}")