# 126: Continuous Training Pipelines - Automated Model Retraining and Drift Detection

## üéØ Learning Objectives

By the end of this notebook, you will:
- **Understand** continuous training paradigm for adaptive ML systems (models auto-retrain on new data)
- **Implement** retraining triggers based on schedules, drift detection, and performance degradation
- **Build** automated validation gates to prevent bad models from deploying
- **Deploy** continuous training pipelines with Airflow orchestration
- **Apply** drift detection to post-silicon test data (detect when device characteristics change)
- **Monitor** model performance decay and trigger timely retraining

## üìö What is Continuous Training?

**Continuous training** is the MLOps practice of **automatically retraining models** as new data arrives, concept drift occurs, or performance degrades. Unlike static models (trained once, deployed forever), continuous training keeps models fresh and accurate.

**Why Continuous Training?**
- ‚úÖ **Adapt to change**: Models degrade over time as real-world patterns shift (customer behavior, device characteristics, market trends)
- ‚úÖ **Improve with data**: New data improves model accuracy (more training samples = better generalization)
- ‚úÖ **Detect drift**: Statistical tests detect data drift (input distribution changes) or concept drift (relationship between X and Y changes)
- ‚úÖ **Reduce manual work**: Automate retraining instead of manual model updates every quarter

**Continuous Training vs Batch Retraining:**
- **Batch**: Retrain monthly/quarterly on schedule (simple but ignores drift, may miss critical changes)
- **Continuous**: Retrain when drift detected or performance drops (adaptive but requires monitoring infrastructure)

## üè≠ Post-Silicon Validation Use Cases

### **Use Case 1: Yield Prediction Model Continuous Retraining**
**Input:** Yield prediction model trained on Q1 wafer test data, deployed to production  
**Problem:** Q2 shows 15% accuracy drop (new test equipment, process changes, different lot characteristics)  
**Output:** Drift detection triggers automatic retraining on Q2 data, model accuracy recovers to baseline  
**Value:** $2.8M/year from preventing false-positive/false-negative yield predictions (early detection of failing devices)

### **Use Case 2: Parametric Test Outlier Detection with Drift Monitoring**
**Input:** Outlier detection model for flagging suspicious test results (voltage, current, frequency out of spec)  
**Problem:** New device generation has different parametric ranges, model flags normal devices as outliers (false alarms)  
**Output:** Data drift detected via KS-test on parametric distributions, model retrains on new generation data  
**Value:** $1.9M/year from reduced false alarms (engineering time saved investigating non-issues)

### **Use Case 3: Test Time Prediction with Performance-Based Retraining**
**Input:** Regression model predicting test duration for capacity planning  
**Problem:** Model accuracy degrades from 95% to 78% (MAPE increases) after test flow changes  
**Output:** Performance monitoring triggers retraining when MAPE exceeds 10% threshold  
**Value:** $1.5M/year from accurate test scheduling (optimize tester utilization, reduce idle time)

### **Use Case 4: Wafer Map Pattern Classification Auto-Retraining**
**Input:** CNN classifying wafer map defect patterns (edge fail, center fail, scratch, random)  
**Problem:** New defect pattern appears (not in training data), model defaults to "unknown" class  
**Output:** Schedule-based weekly retraining incorporates new labeled wafer maps, model learns new patterns  
**Value:** $1.2M/year from faster root cause analysis (classify new defect patterns immediately)

**Total Post-Silicon Value:** $2.8M + $1.9M + $1.5M + $1.2M = **$7.4M/year**

## üîÑ Continuous Training Workflow

```mermaid
graph LR
    A[üìä New Data Arrives] --> B[üîç Drift Detection]
    B --> C{Drift Detected?}
    C -->|Yes| D[‚è∞ Trigger Retraining]
    C -->|No| E[üìÖ Check Schedule]
    E --> F{Scheduled Retrain?}
    F -->|Yes| D
    F -->|No| G[‚úÖ Monitor Performance]
    
    D --> H[üèãÔ∏è Train New Model]
    H --> I[‚úÖ Validation Gates]
    I --> J{Pass Gates?}
    J -->|No| K[‚ùå Reject Model]
    J -->|Yes| L[üöÄ Deploy to Production]
    L --> M[üìà Monitor Metrics]
    M --> A
    
    K --> N[üìß Alert Team]
    N --> A
    
    style A fill:#e1f5ff
    style L fill:#e1ffe1
    style K fill:#ffe1e1
    style J fill:#fff4e1
```

## üìä Learning Path Context

**Prerequisites:**
- **Notebook 123: Model Monitoring & Drift Detection** - Drift detection techniques (KS-test, PSI, concept drift)
- **Notebook 109: ML Pipelines with Airflow** - Orchestration for continuous training workflows

**Next Steps:**
- **Notebook 127: Model Governance & Compliance** - Governance for auto-deployed models
- **Notebook 128: Shadow Mode Deployment** - Validate retrained models before full rollout

---

Let's build adaptive ML systems with continuous training! üöÄ

## 1. Setup & Installation

**Note**: Continuous training requires orchestration tools (Airflow) and monitoring libraries.

In [None]:
# Install continuous training libraries
# !pip install scikit-learn pandas numpy mlflow schedule scipy

import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from scipy.stats import ks_2samp
import warnings
warnings.filterwarnings('ignore')

print("Continuous training libraries loaded")
print("Focus: Automated retraining, drift detection, validation gates")

## 2. Retraining Trigger Mechanisms

**Purpose:** Implement trigger logic that determines WHEN to retrain models.

**Key Points:**
- **Schedule-based triggers**: Time-based (daily, weekly, monthly) - simplest approach
- **Drift-based triggers**: Detect data drift or concept drift using statistical tests
- **Performance-based triggers**: Monitor production metrics (accuracy, F1, RMSE drops below threshold)
- **Hybrid triggers**: Combine multiple conditions (e.g., drift AND schedule)

**Why This Matters:** Proper triggers prevent unnecessary retraining (waste compute) and delayed retraining (accuracy degrades).

In [None]:
class RetrainingTrigger:
    """
    Manages multiple trigger conditions for automated model retraining.
    
    Supports:
    - Schedule-based triggers (time intervals)
    - Data drift triggers (KS test for feature distribution changes)
    - Performance degradation triggers (metric thresholds)
    """
    
    def __init__(self, schedule_days=7, drift_threshold=0.05, perf_threshold=0.80):
        """
        Args:
            schedule_days: Retrain every N days regardless of drift/performance
            drift_threshold: p-value threshold for KS test (reject if p < threshold)
            perf_threshold: Minimum acceptable accuracy (retrain if below)
        """
        self.schedule_days = schedule_days
        self.drift_threshold = drift_threshold
        self.perf_threshold = perf_threshold
        self.last_retrain_date = datetime.now()
        
    def check_schedule_trigger(self):
        """Check if scheduled retraining interval has passed."""
        days_since_retrain = (datetime.now() - self.last_retrain_date).days
        triggered = days_since_retrain >= self.schedule_days
        
        return {
            'triggered': triggered,
            'reason': f'Scheduled retrain (every {self.schedule_days} days)',
            'days_since_retrain': days_since_retrain
        }
    
    def check_drift_trigger(self, train_data, production_data, feature_cols):
        """
        Detect data drift using Kolmogorov-Smirnov (KS) test.
        
        KS test compares distributions:
        - Null hypothesis: Training and production data from same distribution
        - p-value < threshold ‚Üí reject null ‚Üí drift detected ‚Üí retrain
        """
        drift_detected = False
        drift_features = []
        
        for feature in feature_cols:
            # KS test for each feature
            statistic, p_value = ks_2samp(
                train_data[feature], 
                production_data[feature]
            )
            
            if p_value < self.drift_threshold:
                drift_detected = True
                drift_features.append((feature, p_value, statistic))
        
        return {
            'triggered': drift_detected,
            'reason': 'Data drift detected',
            'drift_features': drift_features,
            'num_drifted': len(drift_features)
        }
    
    def check_performance_trigger(self, current_accuracy):
        """Check if production model performance dropped below threshold."""
        triggered = current_accuracy < self.perf_threshold
        
        return {
            'triggered': triggered,
            'reason': f'Performance degradation (accuracy={current_accuracy:.3f} < {self.perf_threshold})',
            'current_accuracy': current_accuracy
        }
    
    def should_retrain(self, train_data=None, production_data=None, 
                       feature_cols=None, current_accuracy=None):
        """
        Master decision function: Check all trigger conditions.
        
        Returns True if ANY trigger condition met (OR logic).
        """
        triggers = {}
        
        # Check schedule
        schedule_result = self.check_schedule_trigger()
        triggers['schedule'] = schedule_result
        
        # Check drift (if data provided)
        if train_data is not None and production_data is not None and feature_cols is not None:
            drift_result = self.check_drift_trigger(train_data, production_data, feature_cols)
            triggers['drift'] = drift_result
        
        # Check performance (if accuracy provided)
        if current_accuracy is not None:
            perf_result = self.check_performance_trigger(current_accuracy)
            triggers['performance'] = perf_result
        
        # Decision: Retrain if ANY trigger fired
        should_retrain = any(t['triggered'] for t in triggers.values())
        
        return {
            'should_retrain': should_retrain,
            'triggers': triggers,
            'timestamp': datetime.now()
        }

# Example: Initialize trigger system
trigger = RetrainingTrigger(
    schedule_days=7,        # Weekly retraining
    drift_threshold=0.05,   # 5% significance level for KS test
    perf_threshold=0.80     # Retrain if accuracy drops below 80%
)

print("‚úÖ Retraining trigger system initialized")
print(f"Schedule: Every {trigger.schedule_days} days")
print(f"Drift threshold: p-value < {trigger.drift_threshold}")
print(f"Performance threshold: accuracy >= {trigger.perf_threshold}")

## 3. Automated Retraining Pipeline

**Purpose:** Build end-to-end pipeline that automatically retrains, validates, and deploys models.

**Key Points:**
- **Data fetching**: Pull latest data from production (feature store, database)
- **Feature engineering**: Consistent transformations (use feature store definitions)
- **Model training**: Same hyperparameters OR hyperparameter tuning
- **Validation gates**: Multiple checks before deployment (accuracy, fairness, business rules)
- **Model versioning**: Track each retrained model with metadata (timestamp, trigger reason, metrics)

**Why This Matters:** Automation ensures consistency and reduces human error in production retraining.

In [None]:
class ContinuousTrainingPipeline:
    """
    End-to-end automated retraining pipeline with validation gates.
    
    Workflow:
    1. Fetch latest data
    2. Engineer features
    3. Train new model
    4. Validate against production model
    5. Deploy if passes all gates
    """
    
    def __init__(self, model_class=RandomForestClassifier, validation_holdout=0.2):
        self.model_class = model_class
        self.validation_holdout = validation_holdout
        self.production_model = None
        self.production_metrics = {}
        self.retrain_history = []
        
    def fetch_data(self, start_date, end_date):
        """Simulate fetching production data (replace with real data source)."""
        # In production: Query database, feature store, or data warehouse
        # Example: SELECT * FROM stdf_tests WHERE test_date BETWEEN start_date AND end_date
        
        np.random.seed(42)
        n_samples = 1000
        
        data = pd.DataFrame({
            'vdd': np.random.normal(1.2, 0.05, n_samples),
            'idd': np.random.normal(50, 5, n_samples),
            'frequency': np.random.normal(2400, 100, n_samples),
            'temperature': np.random.normal(25, 5, n_samples),
            'yield': np.random.choice([0, 1], n_samples, p=[0.1, 0.9])
        })
        
        return data
    
    def engineer_features(self, data):
        """Feature engineering (consistent with training pipeline)."""
        features = data.copy()
        
        # Derived features
        features['power'] = features['vdd'] * features['idd']
        features['power_efficiency'] = features['frequency'] / features['power']
        
        return features
    
    def train_model(self, X_train, y_train):
        """Train new model candidate."""
        model = self.model_class(n_estimators=100, random_state=42)
        model.fit(X_train, y_train)
        return model
    
    def validate_model(self, model, X_val, y_val):
        """
        Validation gates: Check if new model meets quality thresholds.
        
        Gates:
        1. Accuracy > 0.80 (absolute threshold)
        2. Better than production model (relative threshold)
        3. F1 score > 0.75 (class balance check)
        """
        y_pred = model.predict(X_val)
        
        accuracy = accuracy_score(y_val, y_pred)
        f1 = f1_score(y_val, y_pred, average='weighted')
        
        gates_passed = []
        gates_failed = []
        
        # Gate 1: Absolute accuracy threshold
        if accuracy >= 0.80:
            gates_passed.append(f"‚úÖ Accuracy gate ({accuracy:.3f} >= 0.80)")
        else:
            gates_failed.append(f"‚ùå Accuracy gate ({accuracy:.3f} < 0.80)")
        
        # Gate 2: Better than production (if production model exists)
        if self.production_model is not None:
            prod_accuracy = self.production_metrics.get('accuracy', 0)
            if accuracy >= prod_accuracy:
                gates_passed.append(f"‚úÖ Production comparison ({accuracy:.3f} >= {prod_accuracy:.3f})")
            else:
                gates_failed.append(f"‚ùå Production comparison ({accuracy:.3f} < {prod_accuracy:.3f})")
        
        # Gate 3: F1 score threshold
        if f1 >= 0.75:
            gates_passed.append(f"‚úÖ F1 score gate ({f1:.3f} >= 0.75)")
        else:
            gates_failed.append(f"‚ùå F1 score gate ({f1:.3f} < 0.75)")
        
        passed = len(gates_failed) == 0
        
        return {
            'passed': passed,
            'accuracy': accuracy,
            'f1': f1,
            'gates_passed': gates_passed,
            'gates_failed': gates_failed
        }
    
    def deploy_model(self, model, metrics, reason):
        """Deploy new model to production."""
        self.production_model = model
        self.production_metrics = metrics
        
        # Record deployment
        self.retrain_history.append({
            'timestamp': datetime.now(),
            'reason': reason,
            'accuracy': metrics['accuracy'],
            'f1': metrics['f1']
        })
        
        print(f"üöÄ Model deployed to production")
        print(f"   Reason: {reason}")
        print(f"   Accuracy: {metrics['accuracy']:.3f}")
        print(f"   F1 Score: {metrics['f1']:.3f}")
    
    def run_retraining(self, start_date, end_date, trigger_reason):
        """
        Execute full retraining workflow.
        
        Returns:
            dict: Retraining result (deployed, metrics, gates)
        """
        print(f"üîÑ Starting retraining pipeline...")
        print(f"   Trigger: {trigger_reason}")
        
        # Step 1: Fetch data
        data = self.fetch_data(start_date, end_date)
        print(f"‚úÖ Fetched {len(data)} samples")
        
        # Step 2: Engineer features
        features = self.engineer_features(data)
        feature_cols = ['vdd', 'idd', 'frequency', 'temperature', 'power', 'power_efficiency']
        X = features[feature_cols]
        y = features['yield']
        
        # Step 3: Split data (use holdout for validation)
        X_train, X_val, y_train, y_val = train_test_split(
            X, y, test_size=self.validation_holdout, random_state=42
        )
        print(f"‚úÖ Train: {len(X_train)}, Validation: {len(X_val)}")
        
        # Step 4: Train new model
        model = self.train_model(X_train, y_train)
        print(f"‚úÖ Model trained")
        
        # Step 5: Validate model
        validation_result = self.validate_model(model, X_val, y_val)
        
        print(f"\nüìä Validation Results:")
        for gate in validation_result['gates_passed']:
            print(f"   {gate}")
        for gate in validation_result['gates_failed']:
            print(f"   {gate}")
        
        # Step 6: Deploy if passed
        if validation_result['passed']:
            self.deploy_model(model, validation_result, trigger_reason)
            return {
                'deployed': True,
                'metrics': validation_result,
                'reason': trigger_reason
            }
        else:
            print(f"‚ùå Model rejected (failed validation gates)")
            return {
                'deployed': False,
                'metrics': validation_result,
                'reason': trigger_reason
            }

# Example: Initialize and run pipeline
pipeline = ContinuousTrainingPipeline(
    model_class=RandomForestClassifier,
    validation_holdout=0.2
)

# Simulate retraining
result = pipeline.run_retraining(
    start_date=datetime.now() - timedelta(days=30),
    end_date=datetime.now(),
    trigger_reason="Initial training"
)

print(f"\n{'='*60}")
print(f"Deployment status: {result['deployed']}")
print(f"Accuracy: {result['metrics']['accuracy']:.3f}")
print(f"F1 Score: {result['metrics']['f1']:.3f}")

## 4. End-to-End Continuous Training System (Post-Silicon Example)

**Purpose:** Demonstrate complete CT system for yield prediction with trigger detection and automated retraining.

**Key Points:**
- **Trigger monitoring**: Check schedule, drift, and performance continuously
- **Automatic execution**: Run retraining pipeline when triggered
- **Model versioning**: Track all models (production + candidates)
- **Rollback capability**: Revert to previous version if new model fails in production

**Why This Matters:** Real-world CT requires orchestration of triggers, retraining, validation, and deployment.

In [None]:
# Simulate production environment over time
print("üè≠ Simulating Production Environment for Yield Prediction CT System\n")

# Initial production model training
pipeline = ContinuousTrainingPipeline(model_class=RandomForestClassifier)
trigger = RetrainingTrigger(schedule_days=7, drift_threshold=0.05, perf_threshold=0.85)

# Train initial model
initial_result = pipeline.run_retraining(
    start_date=datetime.now() - timedelta(days=60),
    end_date=datetime.now() - timedelta(days=30),
    trigger_reason="Initial model deployment"
)

# Record as production baseline
trigger.last_retrain_date = datetime.now() - timedelta(days=10)  # Simulate 10 days ago

print(f"\n{'='*60}")
print("üìä Production Monitoring - Day 10")
print(f"{'='*60}\n")

# Simulate new production data (with drift)
train_data_original = pipeline.fetch_data(
    datetime.now() - timedelta(days=60),
    datetime.now() - timedelta(days=30)
)

production_data_drifted = pipeline.fetch_data(
    datetime.now() - timedelta(days=7),
    datetime.now()
)

# Introduce drift (Vdd voltage shifted due to process change)
production_data_drifted['vdd'] = production_data_drifted['vdd'] + 0.08  # +80mV shift

# Simulate performance degradation
production_accuracy = 0.82  # Dropped from initial 0.90

# Check if retraining should trigger
feature_cols = ['vdd', 'idd', 'frequency', 'temperature']
trigger_decision = trigger.should_retrain(
    train_data=train_data_original,
    production_data=production_data_drifted,
    feature_cols=feature_cols,
    current_accuracy=production_accuracy
)

print("üîç Trigger Analysis:")
print(f"   Schedule trigger: {trigger_decision['triggers']['schedule']['triggered']}")
print(f"   Days since retrain: {trigger_decision['triggers']['schedule']['days_since_retrain']}")

if 'drift' in trigger_decision['triggers']:
    drift_info = trigger_decision['triggers']['drift']
    print(f"   Drift trigger: {drift_info['triggered']}")
    if drift_info['triggered']:
        print(f"   Drifted features: {drift_info['num_drifted']}")
        for feature, p_value, statistic in drift_info['drift_features']:
            print(f"      - {feature}: p={p_value:.4f}, KS-stat={statistic:.3f}")

if 'performance' in trigger_decision['triggers']:
    perf_info = trigger_decision['triggers']['performance']
    print(f"   Performance trigger: {perf_info['triggered']}")
    print(f"   Current accuracy: {perf_info['current_accuracy']:.3f}")

print(f"\n‚ö†Ô∏è  DECISION: {'RETRAIN REQUIRED' if trigger_decision['should_retrain'] else 'NO RETRAIN NEEDED'}")

# Execute retraining if triggered
if trigger_decision['should_retrain']:
    print(f"\n{'='*60}")
    print("Executing Automated Retraining")
    print(f"{'='*60}\n")
    
    # Determine primary trigger reason
    reasons = []
    if trigger_decision['triggers']['schedule']['triggered']:
        reasons.append("Scheduled retrain")
    if trigger_decision['triggers'].get('drift', {}).get('triggered', False):
        reasons.append("Data drift detected")
    if trigger_decision['triggers'].get('performance', {}).get('triggered', False):
        reasons.append("Performance degradation")
    
    trigger_reason = " + ".join(reasons)
    
    # Run retraining pipeline
    retrain_result = pipeline.run_retraining(
        start_date=datetime.now() - timedelta(days=30),
        end_date=datetime.now(),
        trigger_reason=trigger_reason
    )
    
    if retrain_result['deployed']:
        trigger.last_retrain_date = datetime.now()  # Update last retrain timestamp
        print(f"\n‚úÖ Continuous training cycle completed successfully")
        print(f"   Model version: {len(pipeline.retrain_history)}")
        print(f"   Previous accuracy: {production_accuracy:.3f}")
        print(f"   New accuracy: {retrain_result['metrics']['accuracy']:.3f}")
        print(f"   Improvement: {retrain_result['metrics']['accuracy'] - production_accuracy:.3f}")
    else:
        print(f"\n‚ö†Ô∏è  Model rejected - production model retained")

# Show retraining history
print(f"\n{'='*60}")
print("üìà Retraining History")
print(f"{'='*60}\n")

for i, record in enumerate(pipeline.retrain_history, 1):
    print(f"Version {i}:")
    print(f"   Timestamp: {record['timestamp'].strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"   Reason: {record['reason']}")
    print(f"   Accuracy: {record['accuracy']:.3f}")
    print(f"   F1 Score: {record['f1']:.3f}")
    print()

## 5. Pipeline Orchestration with Airflow

**Purpose:** Integrate continuous training into production workflows using Apache Airflow DAGs.

**Key Points:**
- **DAGs (Directed Acyclic Graphs)**: Define task dependencies (data fetch ‚Üí train ‚Üí validate ‚Üí deploy)
- **Scheduling**: Cron expressions for regular checks (daily, weekly)
- **Retries**: Automatic retry on failure (network issues, data availability)
- **Monitoring**: Track task status, execution time, failures
- **Alerting**: Notify team when retraining fails or model degrades

**Why This Matters:** Airflow is industry standard for ML pipeline orchestration (used by Netflix, Airbnb, Spotify).

In [None]:
# Airflow DAG for Continuous Training (Conceptual Example)
# Note: This is simplified pseudocode - real Airflow requires installation and setup

airflow_dag_code = """
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

# Define default arguments
default_args = {
    'owner': 'mlops-team',
    'depends_on_past': False,
    'email': ['mlops@company.com'],
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 2,
    'retry_delay': timedelta(minutes=5)
}

# Create DAG
dag = DAG(
    'yield_prediction_continuous_training',
    default_args=default_args,
    description='Automated yield prediction model retraining',
    schedule_interval='0 2 * * 0',  # Every Sunday at 2 AM
    start_date=datetime(2024, 1, 1),
    catchup=False,
    tags=['ml', 'continuous-training', 'yield-prediction']
)

# Task 1: Check triggers
def check_triggers_task(**context):
    trigger = RetrainingTrigger(schedule_days=7, drift_threshold=0.05, perf_threshold=0.85)
    
    # Fetch data and check triggers
    decision = trigger.should_retrain(
        train_data=fetch_train_data(),
        production_data=fetch_production_data(),
        feature_cols=['vdd', 'idd', 'frequency', 'temperature'],
        current_accuracy=get_production_accuracy()
    )
    
    # Push decision to XCom (Airflow's inter-task communication)
    context['ti'].xcom_push(key='should_retrain', value=decision['should_retrain'])
    context['ti'].xcom_push(key='trigger_reason', value=str(decision['triggers']))
    
    return decision['should_retrain']

check_triggers = PythonOperator(
    task_id='check_retraining_triggers',
    python_callable=check_triggers_task,
    provide_context=True,
    dag=dag
)

# Task 2: Fetch latest data
def fetch_data_task(**context):
    # Pull trigger decision
    should_retrain = context['ti'].xcom_pull(task_ids='check_retraining_triggers', key='should_retrain')
    
    if not should_retrain:
        return "Skipping - no retrain needed"
    
    # Fetch from data warehouse
    data = fetch_stdf_data(days=30)
    
    # Save to temporary location
    data.to_parquet('/tmp/retrain_data.parquet')
    
    return f"Fetched {len(data)} samples"

fetch_data = PythonOperator(
    task_id='fetch_training_data',
    python_callable=fetch_data_task,
    provide_context=True,
    dag=dag
)

# Task 3: Train model
def train_model_task(**context):
    should_retrain = context['ti'].xcom_pull(task_ids='check_retraining_triggers', key='should_retrain')
    
    if not should_retrain:
        return "Skipping - no retrain needed"
    
    # Load data
    data = pd.read_parquet('/tmp/retrain_data.parquet')
    
    # Train model
    pipeline = ContinuousTrainingPipeline()
    result = pipeline.run_retraining(
        start_date=datetime.now() - timedelta(days=30),
        end_date=datetime.now(),
        trigger_reason=context['ti'].xcom_pull(task_ids='check_retraining_triggers', key='trigger_reason')
    )
    
    # Push result
    context['ti'].xcom_push(key='training_result', value=result)
    
    return result

train_model = PythonOperator(
    task_id='train_new_model',
    python_callable=train_model_task,
    provide_context=True,
    dag=dag
)

# Task 4: Validate model
def validate_model_task(**context):
    result = context['ti'].xcom_pull(task_ids='train_new_model', key='training_result')
    
    if result['deployed']:
        return "Model passed validation gates"
    else:
        raise ValueError("Model failed validation - check gates")

validate_model = PythonOperator(
    task_id='validate_new_model',
    python_callable=validate_model_task,
    provide_context=True,
    dag=dag
)

# Task 5: Deploy to production
def deploy_model_task(**context):
    result = context['ti'].xcom_pull(task_ids='train_new_model', key='training_result')
    
    if result['deployed']:
        # Update model registry
        # Copy model to production serving location
        # Update feature store references
        return "Model deployed to production"
    else:
        return "No deployment - validation failed"

deploy_model = PythonOperator(
    task_id='deploy_to_production',
    python_callable=deploy_model_task,
    provide_context=True,
    dag=dag
)

# Task 6: Send notification
def notify_task(**context):
    result = context['ti'].xcom_pull(task_ids='train_new_model', key='training_result')
    
    # Send Slack/email notification
    send_notification(
        subject="Yield Prediction Model Retrained",
        body=f"Deployed: {result['deployed']}\\nAccuracy: {result['metrics']['accuracy']:.3f}"
    )
    
    return "Notification sent"

notify = PythonOperator(
    task_id='send_notification',
    python_callable=notify_task,
    provide_context=True,
    dag=dag
)

# Define task dependencies (execution order)
check_triggers >> fetch_data >> train_model >> validate_model >> deploy_model >> notify
"""

print("üìã Airflow DAG Structure for Continuous Training:")
print("="*60)
print(airflow_dag_code)
print("\n" + "="*60)
print("‚úÖ DAG defines 6 tasks:")
print("   1. Check triggers (schedule, drift, performance)")
print("   2. Fetch latest data (from data warehouse)")
print("   3. Train new model (retraining pipeline)")
print("   4. Validate model (validation gates)")
print("   5. Deploy to production (if passed)")
print("   6. Send notification (Slack/email)")
print("\nSchedule: Every Sunday at 2 AM (weekly retraining)")
print("Retries: 2 attempts with 5-minute delay")
print("Alerts: Email on failure")

## 6. Real-World Project Templates

**Purpose:** 8 production-ready continuous training projects (4 post-silicon validation + 4 general AI/ML).

**Pattern:** Each project includes objectives, triggers, validation gates, and success criteria.

In [None]:
projects = {
    "post_silicon": [
        {
            "name": "Yield Prediction CT with Multi-Fab Drift Detection",
            "objective": "Continuous training system for wafer yield prediction across multiple fabs with fab-specific drift detection",
            "triggers": [
                "Schedule: Weekly retrain on Sunday 2 AM",
                "Data drift: KS test p-value < 0.05 for Vdd, Idd, frequency (fab-specific baselines)",
                "Performance: Accuracy drops below 88% on rolling 7-day validation",
                "Manual: Triggered when process change implemented (e.g., new litho tool)"
            ],
            "validation_gates": [
                "Accuracy >= 90% on holdout test set (1000 wafers)",
                "F1 score >= 0.85 (prevent class imbalance issues)",
                "Better than production model by >= 1%",
                "Fairness: Accuracy variance across fabs < 5%",
                "Business rule: Reject if predicts >15% yield loss (unrealistic)"
            ],
            "features": [
                "STDF parametric data: Vdd, Idd, frequency, power, temperature",
                "Derived features: power_efficiency, voltage_margin, test_coverage",
                "Fab metadata: fab_id, tool_id, operator_shift",
                "Temporal: day_of_week, lot_age, time_since_last_calibration",
                "Rolling aggregates: 7-day mean Vdd, std Idd per tool"
            ],
            "orchestration": "Airflow DAG with 7 tasks (trigger check ‚Üí fetch data ‚Üí feature engineering ‚Üí train ‚Üí validate ‚Üí deploy ‚Üí notify)",
            "success_criteria": "Maintain 90%+ accuracy for 6 months, <2% accuracy degradation between retrains, zero false rejections of good models",
            "value": "Save $2M/year by maintaining yield prediction accuracy (prevent scrap, optimize test coverage)"
        },
        {
            "name": "Test Time Prediction Retraining on Program Changes",
            "objective": "Auto-retrain test time model when test programs updated (new tests added, sequences changed)",
            "triggers": [
                "Event-driven: Test program version change detected in config management",
                "Performance: MAPE increases from baseline 3% to >5%",
                "Schedule: Nightly if any test program changed in last 24 hours",
                "Data drift: Test execution time distribution shifts (KS test)"
            ],
            "validation_gates": [
                "MAPE <= 4% on validation set (100 lots)",
                "Per-test prediction error < 10% for 95% of tests",
                "Prediction time < 50ms (real-time constraint)",
                "Better than production model by >= 0.5% MAPE"
            ],
            "features": [
                "Test program: test_name, test_category, parallelization_factor",
                "Historical: test_time_mean, test_time_std (last 1000 runs)",
                "Device context: device_type, package, temperature",
                "Load context: tester_utilization, time_of_day, concurrent_jobs"
            ],
            "orchestration": "Kubernetes CronJob + event-driven trigger (listen to test program Git commits)",
            "success_criteria": "Maintain MAPE <4% across all test programs, <1 hour retrain latency from program change, 100% uptime",
            "value": "Optimize test floor capacity planning (prevent 20% underutilization = $500K/year)"
        },
        {
            "name": "Binning Model CT for Dynamic Spec Adaptation",
            "objective": "Retrain binning model when product specs change (voltage limits, frequency targets, customer requirements)",
            "triggers": [
                "Manual: Triggered by product engineer when specs updated",
                "Schedule: Monthly review of binning accuracy",
                "Performance: Bin mismatch rate >3% (predicted vs actual customer bin)",
                "Data drift: Test limit distribution changes (new voltage corners)"
            ],
            "validation_gates": [
                "Bin accuracy >= 97% (predicted bin matches customer bin)",
                "Zero misclassifications of fails as premium bins (revenue risk)",
                "Bin distribution matches expected mix (prevent yield loss)",
                "Fairness: Bin accuracy consistent across device types"
            ],
            "features": [
                "Parametric test results: Vdd_min, Vdd_max, Fmax, leakage_current",
                "Spec margins: distance_to_spec_limit (for each parameter)",
                "Correlation features: Vdd_Fmax_ratio, power_at_nominal",
                "Historical: device_family_avg_bin (population prior)"
            ],
            "orchestration": "Manual trigger via API + monthly Airflow DAG for validation",
            "success_criteria": "97%+ binning accuracy, <0.1% premium bin false positives, <24 hour latency from spec change to deployment",
            "value": "Ensure revenue optimization ($50M/year for flagship product), prevent warranty claims from misbinning"
        },
        {
            "name": "Wafer Map Anomaly Detection Auto-Retraining",
            "objective": "Continuous training for spatial pattern anomaly detection (catch new defect signatures within 30 days)",
            "triggers": [
                "Schedule: Monthly retrain (capture seasonal process variations)",
                "Data drift: Spatial autocorrelation changes (Moran's I statistic)",
                "Performance: Precision drops below 80% (too many false anomalies)",
                "Manual: After new defect type discovered (root cause analysis)"
            ],
            "validation_gates": [
                "Precision >= 85% (anomalies are real defects, not noise)",
                "Recall >= 75% (catch majority of anomalies)",
                "Spatial coverage: Detect anomalies in all wafer regions (edge, center)",
                "False positive rate < 5% (prevent fab disruption)"
            ],
            "features": [
                "Spatial: die_x, die_y, distance_to_wafer_center, radial_zone",
                "Parametric: Vdd, Idd, frequency (per die)",
                "Derived: local_yield (3x3 neighborhood), spatial_gradient",
                "Contextual: wafer_id, lot_id, fab_tool_id, process_step"
            ],
            "orchestration": "Airflow DAG (monthly) + Kubeflow pipeline for model training (GPU for autoencoder)",
            "success_criteria": "80%+ precision and recall for 6 months, detect new defect type within 30 days, <5% false positive rate",
            "value": "Accelerate defect detection (save $1M/year in scrap), enable proactive yield improvement"
        }
    ],
    "general_ml": [
        {
            "name": "E-Commerce Recommendation CT with Seasonal Adaptation",
            "objective": "Retrain recommendation model to adapt to seasonal trends (holidays, back-to-school, prime day)",
            "triggers": [
                "Schedule: Weekly retrain during high-traffic seasons, monthly otherwise",
                "Performance: CTR drops below 3.5% (rolling 7-day average)",
                "Data drift: User behavior distribution shifts (KS test on session duration, categories)",
                "Event-driven: Major catalog updates (>10% new products)"
            ],
            "validation_gates": [
                "CTR >= 4.0% on holdout users (last 3 days)",
                "Diversity: Recommend products from >= 15 categories (prevent filter bubble)",
                "Novelty: 20% recommendations are products user hasn't seen",
                "Better than production by >= 0.3% CTR"
            ],
            "features": [
                "User: purchase_history, browsing_history, demographics, lifetime_value",
                "Product: category, price, ratings, inventory_status",
                "Context: time_of_day, device_type, session_duration",
                "Temporal: days_to_holiday, trending_score (last 24 hours)"
            ],
            "orchestration": "Kubeflow pipeline with dynamic scheduling (weekly/monthly based on traffic)",
            "success_criteria": "Maintain 4%+ CTR year-round, <12 hour retrain latency, zero downtime deployments",
            "value": "Increase revenue $5M/year (4% CTR vs 3.5% baseline = 15% more conversions)"
        },
        {
            "name": "Fraud Detection CT with Adversarial Drift Handling",
            "objective": "Retrain fraud model to adapt to evolving fraud patterns (fraudsters change tactics to evade detection)",
            "triggers": [
                "Schedule: Daily retrain (fraud evolves rapidly)",
                "Performance: Precision drops below 90% (too many false positives = customer friction)",
                "Data drift: Transaction amount distribution changes, new merchant categories",
                "Manual: After fraud ring discovered (update labels for past transactions)"
            ],
            "validation_gates": [
                "Precision >= 92% (minimize false positives)",
                "Recall >= 75% (catch majority of fraud)",
                "Better than production by >= 1% F1 score",
                "Fairness: False positive rate consistent across demographics",
                "Latency: Inference < 100ms (real-time transaction approval)"
            ],
            "features": [
                "Transaction: amount, merchant_category, time, location",
                "User behavior: avg_transaction_amount, transaction_frequency, device_fingerprint",
                "Network: merchant_fraud_rate (last 30 days), peer_group_behavior",
                "Temporal: time_since_last_transaction, velocity (transactions/hour)"
            ],
            "orchestration": "Airflow DAG (daily) with A/B testing for 24 hours before full deployment",
            "success_criteria": "92%+ precision and 75%+ recall for 3 months, <6 hour retrain latency, <$50K false positive cost/month",
            "value": "Prevent $20M/year fraud losses, reduce false positive friction (save 10K customer complaints/month)"
        },
        {
            "name": "Churn Prediction CT with Feature Drift Monitoring",
            "objective": "Continuous training for customer churn prediction with proactive feature drift detection",
            "triggers": [
                "Schedule: Biweekly retrain (churn patterns shift gradually)",
                "Data drift: User engagement distribution changes (sessions, purchases)",
                "Performance: AUC-ROC drops below 0.82",
                "Feature drift: >20% of features show drift (KS test p < 0.05)"
            ],
            "validation_gates": [
                "AUC-ROC >= 0.85 on holdout users (last 60 days)",
                "Top 10% predicted churners have >= 50% actual churn rate (targeting efficiency)",
                "Better than production by >= 0.02 AUC-ROC",
                "Calibration: Predicted probabilities match observed frequencies (reliability diagram)"
            ],
            "features": [
                "User activity: login_frequency, session_duration, feature_usage",
                "Transactions: purchase_frequency, avg_order_value, days_since_last_purchase",
                "Engagement: support_tickets, app_ratings, email_open_rate",
                "Cohort: tenure, acquisition_channel, subscription_tier"
            ],
            "orchestration": "Airflow DAG (biweekly) with feature drift dashboard (Evidently AI)",
            "success_criteria": "Maintain 0.85+ AUC-ROC for 6 months, top 10% churners have 50%+ actual churn, <3 day retrain latency",
            "value": "Reduce churn 15% (save $3M/year), improve retention campaign ROI 2x"
        },
        {
            "name": "Demand Forecasting CT with External Signal Integration",
            "objective": "Retrain demand forecast model with external signals (weather, events, economic indicators)",
            "triggers": [
                "Schedule: Weekly retrain (capture weekly seasonality)",
                "Performance: MAPE increases from 8% to >12%",
                "Data drift: Sales distribution shifts (new product launch, competitor action)",
                "External event: Major holiday, weather event (hurricane), economic shock"
            ],
            "validation_gates": [
                "MAPE <= 10% on next 4 weeks forecast",
                "Bias < 5% (prevent systematic over/under-forecasting)",
                "Better than production by >= 1% MAPE",
                "Coverage: 80% prediction interval captures actual demand 80% of time"
            ],
            "features": [
                "Time series: sales_lag_1w, sales_lag_4w, sales_lag_52w (year-over-year)",
                "Trend: linear_trend, seasonal_decomposition (STL)",
                "External: weather_forecast, local_events, holiday_indicator",
                "Product: promotions, price_changes, inventory_level"
            ],
            "orchestration": "Airflow DAG (weekly) with external API calls (weather, events data)",
            "success_criteria": "Maintain 10% MAPE for 6 months, <1 day forecast latency, 80% prediction interval coverage",
            "value": "Optimize inventory $2M/year (reduce stockouts 30%, overstock 25%)"
        }
    ]
}

print("üéØ 8 Continuous Training Project Templates")
print("="*80)

print("\nüì¶ POST-SILICON VALIDATION PROJECTS (4)\n")
for i, project in enumerate(projects["post_silicon"], 1):
    print(f"{i}. {project['name']}")
    print(f"   Objective: {project['objective']}")
    print(f"   Triggers: {len(project['triggers'])} types ({', '.join([t.split(':')[0] for t in project['triggers']])})")
    print(f"   Validation Gates: {len(project['validation_gates'])} checks")
    print(f"   Orchestration: {project['orchestration']}")
    print(f"   Success: {project['success_criteria']}")
    print(f"   üí∞ Value: {project['value']}")
    print()

print("\nüåê GENERAL AI/ML PROJECTS (4)\n")
for i, project in enumerate(projects["general_ml"], 1):
    print(f"{i}. {project['name']}")
    print(f"   Objective: {project['objective']}")
    print(f"   Triggers: {len(project['triggers'])} types ({', '.join([t.split(':')[0] for t in project['triggers']])})")
    print(f"   Validation Gates: {len(project['validation_gates'])} checks")
    print(f"   Orchestration: {project['orchestration']}")
    print(f"   Success: {project['success_criteria']}")
    print(f"   üí∞ Value: {project['value']}")
    print()

print("="*80)
print("‚úÖ All projects include: Multi-trigger logic, validation gates, orchestration, success metrics")

## 7. üéì Key Takeaways & Best Practices

### üìå Core Concepts

**1. Continuous Training Fundamentals**
- **Definition**: Automated model retraining when performance degrades, data drifts, or schedule triggers
- **Purpose**: Maintain model accuracy in production as data distributions change
- **Components**: Triggers (when to retrain), pipeline (how to retrain), gates (when to deploy), monitoring (track performance)
- **CT vs CI/CD**: CT focuses on model updates (data-driven), CI/CD on code updates (developer-driven) - both needed for MLOps

**2. Why Models Degrade Over Time**
- **Data drift**: Feature distributions change (e.g., Vdd voltage shifted from 1.2V to 1.28V due to process change)
- **Concept drift**: Relationship between features and target changes (e.g., yield formula changes with new test coverage)
- **Upstream changes**: Data pipeline modifications, sensor calibration, measurement errors
- **Population shift**: Production data differs from training data (sampling bias, temporal effects)

**Without CT**: Models become stale (accuracy drops from 92% to 78% over 6 months)  
**With CT**: Models stay fresh (maintain 90%+ accuracy indefinitely)

---

### ‚öôÔ∏è Trigger Mechanisms

**3. Schedule-Based Triggers**
- **How it works**: Retrain at fixed intervals (daily, weekly, monthly)
- **Pros**: Simple, predictable, ensures models always incorporate latest data
- **Cons**: May retrain unnecessarily (waste compute), may miss urgent degradation between intervals
- **When to use**: Baseline strategy for all models, sufficient when drift is gradual

**Example cron schedules**:
- Daily: `0 2 * * *` (2 AM every day - low traffic time)
- Weekly: `0 2 * * 0` (Sunday 2 AM)
- Monthly: `0 2 1 * *` (1st of month 2 AM)

**4. Data Drift Triggers**
- **How it works**: Statistical tests detect distribution changes (KS test, Chi-square, PSI)
- **Kolmogorov-Smirnov (KS) test**: Compare training vs production distributions (p-value < 0.05 ‚Üí drift)
- **Population Stability Index (PSI)**: Measure distribution shift (PSI > 0.2 = significant drift)
- **Per-feature drift**: Test each feature independently, trigger if >=20% features drift
- **Pros**: Data-driven (retrain only when needed), catch drift before performance degrades
- **Cons**: Requires production data logging, threshold tuning (too sensitive = false alarms)

**Implementation**:
```python
from scipy.stats import ks_2samp

for feature in feature_cols:
    statistic, p_value = ks_2samp(train_data[feature], prod_data[feature])
    if p_value < 0.05:  # Reject null hypothesis (distributions differ)
        trigger_retrain(reason=f"Drift in {feature}")
```

**5. Performance-Based Triggers**
- **How it works**: Monitor production metrics (accuracy, F1, MAPE), retrain if below threshold
- **Requires**: Ground truth labels in production (e.g., yield labels after test completion)
- **Metrics to track**: Accuracy, precision, recall, F1, AUC-ROC, MAPE, RMSE (depends on problem)
- **Threshold setting**: Historical baseline - 5% (e.g., if baseline 90%, trigger at 85%)
- **Pros**: Directly tied to business impact, no false alarms (only retrain when truly needed)
- **Cons**: Reactive (waits for degradation), requires labeled production data (not always available)

**Rolling window validation**: Evaluate on last 7 days of production data to detect gradual degradation

**6. Hybrid Trigger Strategies**
- **Combine multiple triggers**: Schedule AND (drift OR performance) - ensures regular updates + urgency
- **Priority levels**: Performance drop (urgent, retrain immediately) > Data drift (important, retrain next cycle) > Schedule (routine)
- **Override mechanisms**: Manual trigger for emergencies (spec change, major bug fix)

**Example logic**:
```python
should_retrain = (
    schedule_trigger.triggered() OR
    (drift_trigger.triggered() AND performance_trigger.degraded()) OR
    manual_override
)
```

---

### üîÑ Retraining Pipeline Architecture

**7. Pipeline Stages**
1. **Data fetching**: Pull latest data from production (last 30-90 days typically)
2. **Feature engineering**: Apply same transformations as training (use feature store for consistency)
3. **Model training**: Train with same hyperparameters OR run hyperparameter tuning
4. **Validation**: Check quality on holdout set (last 20% of data or time-based split)
5. **Deployment**: If passed gates, replace production model (with rollback capability)
6. **Monitoring**: Track new model performance in production (A/B test, shadow mode)

**8. Validation Gates (Quality Checks)**
- **Absolute threshold**: Accuracy >= 85% (minimum acceptable performance)
- **Relative threshold**: New model >= production model + 1% (ensure improvement)
- **Business rules**: No predictions of >15% yield loss (unrealistic, likely bug)
- **Fairness checks**: Accuracy variance across demographics/fabs < 5%
- **Latency constraints**: Inference time < 100ms (real-time applications)
- **All gates must pass**: If any gate fails, reject model (keep production model)

**Example validation**:
```python
def validate_model(new_model, prod_model, X_val, y_val):
    new_acc = accuracy_score(y_val, new_model.predict(X_val))
    prod_acc = accuracy_score(y_val, prod_model.predict(X_val))
    
    gates = {
        'absolute': new_acc >= 0.85,
        'relative': new_acc >= prod_acc + 0.01,
        'f1': f1_score(y_val, new_model.predict(X_val)) >= 0.75
    }
    
    return all(gates.values())
```

**9. Model Versioning & Rollback**
- **Version tracking**: Store model with metadata (timestamp, trigger reason, metrics, Git commit)
- **Registry**: Use MLflow, W&B, or custom registry to track all versions
- **Rollback**: If new model fails in production (detected by monitoring), revert to previous version
- **Retention**: Keep last 5-10 versions for quick rollback, archive older versions

**Metadata example**:
```python
{
    'model_id': 'yield_pred_v23',
    'timestamp': '2024-01-15T02:30:00Z',
    'trigger': 'Data drift in Vdd + scheduled retrain',
    'accuracy': 0.923,
    'f1': 0.897,
    'training_data': 'STDF 2023-12-15 to 2024-01-15',
    'git_commit': 'a3f7b2c',
    'deployed': True
}
```

---

### üéØ Orchestration with Airflow

**10. Airflow DAG Basics**
- **DAG (Directed Acyclic Graph)**: Defines task dependencies (A ‚Üí B ‚Üí C)
- **Tasks**: Individual operations (fetch_data, train_model, validate, deploy)
- **Operators**: PythonOperator (run Python function), BashOperator (run shell command), custom operators
- **Scheduling**: Cron expressions (`0 2 * * 0` = weekly), or dynamic triggers
- **XCom**: Inter-task communication (pass data between tasks)

**11. CT-Specific DAG Design**
- **Branching**: If trigger not fired, skip retraining tasks (save compute)
- **Retries**: Automatic retry on transient failures (network issues, database timeout)
- **Alerting**: Send notification on success/failure (Slack, email, PagerDuty)
- **Dependencies**: Ensure tasks run in order (can't deploy before validation)

**DAG structure for CT**:
```
check_triggers ‚Üí fetch_data ‚Üí engineer_features ‚Üí train_model
                                                       ‚Üì
                                            validate_model
                                                       ‚Üì
                                            deploy_to_production
                                                       ‚Üì
                                            send_notification
```

**12. Alternative Orchestration Tools**
- **Kubeflow**: Kubernetes-native ML pipelines (better for complex workflows, GPU training)
- **Prefect**: Modern Python-first orchestrator (easier than Airflow, better developer experience)
- **AWS Step Functions**: Serverless orchestration (AWS-specific)
- **Argo Workflows**: Kubernetes-native (popular for MLOps on K8s)
- **Metaflow**: Netflix's framework (great for data science workflows)

**Airflow vs Kubeflow**:
- Airflow: General-purpose, mature, great for batch workflows
- Kubeflow: ML-specific, native Kubernetes, better for distributed training

---

### üè≠ Post-Silicon Validation Applications

**13. Yield Prediction CT**
- **Challenge**: Process changes (new litho tool, voltage adjustments) cause data drift
- **Solution**: Weekly scheduled retrain + KS test drift detection on Vdd, Idd, frequency
- **Validation**: Accuracy >= 90%, fairness across fabs (variance < 5%)
- **Value**: Maintain yield prediction accuracy ‚Üí prevent $2M/year scrap losses

**14. Test Time Prediction CT**
- **Challenge**: Test programs change frequently (new tests, sequence updates)
- **Solution**: Event-driven trigger (Git commit to test program) + nightly retrain
- **Validation**: MAPE <= 4%, prediction time < 50ms
- **Value**: Accurate capacity planning ‚Üí prevent 20% test floor underutilization ($500K/year)

**15. Binning Model CT**
- **Challenge**: Product specs change (voltage limits, frequency targets)
- **Solution**: Manual trigger when specs updated + monthly validation
- **Validation**: Bin accuracy >= 97%, zero false positives for premium bins
- **Value**: Revenue optimization ($50M/year), prevent warranty claims

**16. Wafer Map Anomaly Detection CT**
- **Challenge**: New defect patterns emerge over time (need model to learn them)
- **Solution**: Monthly retrain + spatial drift detection (Moran's I statistic)
- **Validation**: Precision >= 85%, recall >= 75%, false positive rate < 5%
- **Value**: Detect defects within 30 days ‚Üí accelerate root cause analysis ($1M/year savings)

---

### ‚ö†Ô∏è Common Pitfalls

**17. Retraining Too Frequently**
- **Problem**: Daily retraining when drift is slow ‚Üí waste compute, increase complexity
- **Solution**: Start with weekly/monthly schedule, add drift triggers only if needed
- **Cost**: Daily retraining can cost $1000+/month in cloud compute (vs $100/month weekly)

**18. Insufficient Validation**
- **Problem**: Deploy new model without proper testing ‚Üí production failures
- **Solution**: Multiple validation gates (accuracy, fairness, business rules), A/B testing before full deployment
- **Example failure**: New model predicts 50% yield loss (bug) ‚Üí deploys ‚Üí fab halts production

**19. Ignoring Data Quality**
- **Problem**: Retrain on corrupted data (sensor failure, pipeline bug) ‚Üí model worse than before
- **Solution**: Data quality checks before training (missing values, outliers, schema validation)
- **Example**: Vdd sensor malfunction reports 0V ‚Üí model learns incorrect pattern

**20. No Rollback Plan**
- **Problem**: New model degrades in production, no way to revert quickly
- **Solution**: Keep last 3-5 model versions, automated rollback on performance drop
- **Detection**: Monitor rolling 24-hour accuracy, rollback if drops >3%

**21. Training-Serving Skew**
- **Problem**: Features computed differently in training vs production ‚Üí model fails
- **Solution**: Use feature store (consistent feature definitions), validate feature distributions
- **Example**: Training uses offline batch features, production uses real-time features ‚Üí different values

**22. Overfitting to Recent Data**
- **Problem**: Retrain on last 7 days only ‚Üí model forgets long-term patterns
- **Solution**: Use last 30-90 days of data, balance recent (high weight) + historical (low weight)
- **Weighting**: Recent data 2x weight, older data 1x weight (time-based importance)

---

### ‚úÖ Best Practices

**23. Start Simple, Iterate**
- **Phase 1**: Manual retraining (understand drift patterns)
- **Phase 2**: Scheduled retraining (weekly/monthly)
- **Phase 3**: Add drift detection triggers
- **Phase 4**: Add performance triggers + A/B testing
- **Don't**: Build complex multi-trigger system on day 1

**24. Monitor Everything**
- **Training metrics**: Accuracy, F1, training time, data volume
- **Deployment metrics**: Models deployed, rejected, rolled back
- **Production metrics**: Inference latency, throughput, error rate
- **Business metrics**: Revenue impact, cost savings (tie CT to ROI)

**25. Automate Validation Reports**
- **Generate**: Comparison report (new model vs production model)
- **Include**: Metrics, feature importances, prediction distributions, validation gate results
- **Share**: Email to team after each retrain (transparency)

**26. Handle Edge Cases**
- **Insufficient data**: If <1000 samples in retrain window, skip retrain (wait for more data)
- **Training failures**: Retry 2x, if still fails, alert team (don't deploy broken model)
- **Validation failures**: Log reason (which gate failed), investigate (data quality? threshold too strict?)

**27. Document Trigger Decisions**
- **Log**: Every trigger check (timestamp, trigger type, decision, reason)
- **Dashboard**: Show trigger history (when did we retrain? why?)
- **Analyze**: Review trigger patterns monthly (are we retraining too much? too little?)

**28. Use A/B Testing Before Full Deployment**
- **Shadow mode**: Run new model in parallel with production (log predictions, don't serve)
- **A/B test**: Serve new model to 10% of traffic for 24 hours
- **Full deployment**: If A/B test passes, deploy to 100%
- **Safety**: Limits blast radius if new model has issues

---

### üöÄ Production Checklist

**Before deploying CT system**:
- [ ] Trigger logic implemented and tested (schedule, drift, performance)
- [ ] Validation gates defined with business stakeholders (thresholds agreed)
- [ ] Orchestration pipeline tested end-to-end (Airflow DAG runs successfully)
- [ ] Model versioning and rollback tested (can revert to previous version)
- [ ] Monitoring dashboards created (trigger checks, retraining status, model metrics)
- [ ] Alerting configured (Slack/email on failures)
- [ ] Data quality checks added (schema validation, outlier detection)
- [ ] Feature store integration (consistent features in training and production)
- [ ] A/B testing infrastructure ready (traffic splitting, metric tracking)
- [ ] Documentation written (runbook for failures, trigger tuning guide)
- [ ] Team training complete (know how to debug, override, rollback)
- [ ] Cost estimated (compute budget for retraining)

---

### üéØ When to Use Continuous Training

**‚úÖ Use CT when**:
- Data distributions change over time (seasonality, trends, process changes)
- Model performance degrades in production (drift, concept shift)
- New data arrives regularly (daily/weekly production data)
- Business requires up-to-date models (fraud, recommendations, demand forecasting)
- Cost of stale model is high (lost revenue, safety risk)

**‚ùå Don't use CT when**:
- Data is static (no new data after initial training)
- Model performance is stable (no degradation over 6+ months)
- Retraining cost > benefit (expensive training, minimal accuracy gain)
- Business doesn't require freshness (historical analysis, one-time prediction)

**Alternatives**:
- **Manual retraining**: Retrain when performance drops (alerts trigger manual investigation)
- **Online learning**: Update model incrementally with each new sample (no batch retraining)
- **Ensemble with new models**: Keep old model, add new model, ensemble predictions

---

### üìö Next Steps

**After mastering continuous training**:
1. **Model Governance (Notebook 127)**: Model cards, audit trails, compliance for regulated industries
2. **Production Monitoring (Notebook 128)**: Real-time monitoring, alerting, incident response for deployed models
3. **CI/CD for ML (Notebook 129)**: Automate code + model deployment with GitHub Actions, Jenkins
4. **Advanced MLOps (Notebook 130)**: Multi-model pipelines, model ensembles, AutoML integration

**Recommended resources**:
- Book: "Machine Learning Design Patterns" (Lakshmanan et al.) - Chapter on continuous training
- Paper: "The ML Test Score" (Google) - Validation framework for production ML systems
- Course: "Full Stack Deep Learning" - Production ML best practices
- Tool docs: Airflow documentation, Kubeflow pipelines, MLflow model registry

---

**üéØ Remember**: Continuous training is not optional for production ML - it's the difference between models that stay accurate (save millions) and models that degrade silently (cost millions). Start simple, monitor everything, iterate based on production feedback.

## üîë Key Takeaways

**When to Use Continuous Training:**
- Model degrades over time (concept drift, data shift)
- Frequent new data available (daily/weekly batches)
- Business requires up-to-date predictions
- Manual retraining too slow or error-prone

**Limitations:**
- Infrastructure complexity (orchestration, monitoring, rollback)
- Training costs accumulate (compute, storage)
- Risk of catastrophic forgetting (new data replaces old patterns)
- Regulatory challenges (model versioning, auditability)

**Alternatives:**
- Periodic batch retraining (weekly/monthly schedule)
- Online learning (real-time updates per sample)
- Trigger-based retraining (drift threshold exceeded)
- Ensemble with new + old models

**Best Practices:**
- Monitor drift metrics continuously (PSI, KL divergence)
- Implement automated rollback on quality degradation
- Version all artifacts (data, code, models, configs)
- Use canary/shadow deployments for validation
- Document retraining triggers and thresholds

**Next Steps:**
- 127: Model Governance & Compliance (audit continuous training)
- 154: Model Monitoring & Observability (drift detection)
- 156: ML Pipeline Orchestration (advanced workflows)