# 109: ML Pipelines with Apache Airflow

## üéØ Learning Objectives

By the end of this notebook, you will:
- **Understand** DAG (Directed Acyclic Graph) design for ML workflows
- **Implement** end-to-end ML pipeline: data extraction ‚Üí training ‚Üí validation ‚Üí deployment
- **Build** automated retraining schedules with dependency management
- **Apply** Airflow to semiconductor test data pipelines (STDF ‚Üí features ‚Üí models)
- **Evaluate** pipeline monitoring, failure recovery, and backfilling strategies

## üìö What are ML Pipelines?

ML pipelines orchestrate the sequence of steps from raw data to deployed model predictions. Unlike one-off notebook experiments, production ML requires repeatable, monitored workflows that handle failures gracefully. Apache Airflow represents pipelines as DAGs where nodes are tasks (Python functions, SQL queries, model training) and edges are dependencies ("train model only after data validation passes").

Airflow's killer features: **scheduling** (daily retraining at 2 AM), **dependency management** (skip deployment if accuracy < threshold), **retries** (network glitches don't break pipelines), **monitoring** (SLA alerts if pipeline exceeds 4 hours), and **backfilling** (reprocess last 30 days after bug fix). Tasks run in isolated environments (Docker containers), enabling language polyglot pipelines (Python preprocessing ‚Üí Spark training ‚Üí R validation).

In semiconductor manufacturing, Airflow orchestrates nightly STDF data ingestion (extract from test servers ‚Üí parse to DataFrames ‚Üí quality checks ‚Üí feature engineering ‚Üí model retraining ‚Üí deploy if improved ‚Üí notify engineers). When a tester goes offline, the pipeline detects missing data, skips dependent tasks, and alerts on-call. Manual interventions (approve deployment) integrate seamlessly via sensors.

**Why ML Pipelines with Airflow?**
- ‚úÖ **Automation**: Zero-touch retraining, no manual notebook runs
- ‚úÖ **Reliability**: Automatic retries, failure notifications, SLA monitoring
- ‚úÖ **Scalability**: Distribute tasks across workers (Kubernetes, Celery)
- ‚úÖ **Observability**: Web UI shows every run, logs, task durations
- ‚úÖ **Version Control**: DAGs as code, track changes in Git

## üè≠ Post-Silicon Validation Use Cases

**Use Case 1: Daily Yield Model Retraining**
- **Pipeline**: STDF extraction ‚Üí data quality checks ‚Üí feature engineering ‚Üí model training ‚Üí A/B test ‚Üí deploy if wins
- **Schedule**: 2 AM daily (after all test data uploaded)
- **Trigger**: 10K+ new devices tested (skip if insufficient data)
- **Monitoring**: SLA = 4 hours, alert if accuracy < 90%
- **Impact**: Model adapts to process drift within 24 hours (vs weekly manual retrains)

**Use Case 2: Multi-Fab Data Aggregation**
- **Pipeline**: Fab1 STDF + Fab2 STDF + Fab3 STDF ‚Üí merge ‚Üí normalize ‚Üí feature store update
- **Schedule**: Hourly (streaming-like batch processing)
- **Dependencies**: Wait for all 3 fabs, timeout after 2 hours
- **Backfill**: Reprocess 90 days when Fab2 fixes timestamp bug
- **Impact**: Unified feature store across global manufacturing network

**Use Case 3: Automated Model Validation Pipeline**
- **Pipeline**: Candidate model ‚Üí offline metrics ‚Üí simulation ‚Üí champion/challenger A/B test ‚Üí gradual rollout
- **Trigger**: New model registered in MLflow
- **Human-in-loop**: Approval sensor before production deployment
- **Rollback**: Auto-rollback if production accuracy drops >5%
- **Impact**: 10 model deployments/month (vs 2/month manual)

**Use Case 4: STDF Quality Monitoring Pipeline**
- **Pipeline**: Ingest STDF ‚Üí schema validation ‚Üí statistical checks ‚Üí alert on anomalies
- **Schedule**: Every 15 minutes (near real-time)
- **Checks**: Missing parameters, out-of-range values, duplicate records
- **Action**: Quarantine bad batches, notify data engineering
- **Impact**: Catch data quality issues before model training (prevents garbage-in-garbage-out)

## üîÑ Airflow Pipeline Architecture

```mermaid
graph TB
    A[Scheduler] --> B[DAG Definition]
    B --> C[Task Queue]
    
    C --> D[Worker 1]
    C --> E[Worker 2]
    C --> F[Worker N]
    
    D --> G[Extract STDF]
    E --> H[Train Model]
    F --> I[Deploy Model]
    
    G --> J[Data Quality Check]
    J --> K{Checks Pass?}
    
    K -->|Yes| L[Feature Engineering]
    K -->|No| M[Alert & Skip]
    
    L --> H
    H --> N[Model Validation]
    N --> O{Accuracy OK?}
    
    O -->|Yes| I
    O -->|No| P[Notify Team]
    
    I --> Q[Production Serving]
    
    B --> R[Metadata DB]
    R --> S[Web UI]
    R --> T[Logs]
    
    style A fill:#e1f5ff
    style Q fill:#e1ffe1
    style M fill:#ffe1e1
    style P fill:#ffe1e1
```

## üìä Learning Path Context

**Prerequisites:**
- **091**: SQL Advanced - Data extraction queries
- **107**: Model Monitoring - Detecting when to retrain
- **108**: Feature Stores - Centralized feature management

**This Notebook (109):**
- Airflow DAG creation and task definition
- Task dependencies and branching logic
- Scheduling (cron expressions, triggers)
- Failure handling and retries
- Pipeline monitoring and SLAs

**Next Steps:**
- **131**: Cloud Deployment - Airflow on Kubernetes
- **132**: CI/CD for ML - Automated testing and deployment

---

Let's automate ML workflows end-to-end! üîÑ

## 1. Setup and Airflow Concepts

**Note:** This notebook teaches Airflow concepts. Full deployment requires:
- `pip install apache-airflow`
- Metadata database (PostgreSQL recommended)
- Executor (LocalExecutor, CeleryExecutor, KubernetesExecutor)

We'll demonstrate pipeline design with simulated task execution.

In [None]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import time
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 8)

print("‚úÖ Pipeline simulation environment ready!")
print("\nüìù Airflow Installation:")
print("   pip install 'apache-airflow==2.7.3' --constraint requirements.txt")
print("   airflow db init")
print("   airflow users create --username admin --password admin --role Admin")
print("   airflow webserver -p 8080")
print("   airflow scheduler")

## 2. DAG Design: Yield Model Retraining Pipeline

**Purpose:** Design end-to-end pipeline from data ingestion to model deployment.

**Key Points:**
- **Task 1**: Extract STDF data from test servers
- **Task 2**: Data quality validation (schema, ranges, completeness)
- **Task 3**: Feature engineering (aggregations, transformations)
- **Task 4**: Model training (RandomForest on engineered features)
- **Task 5**: Model validation (compare to baseline)
- **Task 6**: Deploy if improved (otherwise skip)
- **Why this matters**: Dependencies ensure data quality before expensive training

In [None]:
# Simulated Airflow DAG structure (actual DAG would be in airflow/dags/ folder)

class PipelineTask:
    """Simulated Airflow task for demonstration."""
    def __init__(self, task_id, dependencies=None):
        self.task_id = task_id
        self.dependencies = dependencies or []
        self.status = 'pending'
        self.start_time = None
        self.end_time = None
        self.result = None
    
    def execute(self, context=None):
        """Simulate task execution."""
        self.status = 'running'
        self.start_time = datetime.now()
        print(f"[{self.start_time.strftime('%H:%M:%S')}] ‚ñ∂Ô∏è  {self.task_id} started")
        
        # Simulate work
        time.sleep(0.5)
        
        self.end_time = datetime.now()
        self.status = 'success'
        duration = (self.end_time - self.start_time).total_seconds()
        print(f"[{self.end_time.strftime('%H:%M:%S')}] ‚úÖ {self.task_id} completed ({duration:.2f}s)")
        
        return self.result

# Define pipeline tasks
task_extract_stdf = PipelineTask('extract_stdf_data')
task_validate_data = PipelineTask('validate_data_quality', dependencies=[task_extract_stdf])
task_engineer_features = PipelineTask('engineer_features', dependencies=[task_validate_data])
task_train_model = PipelineTask('train_yield_model', dependencies=[task_engineer_features])
task_validate_model = PipelineTask('validate_model_accuracy', dependencies=[task_train_model])
task_deploy_model = PipelineTask('deploy_to_production', dependencies=[task_validate_model])

# Pipeline metadata
pipeline_config = {
    'dag_id': 'yield_model_retrain_daily',
    'schedule': '0 2 * * *',  # 2 AM daily (cron expression)
    'start_date': datetime(2025, 12, 1),
    'catchup': False,  # Don't backfill missed runs
    'max_active_runs': 1,  # One run at a time
    'sla_minutes': 240,  # 4 hour SLA
    'tasks': [
        task_extract_stdf,
        task_validate_data,
        task_engineer_features,
        task_train_model,
        task_validate_model,
        task_deploy_model
    ]
}

print("Pipeline DAG Design:")
print(f"  DAG ID: {pipeline_config['dag_id']}")
print(f"  Schedule: {pipeline_config['schedule']} (daily at 2 AM)")
print(f"  SLA: {pipeline_config['sla_minutes']} minutes")
print(f"\nTask Dependency Graph:")
print("  extract_stdf_data")
print("    ‚Üì")
print("  validate_data_quality")
print("    ‚Üì")
print("  engineer_features")
print("    ‚Üì")
print("  train_yield_model")
print("    ‚Üì")
print("  validate_model_accuracy")
print("    ‚Üì")
print("  deploy_to_production")

## 3. DAG Execution Simulation

**Purpose:** Run the training pipeline DAG to demonstrate task execution flow.

**Key Points:**
- **Sequential Execution**: Tasks run in dependency order (extract ‚Üí transform ‚Üí train ‚Üí evaluate)
- **State Management**: Track task status (queued ‚Üí running ‚Üí success/failed)
- **Idempotency**: Re-running DAG produces same results (critical for debugging)
- **Logging**: Capture stdout/stderr for each task for troubleshooting

**Why This Matters:** In production, Airflow scheduler executes DAGs automatically on schedule. Understanding execution flow is critical for debugging failures.

In [None]:
# Simulate DAG execution
import time

def execute_dag(dag):
    """Simulate Airflow DAG execution with task dependencies."""
    print(f"üöÄ Starting DAG: {dag['dag_id']}")
    print(f"Schedule: {dag['schedule']}\n")
    
    # Topologically sort tasks by dependencies
    task_order = ['extract_data', 'transform_features', 'train_model', 'evaluate_model']
    
    task_states = {}
    for task_name in task_order:
        task = dag['tasks'][task_name]
        print(f"‚ñ∂ Executing task: {task_name}")
        print(f"  Description: {task['description']}")
        
        # Simulate task execution time
        time.sleep(0.5)
        
        # Simulate success (in real Airflow, could be success/failed/retry)
        task_states[task_name] = 'success'
        print(f"  ‚úÖ Status: {task_states[task_name]}\n")
    
    print(f"üéâ DAG Execution Complete!")
    print(f"  Total Tasks: {len(task_order)}")
    print(f"  Successful: {sum([1 for s in task_states.values() if s == 'success'])}")
    
    return task_states

# Execute the DAG
execution_states = execute_dag(ml_training_dag)

# Visualize task execution timeline
execution_times = {
    'extract_data': 45,  # seconds
    'transform_features': 120,
    'train_model': 300,
    'evaluate_model': 30
}

plt.figure(figsize=(10, 5))
tasks = list(execution_times.keys())
times = list(execution_times.values())
colors = ['green' if execution_states[t] == 'success' else 'red' for t in tasks]

plt.barh(tasks, times, color=colors, edgecolor='black')
plt.xlabel('Execution Time (seconds)')
plt.title('DAG Task Execution Timeline')
plt.axvline(x=600, color='red', linestyle='--', label='SLA: 10 minutes')
plt.legend()
plt.tight_layout()
plt.show()

print(f"\nTotal Pipeline Time: {sum(times)} seconds ({sum(times)/60:.1f} minutes)")

## 4. Task Failure Handling & Retries

**Purpose:** Configure retry logic and failure notifications for robust pipelines.

**Key Points:**
- **Retries**: Automatically retry failed tasks (e.g., network timeouts) up to N times
- **Exponential Backoff**: Wait 2^retry_number minutes between retries (prevent overwhelming systems)
- **Alerts**: Send Slack/PagerDuty notifications on permanent failures
- **Circuit Breaker**: Stop downstream tasks if critical task fails (e.g., data extraction)

**Why This Matters:** Production data pipelines fail 5-10% of the time. Proper retry logic prevents manual intervention for transient errors.

In [None]:
# Simulate task retry logic
def execute_task_with_retries(task_name, max_retries=3, failure_rate=0.3):
    """Simulate task execution with retry logic."""
    for attempt in range(1, max_retries + 1):
        # Simulate random failure (30% failure rate)
        success = np.random.random() > failure_rate
        
        print(f"  Attempt {attempt}/{max_retries}: ", end="")
        if success:
            print(f"‚úÖ SUCCESS")
            return 'success'
        else:
            print(f"‚ùå FAILED (network timeout)")
            if attempt < max_retries:
                backoff_seconds = 2 ** attempt
                print(f"    ‚è≥ Retrying in {backoff_seconds} seconds...")
                time.sleep(0.2)  # Simulate backoff (shortened for demo)
    
    print(f"  üö® Task {task_name} failed after {max_retries} attempts!")
    return 'failed'

# Example: Retry flaky extract_data task
print("Testing Retry Logic for 'extract_data' task:\n")
np.random.seed(42)  # For reproducibility
result = execute_task_with_retries('extract_data', max_retries=3, failure_rate=0.5)

# Visualize retry success rates
retry_scenarios = []
for failure_rate in [0.1, 0.3, 0.5, 0.7]:
    successes = 0
    for _ in range(100):
        if execute_task_with_retries('test_task', max_retries=3, failure_rate=failure_rate) == 'success':
            successes += 1
    retry_scenarios.append({'Failure Rate': f'{failure_rate*100:.0f}%', 'Success Rate': successes})

retry_df = pd.DataFrame(retry_scenarios)
print(f"\n\nRetry Strategy Effectiveness (100 trials each):")
print(retry_df)

# Alert configuration
alert_config = {
    'on_failure_callback': 'send_slack_alert',
    'sla_miss_callback': 'send_pagerduty_alert',
    'email_on_failure': True,
    'email_to': 'ml-ops-team@company.com'
}

print(f"\n\nüìß Alert Configuration:")
for key, value in alert_config.items():
    print(f"  {key}: {value}")

## 5. DAG Monitoring Dashboard

**Purpose:** Visualize pipeline health metrics for operational oversight.

**Key Points:**
- **DAG Run History**: Track success/failure rates over time
- **Task Duration Trends**: Identify tasks getting slower (data volume growth)
- **SLA Violations**: Alert when pipelines exceed time budgets
- **Resource Usage**: Monitor CPU/memory per task for optimization

**Why This Matters:** Data engineers need dashboards to proactively fix bottlenecks before stakeholders complain about delayed models.

In [None]:
# Simulate DAG run history (30 days)
np.random.seed(100)
dates = pd.date_range(end=pd.Timestamp.now(), periods=30, freq='D')
dag_runs = []

for date in dates:
    # Simulate success/failure (90% success rate)
    status = 'success' if np.random.random() > 0.1 else 'failed'
    duration = np.random.normal(480, 60) if status == 'success' else np.random.normal(200, 50)  # seconds
    
    dag_runs.append({
        'date': date,
        'status': status,
        'duration_seconds': max(duration, 0),
        'tasks_succeeded': 4 if status == 'success' else np.random.randint(0, 4),
        'tasks_failed': 0 if status == 'success' else np.random.randint(1, 5)
    })

dag_history_df = pd.DataFrame(dag_runs)

# Visualization Dashboard
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Airflow DAG Monitoring Dashboard', fontsize=16, fontweight='bold')

# 1. DAG Run Success Rate Over Time
success_rate = dag_history_df.groupby(dag_history_df['date'].dt.date)['status'].apply(
    lambda x: (x == 'success').sum() / len(x) * 100
)
axes[0, 0].plot(success_rate.index, success_rate.values, marker='o', color='green', linewidth=2)
axes[0, 0].axhline(y=95, color='red', linestyle='--', label='Target: 95%')
axes[0, 0].set_title('DAG Success Rate (30 Days)')
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Success Rate (%)')
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)

# 2. Task Duration Trend
axes[0, 1].plot(dag_history_df['date'], dag_history_df['duration_seconds'], color='blue', alpha=0.6)
axes[0, 1].axhline(y=600, color='red', linestyle='--', label='SLA: 10 minutes')
axes[0, 1].set_title('Pipeline Duration Trend')
axes[0, 1].set_xlabel('Date')
axes[0, 1].set_ylabel('Duration (seconds)')
axes[0, 1].legend()
axes[0, 1].grid(alpha=0.3)

# 3. Task Success vs Failure Count
status_counts = dag_history_df['status'].value_counts()
axes[1, 0].bar(status_counts.index, status_counts.values, color=['green', 'red'], edgecolor='black')
axes[1, 0].set_title('DAG Run Outcomes (30 Days)')
axes[1, 0].set_ylabel('Count')

# 4. SLA Violations
sla_violations = dag_history_df[dag_history_df['duration_seconds'] > 600]
axes[1, 1].text(0.1, 0.6, f"""
SLA MONITORING SUMMARY
======================
Total Runs: {len(dag_history_df)}
Successful: {(dag_history_df['status'] == 'success').sum()}
Failed: {(dag_history_df['status'] == 'failed').sum()}

SLA Violations: {len(sla_violations)} runs > 10 min
Average Duration: {dag_history_df['duration_seconds'].mean():.0f}s

üü¢ Uptime: {(dag_history_df['status'] == 'success').mean() * 100:.1f}%
""", fontsize=11, family='monospace', verticalalignment='center',
                bbox=dict(boxstyle='round', facecolor='lightblue'))
axes[1, 1].axis('off')

plt.tight_layout()
plt.show()

print(f"\nüìä DAG Health Metrics (Last 30 Days):")
print(f"  Success Rate: {(dag_history_df['status'] == 'success').mean() * 100:.1f}%")
print(f"  Average Duration: {dag_history_df['duration_seconds'].mean():.0f} seconds")
print(f"  SLA Violations: {len(sla_violations)} / {len(dag_history_df)} runs")

## 6. Backfilling Historical Data

**Purpose:** Re-run DAG for past dates to fill data gaps or fix pipeline bugs.

**Key Points:**
- **Use Case**: Bug in feature engineering found ‚Üí need to regenerate training data for last 90 days
- **Backfill Command**: `airflow dags backfill --start-date 2024-01-01 --end-date 2024-03-31`
- **Idempotency**: Tasks check if output already exists ‚Üí skip recomputation (save costs)
- **Parallelism**: Run multiple backfill instances simultaneously (date partitioning)

**Why This Matters:** Data bugs are common. Backfilling prevents manual data fixes and ensures reproducibility.

In [None]:
# Simulate backfill scenario
backfill_dates = pd.date_range(start='2024-01-01', end='2024-01-10', freq='D')

print("üîÑ Backfilling DAG for date range: 2024-01-01 to 2024-01-10")
print(f"Total runs to execute: {len(backfill_dates)}\n")

backfill_results = []
for date in backfill_dates:
    # Simulate backfill execution (check if data exists, skip if yes)
    data_exists = np.random.random() > 0.7  # 30% already exist
    
    if data_exists:
        print(f"  {date.date()}: ‚è© SKIPPED (data already exists)")
        status = 'skipped'
        duration = 0
    else:
        print(f"  {date.date()}: ‚ñ∂ RUNNING backfill...")
        time.sleep(0.1)
        status = 'success'
        duration = np.random.normal(480, 60)
        print(f"  {date.date()}: ‚úÖ COMPLETED ({duration:.0f}s)")
    
    backfill_results.append({
        'date': date,
        'status': status,
        'duration_seconds': duration
    })

backfill_df = pd.DataFrame(backfill_results)

print(f"\n\nüìà Backfill Summary:")
print(f"  Total Dates: {len(backfill_dates)}")
print(f"  Skipped (already existed): {(backfill_df['status'] == 'skipped').sum()}")
print(f"  Executed: {(backfill_df['status'] == 'success').sum()}")
print(f"  Total Time: {backfill_df['duration_seconds'].sum():.0f} seconds ({backfill_df['duration_seconds'].sum()/60:.1f} minutes)")

# Visualize backfill progress
plt.figure(figsize=(10, 5))
colors = ['gray' if s == 'skipped' else 'green' for s in backfill_df['status']]
plt.bar(range(len(backfill_df)), backfill_df['duration_seconds'], color=colors, edgecolor='black')
plt.xticks(range(len(backfill_df)), [d.strftime('%m-%d') for d in backfill_df['date']], rotation=45)
plt.xlabel('Date')
plt.ylabel('Duration (seconds)')
plt.title('Backfill Execution Timeline')
plt.legend(['Skipped', 'Executed'], loc='upper right')
plt.tight_layout()
plt.show()

## üöÄ Real-World Project Templates

Build production ML pipelines using these architectures:

### 1Ô∏è‚É£ **Post-Silicon Yield Prediction Pipeline**
- **Objective**: Daily retraining pipeline for wafer yield forecasting models  
- **DAG Tasks**: Extract STDF files ‚Üí Parse parametric data ‚Üí Feature engineering ‚Üí Train RF model ‚Üí Validate ‚Üí Deploy  
- **Success Metric**: < 30 min end-to-end, 99% uptime  
- **Features**: Parallel wafer processing, incremental training, automated A/B testing  
- **Tech Stack**: Airflow, Spark (STDF parsing), MLflow (model registry), Kubernetes (training jobs)

### 2Ô∏è‚É£ **E-Commerce Recommendation Retraining**
- **Objective**: Weekly collaborative filtering model update with fresh user interactions  
- **DAG Tasks**: S3 clickstream ‚Üí Feature aggregation ‚Üí Matrix factorization ‚Üí Evaluate top-K ‚Üí Deploy to Redis  
- **Success Metric**: Maintain CTR > 3.5% with weekly updates  
- **Features**: Cold-start user handling, popularity bias correction, diversity constraints  
- **Tech Stack**: Airflow, Spark (ALS), Feast (feature store), SageMaker endpoints

### 3Ô∏è‚É£ **Fraud Detection Real-Time Pipeline**
- **Objective**: Hourly feature refresh + model retraining on new fraud patterns  
- **DAG Tasks**: Kafka ‚Üí Feature extraction ‚Üí Streaming aggregations ‚Üí GBDT training ‚Üí Deploy via Docker  
- **Success Metric**: Detect new fraud tactics within 24 hours  
- **Features**: Streaming features (transaction velocity), adversarial validation, explainability logging  
- **Tech Stack**: Airflow, Flink (streaming), XGBoost, SHAP, PostgreSQL

### 4Ô∏è‚É£ **Autonomous Vehicle Model Pipeline**
- **Objective**: Nightly perception model retraining from fleet data  
- **DAG Tasks**: S3 sensor logs ‚Üí Video labeling ‚Üí Image augmentation ‚Üí CNN training ‚Üí TensorRT optimization ‚Üí OTA deploy  
- **Success Metric**: mAP > 0.85 for object detection, < 50ms inference  
- **Features**: Active learning (select hard examples), multi-GPU training, quantization  
- **Tech Stack**: Airflow, PyTorch, TensorRT, CVAT (labeling), Weights & Biases

### 5Ô∏è‚É£ **Healthcare Readmission Risk Pipeline**
- **Objective**: Monthly HIPAA-compliant model updates for patient risk scoring  
- **DAG Tasks**: EMR extraction ‚Üí De-identification ‚Üí Feature engineering ‚Üí Logistic regression ‚Üí Explainability ‚Üí Audit log  
- **Success Metric**: AUROC > 0.80, full audit trail for compliance  
- **Features**: Temporal cross-validation, fairness metrics (demographic parity), LIME explanations  
- **Tech Stack**: Airflow (on-prem), Snowflake (encrypted), scikit-learn, Aequitas (fairness)

### 6Ô∏è‚É£ **Financial Trading Signal Pipeline**
- **Objective**: Every 5 minutes: retrain short-term momentum models  
- **DAG Tasks**: Market data API ‚Üí Technical indicators ‚Üí Ensemble model ‚Üí Backtesting ‚Üí Deploy to low-latency C++ engine  
- **Success Metric**: Sharpe ratio > 1.5, < 1ms inference latency  
- **Features**: Walk-forward validation, transaction cost simulation, market regime detection  
- **Tech Stack**: Airflow, kdb+/q, LightGBM, custom C++ inference

### 7Ô∏è‚É£ **Supply Chain Demand Forecasting**
- **Objective**: Daily SKU-level demand forecasts for 10K products  
- **DAG Tasks**: POS sales ‚Üí Promotional calendar ‚Üí Weather API ‚Üí Prophet forecasting ‚Üí Inventory optimizer ‚Üí ERP upload  
- **Success Metric**: < 10% MAPE for 90% of SKUs  
- **Features**: Hierarchical forecasting, promotional lift modeling, safety stock calculation  
- **Tech Stack**: Airflow, Spark, Prophet/Chronos, Snowflake, SAP integration

### 8Ô∏è‚É£ **Smart Grid Load Forecasting Pipeline**
- **Objective**: Hourly regional electricity demand forecasts  
- **DAG Tasks**: Smart meter aggregation ‚Üí Weather forecast ‚Üí Holiday calendar ‚Üí LSTM training ‚Üí Grid balancing API  
- **Success Metric**: < 3% MAPE for 24-hour ahead, < 5 min pipeline time  
- **Features**: Multi-horizon forecasting, weather scenario ensembles, renewable energy integration  
- **Tech Stack**: Airflow, TensorFlow, InfluxDB (timeseries), Grafana alerts

## üéØ Key Takeaways

### What are ML Pipelines?
Automated workflows that orchestrate data ingestion, feature engineering, model training, evaluation, and deployment with dependency management and scheduling.

### Why Airflow for ML?
- **Python-Native**: Define workflows as code (DAGs) using Python
- **Rich Ecosystem**: 200+ operators for AWS, GCP, Spark, Kubernetes, databases
- **Dependency Management**: Enforces task execution order automatically
- **Scalability**: Distribute tasks across worker nodes (Celery/Kubernetes executors)
- **Monitoring**: Built-in UI for tracking DAG runs, logs, and failures

### Core Airflow Concepts

| **Concept** | **Definition** | **Example** |
|------------|---------------|------------|
| **DAG** | Directed Acyclic Graph (workflow definition) | `ml_training_dag = DAG(dag_id="train_model", schedule="@daily")` |
| **Task** | Single unit of work | `extract_data = PythonOperator(task_id="extract", python_callable=extract_fn)` |
| **Operator** | Task template | `BashOperator`, `PythonOperator`, `SparkSubmitOperator` |
| **Executor** | Task runner | `LocalExecutor` (dev), `CeleryExecutor` (prod), `KubernetesExecutor` (cloud) |
| **Schedule** | Cron or preset | `@daily`, `@hourly`, `0 2 * * *` (2 AM daily) |
| **Sensor** | Waits for condition | `S3KeySensor` (wait for file), `ExternalTaskSensor` |

### DAG Design Best Practices

**‚úÖ Good DAG Design:**
```python
extract_data >> transform_features >> train_model >> evaluate_model >> deploy_model
# Clear linear dependency, easy to debug
```

**‚ùå Bad DAG Design:**
```python
task1 >> [task2, task3, task4] >> task5
task2 >> task6
# Hidden dependencies, hard to troubleshoot failures
```

**Principles:**
- **Idempotency**: Re-running same DAG produces same result (no side effects)
- **Atomicity**: Each task should be independently retry-able
- **Modularity**: Break complex tasks into smaller testable units
- **Observability**: Log inputs/outputs, metrics, and errors explicitly

### Retry Configuration

```python
default_args = {
    'retries': 3,  # Retry up to 3 times
    'retry_delay': timedelta(minutes=5),  # Wait 5 min between retries
    'retry_exponential_backoff': True,  # 5min ‚Üí 10min ‚Üí 20min
    'execution_timeout': timedelta(hours=2),  # Kill if > 2 hours
    'on_failure_callback': send_slack_alert,  # Custom alert function
}
```

### Airflow vs Alternatives

| **Tool** | **Best For** | **Strengths** | **Weaknesses** |
|---------|-------------|--------------|----------------|
| **Airflow** | Python-centric, batch workflows | Mature, huge community, flexible | Complex setup, not for streaming |
| **Prefect** | Modern Python orchestration | Better UX, dynamic DAGs | Smaller ecosystem |
| **Kubeflow Pipelines** | Kubernetes-native ML | Native K8s integration, versioning | Steep learning curve |
| **Luigi** | Simple pipelines | Lightweight, no database | Limited features, less active |
| **Dagster** | Data pipelines with types | Type safety, great testing | Newer, fewer integrations |

### Common Pitfalls

- ‚ùå **Database in Task Logic**: Don't query DBs inside tasks ‚Üí use XCom for small data, external storage for large
- ‚ùå **No SLA Monitoring**: Pipelines silently slow down ‚Üí set `sla` parameter
- ‚ùå **Hardcoded Dates**: Use Airflow macros: `{{ ds }}` (execution date), `{{ prev_ds }}`
- ‚ùå **Too Many Small Tasks**: 100 tasks/DAG ‚Üí scheduler overhead. Aim for 10-30 tasks.
- ‚ùå **Ignoring Backpressure**: Don't queue 1000 DAG runs ‚Üí use `max_active_runs=3`

### Post-Silicon Pipeline Patterns

**Wafer Test Data Pipeline:**
1. **Extract**: Download STDF files from test equipment servers (daily)
2. **Parse**: Convert binary STDF ‚Üí Parquet (Spark job)
3. **Aggregate**: Compute wafer-level statistics (yield%, test time)
4. **Feature Engineering**: Spatial features (die neighbors), temporal trends
5. **Train**: Random Forest for yield prediction
6. **Validate**: Backtest on last 30 days
7. **Deploy**: Update model in production API

**Monitoring Triggers:**
- Retrain if PSI > 0.25 on Vdd or Frequency features
- Alert if wafer yield < 85% (business threshold)
- Backfill last 90 days if feature engineering bug found

### Production Deployment

**Typical Airflow Stack:**
- **Scheduler**: 1-3 instances (HA setup with heartbeat)
- **Web Server**: 2+ instances (behind load balancer)
- **Workers**: 10-100 Celery workers (depending on task parallelism)
- **Database**: PostgreSQL (metadata storage, not for large data!)
- **Message Broker**: Redis or RabbitMQ (for Celery executor)
- **Storage**: S3/GCS for logs, model artifacts, intermediate data

**Cost Optimization:**
- Use `KubernetesPodOperator` for auto-scaling training jobs
- Set `pool` limits to prevent resource exhaustion
- Clean up old DAG runs: `airflow dags delete --older-than 90`

### Performance Benchmarks (Typical)

- **DAG Parse Time**: < 5 seconds (for 30 tasks DAG)
- **Scheduler Latency**: < 10 seconds (from schedule time to task start)
- **Task Overhead**: ~1-2 seconds per task (setup + teardown)
- **Max Throughput**: 10K+ tasks/hour (with KubernetesExecutor)

### Next Steps
- **Advanced**: Dynamic DAG generation, SubDAGs, TaskGroups
- **Integration**: Airflow + MLflow + Feast (complete MLOps stack)
- **Scaling**: Multi-cluster Airflow, cross-region DAGs

---

**Remember**: *Orchestration is infrastructure. Invest when pipelines become unmanageable manually!* üõ†Ô∏è