# Orchestration — Overview

## Purpose
Data pipeline orchestration is the automated coordination, scheduling, and management of complex data workflows. It ensures that data tasks execute in the correct order, handle failures gracefully, and provide visibility into pipeline health.

## Key Questions
- What is orchestration and why is it essential for data engineering?
- How do popular orchestration tools (Airflow, Dagster, Prefect) compare?
- What are DAGs and how do task dependencies work?
- How do we schedule, monitor, and maintain data pipelines?
- What are best practices for production-grade orchestration?

---
## What is Orchestration and Why It Matters

### Definition
**Orchestration** is the automated arrangement, coordination, and management of complex data workflows. It acts as the "conductor" that ensures all components of a data pipeline work together harmoniously.

### Why Orchestration Matters

| Challenge | Without Orchestration | With Orchestration |
|-----------|----------------------|--------------------|
| **Dependency Management** | Manual tracking of task order | Automatic dependency resolution |
| **Failure Handling** | Silent failures, data corruption | Retries, alerts, and rollback |
| **Scheduling** | Cron jobs scattered across systems | Centralized scheduling with UI |
| **Visibility** | No insight into pipeline status | Real-time monitoring dashboards |
| **Scalability** | Hard to manage growing pipelines | Designed for complex workflows |

### Core Responsibilities of an Orchestrator

1. **Task Scheduling** — Execute tasks at specified times or intervals
2. **Dependency Management** — Ensure tasks run in the correct order
3. **Resource Allocation** — Manage compute resources efficiently
4. **Error Handling** — Retry failed tasks, send alerts, handle exceptions
5. **Monitoring & Logging** — Provide visibility into pipeline execution
6. **Backfilling** — Re-run historical data processing when logic changes

---
## DAGs and Task Dependencies

### What is a DAG?

A **Directed Acyclic Graph (DAG)** is the fundamental structure for defining workflows:

- **Directed** — Tasks have a clear direction (upstream → downstream)
- **Acyclic** — No circular dependencies (prevents infinite loops)
- **Graph** — Tasks (nodes) connected by dependencies (edges)

```
    [Extract A]    [Extract B]
         \            /
          \          /
           [Transform]
               |
           [Validate]
               |
            [Load]
```

### Dependency Types

| Type | Description | Example |
|------|-------------|----------|
| **Sequential** | Task B waits for Task A | Extract → Transform → Load |
| **Parallel** | Tasks run concurrently | Extract from multiple sources |
| **Fan-out** | One task triggers many | Split data for parallel processing |
| **Fan-in** | Many tasks feed into one | Aggregate results from parallel tasks |
| **Conditional** | Task runs based on condition | Run cleanup only if transform fails |

In [None]:
# Example: Simple DAG structure in Python (conceptual)
from dataclasses import dataclass
from typing import List, Callable

@dataclass
class Task:
    """Represents a single task in a DAG."""
    name: str
    execute: Callable
    dependencies: List[str] = None
    
    def __post_init__(self):
        self.dependencies = self.dependencies or []

# Define tasks
def extract_data():
    print("Extracting data from source...")
    return {"records": 1000}

def transform_data():
    print("Transforming data...")
    return {"transformed": True}

def load_data():
    print("Loading data to warehouse...")
    return {"loaded": True}

# Create DAG structure
dag = [
    Task("extract", extract_data, dependencies=[]),
    Task("transform", transform_data, dependencies=["extract"]),
    Task("load", load_data, dependencies=["transform"]),
]

print("DAG Tasks:")
for task in dag:
    deps = task.dependencies if task.dependencies else "None"
    print(f"  {task.name} -> depends on: {deps}")

---
## Apache Airflow, Dagster, Prefect Comparison

### Overview

| Feature | Apache Airflow | Dagster | Prefect |
|---------|---------------|---------|----------|
| **Released** | 2015 (Airbnb) | 2019 | 2018 |
| **Paradigm** | Task-centric | Asset-centric | Task-centric |
| **Configuration** | Python DSL | Python DSL | Python DSL |
| **UI** | Web-based | Web-based | Web-based (Cloud) |
| **Scaling** | Celery, Kubernetes | Kubernetes, Dagster+ | Kubernetes, Prefect Cloud |
| **Learning Curve** | Moderate | Steeper | Gentle |

### Apache Airflow

**Strengths:**
- Industry standard with massive community
- Extensive operator ecosystem (AWS, GCP, databases)
- Battle-tested at scale (Airbnb, Lyft, Spotify)
- Strong scheduling capabilities

**Weaknesses:**
- Complex local development setup
- DAGs parsed on scheduler (can be slow)
- Testing workflows is challenging

### Dagster

**Strengths:**
- Asset-centric (data lineage built-in)
- Excellent local development experience
- Strong typing and data contracts
- First-class testing support

**Weaknesses:**
- Smaller community than Airflow
- Different mental model (assets vs tasks)
- Fewer third-party integrations

### Prefect

**Strengths:**
- Pythonic API (feels natural)
- Dynamic workflows at runtime
- Easy migration from scripts
- Excellent hybrid execution model

**Weaknesses:**
- Prefect Cloud for full features (cost)
- Less mature than Airflow
- Changing API between versions

In [None]:
# Airflow Example (conceptual - requires Airflow installation)
"""
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def extract():
    return "data extracted"

def transform(ti):
    data = ti.xcom_pull(task_ids='extract_task')
    return f"transformed: {data}"

def load(ti):
    data = ti.xcom_pull(task_ids='transform_task')
    print(f"Loading: {data}")

with DAG(
    'etl_pipeline',
    start_date=datetime(2024, 1, 1),
    schedule_interval='@daily',
    catchup=False
) as dag:
    
    extract_task = PythonOperator(
        task_id='extract_task',
        python_callable=extract
    )
    
    transform_task = PythonOperator(
        task_id='transform_task',
        python_callable=transform
    )
    
    load_task = PythonOperator(
        task_id='load_task',
        python_callable=load
    )
    
    extract_task >> transform_task >> load_task
"""
print("Airflow DAG: Uses operators and >> syntax for dependencies")

In [None]:
# Dagster Example (conceptual - requires Dagster installation)
"""
from dagster import asset, Definitions

@asset
def raw_data():
    '''Extract raw data from source.'''
    return [{"id": 1, "value": 100}, {"id": 2, "value": 200}]

@asset
def cleaned_data(raw_data):
    '''Transform and clean the raw data.'''
    return [{"id": r["id"], "value": r["value"] * 2} for r in raw_data]

@asset
def aggregated_data(cleaned_data):
    '''Aggregate cleaned data.'''
    total = sum(r["value"] for r in cleaned_data)
    return {"total_value": total, "count": len(cleaned_data)}

defs = Definitions(assets=[raw_data, cleaned_data, aggregated_data])
"""
print("Dagster: Asset-centric approach with automatic dependency inference")

In [None]:
# Prefect Example (conceptual - requires Prefect installation)
"""
from prefect import flow, task

@task(retries=3, retry_delay_seconds=60)
def extract():
    return {"records": [1, 2, 3, 4, 5]}

@task
def transform(data):
    return {"records": [r * 2 for r in data["records"]]}

@task
def load(data):
    print(f"Loading {len(data['records'])} records")
    return True

@flow(name="ETL Pipeline")
def etl_pipeline():
    raw = extract()
    transformed = transform(raw)
    load(transformed)

if __name__ == "__main__":
    etl_pipeline()
"""
print("Prefect: Pythonic decorators with built-in retry logic")

---
## Scheduling and Monitoring

### Scheduling Strategies

| Strategy | Use Case | Example |
|----------|----------|----------|
| **Cron-based** | Regular intervals | `0 2 * * *` (daily at 2 AM) |
| **Event-driven** | React to data arrival | S3 file upload triggers pipeline |
| **Sensor-based** | Wait for condition | Check if upstream table is updated |
| **Manual** | Ad-hoc runs | Backfill historical data |
| **Data-aware** | Based on data freshness | Run when source data changes |

### Common Cron Expressions

```
┌───────────── minute (0-59)
│ ┌───────────── hour (0-23)
│ │ ┌───────────── day of month (1-31)
│ │ │ ┌───────────── month (1-12)
│ │ │ │ ┌───────────── day of week (0-6)
│ │ │ │ │
* * * * *

@hourly   = 0 * * * *      (every hour)
@daily    = 0 0 * * *      (midnight daily)
@weekly   = 0 0 * * 0      (midnight Sunday)
@monthly  = 0 0 1 * *      (midnight first of month)
```

### Monitoring Best Practices

1. **SLAs (Service Level Agreements)**
   - Define expected completion times
   - Alert when SLAs are breached

2. **Key Metrics to Track**
   - Task success/failure rates
   - Pipeline duration trends
   - Queue depth and resource utilization
   - Data freshness (time since last update)

3. **Alerting Strategy**
   - Critical failures → PagerDuty/immediate alert
   - Warnings → Slack notification
   - Info → Dashboard/logs only

4. **Observability Stack**
   - Logs: Structured logging with context
   - Metrics: Prometheus/Grafana dashboards
   - Traces: OpenTelemetry for distributed tracing

In [None]:
# Example: Simple monitoring pattern
import time
from datetime import datetime
from typing import Dict, Any

class TaskMonitor:
    """Simple task monitoring utility."""
    
    def __init__(self):
        self.metrics: Dict[str, Any] = {}
    
    def record_start(self, task_name: str):
        self.metrics[task_name] = {
            "start_time": datetime.now(),
            "status": "running"
        }
        print(f"[{datetime.now()}] Task '{task_name}' started")
    
    def record_success(self, task_name: str):
        if task_name in self.metrics:
            duration = (datetime.now() - self.metrics[task_name]["start_time"]).total_seconds()
            self.metrics[task_name]["status"] = "success"
            self.metrics[task_name]["duration_seconds"] = duration
            print(f"[{datetime.now()}] Task '{task_name}' completed in {duration:.2f}s")
    
    def record_failure(self, task_name: str, error: str):
        if task_name in self.metrics:
            self.metrics[task_name]["status"] = "failed"
            self.metrics[task_name]["error"] = error
            print(f"[{datetime.now()}] Task '{task_name}' FAILED: {error}")
    
    def get_summary(self) -> Dict:
        return {
            "total_tasks": len(self.metrics),
            "successful": sum(1 for m in self.metrics.values() if m["status"] == "success"),
            "failed": sum(1 for m in self.metrics.values() if m["status"] == "failed")
        }

# Demo
monitor = TaskMonitor()

monitor.record_start("extract")
time.sleep(0.1)  # Simulate work
monitor.record_success("extract")

monitor.record_start("transform")
time.sleep(0.05)
monitor.record_success("transform")

print(f"\nPipeline Summary: {monitor.get_summary()}")

---
## Decision Framework: Choosing an Orchestrator

```
                    START
                      │
         ┌────────────┴────────────┐
         │ Need enterprise support │
         │    and large team?      │
         └────────────┬────────────┘
                      │
              ┌───────┴───────┐
             YES              NO
              │               │
              ▼               ▼
        ┌─────────┐   ┌──────────────┐
        │ Airflow │   │ Data lineage │
        │ (MWAA)  │   │  important?  │
        └─────────┘   └──────┬───────┘
                             │
                     ┌───────┴───────┐
                    YES              NO
                     │               │
                     ▼               ▼
               ┌─────────┐    ┌──────────────┐
               │ Dagster │    │ Simple Python│
               └─────────┘    │   scripts?   │
                              └──────┬───────┘
                                     │
                             ┌───────┴───────┐
                            YES              NO
                             │               │
                             ▼               ▼
                       ┌─────────┐    ┌─────────┐
                       │ Prefect │    │ Airflow │
                       └─────────┘    └─────────┘
```

---
## Takeaway

### Key Concepts

| Concept | Summary |
|---------|----------|
| **Orchestration** | Automated coordination of data workflow tasks |
| **DAG** | Directed Acyclic Graph defining task dependencies |
| **Scheduling** | Cron, event-driven, or sensor-based triggers |
| **Monitoring** | SLAs, metrics, alerting, and observability |

### Tool Selection Guide

- **Apache Airflow** — Industry standard, extensive ecosystem, best for large teams
- **Dagster** — Asset-centric, great for data lineage and testing
- **Prefect** — Pythonic, dynamic workflows, easy migration from scripts

### Best Practices

1. **Idempotency** — Tasks should be safely re-runnable
2. **Atomicity** — Tasks succeed or fail completely
3. **Small Tasks** — Prefer many small tasks over few large ones
4. **Retry Logic** — Implement exponential backoff for failures
5. **Documentation** — Document DAGs, dependencies, and data contracts
6. **Testing** — Test DAGs in isolation before production
7. **Version Control** — Store DAG definitions in Git

### Next Steps

- Explore specific orchestrator deep-dives (Airflow, Dagster, Prefect)
- Learn about data quality checks in pipelines
- Study event-driven architectures for real-time orchestration