# Lab 6: Orchestration vá»›i Airflow - TÃ­ch há»£p Complete Pipeline

## ðŸŽ¯ Objectives
- Táº¡o Airflow DAGs cho complete data lakehouse pipeline
- Orchestrate Kafka, Spark, Iceberg, dbt, vÃ  GE
- Implement error handling vÃ  retries
- Monitoring vÃ  alerting

## ðŸ“‹ Prerequisites
- HoÃ n thÃ nh Lab 1-5
- Airflow Ä‘ang cháº¡y
- Táº¥t cáº£ services Ä‘Ã£ Ä‘Æ°á»£c setup


In [None]:
# Complete Airflow DAG cho Data Lakehouse Pipeline
dag_code = """
from airflow.sdk import DAG, task
from airflow.operators.bash import BashOperator
from airflow.providers.standard.operators.empty import EmptyOperator
import pendulum

@dag(
    dag_id='data_lakehouse_pipeline',
    schedule='@daily',
    start_date=pendulum.datetime(2024, 1, 1, tz='UTC'),
    catchup=False,
    default_args={
        'retries': 2,
        'retry_delay': timedelta(minutes=5),
    },
    tags=['lakehouse', 'end-to-end'],
)
def data_lakehouse_pipeline():
    start = EmptyOperator(task_id='start')
    
    # Stage 1: Ingest data vá»›i Kafka producer
    ingest_task = BashOperator(
        task_id='ingest_data',
        bash_command='python /opt/airflow/data/scripts/kafka_producer.py',
    )
    
    # Stage 2: Process vá»›i Spark Streaming
    spark_streaming = BashOperator(
        task_id='spark_streaming',
        bash_command='spark-submit --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.0 /opt/airflow/spark_jobs/streaming_job.py',
    )
    
    # Stage 3: Transform vá»›i dbt
    dbt_run = BashOperator(
        task_id='dbt_transform',
        bash_command='cd /opt/airflow/dbt_project && dbt run --profiles-dir . --project-dir .',
    )
    
    # Stage 4: Validate vá»›i Great Expectations
    ge_validate = BashOperator(
        task_id='ge_validation',
        bash_command='cd /opt/airflow/ge_project && great_expectations checkpoint run data_quality_checkpoint',
    )
    
    # Stage 5: Generate reports
    @task
    def generate_report(**context):
        print('Pipeline completed successfully!')
        return 'Report generated'
    
    end = EmptyOperator(task_id='end', trigger_rule='all_done')
    
    # Define workflow
    start >> ingest_task >> spark_streaming >> dbt_run >> ge_validate >> generate_report() >> end

pipeline = data_lakehouse_pipeline()
"""

print("ðŸ“‹ Complete Airflow DAG:")
print("=" * 60)
print(dag_code)
print("=" * 60)
