# Lab 6: Airflow Integration - Orchestrating GE v·ªõi Airflow

## üéØ Objectives
- T√≠ch h·ª£p Great Expectations v·ªõi Airflow
- Run validations trong Airflow pipelines
- Error handling v√† alerts
- Best practices cho GE + Airflow

## üìã Prerequisites
- Ho√†n th√†nh Lab 1-5
- Airflow Lab ƒë√£ ho√†n th√†nh
- Airflow cluster ƒëang ch·∫°y
- GE project ƒë√£ setup

## üèóÔ∏è Integration Overview

**Airflow + Great Expectations** cho data quality automation:
- **Airflow**: Orchestration v√† scheduling
- **Great Expectations**: Data validation
- **Together**: Automated data quality checks

### Architecture:
```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ   Data      ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ   Airflow   ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ      GE     ‚îÇ
‚îÇ  Pipeline   ‚îÇ     ‚îÇ(Orchestrate) ‚îÇ     ‚îÇ(Validate)  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                            ‚îÇ
                            ‚ñº
                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                    ‚îÇ   Alerts    ‚îÇ
                    ‚îÇ (On Failure)‚îÇ
                    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```


## 1. Airflow DAG v·ªõi Great Expectations

T·∫°o DAG ƒë·ªÉ run GE validations.


In [None]:
# Example Airflow DAG v·ªõi Great Expectations
dag_example = """
from airflow.sdk import DAG, task
from airflow.operators.bash import BashOperator
from airflow.providers.standard.operators.empty import EmptyOperator
import pendulum

@dag(
    dag_id='ge_validation_pipeline',
    schedule='@daily',
    start_date=pendulum.datetime(2024, 1, 1, tz='UTC'),
    catchup=False,
    tags=['great-expectations', 'data-quality'],
)
def ge_validation_pipeline():
    start = EmptyOperator(task_id='start')
    
    # Run GE checkpoint
    ge_checkpoint = BashOperator(
        task_id='run_ge_checkpoint',
        bash_command='cd /path/to/Great_Expectations_lab && great_expectations checkpoint run customers_checkpoint',
        retries=2,
        retry_delay=timedelta(minutes=5),
    )
    
    # Check validation results
    @task
    def check_validation_results(**context):
        import json
        # Read validation results
        # Send alerts n·∫øu failed
        return \"Validation completed\"
    
    end = EmptyOperator(task_id='end')
    
    # Define dependencies
    start >> ge_checkpoint >> check_validation_results() >> end

ge_pipeline_instance = ge_validation_pipeline()
"""

print("üìã Airflow DAG Example:")
print("=" * 60)
print(dag_example)
print("=" * 60)

print("\nüí° Key Points:")
print("  - Run GE checkpoints trong Airflow tasks")
print("  - Check validation results")
print("  - Send alerts n·∫øu validation fails")
print("  - Integrate v·ªõi data pipelines")


## 2. Best Practices

Best practices cho GE + Airflow integration.


In [None]:
print("‚úÖ Best Practices:")
print("=" * 60)
print("""
1. **Validation Timing:**
   - Validate data sau khi load
   - Validate tr∆∞·ªõc khi transform
   - Validate sau khi transform

2. **Error Handling:**
   - Set retries cho GE tasks
   - Fail fast n·∫øu validation fails
   - Send alerts cho critical failures

3. **Checkpoints:**
   - Use checkpoints thay v√¨ run validations directly
   - Store validation results
   - Track validation history

4. **Integration Points:**
   - After data ingestion
   - Before data transformation
   - After data transformation
   - Before data delivery

5. **Monitoring:**
   - Track validation success rates
   - Monitor data quality trends
   - Alert on quality degradation

6. **Documentation:**
   - Generate Data Docs trong pipeline
   - Share docs v·ªõi team
   - Keep expectations documented
""")
print("=" * 60)


## 3. T√≥m t·∫Øt v√† K·∫øt lu·∫≠n

### ‚úÖ Nh·ªØng g√¨ ƒë√£ h·ªçc trong to√†n b·ªô Great Expectations Lab Series:

**Lab 1: GE Basics**
- Great Expectations l√† g√¨
- Data Context v√† Data Sources
- T·∫°o Expectations ƒë·∫ßu ti√™n

**Lab 2: Expectations**
- C√°c lo·∫°i Expectations
- Column v√† table-level expectations
- Custom expectations

**Lab 3: Checkpoints**
- T·∫°o v√† run checkpoints
- Validation Actions
- Handle results

**Lab 4: Data Docs**
- Generate v√† customize docs
- Share documentation

**Lab 5: dbt Integration**
- dbt-expectations package
- GE-like tests trong dbt

**Lab 6: Airflow Integration**
- T√≠ch h·ª£p GE v·ªõi Airflow
- Automated validations
- Error handling

### üéØ Key Takeaways:

**Great Expectations:**
- Data quality v√† validation tool
- Declarative expectations
- Automated validation
- Auto-generated documentation

**Integration:**
- Standalone: GE nh∆∞ tool ƒë·ªôc l·∫≠p
- dbt: dbt-expectations package
- Airflow: Orchestrate validations

**Best Practices:**
- Validate at multiple stages
- Fail fast on quality issues
- Monitor v√† alert
- Document expectations

### üìö Next Steps:

1. **Production Deployment:**
   - Setup GE tr√™n production
   - Configure checkpoints
   - Setup monitoring

2. **Advanced Topics:**
   - Custom expectations
   - Profiling data
   - Data quality metrics
   - Integration v·ªõi BI tools

### üîó Useful Links:
- [Great Expectations Documentation](https://docs.greatexpectations.io/)
- [dbt-expectations](https://github.com/calogica/dbt-expectations)
- [GE Best Practices](https://docs.greatexpectations.io/docs/guides/expectations/expectations_best_practices/)

### üéâ Congratulations!

B·∫°n ƒë√£ ho√†n th√†nh Great Expectations Lab Series! B√¢y gi·ªù b·∫°n c√≥ ƒë·ªß ki·∫øn th·ª©c ƒë·ªÉ:
- Implement data quality checks v·ªõi GE
- T√≠ch h·ª£p GE v·ªõi dbt v√† Airflow
- Deploy v√† maintain data quality pipelines
- Build robust data validation systems

**Happy Validating! üéØ**
