# Lab 6: Scheduling and Timetables - Scheduling DAGs in Airflow

## üéØ Objectives
- Understand how scheduling works in Airflow
- Use cron expressions for scheduling
- Use timedelta for interval-based scheduling
- Create custom timetables for complex scheduling
- Understand catchup and data intervals
- Handle timezone and scheduling edge cases

## üìã Prerequisites
- Completed Lab 1-5
- Understand DAGs and tasks
- Airflow cluster is running

## üèóÔ∏è Scheduling Overview
Airflow scheduling is based on:
- **Schedule Interval**: Time period between DAG runs
- **Start Date**: Start date of schedule
- **Catchup**: Whether to run missed runs
- **Data Intervals**: Time period for data processing
- **Timetables**: Custom logic for complex scheduling


## 1. Import Libraries and Setup


In [None]:
# Import Airflow scheduling modules
from airflow.sdk import DAG, task
from airflow.providers.standard.operators.empty import EmptyOperator
from airflow.timetables.interval import CronDataIntervalTimetable
from airflow.timetables.base import Timetable
from airflow.timetables.trigger import CronTriggerTimetable

import pendulum
from datetime import datetime, timedelta

print("‚úÖ Airflow scheduling modules imported successfully!")


## 2. Cron Expressions - L·∫≠p l·ªãch v·ªõi Cron

Cron expressions cho ph√©p l·∫≠p l·ªãch linh ho·∫°t v·ªõi c√°c preset v√† custom patterns.


In [None]:
# DAG v·ªõi Cron Presets
@dag(
    dag_id="cron_presets_example",
    schedule="@daily",  # Cron preset: Ch·∫°y h√†ng ng√†y l√∫c 00:00
    start_date=pendulum.datetime(2024, 1, 1, tz="UTC"),
    catchup=False,
    tags=["scheduling", "cron", "presets"],
)
def cron_presets_dag():
    """
    ### Cron Presets Example
    S·ª≠ d·ª•ng cron presets cho scheduling ƒë∆°n gi·∫£n.
    """
    
    @task
    def daily_task():
        """Task ch·∫°y h√†ng ng√†y"""
        print("Running daily task")
        return "Daily task completed"
    
    daily_task()

# Create DAG
cron_presets_dag_instance = cron_presets_dag()

print("‚úÖ Cron Presets DAG created!")
print(f"Schedule: @daily (runs every day at 00:00)")
print("\nüìã Common Cron Presets:")
print("  - @once: Ch·∫°y m·ªôt l·∫ßn duy nh·∫•t")
print("  - @hourly: Ch·∫°y m·ªói gi·ªù")
print("  - @daily: Ch·∫°y h√†ng ng√†y l√∫c 00:00")
print("  - @weekly: Ch·∫°y h√†ng tu·∫ßn (Ch·ªß nh·∫≠t)")
print("  - @monthly: Ch·∫°y h√†ng th√°ng (ng√†y 1)")
print("  - @yearly: Ch·∫°y h√†ng nƒÉm (1/1)")


## 3. Custom Cron Expressions

Cron expressions cho ph√©p ƒë·ªãnh nghƒ©a schedule patterns ph·ª©c t·∫°p h∆°n.


In [None]:
# DAG v·ªõi Custom Cron Expressions
@dag(
    dag_id="custom_cron_example",
    schedule="0 */6 * * *",  # Ch·∫°y m·ªói 6 gi·ªù (00:00, 06:00, 12:00, 18:00)
    start_date=pendulum.datetime(2024, 1, 1, tz="UTC"),
    catchup=False,
    tags=["scheduling", "cron", "custom"],
)
def custom_cron_dag():
    """
    ### Custom Cron Expression Example
    S·ª≠ d·ª•ng custom cron expression cho scheduling linh ho·∫°t.
    """
    
    @task
    def scheduled_task():
        """Task ch·∫°y theo custom schedule"""
        print("Running scheduled task")
        return "Task completed"
    
    scheduled_task()

# Create DAG
custom_cron_dag_instance = custom_cron_dag()

print("‚úÖ Custom Cron DAG created!")
print(f"Schedule: 0 */6 * * * (every 6 hours)")
print("\nüìã Cron Expression Format:")
print("  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ minute (0-59)")
print("  ‚îÇ ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ hour (0-23)")
print("  ‚îÇ ‚îÇ ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ day of month (1-31)")
print("  ‚îÇ ‚îÇ ‚îÇ ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ month (1-12)")
print("  ‚îÇ ‚îÇ ‚îÇ ‚îÇ ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ day of week (0-6, Sunday=0)")
print("  ‚îÇ ‚îÇ ‚îÇ ‚îÇ ‚îÇ")
print("  * * * * *")
print("\nüí° Examples:")
print("  - '0 0 * * *': Daily at midnight")
print("  - '0 */3 * * *': Every 3 hours")
print("  - '0 9 * * 1-5': Weekdays at 9 AM")
print("  - '0 0 1 * *': First day of month")
print("  - '30 14 * * *': Daily at 2:30 PM")


In [None]:
# DAG v·ªõi Timedelta Scheduling
@dag(
    dag_id="timedelta_scheduling_example",
    schedule=timedelta(hours=2),  # Ch·∫°y m·ªói 2 gi·ªù
    start_date=pendulum.datetime(2024, 1, 1, tz="UTC"),
    catchup=False,
    tags=["scheduling", "timedelta"],
)
def timedelta_scheduling_dag():
    """
    ### Timedelta Scheduling Example
    S·ª≠ d·ª•ng timedelta cho interval-based scheduling.
    """
    
    @task
    def interval_task():
        """Task ch·∫°y theo interval"""
        print("Running interval-based task")
        return "Interval task completed"
    
    interval_task()

# Create DAG
timedelta_scheduling_dag_instance = timedelta_scheduling_dag()

print("‚úÖ Timedelta Scheduling DAG created!")
print(f"Schedule: timedelta(hours=2) - every 2 hours")
print("\nüìã Timedelta Examples:")
print("  - timedelta(minutes=30): Every 30 minutes")
print("  - timedelta(hours=1): Every hour")
print("  - timedelta(hours=6): Every 6 hours")
print("  - timedelta(days=1): Daily")
print("  - timedelta(weeks=1): Weekly")
print("\nüí° Timedelta vs Cron:")
print("  - Timedelta: ƒê∆°n gi·∫£n cho regular intervals")
print("  - Cron: Linh ho·∫°t h∆°n cho complex patterns")


## 5. Catchup - X·ª≠ l√Ω Missed Runs

Catchup quy·∫øt ƒë·ªãnh c√≥ ch·∫°y c√°c DAG runs ƒë√£ b·ªè l·ª° t·ª´ start_date kh√¥ng.


In [None]:
# DAG v·ªõi Catchup = True
@dag(
    dag_id="catchup_true_example",
    schedule="@daily",
    start_date=pendulum.datetime(2024, 1, 1, tz="UTC"),  # Start t·ª´ 1/1/2024
    catchup=True,  # S·∫Ω ch·∫°y t·∫•t c·∫£ missed runs t·ª´ start_date
    tags=["scheduling", "catchup"],
)
def catchup_true_dag():
    """
    ### Catchup = True Example
    N·∫øu DAG ƒë∆∞·ª£c enable v√†o 1/10/2024, s·∫Ω ch·∫°y t·∫•t c·∫£ runs t·ª´ 1/1 ƒë·∫øn 1/10.
    """
    
    @task
    def catchup_task():
        """Task v·ªõi catchup enabled"""
        print("Running catchup task")
        return "Catchup task completed"
    
    catchup_task()

# DAG v·ªõi Catchup = False
@dag(
    dag_id="catchup_false_example",
    schedule="@daily",
    start_date=pendulum.datetime(2024, 1, 1, tz="UTC"),
    catchup=False,  # Ch·ªâ ch·∫°y t·ª´ th·ªùi ƒëi·ªÉm hi·ªán t·∫°i
    tags=["scheduling", "no-catchup"],
)
def catchup_false_dag():
    """
    ### Catchup = False Example
    Ch·ªâ ch·∫°y t·ª´ th·ªùi ƒëi·ªÉm DAG ƒë∆∞·ª£c enable, kh√¥ng ch·∫°y missed runs.
    """
    
    @task
    def no_catchup_task():
        """Task v·ªõi catchup disabled"""
        print("Running no-catchup task")
        return "No catchup task completed"
    
    no_catchup_task()

print("‚úÖ Catchup Examples DAGs created!")
print("\nüìä Catchup Behavior:")
print("=" * 60)
print("Catchup = True:")
print("  - Ch·∫°y t·∫•t c·∫£ missed runs t·ª´ start_date")
print("  - Useful cho backfilling historical data")
print("  - C√≥ th·ªÉ t·∫°o nhi·ªÅu DAG runs c√πng l√∫c")
print("\nCatchup = False:")
print("  - Ch·ªâ ch·∫°y t·ª´ th·ªùi ƒëi·ªÉm hi·ªán t·∫°i")
print("  - Recommended cho production DAGs")
print("  - Tr√°nh t·∫°o qu√° nhi·ªÅu runs")
print("=" * 60)


## 6. Data Intervals - Hi·ªÉu Data Processing Windows

Data intervals x√°c ƒë·ªãnh kho·∫£ng th·ªùi gian data ƒë∆∞·ª£c process trong m·ªói DAG run.


In [None]:
# DAG v·ªõi Data Intervals
@dag(
    dag_id="data_intervals_example",
    schedule="@daily",
    start_date=pendulum.datetime(2024, 1, 1, tz="UTC"),
    catchup=False,
    tags=["scheduling", "data-intervals"],
)
def data_intervals_dag():
    """
    ### Data Intervals Example
    Hi·ªÉu c√°ch data intervals ho·∫°t ƒë·ªông trong Airflow.
    """
    
    @task
    def process_data(**context):
        """Process data v·ªõi data interval information"""
        # Data interval start v√† end
        data_interval_start = context['data_interval_start']
        data_interval_end = context['data_interval_end']
        
        # Logical date (execution date)
        logical_date = context['ds']
        
        print("=" * 60)
        print("Data Interval Information:")
        print(f"  Data Interval Start: {data_interval_start}")
        print(f"  Data Interval End: {data_interval_end}")
        print(f"  Logical Date (ds): {logical_date}")
        print(f"  Duration: {data_interval_end - data_interval_start}")
        print("=" * 60)
        
        # V√≠ d·ª•: Process data t·ª´ 2024-01-01 00:00 ƒë·∫øn 2024-01-02 00:00
        print(f"\nProcessing data from {data_interval_start} to {data_interval_end}")
        
        return {
            "interval_start": data_interval_start.isoformat(),
            "interval_end": data_interval_end.isoformat(),
            "records_processed": 1000,
        }
    
    process_data()

# Create DAG
data_intervals_dag_instance = data_intervals_dag()

print("‚úÖ Data Intervals DAG created!")
print("\nüí° Data Intervals:")
print("  - data_interval_start: B·∫Øt ƒë·∫ßu c·ªßa data window")
print("  - data_interval_end: K·∫øt th√∫c c·ªßa data window")
print("  - logical_date (ds): Ng√†y logic c·ªßa DAG run")
print("\nüìä Example v·ªõi @daily schedule:")
print("  DAG run on 2024-01-02 processes data from:")
print("    Start: 2024-01-01 00:00:00")
print("    End:   2024-01-02 00:00:00")
print("    Logical Date: 2024-01-01")


## 7. Custom Timetables - Advanced Scheduling

Custom timetables cho ph√©p t·∫°o scheduling logic ph·ª©c t·∫°p kh√¥ng th·ªÉ l√†m ƒë∆∞·ª£c v·ªõi cron ho·∫∑c timedelta.


In [None]:
# Custom Timetable: Ch·ªâ ch·∫°y v√†o ng√†y l√†m vi·ªác (Monday-Friday)
from airflow.timetables.base import DagRunInfo, DataInterval, TimeRestriction, Timetable

class WorkdayTimetable(Timetable):
    """
    Custom timetable ch·ªâ ch·∫°y v√†o ng√†y l√†m vi·ªác (Monday-Friday).
    """
    
    def infer_manual_data_interval(self, run_after: pendulum.DateTime) -> DataInterval:
        """Infer data interval cho manual runs"""
        # Ch·ªâ ch·∫°y v√†o ng√†y l√†m vi·ªác
        while run_after.weekday() >= 5:  # Saturday=5, Sunday=6
            run_after = run_after.subtract(days=1)
        
        start = run_after.start_of("day")
        end = start.add(days=1)
        return DataInterval(start=start, end=end)
    
    def next_dagrun_info(
        self,
        last_automated_data_interval: DataInterval | None,
        restriction: TimeRestriction,
    ) -> DagRunInfo | None:
        """Calculate next DAG run"""
        if last_automated_data_interval is None:
            # First run
            start = restriction.earliest
        else:
            start = last_automated_data_interval.end
        
        # T√¨m ng√†y l√†m vi·ªác ti·∫øp theo
        while start.weekday() >= 5:  # Skip weekends
            start = start.add(days=1)
        
        end = start.add(days=1)
        
        # Check n·∫øu v∆∞·ª£t qu√° latest
        if restriction.latest and end > restriction.latest:
            return None
        
        return DagRunInfo.interval(start=start, end=end)

# DAG v·ªõi Custom Timetable
@dag(
    dag_id="custom_timetable_example",
    schedule=WorkdayTimetable(),  # Custom timetable
    start_date=pendulum.datetime(2024, 1, 1, tz="UTC"),
    catchup=False,
    tags=["scheduling", "custom-timetable"],
)
def custom_timetable_dag():
    """
    ### Custom Timetable Example
    DAG ch·ªâ ch·∫°y v√†o ng√†y l√†m vi·ªác (Monday-Friday).
    """
    
    @task
    def workday_task():
        """Task ch·ªâ ch·∫°y v√†o ng√†y l√†m vi·ªác"""
        print("Running workday-only task")
        return "Workday task completed"
    
    workday_task()

# Create DAG
custom_timetable_dag_instance = custom_timetable_dag()

print("‚úÖ Custom Timetable DAG created!")
print("\nüí° Custom Timetable:")
print("  - Inherit t·ª´ Timetable base class")
print("  - Implement next_dagrun_info() method")
print("  - C√≥ th·ªÉ t·∫°o logic scheduling ph·ª©c t·∫°p")
print("\nüìã Use Cases:")
print("  - Business days only")
print("  - Skip holidays")
print("  - Custom business rules")
print("  - Event-driven scheduling")


## 8. Timezone Handling

Airflow s·ª≠ d·ª•ng UTC l√†m timezone m·∫∑c ƒë·ªãnh. Lu√¥n s·ª≠ d·ª•ng pendulum ƒë·ªÉ x·ª≠ l√Ω timezone ƒë√∫ng c√°ch.


In [None]:
# DAG v·ªõi Timezone Handling
@dag(
    dag_id="timezone_example",
    schedule="0 9 * * *",  # 9 AM UTC
    start_date=pendulum.datetime(2024, 1, 1, tz="UTC"),  # Lu√¥n d√πng UTC
    catchup=False,
    tags=["scheduling", "timezone"],
)
def timezone_dag():
    """
    ### Timezone Handling Example
    Lu√¥n s·ª≠ d·ª•ng UTC v√† pendulum cho timezone handling.
    """
    
    @task
    def timezone_task(**context):
        """Task v·ªõi timezone handling"""
        # Airflow context lu√¥n d√πng UTC
        execution_date = context['data_interval_start']
        
        print("=" * 60)
        print("Timezone Information:")
        print(f"  Execution Date (UTC): {execution_date}")
        print(f"  Timezone: {execution_date.timezone_name}")
        
        # Convert sang timezone kh√°c n·∫øu c·∫ßn
        vietnam_time = execution_date.in_timezone("Asia/Ho_Chi_Minh")
        print(f"  Vietnam Time: {vietnam_time}")
        print("=" * 60)
        
        return {
            "utc_time": execution_date.isoformat(),
            "vietnam_time": vietnam_time.isoformat(),
        }
    
    timezone_task()

# Create DAG
timezone_dag_instance = timezone_dag()

print("‚úÖ Timezone DAG created!")
print("\nüí° Timezone Best Practices:")
print("  - Lu√¥n s·ª≠ d·ª•ng UTC cho start_date")
print("  - S·ª≠ d·ª•ng pendulum.datetime() v·ªõi tz='UTC'")
print("  - Convert sang local timezone trong tasks n·∫øu c·∫ßn")
print("  - Kh√¥ng s·ª≠ d·ª•ng datetime.datetime() (kh√¥ng c√≥ timezone)")
print("\n‚ö†Ô∏è  Common Mistakes:")
print("  - S·ª≠ d·ª•ng datetime.datetime() thay v√¨ pendulum")
print("  - Kh√¥ng specify timezone trong start_date")
print("  - Mix UTC v√† local timezone")


## 9. Schedule Examples - Real-world Scenarios

C√°c v√≠ d·ª• scheduling patterns ph·ªï bi·∫øn trong th·ª±c t·∫ø.


In [None]:
# Example 1: Business Hours Only (9 AM - 5 PM, Weekdays)
@dag(
    dag_id="business_hours_example",
    schedule="0 9-17 * * 1-5",  # 9 AM to 5 PM, Monday to Friday
    start_date=pendulum.datetime(2024, 1, 1, tz="UTC"),
    catchup=False,
    tags=["scheduling", "business-hours"],
)
def business_hours_dag():
    """Ch·∫°y trong gi·ªù l√†m vi·ªác, ng√†y l√†m vi·ªác"""
    @task
    def business_task():
        print("Running during business hours")
        return "Business task completed"
    business_task()

# Example 2: End of Day Processing
@dag(
    dag_id="end_of_day_example",
    schedule="0 23 * * *",  # 11 PM daily
    start_date=pendulum.datetime(2024, 1, 1, tz="UTC"),
    catchup=False,
    tags=["scheduling", "eod"],
)
def end_of_day_dag():
    """Ch·∫°y v√†o cu·ªëi ng√†y ƒë·ªÉ process daily data"""
    @task
    def eod_task():
        print("Running end-of-day processing")
        return "EOD task completed"
    eod_task()

# Example 3: High Frequency (Every 15 minutes)
@dag(
    dag_id="high_frequency_example",
    schedule="*/15 * * * *",  # Every 15 minutes
    start_date=pendulum.datetime(2024, 1, 1, tz="UTC"),
    catchup=False,
    tags=["scheduling", "high-frequency"],
)
def high_frequency_dag():
    """Ch·∫°y v·ªõi t·∫ßn su·∫•t cao cho real-time processing"""
    @task
    def frequent_task():
        print("Running high-frequency task")
        return "Frequent task completed"
    frequent_task()

# Example 4: Weekly Report (Monday morning)
@dag(
    dag_id="weekly_report_example",
    schedule="0 8 * * 1",  # 8 AM every Monday
    start_date=pendulum.datetime(2024, 1, 1, tz="UTC"),
    catchup=False,
    tags=["scheduling", "weekly"],
)
def weekly_report_dag():
    """Ch·∫°y h√†ng tu·∫ßn ƒë·ªÉ generate weekly reports"""
    @task
    def weekly_task():
        print("Generating weekly report")
        return "Weekly report generated"
    weekly_task()

print("‚úÖ Real-world Scheduling Examples created!")
print("\nüìã Common Scheduling Patterns:")
print("=" * 60)
print("1. Business Hours:")
print("   Schedule: '0 9-17 * * 1-5'")
print("   Runs: 9 AM - 5 PM, Monday-Friday")
print("\n2. End of Day:")
print("   Schedule: '0 23 * * *'")
print("   Runs: 11 PM daily")
print("\n3. High Frequency:")
print("   Schedule: '*/15 * * * *'")
print("   Runs: Every 15 minutes")
print("\n4. Weekly Reports:")
print("   Schedule: '0 8 * * 1'")
print("   Runs: 8 AM every Monday")
print("=" * 60)


## 10. Best Practices v√† Common Pitfalls

### ‚úÖ Best Practices:

1. **Start Date:**
   - Lu√¥n s·ª≠ d·ª•ng UTC timezone
   - S·ª≠ d·ª•ng `pendulum.datetime()` v·ªõi `tz="UTC"`
   - Set start_date trong qu√° kh·ª© ƒë·ªÉ c√≥ th·ªÉ backfill

2. **Catchup:**
   - Set `catchup=False` cho production DAGs
   - Ch·ªâ d√πng `catchup=True` cho backfilling
   - Monitor khi enable catchup ƒë·ªÉ tr√°nh qu√° t·∫£i

3. **Schedule Intervals:**
   - S·ª≠ d·ª•ng cron presets khi c√≥ th·ªÉ (@daily, @hourly)
   - Custom cron cho complex patterns
   - Timedelta cho regular intervals

4. **Timezone:**
   - Lu√¥n d√πng UTC trong Airflow
   - Convert sang local timezone trong tasks n·∫øu c·∫ßn
   - Kh√¥ng mix timezones

5. **Data Intervals:**
   - Hi·ªÉu data_interval_start v√† data_interval_end
   - S·ª≠ d·ª•ng logical_date (ds) cho data partitioning
   - ƒê·∫£m b·∫£o data interval ƒë√∫ng v·ªõi business logic

### ‚ö†Ô∏è Common Pitfalls:

1. **Timezone confusion**: Mix UTC v√† local timezone
2. **Catchup overload**: Enable catchup t·∫°o qu√° nhi·ªÅu runs
3. **Wrong start_date**: Set trong t∆∞∆°ng lai l√†m DAG kh√¥ng ch·∫°y
4. **Schedule mismatch**: Schedule kh√¥ng match v·ªõi data availability
5. **Data interval misunderstanding**: Process sai data window


## 11. T√≥m t·∫Øt v√† Next Steps

### ‚úÖ Nh·ªØng g√¨ ƒë√£ h·ªçc:
1. Cron expressions v√† presets (@daily, @hourly, etc.)
2. Custom cron patterns cho complex scheduling
3. Timedelta scheduling cho regular intervals
4. Catchup behavior v√† khi n√†o s·ª≠ d·ª•ng
5. Data intervals v√† data processing windows
6. Custom timetables cho advanced scheduling
7. Timezone handling v·ªõi UTC v√† pendulum
8. Real-world scheduling patterns
9. Best practices v√† common pitfalls

### üìö Next Lab:
- **Lab 7**: End-to-End Pipeline Integration
- T√≠ch h·ª£p Airflow v·ªõi Kafka, Spark, Databases
- Multi-service orchestration
- Error handling v√† recovery
- Monitoring v√† alerting

### üîó Useful Links:
- [Scheduling Documentation](https://airflow.apache.org/docs/apache-airflow/3.1.1/core-concepts/dags.html#scheduling)
- [Cron Expressions](https://airflow.apache.org/docs/apache-airflow/3.1.1/authoring-and-scheduling/cron.html)
- [Timetables](https://airflow.apache.org/docs/apache-airflow/3.1.1/authoring-and-scheduling/timetable.html)
- [Data Intervals](https://airflow.apache.org/docs/apache-airflow/3.1.1/core-concepts/dag-run.html#data-interval)

### üí° Key Takeaways:

**Scheduling Options:**
- Cron presets: @daily, @hourly, etc. (ƒë∆°n gi·∫£n)
- Cron expressions: Custom patterns (linh ho·∫°t)
- Timedelta: Regular intervals (ƒë∆°n gi·∫£n)
- Custom timetables: Complex logic (advanced)

**Important Concepts:**
- Data intervals: Kho·∫£ng th·ªùi gian data ƒë∆∞·ª£c process
- Catchup: C√≥ ch·∫°y missed runs kh√¥ng
- Timezone: Lu√¥n d√πng UTC trong Airflow
- Start date: Ng√†y b·∫Øt ƒë·∫ßu schedule

### üí° Exercises:
1. T·∫°o DAG v·ªõi schedule ch·∫°y m·ªói 3 gi·ªù
2. T·∫°o DAG ch·ªâ ch·∫°y v√†o ng√†y l√†m vi·ªác
3. Implement custom timetable cho business logic
4. X·ª≠ l√Ω timezone conversion trong tasks
5. T·∫°o DAG v·ªõi catchup ƒë·ªÉ backfill data
