# Lab 5: XCom and Data Sharing - Sharing Data Between Tasks

## üéØ Objectives
- Understand XCom (Cross-Communication) in Airflow
- Use task return values to share data
- XCom push and pull operations
- XCom with Task SDK (@task decorator)
- XCom with Operators (PythonOperator)
- Best practices for data sharing
- XCom limitations and alternatives

## üìã Prerequisites
- Completed Lab 1-4
- Understand task dependencies
- Airflow cluster is running

## üèóÔ∏è XCom Overview
XCom (Cross-Communication) is Airflow's mechanism for sharing data between tasks:
- **XCom Push**: Save data to XCom
- **XCom Pull**: Get data from XCom
- **Automatic**: Task return values are automatically pushed to XCom
- **Manual**: Use `xcom_push()` and `xcom_pull()` methods


## 1. Import Libraries and Setup


In [None]:
# Import Airflow XCom and related modules
from airflow.sdk import DAG, task
from airflow.providers.standard.operators.python import PythonOperator
from airflow.providers.standard.operators.bash import BashOperator

import pendulum
from datetime import datetime
import json

print("‚úÖ Airflow XCom modules imported successfully!")


## 2. XCom with Task SDK (@task decorator) - Automatic Return Values

With Task SDK, return values are automatically pushed to XCom. This is the simplest way to share data.


In [None]:
# DAG with Task SDK - Automatic XCom
@dag(
    dag_id="xcom_task_sdk_example",
    schedule=None,
    start_date=pendulum.datetime(2024, 1, 1, tz="UTC"),
    catchup=False,
    tags=["xcom", "task-sdk"],
)
def xcom_task_sdk_dag():
    """
    ### XCom with Task SDK
    Task SDK automatically pushes return values to XCom.
    """
    
    @task
    def extract_data():
        """Extract data and return - automatically pushed to XCom"""
        data = {
            "users": [
                {"id": 1, "name": "Alice", "age": 30},
                {"id": 2, "name": "Bob", "age": 25},
                {"id": 3, "name": "Charlie", "age": 35},
            ],
            "total": 3,
            "timestamp": datetime.now().isoformat(),
        }
        print(f"Extracted {data['total']} users")
        return data  # Automatically pushed to XCom
    
    @task
    def transform_data(data: dict):
        """Transform data - automatically received from XCom"""
        users = data["users"]
        
        # Calculate statistics
        total_age = sum(user["age"] for user in users)
        avg_age = total_age / len(users)
        
        transformed = {
            "total_users": len(users),
            "average_age": avg_age,
            "max_age": max(user["age"] for user in users),
            "min_age": min(user["age"] for user in users),
        }
        print(f"Transformed data: {transformed}")
        return transformed  # Automatically pushed to XCom
    
    @task
    def load_data(stats: dict):
        """Load data - automatically received from XCom"""
        print(f"Loading statistics:")
        print(f"  Total users: {stats['total_users']}")
        print(f"  Average age: {stats['average_age']:.2f}")
        print(f"  Age range: {stats['min_age']} - {stats['max_age']}")
        return f"Loaded {stats['total_users']} users successfully"
    
    # Define workflow - data automatically passes through XCom
    extracted = extract_data()
    transformed = transform_data(extracted)  # extracted automatically pulled from XCom
    load_data(transformed)  # transformed automatically pulled from XCom

# Create DAG
xcom_task_sdk_dag_instance = xcom_task_sdk_dag()

print("‚úÖ XCom Task SDK DAG created!")
print(f"Tasks: {[task.task_id for task in xcom_task_sdk_dag_instance.tasks]}")
print("\nüí° V·ªõi Task SDK:")
print("  - Return values t·ª± ƒë·ªông push v√†o XCom")
print("  - Function parameters t·ª± ƒë·ªông pull t·ª´ XCom")
print("  - Kh√¥ng c·∫ßn manual xcom_push/xcom_pull")


## 3. XCom with PythonOperator - Manual Push/Pull

With PythonOperator, you need to manually push and pull XCom values using `xcom_push()` and `xcom_pull()`.


In [None]:
# DAG with PythonOperator - Manual XCom
@dag(
    dag_id="xcom_python_operator_example",
    schedule=None,
    start_date=pendulum.datetime(2024, 1, 1, tz="UTC"),
    catchup=False,
    tags=["xcom", "python-operator"],
)
def xcom_python_operator_dag():
    """
    ### XCom with PythonOperator
    Manually push and pull XCom values with PythonOperator.
    """
    
    def extract_data(**context):
        """Extract data and push to XCom manually"""
        data = {
            "records": [1, 2, 3, 4, 5],
            "sum": 15,
        }
        
        # Manual XCom push
        context['ti'].xcom_push(key='extracted_data', value=data)
        print(f"Pushed data to XCom: {data}")
        return "Extraction completed"
    
    def transform_data(**context):
        """Transform data - pull from XCom manually"""
        # Manual XCom pull
        data = context['ti'].xcom_pull(key='extracted_data', task_ids='extract_data')
        
        if data:
            # Transform data
            transformed = {
                "total_records": len(data['records']),
                "sum": data['sum'],
                "average": data['sum'] / len(data['records']),
            }
            
            # Push transformed data
            context['ti'].xcom_push(key='transformed_data', value=transformed)
            print(f"Transformed data: {transformed}")
            return "Transformation completed"
        else:
            raise ValueError("No data found in XCom")
    
    def load_data(**context):
        """Load data - pull from XCom manually"""
        # Pull from another task
        transformed = context['ti'].xcom_pull(key='transformed_data', task_ids='transform_data')
        
        if transformed:
            print(f"Loading data:")
            print(f"  Total records: {transformed['total_records']}")
            print(f"  Sum: {transformed['sum']}")
            print(f"  Average: {transformed['average']:.2f}")
            return "Load completed"
        else:
            raise ValueError("No transformed data found")
    
    # Tasks v·ªõi PythonOperator
    extract_task = PythonOperator(
        task_id="extract_data",
        python_callable=extract_data,
    )
    
    transform_task = PythonOperator(
        task_id="transform_data",
        python_callable=transform_data,
    )
    
    load_task = PythonOperator(
        task_id="load_data",
        python_callable=load_data,
    )
    
    # Define dependencies
    extract_task >> transform_task >> load_task

# Create DAG
xcom_python_operator_dag_instance = xcom_python_operator_dag()

print("‚úÖ XCom PythonOperator DAG created!")
print(f"Tasks: {[task.task_id for task in xcom_python_operator_dag_instance.tasks]}")
print("\nüí° V·ªõi PythonOperator:")
print("  - S·ª≠ d·ª•ng context['ti'].xcom_push() ƒë·ªÉ push")
print("  - S·ª≠ d·ª•ng context['ti'].xcom_pull() ƒë·ªÉ pull")
print("  - C·∫ßn specify key v√† task_ids")


## 4. XCom with Multiple Return Values

Tasks can return multiple values or dictionaries, and they will be automatically pushed to XCom.


In [None]:
# DAG with Multiple Return Values
@dag(
    dag_id="xcom_multiple_values_example",
    schedule=None,
    start_date=pendulum.datetime(2024, 1, 1, tz="UTC"),
    catchup=False,
    tags=["xcom", "multiple-values"],
)
def xcom_multiple_values_dag():
    """
    ### XCom with Multiple Return Values
    Tasks can return dictionaries with multiple values.
    """
    
    @task(multiple_outputs=True)  # Enable multiple outputs
    def extract_multiple_sources():
        """Extract from multiple sources and return dictionary"""
        source_a = {"records": [1, 2, 3], "source": "A"}
        source_b = {"records": [4, 5, 6], "source": "B"}
        
        return {
            "source_a": source_a,
            "source_b": source_b,
            "total_records": 6,
        }
    
    @task
    def process_source_a(source_a: dict):
        """Process source A"""
        print(f"Processing {source_a['source']}: {source_a['records']}")
        return sum(source_a['records'])
    
    @task
    def process_source_b(source_b: dict):
        """Process source B"""
        print(f"Processing {source_b['source']}: {source_b['records']}")
        return sum(source_b['records'])
    
    @task
    def aggregate_results(result_a: int, result_b: int):
        """Aggregate results from both sources"""
        total = result_a + result_b
        print(f"Aggregated result: {total}")
        return total
    
    # Extract data with multiple outputs
    extracted = extract_multiple_sources()
    
    # Access individual values from dictionary
    process_a = process_source_a(extracted['source_a'])
    process_b = process_source_b(extracted['source_b'])
    
    # Aggregate
    aggregate_results(process_a, process_b)

# Create DAG
xcom_multiple_values_dag_instance = xcom_multiple_values_dag()

print("‚úÖ XCom Multiple Values DAG created!")
print(f"Tasks: {[task.task_id for task in xcom_multiple_values_dag_instance.tasks]}")
print("\nüí° Multiple outputs:")
print("  - S·ª≠ d·ª•ng @task(multiple_outputs=True)")
print("  - Return dictionary v·ªõi multiple keys")
print("  - Access values b·∫±ng key: extracted['source_a']")


## 5. XCom with Lists and Complex Data Structures

XCom can store lists, dictionaries, and complex data structures (but has size limits).


In [None]:
# DAG with Complex Data Structures
@dag(
    dag_id="xcom_complex_data_example",
    schedule=None,
    start_date=pendulum.datetime(2024, 1, 1, tz="UTC"),
    catchup=False,
    tags=["xcom", "complex-data"],
)
def xcom_complex_data_dag():
    """
    ### XCom with Complex Data Structures
    XCom can store lists, dictionaries, and nested structures.
    """
    
    @task
    def generate_complex_data():
        """Generate complex nested data structure"""
        data = {
            "metadata": {
                "timestamp": datetime.now().isoformat(),
                "version": "1.0",
            },
            "users": [
                {
                    "id": 1,
                    "name": "Alice",
                    "scores": [95, 87, 92],
                    "metadata": {"department": "Engineering"}
                },
                {
                    "id": 2,
                    "name": "Bob",
                    "scores": [78, 85, 90],
                    "metadata": {"department": "Sales"}
                },
            ],
            "statistics": {
                "total_users": 2,
                "average_score": 88.5,
            }
        }
        print(f"Generated complex data with {data['statistics']['total_users']} users")
        return data
    
    @task
    def process_users(complex_data: dict):
        """Process users from complex data"""
        users = complex_data['users']
        
        results = []
        for user in users:
            avg_score = sum(user['scores']) / len(user['scores'])
            results.append({
                "user_id": user['id'],
                "name": user['name'],
                "average_score": avg_score,
                "department": user['metadata']['department']
            })
        
        print(f"Processed {len(results)} users")
        return results
    
    @task
    def generate_report(user_results: list, metadata: dict):
        """Generate report from processed data"""
        print("=" * 60)
        print("User Performance Report")
        print("=" * 60)
        print(f"Generated at: {metadata['timestamp']}")
        print(f"Version: {metadata['version']}")
        print("\nUser Details:")
        for result in user_results:
            print(f"  {result['name']} ({result['department']}): {result['average_score']:.2f}")
        print("=" * 60)
        return "Report generated"
    
    # Extract complex data
    complex_data = generate_complex_data()
    
    # Process users
    user_results = process_users(complex_data)
    
    # Generate report v·ªõi multiple inputs
    generate_report(user_results, complex_data['metadata'])

# Create DAG
xcom_complex_data_dag_instance = xcom_complex_data_dag()

print("‚úÖ XCom Complex Data DAG created!")
print(f"Tasks: {[task.task_id for task in xcom_complex_data_dag_instance.tasks]}")
print("\nüí° Complex data structures:")
print("  - XCom h·ªó tr·ª£ nested dictionaries v√† lists")
print("  - C√≥ th·ªÉ access nested values: data['metadata']['timestamp']")
print("  - L∆∞u √Ω: XCom c√≥ size limits (default: 48KB)")


## 6. XCom Pull from Multiple Tasks

A task can pull XCom values from multiple different upstream tasks.


In [None]:
# DAG with XCom from Multiple Tasks
@dag(
    dag_id="xcom_multiple_tasks_example",
    schedule=None,
    start_date=pendulum.datetime(2024, 1, 1, tz="UTC"),
    catchup=False,
    tags=["xcom", "multiple-tasks"],
)
def xcom_multiple_tasks_dag():
    """
    ### XCom from Multiple Tasks
    Pull XCom values from multiple upstream tasks.
    """
    
    @task
    def extract_source_a():
        """Extract from source A"""
        data = {"source": "A", "records": [1, 2, 3], "sum": 6}
        print(f"Extracted from source A: {data}")
        return data
    
    @task
    def extract_source_b():
        """Extract from source B"""
        data = {"source": "B", "records": [4, 5, 6], "sum": 15}
        print(f"Extracted from source B: {data}")
        return data
    
    @task
    def extract_source_c():
        """Extract from source C"""
        data = {"source": "C", "records": [7, 8, 9], "sum": 24}
        print(f"Extracted from source C: {data}")
        return data
    
    @task
    def merge_data(source_a: dict, source_b: dict, source_c: dict):
        """Merge data from all 3 sources"""
        all_records = (
            source_a['records'] + 
            source_b['records'] + 
            source_c['records']
        )
        total_sum = source_a['sum'] + source_b['sum'] + source_c['sum']
        
        merged = {
            "all_records": all_records,
            "total_sum": total_sum,
            "total_records": len(all_records),
            "sources": [source_a['source'], source_b['source'], source_c['source']]
        }
        
        print(f"Merged data: {merged}")
        return merged
    
    @task
    def finalize(merged_data: dict):
        """Finalize with merged data"""
        print(f"Finalizing with {merged_data['total_records']} records")
        print(f"Total sum: {merged_data['total_sum']}")
        print(f"Sources: {', '.join(merged_data['sources'])}")
        return "Finalized"
    
    # Extract from multiple sources (parallel)
    source_a_data = extract_source_a()
    source_b_data = extract_source_b()
    source_c_data = extract_source_c()
    
    # Merge data from all 3 sources
    merged = merge_data(source_a_data, source_b_data, source_c_data)
    
    # Finalize
    finalize(merged)

# Create DAG
xcom_multiple_tasks_dag_instance = xcom_multiple_tasks_dag()

print("‚úÖ XCom Multiple Tasks DAG created!")
print(f"Tasks: {[task.task_id for task in xcom_multiple_tasks_dag_instance.tasks]}")
print("\nüí° Multiple upstream tasks:")
print("  - Task c√≥ th·ªÉ nh·∫≠n inputs t·ª´ nhi·ªÅu upstream tasks")
print("  - Function parameters map v·ªõi return values t·ª´ upstream tasks")
print("  - T·∫•t c·∫£ upstream tasks ph·∫£i complete tr∆∞·ªõc khi task n√†y ch·∫°y")


In [None]:
# DAG demonstrating XCom Best Practices
@dag(
    dag_id="xcom_best_practices_example",
    schedule=None,
    start_date=pendulum.datetime(2024, 1, 1, tz="UTC"),
    catchup=False,
    tags=["xcom", "best-practices"],
)
def xcom_best_practices_dag():
    """
    ### XCom Best Practices Example
    Demonstrates best practices when using XCom.
    """
    
    @task
    def extract_metadata_only():
        """
        ‚úÖ Best Practice: Only pass metadata, don't pass large data
        Instead of passing entire dataset, only pass file path or reference
        """
        # Simulate: Instead of passing large dataset, only pass file path
        file_path = "/tmp/data/large_dataset.parquet"
        metadata = {
            "file_path": file_path,
            "record_count": 1000000,
            "file_size_mb": 250,
            "schema": ["id", "name", "value"],
        }
        print(f"Extracted metadata: {metadata}")
        return metadata  # Only pass metadata, not data
    
    @task
    def process_file(metadata: dict):
        """
        ‚úÖ Best Practice: Process file from path, not from XCom
        """
        file_path = metadata['file_path']
        print(f"Processing file: {file_path}")
        print(f"Records: {metadata['record_count']}")
        # In practice, read file from path and process
        return {"status": "processed", "records_processed": metadata['record_count']}
    
    @task
    def store_summary(summary: dict):
        """
        ‚úÖ Best Practice: Only store summary/aggregated data
        """
        print(f"Storing summary: {summary}")
        return "Summary stored"
    
    # Workflow with best practices
    metadata = extract_metadata_only()
    summary = process_file(metadata)
    store_summary(summary)

# Create DAG
xcom_best_practices_dag_instance = xcom_best_practices_dag()

print("‚úÖ XCom Best Practices DAG created!")
print("\nüìã XCom Best Practices:")
print("=" * 60)
print("‚úÖ DO:")
print("  - Ch·ªâ pass small data (< 48KB)")
print("  - Pass metadata/references thay v√¨ large datasets")
print("  - Pass file paths thay v√¨ file contents")
print("  - Pass aggregated/summary data")
print("  - S·ª≠ d·ª•ng Task SDK cho automatic XCom")
print("\n‚ùå DON'T:")
print("  - Pass large datasets qua XCom")
print("  - Pass binary data qua XCom")
print("  - Pass sensitive data (use Variables/Connections)")
print("  - Rely on XCom cho data storage")
print("=" * 60)
print("\nüí° Alternatives cho Large Data:")
print("  - File storage (S3, GCS, local files)")
print("  - Databases")
print("  - External storage systems")
print("  - Pass only references/IDs qua XCom")


## 8. XCom with Task Mapping and Dynamic Tasks

XCom works well with dynamic task mapping - each mapped task instance has its own XCom.


In [None]:
# DAG with XCom and Task Mapping
@dag(
    dag_id="xcom_task_mapping_example",
    schedule=None,
    start_date=pendulum.datetime(2024, 1, 1, tz="UTC"),
    catchup=False,
    tags=["xcom", "task-mapping"],
)
def xcom_task_mapping_dag():
    """
    ### XCom with Task Mapping
    XCom works with dynamic task mapping.
    """
    
    @task
    def get_files_to_process():
        """Get list of files to process"""
        files = [
            {"path": "/data/file1.csv", "size": 1000},
            {"path": "/data/file2.csv", "size": 2000},
            {"path": "/data/file3.csv", "size": 1500},
        ]
        print(f"Found {len(files)} files to process")
        return files
    
    @task
    def process_file(file_info: dict):
        """Process one file - will be mapped for each file"""
        file_path = file_info['path']
        file_size = file_info['size']
        
        # Simulate processing
        records_processed = file_size // 100
        
        result = {
            "file_path": file_path,
            "records_processed": records_processed,
            "status": "success"
        }
        
        print(f"Processed {file_path}: {records_processed} records")
        return result
    
    @task
    def aggregate_results(results: list):
        """Aggregate results from all mapped tasks"""
        total_records = sum(r['records_processed'] for r in results)
        total_files = len(results)
        
        summary = {
            "total_files": total_files,
            "total_records": total_records,
            "average_records_per_file": total_records / total_files if total_files > 0 else 0
        }
        
        print(f"Aggregated summary: {summary}")
        return summary
    
    # Get files
    files = get_files_to_process()
    
    # Process files with dynamic mapping
    # Each mapped task instance will have its own XCom
    processed_files = process_file.expand(file_info=files)
    
    # Aggregate - receives list of all results
    aggregate_results(processed_files)

# Create DAG
xcom_task_mapping_dag_instance = xcom_task_mapping_dag()

print("‚úÖ XCom Task Mapping DAG created!")
print(f"Tasks: {[task.task_id for task in xcom_task_mapping_dag_instance.tasks]}")
print("\nüí° XCom v·ªõi Task Mapping:")
print("  - M·ªói mapped task instance c√≥ XCom ri√™ng")
print("  - Aggregate task nh·∫≠n list c·ªßa t·∫•t c·∫£ results")
print("  - XCom key t·ª± ƒë·ªông include map index")


## 9. Summary and Next Steps

### ‚úÖ What we learned:
1. XCom basics - Cross-communication between tasks
2. Task SDK automatic XCom - Automatic return values
3. PythonOperator manual XCom - Push/pull operations
4. Multiple return values with dictionaries
5. Complex data structures in XCom
6. Pull from multiple upstream tasks
7. XCom limitations and best practices
8. XCom with task mapping

### üìö Next Lab:
- **Lab 6**: Scheduling and Timetables
- Cron expressions
- Timedelta schedules
- Custom timetables
- Catchup and data intervals

### üîó Useful Links:
- [XCom Documentation](https://airflow.apache.org/docs/apache-airflow/3.1.1/core-concepts/xcoms.html)
- [Task SDK XCom](https://airflow.apache.org/docs/apache-airflow/3.1.1/task-sdk/index.html)
- [XCom Best Practices](https://airflow.apache.org/docs/apache-airflow/3.1.1/best-practices.html#xcom)

### üí° Key Takeaways:

**XCom Size Limits:**
- Default: 48KB per value
- Configurable via `xcom_max_value_size`
- Should not pass large data

**Best Practices:**
- ‚úÖ Pass metadata/references
- ‚úÖ Pass file paths instead of contents
- ‚úÖ Use Task SDK for automatic XCom
- ‚ùå Don't pass large datasets
- ‚ùå Don't use XCom as database

**Alternatives:**
- File storage (S3, GCS, local)
- Databases
- External APIs
- Airflow Variables (for config)

### üí° Exercises:
1. Create DAG with XCom passing metadata between tasks
2. Implement data pipeline with file paths via XCom
3. Use multiple return values
4. Combine XCom with task mapping
5. Implement error handling with XCom
