# Unit 2 Designing an ML Pipeline with Apache Airflow

Here is the text converted to Markdown:

# Welcome to the second lesson in our "Automating Retraining with Apache Airflow" course\!

In the previous lesson, we introduced Apache Airflow and built a simple "hello world" DAG. Now that we understand the basics, we're ready to take a significant step forward and explore how to structure more complex workflows specifically designed for machine learning pipelines.

Machine learning workflows typically involve several distinct stages, from data extraction to model deployment. In this lesson, we'll build a complete ML pipeline using Airflow's TaskFlow API, learning how to orchestrate the various steps needed to train and deploy a model. We'll see how Airflow can help us manage dependencies between tasks, pass data between steps, and implement conditional logic based on model performance.

By the end of this lesson, you'll understand how to design a functional DAG that represents an end-to-end ML training pipeline, giving you a solid foundation for implementing actual ML code in future lessons.

## Understanding ML Workflows in Airflow

Before diving into code, let's consider what an ML pipeline typically involves and how we can represent it in Airflow. A standard machine learning workflow often includes these key stages:

  * **Data extraction:** Collecting data from various sources.
  * **Data transformation:** Cleaning, preprocessing, and feature engineering.
  * **Model training:** Using the prepared data to train a machine learning model.
  * **Model validation:** Evaluating the model's performance against a validation set.
  * **Model deployment:** Deploying the model to production (if it meets quality thresholds).

Each of these stages can be represented as a task in our Airflow DAG, with clear dependencies between them. For example, we can't train a model until the data has been extracted and transformed, or we can't deploy it until it has been trained and evaluated. Airflow is particularly well-suited for ML workflows because it:

  * Handles task scheduling and retries automatically;
  * Provides visibility into the execution of each step;
  * Allows us to pass data between tasks;
  * Enables conditional execution based on the results of previous tasks;
  * Maintains a record of runs for reproducibility and auditing.

## Designing an ML Pipeline DAG

Let's start designing our ML pipeline by setting up the basic DAG structure:

```python
from datetime import datetime, timedelta
from airflow.decorators import dag, task

# Define default arguments for the DAG
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),  # Longer retry delay for ML tasks
}
```

This code should look familiar from our first lesson. We import the necessary modules and set up default arguments. Notice that we've increased the `retry_delay` to 5 minutes, which is more appropriate for ML tasks that might take longer to execute than simple examples.

Now, let's define our DAG using the `@dag` decorator:

```python
@dag(
    dag_id='mlops_pipeline',  # Unique identifier for the DAG
    description='Introduction to Airflow for ML pipelines',
    default_args=default_args,
    schedule='@daily',  # Run daily
    start_date=datetime(2023, 1, 1),
    catchup=False,  # Don't run for past dates
    tags=['intro', 'ml'],
)
def training_pipeline():
    """
    This DAG introduces Airflow concepts for ML workflows.
    It demonstrates task definition, dependencies, and flow control
    before we implement our actual ML pipeline in the later units.
    """
```

We've named our DAG `mlops_pipeline` and set it to run daily. While scheduling a DAG to run at regular intervals (like daily) is common, Airflow also supports alternative triggering mechanisms: for example, you can use triggers to launch a DAG in response to external events, or sensors to wait for specific conditions (such as the arrival of a file or completion of another process) before starting the workflow. This flexibility allows you to tailor your ML pipeline's execution to your specific operational requirements.

The `catchup=False` parameter ensures that Airflow won't execute the DAG for past dates, which is important for ML pipelines where we typically only want to train with the most recent data. Finally, the tags help categorize our DAG in the Airflow UI, making it easier to find among many workflows.

## Implementing Data Extraction and Transformation

Now let's implement the first two tasks of our ML pipeline:

```python
    @task(task_id="extract_data")
    def extract_data():
        """Simulate extracting data from a source."""
        print("Extracting data from source...")
        # In a real scenario, this would connect to a data source
        return {"data_extracted": True, "records": 1000}
```

This task simulates extracting data from a source system. In a real ML pipeline, this might involve querying a database, accessing an API, or reading files from a data lake. The task returns a dictionary with metadata about the extraction process, including a flag indicating success and the number of records extracted. This information will be useful for downstream tasks.

Next, let's implement the transformation task:

```python
    @task(task_id="transform_data")
    def transform_data(extract_result):
        """Simulate transforming the extracted data."""
        if extract_result["data_extracted"]:
            num_records = extract_result["records"]
            print(f"Transforming {num_records} records...")
            # Simulate data transformation
            return {"data_transformed": True, "features": 10}
        else:
            raise ValueError("Data extraction failed")
```

The `transform_data` task takes the result from the extraction task as an input parameter, demonstrating how the TaskFlow API makes it easy to pass data between tasks. The task checks if extraction was successful, performs its transformation (simulated in this case), and returns information about the features generated. Note how we're raising an exception if extraction failed — this ensures that our pipeline will fail appropriately if a critical step doesn't complete successfully.

## Building Model Training and Validation Tasks

With our data prepared, let's implement the next stages of our ML pipeline:

```python
    @task(task_id="train_model")
    def train_model(transform_result):
        """Simulate training a machine learning model."""
        if transform_result["data_transformed"]:
            num_features = transform_result["features"]
            print(f"Training model with {num_features} features...")
            # Simulate model training
            return {"model_trained": True, "accuracy": 0.85}
        else:
            raise ValueError("Data transformation failed")
```

The `train_model` task simulates training a machine learning model using the transformed data. In a real pipeline, this is where you would implement your model training code using frameworks like `scikit-learn`, `TensorFlow`, or `PyTorch`. Our simulated task returns a dictionary with the training results, including an accuracy metric that will be used for model validation.

Now for the validation task:

```python
    @task(task_id="validate_model")
    def validate_model(train_result):
        """Simulate validating the model's performance."""
        if train_result["model_trained"]:
            accuracy = train_result["accuracy"]
            print(f"Validating model. Accuracy: {accuracy}")
            # Determine if model meets quality threshold
            if accuracy >= 0.8:
                return {"validation": "passed", "accuracy": accuracy}
            else:
                return {"validation": "failed", "accuracy": accuracy}
        else:
            raise ValueError("Model training failed")
```

The `validate_model` task evaluates the model's performance against a predefined threshold (0.8 in this case). Rather than failing the task if the model doesn't meet the threshold, it returns a status indicating whether validation passed or failed. This approach gives us flexibility in how we handle underperforming models — we might choose to deploy them anyway with monitoring, or we might prevent deployment entirely.

**Note:** in general, model validation could be more complex and involve different metrics than simple accuracy. We'll delve into this topic in more detail in a later lesson in the course.

## Implementing Conditional Deployment Logic

The final task in our ML pipeline is model deployment, which should only happen if the model passes validation:

```python
    @task(task_id="deploy_model", trigger_rule="none_failed")
    def deploy_model(validation_result):
        """Simulate deploying the model if validation passed."""
        if validation_result["validation"] == "passed":
            print(f"Deploying model with accuracy: {validation_result['accuracy']}")
            return {"deployed": True}
        else:
            print(f"Model validation failed with accuracy: {validation_result['accuracy']}")
            return {"deployed": False}
```

This task introduces an important concept: the `trigger_rule` parameter. By setting `trigger_rule="none_failed"`, we're telling Airflow to run this task as long as no upstream tasks failed (even if some skipped). This ensures our deployment task runs as long as no upstream tasks have failed (i.e., raised an exception), even if the validation task returns a result indicating the model isn't good enough to deploy. The deployment logic inside the task will then decide whether to actually deploy the model based on the validation result.

Inside the task, we check the validation result and either deploy the model or log the failure. In both cases, the task itself succeeds, but the return value indicates whether deployment actually occurred. This pattern is useful for conditional logic that doesn't necessarily represent a failure of the pipeline itself.

## Defining Task Dependencies

Now that we've defined all our tasks, we need to establish the relationships between them:

```python
    # Define the task dependencies using TaskFlow
    # This creates the DAG structure automatically based on function calls
    extract_result = extract_data()
    transform_result = transform_data(extract_result)
    train_result = train_model(transform_result)
    validation_result = validate_model(train_result)
    deploy_model(validation_result)

# Now actually enable this to be run as a DAG
training_pipeline()
```

When we call `extract_data()`, it returns a value which we pass to `transform_data()`, creating a dependency between them. This pattern continues through our entire workflow, creating a linear sequence of tasks.

The final line `training_pipeline()` instantiates our DAG object, making it discoverable by the Airflow scheduler. This is a critical step — without it, our DAG won't be registered with Airflow.

## Conclusion and Next Steps

In this lesson, you've built a complete ML pipeline using Apache Airflow's TaskFlow API. You've learned how to structure a DAG for machine learning workflows, create tasks that represent each stage of an ML pipeline, pass data between tasks, implement conditional logic, and use trigger rules to control task execution. These concepts form the foundation of production-ready ML pipelines that can reliably train and deploy models on a schedule.

In the upcoming practice exercises, you'll have the opportunity to apply these concepts by creating your own ML pipeline DAG. You'll experiment with different task configurations, implement conditional logic, and explore how Airflow can help make your ML workflows more robust and manageable. This hands-on experience will solidify your understanding of how Airflow can serve as a powerful orchestration tool for machine learning operations.

## Model Validation Logic in Airflow

You’ve just seen how each stage of an ML pipeline can be represented as a task in Airflow, and how data and results flow from one step to the next. Now it’s your turn to put this into practice by completing the validation logic for the model evaluation step.

In the validate_model task, you need to decide if the trained model is good enough to move forward. Your job is to:

Add a conditional check to see if the model’s accuracy is at least 0.8.
If the accuracy meets or exceeds this threshold, return a dictionary with "validation": "passed" and include the accuracy.
If the accuracy is below the threshold, return a dictionary with "validation": "failed" and include the accuracy as well.
This step is important because it controls whether the model will be deployed in the next stage. Give it a try and see how your logic shapes the pipeline’s outcome!

```python
"""
Introduction to Airflow for MLOps - Sample DAG

This module defines a basic Airflow DAG that demonstrates core concepts
for ML workflows using the TaskFlow API. This serves as an introduction 
to Airflow before integrating our actual ML components.
"""

from datetime import datetime, timedelta
from airflow.decorators import dag, task
import logging

# Define default arguments for the DAG
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

# Define the DAG
@dag(
    dag_id='mlops_pipeline',  # Unique identifier for the DAG
    description='Introduction to Airflow for ML pipelines',
    default_args=default_args,
    schedule_interval='@daily',  # Run daily
    start_date=datetime(2023, 1, 1),  # Start date (in the past)
    catchup=False,  # Don't run for past dates
    tags=['intro', 'ml'],
)
def training_pipeline():
    """
    This DAG introduces Airflow concepts for ML workflows.
    It demonstrates task definition, dependencies, and flow control
    before we implement our actual ML pipeline in the later units.
    """
    
    # Define tasks using the TaskFlow API
    @task(task_id="extract_data")
    def extract_data():
        """Simulate extracting data from a source."""
        print("Extracting data from source...")
        # In a real scenario, this would connect to a data source
        return {"data_extracted": True, "records": 1000}
    
    @task(task_id="transform_data")
    def transform_data(extract_result):
        """Simulate transforming the extracted data."""
        if extract_result["data_extracted"]:
            num_records = extract_result["records"]
            print(f"Transforming {num_records} records...")
            # Simulate data transformation
            return {"data_transformed": True, "features": 10}
        else:
            raise ValueError("Data extraction failed")
    
    @task(task_id="train_model")
    def train_model(transform_result):
        """Simulate training a machine learning model."""
        if transform_result["data_transformed"]:
            num_features = transform_result["features"]
            print(f"Training model with {num_features} features...")
            # Simulate model training
            return {"model_trained": True, "accuracy": 0.85}
        else:
            raise ValueError("Data transformation failed")
    
    @task(task_id="validate_model")
    def validate_model(train_result):
        """Simulate validating the model's performance."""
        if train_result["model_trained"]:
            accuracy = train_result["accuracy"]
            print(f"Validating model. Accuracy: {accuracy}")
            # TODO: Add a conditional check to determine if the model meets quality standards
            # Hint: Check if accuracy is greater than or equal to 0.8

            # TODO: If the model meets quality standards, return a dictionary with validation status "passed"
            # TODO: If the model doesn't meet quality standards, return a dictionary with validation status "failed"
            # Remember to include the accuracy in both cases
        else:
            raise ValueError("Model training failed")
    
    @task(task_id="deploy_model", trigger_rule="none_failed")
    def deploy_model(validation_result):
        """Simulate deploying the model if validation passed."""
        if validation_result["validation"] == "passed":
            print(f"Deploying model with accuracy: {validation_result['accuracy']}")
            return {"deployed": True}
        else:
            logging.warning(f"Model validation failed with accuracy: {validation_result['accuracy']}")
            return {"deployed": False}
    
    # Define the task dependencies using TaskFlow
    # This creates the DAG structure automatically based on function calls
    extract_result = extract_data()
    transform_result = transform_data(extract_result)
    train_result = train_model(transform_result)
    validation_result = validate_model(train_result)
    deploy_model(validation_result)

# Now actually enable this to be run as a DAG
training_pipeline()
```

I will complete the `validate_model` task as requested.

```python
"""
Introduction to Airflow for MLOps - Sample DAG

This module defines a basic Airflow DAG that demonstrates core concepts
for ML workflows using the TaskFlow API. This serves as an introduction 
to Airflow before integrating our actual ML components.
"""

from datetime import datetime, timedelta
from airflow.decorators import dag, task
import logging

# Define default arguments for the DAG
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

# Define the DAG
@dag(
    dag_id='mlops_pipeline',  # Unique identifier for the DAG
    description='Introduction to Airflow for ML pipelines',
    default_args=default_args,
    schedule_interval='@daily',  # Run daily
    start_date=datetime(2023, 1, 1),  # Start date (in the past)
    catchup=False,  # Don't run for past dates
    tags=['intro', 'ml'],
)
def training_pipeline():
    """
    This DAG introduces Airflow concepts for ML workflows.
    It demonstrates task definition, dependencies, and flow control
    before we implement our actual ML pipeline in the later units.
    """
    
    # Define tasks using the TaskFlow API
    @task(task_id="extract_data")
    def extract_data():
        """Simulate extracting data from a source."""
        print("Extracting data from source...")
        # In a real scenario, this would connect to a data source
        return {"data_extracted": True, "records": 1000}
    
    @task(task_id="transform_data")
    def transform_data(extract_result):
        """Simulate transforming the extracted data."""
        if extract_result["data_extracted"]:
            num_records = extract_result["records"]
            print(f"Transforming {num_records} records...")
            # Simulate data transformation
            return {"data_transformed": True, "features": 10}
        else:
            raise ValueError("Data extraction failed")
    
    @task(task_id="train_model")
    def train_model(transform_result):
        """Simulate training a machine learning model."""
        if transform_result["data_transformed"]:
            num_features = transform_result["features"]
            print(f"Training model with {num_features} features...")
            # Simulate model training
            return {"model_trained": True, "accuracy": 0.85}
        else:
            raise ValueError("Data transformation failed")
    
    @task(task_id="validate_model")
    def validate_model(train_result):
        """Simulate validating the model's performance."""
        if train_result["model_trained"]:
            accuracy = train_result["accuracy"]
            print(f"Validating model. Accuracy: {accuracy}")
            
            # Add a conditional check to determine if the model meets quality standards
            # If the model meets quality standards, return a dictionary with validation status "passed"
            if accuracy >= 0.8:
                return {"validation": "passed", "accuracy": accuracy}
            # If the model doesn't meet quality standards, return a dictionary with validation status "failed"
            else:
                return {"validation": "failed", "accuracy": accuracy}
        else:
            raise ValueError("Model training failed")
    
    @task(task_id="deploy_model", trigger_rule="none_failed")
    def deploy_model(validation_result):
        """Simulate deploying the model if validation passed."""
        if validation_result["validation"] == "passed":
            print(f"Deploying model with accuracy: {validation_result['accuracy']}")
            return {"deployed": True}
        else:
            logging.warning(f"Model validation failed with accuracy: {validation_result['accuracy']}")
            return {"deployed": False}
    
    # Define the task dependencies using TaskFlow
    # This creates the DAG structure automatically based on function calls
    extract_result = extract_data()
    transform_result = transform_data(extract_result)
    train_result = train_model(transform_result)
    validation_result = validate_model(train_result)
    deploy_model(validation_result)

# Now actually enable this to be run as a DAG
training_pipeline()
```

## Archiving Models in Your Pipeline

You’ve just seen how to validate and deploy a model in your Airflow ML pipeline. Now, let’s take it a step further by ensuring your deployed models are properly archived for future reference.

Your task is to add a new step to the pipeline that archives the model details after a successful deployment. This helps keep track of which models have been put into production.

Here’s what you need to do:

Create a new task called archive_model using the TaskFlow API.
This task should:
Take the output from the deploy_model task as its input.
Check if the deployment was successful.
If it was, print a message about archiving and return a dictionary with "archived": True.
If not, print a message about skipping archiving and return a dictionary with "archived": False.
Update the dependency chain so that archive_model runs after deploy_model and receives its result.
Adding this step will help you see how to extend your pipeline with new tasks and manage the flow of information between them.

```python
"""
Introduction to Airflow for MLOps - Sample DAG

This module defines a basic Airflow DAG that demonstrates core concepts
for ML workflows using the TaskFlow API. This serves as an introduction 
to Airflow before integrating our actual ML components.
"""

from datetime import datetime, timedelta
from airflow.decorators import dag, task
import logging

# Define default arguments for the DAG
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

# Define the DAG
@dag(
    dag_id='mlops_pipeline',  # Unique identifier for the DAG
    description='Introduction to Airflow for ML pipelines',
    default_args=default_args,
    schedule_interval='@daily',  # Run daily
    start_date=datetime(2023, 1, 1),  # Start date (in the past)
    catchup=False,  # Don't run for past dates
    tags=['intro', 'ml'],
)
def training_pipeline():
    """
    This DAG introduces Airflow concepts for ML workflows.
    It demonstrates task definition, dependencies, and flow control
    before we implement our actual ML pipeline in the later units.
    """
    
    # Define tasks using the TaskFlow API
    @task(task_id="extract_data")
    def extract_data():
        """Simulate extracting data from a source."""
        print("Extracting data from source...")
        # In a real scenario, this would connect to a data source
        return {"data_extracted": True, "records": 1000}
    
    @task(task_id="transform_data")
    def transform_data(extract_result):
        """Simulate transforming the extracted data."""
        if extract_result["data_extracted"]:
            num_records = extract_result["records"]
            print(f"Transforming {num_records} records...")
            # Simulate data transformation
            return {"data_transformed": True, "features": 10}
        else:
            raise ValueError("Data extraction failed")
    
    @task(task_id="train_model")
    def train_model(transform_result):
        """Simulate training a machine learning model."""
        if transform_result["data_transformed"]:
            num_features = transform_result["features"]
            print(f"Training model with {num_features} features...")
            # Simulate model training
            return {"model_trained": True, "accuracy": 0.85}
        else:
            raise ValueError("Data transformation failed")
    
    @task(task_id="validate_model")
    def validate_model(train_result):
        """Simulate validating the model's performance."""
        if train_result["model_trained"]:
            accuracy = train_result["accuracy"]
            print(f"Validating model. Accuracy: {accuracy}")
            # Determine if model meets quality threshold
            if accuracy >= 0.8:
                return {"validation": "passed", "accuracy": accuracy}
            else:
                return {"validation": "failed", "accuracy": accuracy}
        else:
            raise ValueError("Model training failed")
    
    @task(task_id="deploy_model", trigger_rule="none_failed")
    def deploy_model(validation_result):
        """Simulate deploying the model if validation passed."""
        if validation_result["validation"] == "passed":
            print(f"Deploying model with accuracy: {validation_result['accuracy']}")
            return {"deployed": True}
        else:
            logging.warning(f"Model validation failed with accuracy: {validation_result['accuracy']}")
            return {"deployed": False}

    # TODO: Add a new task called "archive_model" that simulates archiving model details after deployment.
    # The task should:
    #   - Accept the output from deploy_model as input
    #   - Check if deployment was successful 
    #   - If successful, print a message about archiving and return {"archived": True}
    #   - If not, print a message about skipping archiving and return {"archived": False}

    # Define the task dependencies using TaskFlow
    # This creates the DAG structure automatically based on function calls
    extract_result = extract_data()
    transform_result = transform_data(extract_result)
    train_result = train_model(transform_result)
    validation_result = validate_model(train_result)
    deploy_result = deploy_model(validation_result)
    # TODO: Update the dependency chain to include the new archive_model task after deploy_model.
    # The archive_model task should receive deploy_result as its input.

# Now actually enable this to be run as a DAG
training_pipeline()

```

I will add a new task `archive_model` and update the dependency chain to include it after the `deploy_model` task.

```python
"""
Introduction to Airflow for MLOps - Sample DAG

This module defines a basic Airflow DAG that demonstrates core concepts
for ML workflows using the TaskFlow API. This serves as an introduction 
to Airflow before integrating our actual ML components.
"""

from datetime import datetime, timedelta
from airflow.decorators import dag, task
import logging

# Define default arguments for the DAG
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

# Define the DAG
@dag(
    dag_id='mlops_pipeline',  # Unique identifier for the DAG
    description='Introduction to Airflow for ML pipelines',
    default_args=default_args,
    schedule_interval='@daily',  # Run daily
    start_date=datetime(2023, 1, 1),  # Start date (in the past)
    catchup=False,  # Don't run for past dates
    tags=['intro', 'ml'],
)
def training_pipeline():
    """
    This DAG introduces Airflow concepts for ML workflows.
    It demonstrates task definition, dependencies, and flow control
    before we implement our actual ML pipeline in the later units.
    """
    
    # Define tasks using the TaskFlow API
    @task(task_id="extract_data")
    def extract_data():
        """Simulate extracting data from a source."""
        print("Extracting data from source...")
        # In a real scenario, this would connect to a data source
        return {"data_extracted": True, "records": 1000}
    
    @task(task_id="transform_data")
    def transform_data(extract_result):
        """Simulate transforming the extracted data."""
        if extract_result["data_extracted"]:
            num_records = extract_result["records"]
            print(f"Transforming {num_records} records...")
            # Simulate data transformation
            return {"data_transformed": True, "features": 10}
        else:
            raise ValueError("Data extraction failed")
    
    @task(task_id="train_model")
    def train_model(transform_result):
        """Simulate training a machine learning model."""
        if transform_result["data_transformed"]:
            num_features = transform_result["features"]
            print(f"Training model with {num_features} features...")
            # Simulate model training
            return {"model_trained": True, "accuracy": 0.85}
        else:
            raise ValueError("Data transformation failed")
    
    @task(task_id="validate_model")
    def validate_model(train_result):
        """Simulate validating the model's performance."""
        if train_result["model_trained"]:
            accuracy = train_result["accuracy"]
            print(f"Validating model. Accuracy: {accuracy}")
            # Determine if model meets quality threshold
            if accuracy >= 0.8:
                return {"validation": "passed", "accuracy": accuracy}
            else:
                return {"validation": "failed", "accuracy": accuracy}
        else:
            raise ValueError("Model training failed")
    
    @task(task_id="deploy_model", trigger_rule="none_failed")
    def deploy_model(validation_result):
        """Simulate deploying the model if validation passed."""
        if validation_result["validation"] == "passed":
            print(f"Deploying model with accuracy: {validation_result['accuracy']}")
            return {"deployed": True}
        else:
            logging.warning(f"Model validation failed with accuracy: {validation_result['accuracy']}")
            return {"deployed": False}

    # Add a new task called "archive_model" that simulates archiving model details after deployment.
    @task(task_id="archive_model")
    def archive_model(deploy_result):
        """Simulate archiving the model details."""
        if deploy_result.get("deployed"):
            print("Model deployment was successful. Archiving model details...")
            return {"archived": True}
        else:
            print("Model was not deployed. Skipping archiving.")
            return {"archived": False}

    # Define the task dependencies using TaskFlow
    # This creates the DAG structure automatically based on function calls
    extract_result = extract_data()
    transform_result = transform_data(extract_result)
    train_result = train_model(transform_result)
    validation_result = validate_model(train_result)
    deploy_result = deploy_model(validation_result)
    # Update the dependency chain to include the new archive_model task after deploy_model.
    # The archive_model task should receive deploy_result as its input.
    archive_model(deploy_result)

# Now actually enable this to be run as a DAG
training_pipeline()
```

## Adding Rollback to Your ML Pipeline

Now, let’s make your pipeline even more reliable by adding a way to respond if deployment fails.

Your task is to add a new step called rollback_model to the pipeline. This step should check if the deployment was unsuccessful and, if so, perform a simulated rollback. If deployment was successful, it should simply note that no rollback is needed. Here’s what you need to do:

Create a new task named rollback_model.
This task should:
Take the output from the deploy_model task as its input.
Check if deployment failed (when deploy_result["deployed"] is False).
If deployment failed, print a message about rolling back and return {"rollback_performed": True}.
If deployment succeeded, print a message that no rollback is needed and return {"rollback_performed": False}.
Update the pipeline so that rollback_model runs after deploy_model and receives its result as input.

```python
"""
Introduction to Airflow for MLOps - Sample DAG

This module defines a basic Airflow DAG that demonstrates core concepts
for ML workflows using the TaskFlow API. This serves as an introduction 
to Airflow before integrating our actual ML components.
"""

from datetime import datetime, timedelta
from airflow.decorators import dag, task
import logging

# Define default arguments for the DAG
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

# Define the DAG
@dag(
    dag_id='mlops_pipeline',  # Unique identifier for the DAG
    description='Introduction to Airflow for ML pipelines',
    default_args=default_args,
    schedule_interval='@daily',  # Run daily
    start_date=datetime(2023, 1, 1),  # Start date (in the past)
    catchup=False,  # Don't run for past dates
    tags=['intro', 'ml'],
)
def training_pipeline():
    """
    This DAG introduces Airflow concepts for ML workflows.
    It demonstrates task definition, dependencies, and flow control
    before we implement our actual ML pipeline in the later units.
    """
    
    # Define tasks using the TaskFlow API
    @task(task_id="extract_data")
    def extract_data():
        """Simulate extracting data from a source."""
        print("Extracting data from source...")
        # In a real scenario, this would connect to a data source
        return {"data_extracted": True, "records": 1000}
    
    @task(task_id="transform_data")
    def transform_data(extract_result):
        """Simulate transforming the extracted data."""
        if extract_result["data_extracted"]:
            num_records = extract_result["records"]
            print(f"Transforming {num_records} records...")
            # Simulate data transformation
            return {"data_transformed": True, "features": 10}
        else:
            raise ValueError("Data extraction failed")
    
    @task(task_id="train_model")
    def train_model(transform_result):
        """Simulate training a machine learning model."""
        if transform_result["data_transformed"]:
            num_features = transform_result["features"]
            print(f"Training model with {num_features} features...")
            # Simulate model training
            return {"model_trained": True, "accuracy": 0.85}
        else:
            raise ValueError("Data transformation failed")
    
    @task(task_id="validate_model")
    def validate_model(train_result):
        """Simulate validating the model's performance."""
        if train_result["model_trained"]:
            accuracy = train_result["accuracy"]
            print(f"Validating model. Accuracy: {accuracy}")
            # Determine if model meets quality threshold
            if accuracy >= 0.8:
                return {"validation": "passed", "accuracy": accuracy}
            else:
                return {"validation": "failed", "accuracy": accuracy}
        else:
            raise ValueError("Model training failed")
    
    @task(task_id="deploy_model", trigger_rule="none_failed")
    def deploy_model(validation_result):
        """Simulate deploying the model if validation passed."""
        if validation_result["validation"] == "passed":
            print(f"Deploying model with accuracy: {validation_result['accuracy']}")
            return {"deployed": True}
        else:
            logging.warning(f"Model validation failed with accuracy: {validation_result['accuracy']}")
            return {"deployed": False}

    # TODO: Add a new task called "rollback_model" using the @task decorator.
    # This task should:
    #   - Take the output from deploy_model as its input
    #   - Check if deployment was unsuccessful
    #   - If deployment failed, print a message about rolling back and return {"rollback_performed": True}
    #   - If deployment succeeded, print a message that no rollback is needed and return {"rollback_performed": False}

    # Define the task dependencies using TaskFlow
    # This creates the DAG structure automatically based on function calls
    extract_result = extract_data()
    transform_result = transform_data(extract_result)
    train_result = train_model(transform_result)
    validation_result = validate_model(train_result)
    deploy_result = deploy_model(validation_result)
    # TODO: Call your rollback_model task here, passing deploy_result as input.
    # The rollback_model task should always run after deploy_model,
    # but only perform a rollback if deployment failed.

# Now actually enable this to be run as a DAG
training_pipeline()
```

I will add a `rollback_model` task and update the dependency chain to include it.

```python
"""
Introduction to Airflow for MLOps - Sample DAG

This module defines a basic Airflow DAG that demonstrates core concepts
for ML workflows using the TaskFlow API. This serves as an introduction 
to Airflow before integrating our actual ML components.
"""

from datetime import datetime, timedelta
from airflow.decorators import dag, task
import logging

# Define default arguments for the DAG
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

# Define the DAG
@dag(
    dag_id='mlops_pipeline',  # Unique identifier for the DAG
    description='Introduction to Airflow for ML pipelines',
    default_args=default_args,
    schedule_interval='@daily',  # Run daily
    start_date=datetime(2023, 1, 1),  # Start date (in the past)
    catchup=False,  # Don't run for past dates
    tags=['intro', 'ml'],
)
def training_pipeline():
    """
    This DAG introduces Airflow concepts for ML workflows.
    It demonstrates task definition, dependencies, and flow control
    before we implement our actual ML pipeline in the later units.
    """
    
    # Define tasks using the TaskFlow API
    @task(task_id="extract_data")
    def extract_data():
        """Simulate extracting data from a source."""
        print("Extracting data from source...")
        # In a real scenario, this would connect to a data source
        return {"data_extracted": True, "records": 1000}
    
    @task(task_id="transform_data")
    def transform_data(extract_result):
        """Simulate transforming the extracted data."""
        if extract_result["data_extracted"]:
            num_records = extract_result["records"]
            print(f"Transforming {num_records} records...")
            # Simulate data transformation
            return {"data_transformed": True, "features": 10}
        else:
            raise ValueError("Data extraction failed")
    
    @task(task_id="train_model")
    def train_model(transform_result):
        """Simulate training a machine learning model."""
        if transform_result["data_transformed"]:
            num_features = transform_result["features"]
            print(f"Training model with {num_features} features...")
            # Simulate model training
            return {"model_trained": True, "accuracy": 0.85}
        else:
            raise ValueError("Data transformation failed")
    
    @task(task_id="validate_model")
    def validate_model(train_result):
        """Simulate validating the model's performance."""
        if train_result["model_trained"]:
            accuracy = train_result["accuracy"]
            print(f"Validating model. Accuracy: {accuracy}")
            # Determine if model meets quality threshold
            if accuracy >= 0.8:
                return {"validation": "passed", "accuracy": accuracy}
            else:
                return {"validation": "failed", "accuracy": accuracy}
        else:
            raise ValueError("Model training failed")
    
    @task(task_id="deploy_model", trigger_rule="none_failed")
    def deploy_model(validation_result):
        """Simulate deploying the model if validation passed."""
        if validation_result["validation"] == "passed":
            print(f"Deploying model with accuracy: {validation_result['accuracy']}")
            return {"deployed": True}
        else:
            logging.warning(f"Model validation failed with accuracy: {validation_result['accuracy']}")
            return {"deployed": False}

    # Add a new task called "rollback_model" using the @task decorator.
    @task(task_id="rollback_model")
    def rollback_model(deploy_result):
        """Simulates a rollback if deployment failed."""
        if deploy_result["deployed"]:
            print("Model deployed successfully. No rollback needed.")
            return {"rollback_performed": False}
        else:
            print("Model deployment failed. Performing rollback...")
            return {"rollback_performed": True}
            
    # Define the task dependencies using TaskFlow
    # This creates the DAG structure automatically based on function calls
    extract_result = extract_data()
    transform_result = transform_data(extract_result)
    train_result = train_model(transform_result)
    validation_result = validate_model(train_result)
    deploy_result = deploy_model(validation_result)
    # Call your rollback_model task here, passing deploy_result as input.
    rollback_model(deploy_result)

# Now actually enable this to be run as a DAG
training_pipeline()
```

## Build a Complete ML Pipeline DAG