# AWS Managed Workflows for Apache Airflow (MWAA)

<img src="../_assets/aws_service_icons/mwaa.svg" width="80" alt="AWS MWAA">

## Goals
- Understand what **MWAA** is (and what it is not).
- Know what it’s practically used for.
- See a minimal **AWS SDK** pseudo-code workflow (no execution).


## Prerequisites
- Basic Airflow concepts: DAGs, tasks, schedules, retries.
- Basic AWS concepts: IAM roles, S3, VPC/subnets/security groups.
- Familiarity with an AWS SDK (e.g., Python `boto3`).

> This notebook includes **pseudo-code only**. It does not run any AWS SDK calls.


## What MWAA is
**Amazon Managed Workflows for Apache Airflow (MWAA)** is AWS’s **managed** offering for running **Apache Airflow**.

You bring:
- **DAG code** (workflow definitions) and optional dependencies/plugins.
- Connection details and IAM permissions for the systems your tasks interact with.

AWS manages:
- Provisioning and operating the **Airflow components** (scheduler, web server, workers).
- Scaling, patching, and integrating with AWS primitives (CloudWatch logs/metrics, IAM, VPC networking).

MWAA environments are typically configured with:
- An **S3 bucket** for DAGs (and often plugins/requirements).
- An **execution role** (IAM) used by the environment.
- **VPC networking** (subnets + security groups).

### What it is not
- Not an ETL/compute engine by itself: Airflow **orchestrates** work; your tasks run on other compute/services.
- Not a replacement for data processing tools (Spark/Glue), model training services (SageMaker), or storage (S3/RDS).


## What MWAA is practically used for
MWAA is used when you want **Airflow-style orchestration** (dependencies, retries, schedules, backfills, observability) without running Airflow yourself.

Common use-cases:
- **Data pipelines**: ingest → validate → transform → load (ETL/ELT).
- **ML pipelines**: feature generation, training, evaluation, batch inference, model promotion.
- **Cross-service orchestration**: coordinate work across Lambda/ECS/Batch/Glue/EMR/SageMaker and databases.
- **Operational workflows**: report generation, periodic maintenance jobs, data quality checks.

Why teams choose MWAA:
- Prefer the Airflow ecosystem (operators/sensors) but want AWS-managed ops.
- Need a central scheduler with clear dependency graphs, retries, and auditability.


## Using MWAA with the AWS SDK (pseudo-code)
Below is a minimal, **non-executable** sketch of a typical workflow:
1) Upload a DAG to the S3 DAGs prefix.
2) Create an MWAA environment pointing at that bucket/prefix.
3) Wait for the environment to become available.
4) Trigger a DAG run using an MWAA CLI token.

```python
# PSEUDO-CODE (do not run)

import boto3
import requests
from time import sleep

region = "us-east-1"

# Service clients
mwaa = boto3.client("mwaa", region_name=region)
s3 = boto3.client("s3", region_name=region)

# 1) Upload a DAG to the MWAA DAGs S3 prefix
bucket = "my-mwaa-artifacts"
dags_prefix = "dags/"  # MWAA reads DAGs from this prefix
s3.upload_file(
    Filename="./dags/example_pipeline.py",
    Bucket=bucket,
    Key=f"{dags_prefix}example_pipeline.py",
)

# 2) Create the MWAA environment (one-time setup)
env_name = "ml-pipelines"
mwaa.create_environment(
    Name=env_name,
    AirflowVersion="2.7.2",  # example; choose a supported version
    SourceBucketArn=f"arn:aws:s3:::{bucket}",
    DagS3Path=dags_prefix,
    ExecutionRoleArn="arn:aws:iam::<account-id>:role/<mwaa-execution-role>",
    NetworkConfiguration={
        "SecurityGroupIds": ["sg-xxxxxxxx"],
        "SubnetIds": ["subnet-aaaaaaa", "subnet-bbbbbbb"],
    },
    EnvironmentClass="mw1.small",
    LoggingConfiguration={
        "DagProcessingLogs": {"Enabled": True, "LogLevel": "INFO"},
        "SchedulerLogs": {"Enabled": True, "LogLevel": "INFO"},
        "TaskLogs": {"Enabled": True, "LogLevel": "INFO"},
        "WebserverLogs": {"Enabled": True, "LogLevel": "INFO"},
        "WorkerLogs": {"Enabled": True, "LogLevel": "INFO"},
    },
)

# 3) Wait until the environment is ready (environment creation is async)
def wait_until_available(name: str):
    while True:
        env = mwaa.get_environment(Name=name)["Environment"]
        if env["Status"] == "AVAILABLE":
            return env
        sleep(60)

env = wait_until_available(env_name)

# 4) Trigger a DAG run via the MWAA Airflow CLI endpoint
# MWAA provides a short-lived token to call the webserver's CLI endpoint.
token = mwaa.create_cli_token(Name=env_name)
webserver = token["WebServerHostname"]
cli_token = token["CliToken"]

resp = requests.post(
    url=f"https://{webserver}/aws_mwaa/cli",
    headers={"Authorization": f"Bearer {cli_token}", "Content-Type": "text/plain"},
    data="dags trigger example_pipeline",
    timeout=30,
)
print(resp.text)
```


## Pitfalls & quick tips
- **Networking**: MWAA runs in your VPC; plan subnets/route tables/NAT so tasks can reach required endpoints.
- **IAM**: the execution role must allow access to S3 DAGs, logs, and any services your tasks call.
- **Dependencies**: Python deps are typically provided via a `requirements.txt` in S3 (and plugins via `plugins.zip`).
- **Asynchronous ops**: environment creation/updates take time; build retry/polling into automation.
- **Costs**: MWAA is billed while the environment exists; automate cleanup for demos.


## References
- AWS Docs: Amazon MWAA (concepts, setup, IAM, networking)
- Apache Airflow Docs: DAGs, operators, scheduling, best practices
