# Amazon EMR (Elastic MapReduce)

<img src="../_assets/aws_service_icons/emr.svg" width="80" alt="Amazon EMR">

## Goals
- Understand what **Amazon EMR** is (and what it is not).
- Know common practical use-cases for EMR in data/ML workflows.
- See a minimal **AWS SDK** pseudo-code workflow (no execution).


## Prerequisites
- Basic AWS concepts (regions, IAM roles, VPC/subnets).
- Familiarity with batch processing and distributed compute (e.g., Spark concepts) helps.

> This notebook includes **pseudo-code only**. It does not run any AWS SDK calls.


## What EMR is
**Amazon EMR** is AWS’s managed service for running common **open-source distributed data processing frameworks**.

Think of EMR as “managed clusters + managed integrations” for frameworks like:
- **Apache Spark** (ETL, feature engineering, ML pipelines)
- **Apache Hadoop** (HDFS/MapReduce ecosystem)
- **Hive/Presto/Trino** (SQL-on-data-lake style querying)
- (Depending on the release) other ecosystem tools

EMR can be used in different deployment modes:
- **EMR on EC2**: EMR provisions and manages an EC2 cluster for you.
- **EMR on EKS**: run EMR workloads on Kubernetes (EKS) using EMR’s runtime.
- **EMR Serverless**: run supported workloads without managing clusters.

Key concepts (EMR on EC2 terminology):
- **Cluster / Job flow**: a set of compute instances configured for your frameworks.
- **Steps**: ordered units of work (e.g., a Spark submit, a Hive query).
- **Release label**: the EMR platform version (pins framework versions).
- **Logs**: typically shipped to S3 for debugging and auditing.

### What it is not
- Not a general-purpose orchestrator (use Airflow/Step Functions/Prefect for multi-system workflows).
- Not primarily a storage layer (most modern EMR setups use **S3** as the data lake; HDFS is optional/temporary).
- Not the same as AWS Glue: Glue is more “serverless ETL”, while EMR is for when you want **more control** over the runtime, libraries, tuning, and cluster shape.


## What EMR is practically used for
EMR is commonly used when a workload benefits from **distributed compute** and you want a managed way to run/tune the underlying frameworks.

Typical use-cases:
- **Batch ETL at scale**: transform raw data in S3 into curated tables/files.
- **Feature engineering**: build training datasets and feature tables using Spark.
- **Large joins/aggregations**: computations that are too slow/expensive on a single machine.
- **SQL over a data lake**: interactive or scheduled queries via Hive/Trino/Presto.
- **Migration from on-prem Hadoop**: lift-and-shift (then modernize) existing Spark/Hive workloads.

Common operational patterns:
- **Ephemeral clusters**: create a cluster, run steps, and auto-terminate to control cost.
- **Separation of storage and compute**: keep data in S3; treat the cluster as disposable.
- **Cost optimization**: mix On-Demand and Spot instances; right-size instance groups.
- **Reproducibility**: pin the EMR release label and package dependencies carefully.


## Using EMR with the AWS SDK (pseudo-code)
Below is a minimal, **non-executable** sketch of creating an EMR cluster (EMR on EC2), submitting a Spark step, monitoring state, and cleaning up.

Notes:
- Prefer **IAM roles** for permissions; never hardcode AWS keys in notebooks.
- Creating EMR resources can incur cost; use auto-termination and clean up.
- Network settings (subnets, security groups) and IAM roles vary by organization; treat them as placeholders.

```python
# PSEUDO-CODE (do not run)

import time
import boto3

region = "us-east-1"
emr = boto3.client("emr", region_name=region)

# Inputs/outputs live in S3 (common EMR pattern)
s3_code_uri = "s3://<bucket>/jobs/etl.py"
s3_input_uri = "s3://<bucket>/data/raw/"
s3_output_uri = "s3://<bucket>/data/curated/"
s3_log_uri = "s3://<bucket>/emr-logs/"

# IAM roles (use org-approved roles; names below are common defaults)
service_role = "EMR_DefaultRole"           # cluster-level permissions
job_flow_role = "EMR_EC2_DefaultRole"      # permissions for EC2 instances in the cluster

# Minimal cluster + step definition
resp = emr.run_job_flow(
    Name="demo-emr-spark-etl",
    ReleaseLabel="emr-6.15.0",
    Applications=[{"Name": "Spark"}],
    LogUri=s3_log_uri,
    Instances={
        "Ec2SubnetId": "subnet-xxxxxxxx",
        "KeepJobFlowAliveWhenNoSteps": False,
        "TerminationProtected": False,
        "InstanceGroups": [
            {
                "Name": "Primary",
                "InstanceRole": "MASTER",
                "InstanceType": "m5.xlarge",
                "InstanceCount": 1,
                "Market": "ON_DEMAND",
            },
            {
                "Name": "Core",
                "InstanceRole": "CORE",
                "InstanceType": "m5.xlarge",
                "InstanceCount": 2,
                "Market": "ON_DEMAND",
            },
        ],
    },
    Steps=[
        {
            "Name": "spark-etl",
            "ActionOnFailure": "TERMINATE_CLUSTER",
            "HadoopJarStep": {
                "Jar": "command-runner.jar",
                "Args": [
                    "spark-submit",
                    "--deploy-mode",
                    "cluster",
                    s3_code_uri,
                    "--input",
                    s3_input_uri,
                    "--output",
                    s3_output_uri,
                ],
            },
        }
    ],
    ServiceRole=service_role,
    JobFlowRole=job_flow_role,
    VisibleToAllUsers=True,
)

cluster_id = resp["JobFlowId"]
print("Started cluster:", cluster_id)

# Poll cluster state until it ends (or use an org-specific orchestration tool)
terminal_states = {"TERMINATED", "TERMINATED_WITH_ERRORS"}
while True:
    cluster = emr.describe_cluster(ClusterId=cluster_id)["Cluster"]
    state = cluster["Status"]["State"]
    print("Cluster state:", state)
    if state in terminal_states:
        break
    time.sleep(30)

# If you kept the cluster alive (KeepJobFlowAliveWhenNoSteps=True), terminate explicitly:
# emr.terminate_job_flows(JobFlowIds=[cluster_id])
```
