# AWS Glue

<img src="../_assets/aws_service_icons/glue.svg" width="80" alt="AWS Glue">

## Goals
- Understand what **AWS Glue** is.
- Know what it is used for in real data platforms.
- See a minimal **AWS SDK** pseudo-code workflow (no execution).


## Prerequisites
- Familiarity with **S3** and basic **IAM** concepts helps.
- High-level idea of a **data lake** (raw → curated) is useful.


## What Glue is
**AWS Glue** is a managed, serverless **data integration** service. In practice, it gives you:

- A central **metadata catalog** (the *Glue Data Catalog*) for datasets and schemas.
- **Crawlers** to discover data (often in S3) and populate/update that catalog.
- Managed **ETL / ELT jobs** (commonly Apache Spark) to transform and move data.

Glue often sits in the middle of an AWS analytics stack, connecting storage (S3), query engines (Athena), warehouses (Redshift), and processing frameworks.


## What Glue is practically used for
Common real-world uses include:

- **Cataloging** a data lake: keep an inventory of datasets, schemas, partitions, and locations.
- **Schema discovery** with crawlers: infer tables/columns from files (CSV/JSON/Parquet) and update metadata over time.
- **Batch ETL**: read raw data, clean/normalize it, write curated outputs (often partitioned Parquet) back to S3.
- **Interoperability**: make datasets queryable via **Athena** and usable by tools that speak to the Glue Catalog.
- **Orchestrated pipelines**: run jobs on schedules/events via Glue triggers/workflows (or external orchestrators).


## Core concepts (minimum you should know)
- **Data Catalog**: databases + tables + partitions (metadata pointing to data in S3, JDBC sources, etc.).
- **Crawler**: scans a target (often S3) and creates/updates catalog tables.
- **Job**: an ETL program (often Spark) Glue runs using a configured IAM role and arguments.
- **IAM Role**: permissions for Glue to read sources (e.g., S3), write targets, and log to CloudWatch.


## Using Glue with the AWS SDK (pseudo-code)
Below is a minimal, **non-executable** sketch using an AWS SDK (shown as `boto3`, the AWS SDK for Python).

Notes:
- Avoid hardcoding credentials; prefer **IAM roles** (for AWS compute) and SSO/role-based credentials locally.
- `create_*` APIs are not idempotent by default; production code typically handles "already exists" errors.

```python
# PSEUDO-CODE (do not run)

import boto3
import time

region = "us-east-1"
glue = boto3.client("glue", region_name=region)

# Placeholders — replace with your environment
database_name = "my_data_lake_raw"
crawler_name = "raw_s3_crawler"
table_prefix = "raw_"
s3_target_path = "s3://my-bucket/raw/events/"

job_name = "etl_raw_to_curated"
iam_role_arn = "arn:aws:iam::123456789012:role/AWSGlueServiceRole-MyRole"
script_location = "s3://my-bucket/glue-scripts/etl_raw_to_curated.py"

# 1) Create (or ensure) a Glue Data Catalog database
glue.create_database(DatabaseInput={"Name": database_name})

# 2) Create a crawler to infer schema from S3 and populate the catalog
glue.create_crawler(
    Name=crawler_name,
    Role=iam_role_arn,
    DatabaseName=database_name,
    Targets={"S3Targets": [{"Path": s3_target_path}]},
    TablePrefix=table_prefix,
)

glue.start_crawler(Name=crawler_name)

# 3) (Optional) Poll until the crawler finishes
while True:
    state = glue.get_crawler(Name=crawler_name)["Crawler"]["State"]  # READY / RUNNING
    if state == "READY":
        break
    time.sleep(15)

# 4) Create a Glue ETL job (Spark) and run it
glue.create_job(
    Name=job_name,
    Role=iam_role_arn,
    Command={
        "Name": "glueetl",
        "ScriptLocation": script_location,
        "PythonVersion": "3",
    },
    GlueVersion="4.0",
    DefaultArguments={
        "--job-language": "python",
        "--TempDir": "s3://my-bucket/glue-temp/",
        "--SOURCE_DB": database_name,
        "--SOURCE_TABLE": f"{table_prefix}events",
        "--TARGET_S3": "s3://my-bucket/curated/events/",
    },
)

run = glue.start_job_run(JobName=job_name, Arguments={"--RUN_ID": "2026-01-06"})
run_id = run["JobRunId"]

# 5) (Optional) Poll job status
while True:
    jr = glue.get_job_run(
        JobName=job_name,
        RunId=run_id,
        PredecessorsIncluded=False,
    )["JobRun"]
    status = jr["JobRunState"]  # STARTING / RUNNING / SUCCEEDED / FAILED / ...
    if status in {"SUCCEEDED", "FAILED", "STOPPED", "TIMEOUT"}:
        break
    time.sleep(30)

print("final_status:", status)
```
