# AWS Glue: Reading from & Writing to S3 ☁️

This notebook performs a **Real-World ETL Job**:
1. **EXTRACT**: Read CSV data directly from your S3 bucket.
2. **TRANSFORM**: Clean/modify the schema.
3. **LOAD**: Write the results back to S3 as Parquet.

*Prerequisite: Ensure `sales_data.csv` is uploaded to `s3://egirgis-datalake-v1/raw/sales_data/`*

In [None]:
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.utils import getResolvedOptions

# 1. Setup Context
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
print("Glue Context Initialized")

### 1. Read from S3 (Extract)
We verify we can reach the bucket and read the file.

In [None]:
# Define S3 Paths (Change these if you use a different bucket)
BUCKET_NAME = "egirgis-datalake-v1"
INPUT_PATH = f"s3://{BUCKET_NAME}/raw/sales_data/"
OUTPUT_PATH = f"s3://{BUCKET_NAME}/processed/sales_clean/"

print(f"Reading from: {INPUT_PATH}")

# Read CSV from S3
dyf_s3 = glueContext.create_dynamic_frame.from_options(
    format_options={"quoteChar": "\"", "withHeader": True, "separator": ","},
    connection_type="s3",
    format="csv",
    connection_options={"paths": [INPUT_PATH], "recurse": True},
    transformation_ctx="input_dyf"
)

dyf_s3.show(5)
dyf_s3.printSchema()

### 2. Transform
Let's fix types (Sales to double) and rename columns exactly as we would in a real job.

In [None]:
mapped_dyf = dyf_s3.apply_mapping([
    ("City", "string", "city", "string"),
    ("Product", "string", "product_name", "string"),
    ("Sales", "string", "sales_amount", "double"),
    ("Date", "string", "date", "string")
])

mapped_dyf.printSchema()

### 3. Write to S3 (Load)
Write the clean data back to S3 in Parquet format (best for analytics).

In [None]:
print(f"Writing to: {OUTPUT_PATH}")

glueContext.write_dynamic_frame.from_options(
    frame=mapped_dyf,
    connection_type="s3",
    format="parquet",
    connection_options={"path": OUTPUT_PATH, "partitionKeys": ["city"]},
    transformation_ctx="output_dyf"
)

print("Write Complete! Check your S3 bucket for the 'processed' folder.")