# PySpark on Docker Compose
Distributed tabular ETL with Apache Spark — petabyte-scale joins with fault tolerance and Adaptive Query Execution (AQE).

## Setup

Start the Spark stack and launch Jupyter:

```bash
# 1. Build images
docker compose build

# 2. Start MinIO + Spark + App
docker compose up -d minio minio-setup spark-master
docker compose up -d --scale spark-worker=1 spark-worker app

# 3. Upload sample data
./scripts/upload-data.sh

# 4. Launch Jupyter Lab
docker compose exec app jupyter lab --ip 0.0.0.0 --port 8888 --allow-root --no-browser --notebook-dir=/app/notebook
```

Then open http://localhost:8888 in your browser.

## What is Spark?

Apache Spark is the de facto standard for **petabyte-scale tabular ETL**. Key concepts:

- **SparkSession** — entry point to all Spark functionality
- **DataFrames** — distributed collections with SQL-like API
- **Lazy evaluation** — transformations build a DAG, actions trigger execution
- **Adaptive Query Execution (AQE)** — runtime query re-optimization
- **Fault tolerance** — automatic recovery via RDD lineage

## Architecture

```
Driver (app container) → Executor (spark-worker) → MinIO (S3 storage)
```

The driver plans and coordinates. Workers execute tasks in parallel. Data flows through MinIO as the shared storage layer.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window

spark = (
    SparkSession.builder.appName("Notebook_Spark_ETL")
    .master("spark://spark-master:7077")
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000")
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin")
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin")
    .config("spark.hadoop.fs.s3a.path.style.access", "true")
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .config("spark.sql.adaptive.enabled", "true")
    .getOrCreate()
)
print(f"Spark UI: http://localhost:8080")
print(f"Connected to: {spark.sparkContext.master}")

## Explore Data

In [None]:
trips = spark.read.parquet("s3a://lake/taxi/*.parquet")
zones = (
    spark.read.option("header", "true")
    .option("inferSchema", "true")
    .csv("s3a://lake/taxi/taxi_zone_lookup.csv")
)

print(f"Trips: {trips.count():,} rows")
trips.printSchema()
trips.show(5)

print(f"\nZones: {zones.count()} rows")
zones.show(5)

## Filter Operations

In [None]:
# High-value trips: fare > $10 and distance > 5 miles
filtered = trips.filter((F.col("fare_amount") > 10.0) & (F.col("trip_distance") > 5.0))
print(f"High-value trips: {filtered.count():,} (out of {trips.count():,})")
filtered.select("trip_distance", "fare_amount", "tip_amount", "total_amount").show(10)

## GroupBy + Aggregation

In [None]:
# Revenue breakdown by payment type
payment_stats = (
    trips.groupBy("payment_type")
    .agg(
        F.count("*").alias("trip_count"),
        F.sum("total_amount").alias("total_revenue"),
        F.avg("tip_amount").alias("avg_tip"),
        F.avg("trip_distance").alias("avg_distance"),
    )
    .orderBy(F.desc("total_revenue"))
)
payment_stats.show()

## Join — Enrich with Zone Names

In [None]:
# Join trips with zone lookup to get human-readable pickup locations
enriched = trips.join(zones, trips["PULocationID"] == zones["LocationID"], "inner")
enriched.select("Borough", "Zone", "fare_amount", "trip_distance", "tip_amount").show(
    10
)

## Window Functions + Ranking

In [None]:
# Top zones per borough by revenue
window = Window.partitionBy("Borough").orderBy(F.desc("revenue"))

zone_revenue = (
    enriched.groupBy("Borough", "Zone")
    .agg(
        F.sum("total_amount").alias("revenue"),
        F.count("*").alias("trips"),
        F.avg("tip_amount").alias("avg_tip"),
    )
    .withColumn("rank", F.row_number().over(window))
    .filter(F.col("rank") <= 3)
    .orderBy("Borough", "rank")
)
zone_revenue.show(20, truncate=False)

## Full ETL Pipeline

Filter → Join → Aggregate → Rank → Write to S3

In [None]:
window = Window.partitionBy("Borough").orderBy(F.desc("revenue"))

report = (
    trips.filter((F.col("fare_amount") > 10.0) & (F.col("trip_distance") > 0))
    .join(zones, trips["PULocationID"] == zones["LocationID"], "inner")
    .groupBy("Borough", "Zone")
    .agg(
        F.sum("total_amount").alias("revenue"),
        F.count("*").alias("trips"),
        F.avg("tip_amount").alias("avg_tip"),
        F.avg("trip_distance").alias("avg_distance"),
    )
    .withColumn("rank", F.row_number().over(window))
    .orderBy("Borough", "rank")
)

report.write.partitionBy("Borough").mode("overwrite").parquet(
    "s3a://warehouse/notebook_report/"
)
print(f"Wrote {report.count():,} rows to s3a://warehouse/notebook_report/")

## Read Back Results

In [None]:
saved = spark.read.parquet("s3a://warehouse/notebook_report/")
saved.filter(F.col("rank") <= 3).orderBy("Borough", "rank").show(20, truncate=False)

## Cleanup

In [None]:
spark.stop()
print("SparkSession stopped.")