# DAY 6 â€“ Medallion Architecture (Bronze â†’ Silver â†’ Gold)

**Dataset:** E-commerce Events (Oct & Nov 2019)

**Source:**
- /Volumes/workspace/ecommerce/ecommerce_data/2019-Oct.csv
- /Volumes/workspace/ecommerce/ecommerce_data/2019-Nov.csv

**Target:** Delta Lake tables using Medallion Architecture

## Architecture Overview

- **Bronze**: Raw ingested events (append-only)
- **Silver**: Cleaned & deduplicated events
- **Gold**: Business aggregates for analytics

## ðŸ¥‰ Bronze Layer â€“ Raw Ingestion (Oct + Nov)

In [None]:
from pyspark.sql import functions as F

oct_df = spark.read.csv(
    "/Volumes/workspace/ecommerce/ecommerce_data/2019-Oct.csv",
    header=True,
    inferSchema=True
)

nov_df = spark.read.csv(
    "/Volumes/workspace/ecommerce/ecommerce_data/2019-Nov.csv",
    header=True,
    inferSchema=True
)

raw_events = oct_df.unionByName(nov_df)

bronze = raw_events.withColumn("ingestion_ts", F.current_timestamp())

bronze.write.format("delta") \
    .mode("overwrite") \
    .saveAsTable("bronze_events")

## ðŸ¥ˆ Silver Layer â€“ Clean & Validate Data

In [None]:
bronze = spark.table("bronze_events")

silver = bronze.filter(F.col("price") > 0) \
    .filter(F.col("price") < 10000) \
    .dropDuplicates(["user_session", "event_time", "event_type"]) \
    .withColumn("event_time", F.to_timestamp("event_time")) \
    .withColumn("event_date", F.to_date("event_time")) \
    .withColumn(
        "price_tier",
        F.when(F.col("price") < 10, "budget")
         .when(F.col("price") < 50, "mid")
         .otherwise("premium")
    )

silver.write.format("delta") \
    .mode("overwrite") \
    .saveAsTable("silver_events")

## ðŸ¥‡ Gold Layer â€“ Business Aggregates

In [None]:
silver = spark.table("silver_events")

product_perf = silver.groupBy("product_id", "category_code") \
    .agg(
        F.countDistinct(F.when(F.col("event_type") == "view", "user_id")).alias("views"),
        F.countDistinct(F.when(F.col("event_type") == "purchase", "user_id")).alias("purchases"),
        F.sum(F.when(F.col("event_type") == "purchase", "price")).alias("revenue")
    ) \
    .withColumn("conversion_rate", F.col("purchases") / F.col("views") * 100)

product_perf.write.format("delta") \
    .mode("overwrite") \
    .saveAsTable("gold_product_performance")

## âœ… Key Takeaways

- Medallion architecture ensures data quality & scalability
- Bronze keeps raw history
- Silver applies business rules
- Gold serves analytics-ready metrics
- Delta tables enable reliable incremental pipelines