# Lesson 15 - Medallion Architecture

Okay, let's structure the technical notes for Lesson 15, focusing on the Medallion Architecture, robust pipelines, and data quality using PySpark.

---

## PySpark Technical Notes: Lesson 15 - The Medallion Architecture for Data Lakehouses

### Introduction

As organizations collect vast amounts of data, structuring and refining this data for reliable analytics and machine learning becomes paramount. The "Data Lakehouse" paradigm, combining the flexibility of data lakes with the management features of data warehouses, often employs architectural patterns to manage data flow and quality. The **Medallion Architecture** (often associated with Databricks and Delta Lake) is a popular pattern that organizes data into distinct layers – Bronze, Silver, and Gold – representing progressive stages of refinement. This lesson explores this architecture, principles for building robust PySpark data pipelines to implement it, and techniques for managing data quality throughout the process.

### The Medallion Architecture Explained

**Theory:**

The Medallion Architecture provides a blueprint for logically organizing data within a data lake or lakehouse. It promotes incremental data improvement, auditability, and decoupling of data processing stages. Data flows through successive layers, each serving a different purpose and audience.

1.  **Bronze Layer (Raw Data):**
    *   **Purpose:** Ingestion point for all source data into the lakehouse. Captures data in its original, unaltered state ("as-is"). Think of it as the historical archive.
    *   **Characteristics:**
        *   Mirrors source system structures (often).
        *   Minimal transformations (perhaps only metadata addition like load timestamps, source identifiers).
        *   Schema is often inferred or captured as it arrives (schema-on-read).
        *   Immutable: Data is typically appended, preserving the raw history. Older records are not updated in place.
        *   Long retention periods.
    *   **Formats:** Can be diverse (JSON, CSV, Avro, Parquet, raw logs, database CDC streams). Using an efficient storage format like Delta Lake or Parquet even here can offer benefits like schema evolution tracking and time travel, but the *content* remains raw.
    *   **Consumers:** Data engineers, data scientists (for exploration or fixing upstream issues). Rarely queried directly by analysts.
    *   **Analogy:** The raw, unrefined metal ore.

2.  **Silver Layer (Cleansed & Conformed Data):**
    *   **Purpose:** Provides a validated, enriched, and more structured view of the data. This is where data quality rules, cleansing, and basic transformations (like joining reference data) occur.
    *   **Characteristics:**
        *   Data is cleansed (e.g., handling nulls, standardizing formats, type casting).
        *   Data is validated against quality rules. Invalid data might be quarantined or flagged.
        *   Schema is more defined and enforced (schema-on-write).
        *   Data is often joined or conformed (e.g., standardizing codes, aligning data from different sources).
        *   Represents a single "source of truth" for key business entities (e.g., customers, products).
        *   May involve some level of normalization or semi-denormalization.
    *   **Formats:** Typically query-optimized columnar formats like Delta Lake or Parquet. Delta Lake is highly recommended here for its ACID properties, time travel (debugging), and schema enforcement/evolution capabilities.
    *   **Consumers:** Data engineers, data scientists (for feature engineering), data analysts (for ad-hoc querying).
    *   **Analogy:** The refined, shaped, but unpolished silver medallion.

3.  **Gold Layer (Curated Business-Level Data):**
    *   **Purpose:** Delivers highly refined, aggregated data views optimized for specific business use cases, analytics, and reporting.
    *   **Characteristics:**
        *   Business-centric: Organized around business dimensions and measures.
        *   Often denormalized and aggregated to support specific reporting needs (e.g., star schemas, data marts).
        *   Focuses on performance for analytical queries.
        *   Contains derived metrics, KPIs, and potentially features for ML models.
        *   "Ready-to-consume" data.
    *   **Formats:** Almost always Delta Lake or Parquet for performance and reliability.
    *   **Consumers:** Business analysts, data scientists (consuming features), BI dashboards, reporting tools.
    *   **Analogy:** The polished, finished, valuable gold medallion ready for display/use.

**Benefits of the Medallion Architecture:**

*   **Improved Data Quality:** Incremental validation and cleaning steps.
*   **Auditability & Replayability:** Raw data in Bronze allows reprocessing if logic changes or errors are found later. Delta Lake's time travel enhances this.
*   **Decoupling:** Changes in source systems primarily affect Bronze -> Silver pipelines. Changes in reporting needs primarily affect Silver -> Gold pipelines.
*   **Self-Service:** Different user groups can reliably access data at the appropriate level of refinement.
*   **Simplified Debugging:** Easier to trace data issues back through the layers.

### Building Robust Data Pipelines with PySpark

**Theory:**

Implementing the Medallion Architecture requires building reliable data pipelines. Robustness means the pipeline is resilient to failures, produces consistent results, is maintainable, and can handle evolving requirements. Key principles include:

1.  **Idempotency:** Running the pipeline multiple times with the same input should produce the same output state. This is crucial for recovery from failures. Delta Lake's `MERGE` operation and transactional writes are key enablers.
2.  **Modularity:** Breaking down complex pipelines into smaller, reusable functions or stages (e.g., separate PySpark jobs/scripts for Bronze->Silver and Silver->Gold).
3.  **Error Handling & Logging:** Implementing `try...except` blocks for I/O operations or complex transformations. Comprehensive logging helps diagnose issues.
4.  **Configuration Management:** Externalizing configurations (paths, connection strings, business rules) instead of hardcoding them.
5.  **Monitoring & Alerting:** Tracking pipeline execution status, data volumes, and quality metrics. Setting up alerts for failures or anomalies.
6.  **Schema Management:** Handling schema changes gracefully (using Delta Lake's schema evolution or explicit schema definitions).
7.  **Testing:** Implementing unit and integration tests for transformation logic and data quality rules.

**PySpark Implementation Aspects:**

*   **Structuring Jobs:** Organize PySpark code into functions or classes for readability and reuse (e.g., `ingest_to_bronze()`, `cleanse_to_silver()`, `aggregate_to_gold()`).
*   **Parameterization:** Use command-line arguments or configuration files to pass parameters like input/output paths, dates, etc.
*   **Delta Lake:** Leverage Delta Lake (`.format("delta")`) for reading and writing, especially for Silver and Gold layers, to gain ACID transactions, time travel, schema enforcement, and `MERGE` capabilities for idempotency.
*   **Partitioning:** Strategically partition data at each layer to optimize read/write performance based on common query patterns.
    *   Bronze: Often partitioned by ingestion date (`/bronze/source_a/ingest_date=YYYY-MM-DD/`).
    *   Silver: Might be partitioned by event date or key business identifiers (`/silver/events/event_date=YYYY-MM-DD/`, `/silver/customers/country=US/`).
    *   Gold: Partitioned based on primary query dimensions (`/gold/sales_summary/region=EU/year=YYYY/`).

**Code Example: Conceptual Pipeline Stages**

```python
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import col, current_timestamp, lit, input_file_name
from delta.tables import DeltaTable
import logging # Configure logging appropriately in a real application

# Initialize SparkSession with Delta Lake support
spark = SparkSession.builder \
    .appName("MedallionPipelineExample") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# --- Configuration ---
raw_source_path = "/mnt/landing/source_x/*.json" # Example path
bronze_table_path = "/mnt/lakehouse/bronze/source_x_events"
silver_table_path = "/mnt/lakehouse/silver/events"
gold_table_path = "/mnt/lakehouse/gold/daily_event_summary"
quarantine_path = "/mnt/lakehouse/quarantine/events"

# --- Bronze Layer ---
def ingest_to_bronze(source_path: str, bronze_path: str):
    """Ingests raw data, adds metadata, and appends to Bronze Delta table."""
    try:
        print(f"Ingesting from {source_path} to {bronze_path}")
        raw_df = spark.read.format("json").load(source_path) # Adjust format as needed

        # Add ingestion metadata
        bronze_df = raw_df.withColumn("ingest_timestamp", current_timestamp()) \
                         .withColumn("source_file", input_file_name())

        # Append to Bronze table (Delta handles schema evolution if configured)
        bronze_df.write.format("delta") \
                 .mode("append") \
                 .option("mergeSchema", "true") # Allow schema evolution
                 .save(bronze_path)
        print("Ingestion to Bronze successful.")
        return True
    except Exception as e:
        logging.error(f"Error ingesting to Bronze: {e}")
        return False

# --- Silver Layer (including Data Quality) ---
# (See Data Quality section below for DQ implementation details)
def process_to_silver(bronze_path: str, silver_path: str, quarantine_path: str):
    """Reads from Bronze, applies cleaning/validation, writes good data to Silver, bad to quarantine."""
    try:
        print(f"Processing from {bronze_path} to {silver_path}")
        bronze_df = spark.read.format("delta").load(bronze_path)
        # --- Placeholder for Data Quality Checks ---
        # validated_df = apply_quality_checks(bronze_df) # Returns DF with quality flags/results
        # good_df = validated_df.filter(col("quality_status") == "PASS")
        # bad_df = validated_df.filter(col("quality_status") == "FAIL")
        # For simplicity here, let's assume basic filtering is the 'quality check'
        
        # Example: Assume 'event_id' and 'event_timestamp' are critical
        validated_df = bronze_df.filter(col("event_id").isNotNull() & col("event_timestamp").isNotNull())
        
        # Simple type casting and selection for Silver
        silver_df = validated_df.select(
            col("event_id").cast("string"),
            col("event_timestamp").cast("timestamp"),
            col("payload.user_id").alias("user_id").cast("long"), # Example transformation
            col("payload.value").alias("event_value").cast("double"),
            "source_file", # Keep some provenance
            "ingest_timestamp"
        )

        # Use MERGE for Idempotency (example assumes event_id is a unique key)
        if DeltaTable.isDeltaTable(spark, silver_path):
            delta_table = DeltaTable.forPath(spark, silver_path)
            delta_table.alias("target") \
                .merge(silver_df.alias("source"), "target.event_id = source.event_id") \
                .whenMatchedUpdateAll() \
                .whenNotMatchedInsertAll() \
                .execute()
            print("Merge into Silver successful.")
        else:
            silver_df.write.format("delta") \
                     .mode("overwrite") \
                     .partitionBy("ingest_timestamp") # Example partitioning
                     .save(silver_path)
            print("Initial write to Silver successful.")

        # --- Handle Bad Data (Example) ---
        # bad_df = bronze_df.filter(~(col("event_id").isNotNull() & col("event_timestamp").isNotNull()))
        # if bad_df.count() > 0:
        #    print(f"Writing {bad_df.count()} records to quarantine: {quarantine_path}")
        #    bad_df.withColumn("quarantine_reason", lit("Missing critical fields")) \
        #          .write.format("delta").mode("append").save(quarantine_path)

        return True
    except Exception as e:
        logging.error(f"Error processing to Silver: {e}")
        return False

# --- Gold Layer ---
def aggregate_to_gold(silver_path: str, gold_path: str):
    """Reads from Silver, performs business aggregation, writes to Gold."""
    try:
        print(f"Aggregating from {silver_path} to {gold_path}")
        silver_df = spark.read.format("delta").load(silver_path)

        # Example: Daily event count per user
        gold_df = silver_df.groupBy(F.to_date("event_timestamp").alias("event_date"), "user_id") \
                           .agg(F.count("*").alias("daily_event_count"),
                                F.sum("event_value").alias("total_daily_value"))

        # Overwrite or merge into Gold table
        gold_df.write.format("delta") \
               .mode("overwrite") \
               .partitionBy("event_date") # Optimize for queries filtering by date
               .option("overwriteSchema", "true") # Allow schema changes in Gold aggregations
               .save(gold_path)
        print("Aggregation to Gold successful.")
        return True
    except Exception as e:
        logging.error(f"Error aggregating to Gold: {e}")
        return False

# --- Pipeline Execution ---
if ingest_to_bronze(raw_source_path, bronze_table_path):
    if process_to_silver(bronze_table_path, silver_table_path, quarantine_path):
        aggregate_to_gold(silver_table_path, gold_table_path)

# Stop SparkSession
spark.stop()

```

**Code Explanation:**

1.  **Initialization:** Sets up SparkSession with Delta Lake extensions.
2.  **Configuration:** Defines paths for different layers and quarantine zone. In production, use a proper config framework.
3.  **`ingest_to_bronze`:**
    *   Reads raw data (here, JSON).
    *   Adds metadata columns (`ingest_timestamp`, `source_file`) for provenance.
    *   Appends data to the Bronze Delta table using `.mode("append")`. `mergeSchema=true` allows adding new columns found in source data without failing the job.
4.  **`process_to_silver`:**
    *   Reads from the Bronze Delta table.
    *   **Placeholder for DQ:** Comments indicate where comprehensive DQ checks (`apply_quality_checks`) would fit. The example shows basic filtering (`isNotNull`).
    *   Performs transformations: selecting columns, aliasing (`alias`), casting types (`cast`).
    *   **Idempotency with MERGE:** Checks if the Silver table exists. If yes, uses `DeltaTable.merge()` to upsert data based on a key (`event_id`). This ensures that re-running the job for the same Bronze data doesn't create duplicates and can update existing records if needed. If the table doesn't exist, it performs an initial write.
    *   **Partitioning:** Example shows partitioning Silver by `ingest_timestamp`. Choose partitions based on query patterns.
    *   **Quarantine Handling:** Comments show how filtered-out `bad_df` could be written to a separate quarantine location with a reason.
5.  **`aggregate_to_gold`:**
    *   Reads from the validated Silver table.
    *   Performs business aggregation (`groupBy`, `agg` with `count`, `sum`).
    *   Writes to the Gold Delta table, often using `.mode("overwrite")` for aggregate tables that represent a snapshot (like daily summaries). Partitioning (`event_date`) is crucial for query performance. `overwriteSchema=true` allows the schema of the aggregate table to change if the aggregation logic changes.
6.  **Pipeline Execution:** Calls the functions sequentially. In a real system, use a workflow orchestrator (like Airflow, Databricks Workflows, Azure Data Factory). Basic error checking is shown.
7.  **Error Handling:** Basic `try...except` blocks log errors (should be more detailed in production).

### Managing Data Quality

**Theory:**

Data quality (DQ) is integral to the Medallion Architecture, primarily enforced during the Bronze -> Silver transition. It involves defining rules, measuring data against those rules, and deciding how to handle non-compliant data.

**Common DQ Dimensions:**

*   **Completeness:** Are required fields populated? (e.g., `isNotNull`)
*   **Uniqueness:** Are identifiers unique? (e.g., `groupBy().count()` check)
*   **Validity/Accuracy:** Do values conform to expected formats or ranges? (e.g., regex checks for emails, range checks for numbers, valid enum values)
*   **Consistency:** Do related data points align across records or tables? (e.g., state/zip code consistency)
*   **Timeliness:** Is the data arriving within the expected timeframe?

**Implementing DQ Checks in PySpark:**

*   **Built-in Functions:** Leverage `pyspark.sql.functions` extensively:
    *   `isNull()`, `isNotNull()`
    *   `when().otherwise()` for conditional logic/flagging
    *   `length()`, `substring()` for string checks
    *   `rlike()` for regex pattern matching
    *   `cast()` for type validation (can fail or return null)
    *   `assertNotNull()` (though often better to filter/flag than fail the job)
*   **User-Defined Functions (UDFs):** For highly complex or custom validation logic not covered by built-ins (use sparingly due to performance implications).
*   **Dedicated DQ Libraries:** Frameworks like `Deequ` (on Scala, can be used via Spark) or Python libraries integrated via UDFs/Pandas UDFs can provide more structured DQ definition and reporting. Databricks also offers built-in expectations with Delta Live Tables.

**Handling Bad Records:**

*   **Filtering:** Simply drop invalid records (simplest, but data is lost).
*   **Flagging:** Add DQ metadata columns (e.g., `quality_status`, `validation_errors`) to records in the Silver layer. Allows downstream users to decide whether to use flagged records.
*   **Quarantining:** Move invalid records to a separate "quarantine" table or location, often with metadata about why they failed. Allows for later investigation and potential reprocessing. This is often the preferred approach.

**Code Example: Data Quality Checks (Conceptual Integration into `process_to_silver`)**

```python
from pyspark.sql.functions import col, when, length, isnull, rlike, array, struct, lit

def apply_quality_checks(df: DataFrame) -> DataFrame:
    """Applies various data quality checks and returns DataFrame with results."""
    
    email_regex = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"

    # Define checks as conditions
    check_event_id_null = isnull(col("event_id"))
    check_timestamp_null = isnull(col("event_timestamp"))
    check_user_id_negative = col("payload.user_id") < 0
    check_email_format = ~col("payload.user_email").rlike(email_regex) # Assuming email in payload

    # Create flags or error messages for each check
    df_with_checks = df.withColumn("dq_errors", array(
        when(check_event_id_null, lit("event_id is null")).otherwise(lit(None)),
        when(check_timestamp_null, lit("event_timestamp is null")).otherwise(lit(None)),
        when(check_user_id_negative, lit("user_id is negative")).otherwise(lit(None)),
        when(check_email_format, lit("email format invalid")).otherwise(lit(None))
    ))

    # Filter out nulls from the error array
    df_with_filtered_errors = df_with_checks.withColumn(
        "dq_errors", F.expr("filter(dq_errors, x -> x is not null)")
    )
    
    # Determine overall status
    df_with_status = df_with_filtered_errors.withColumn(
        "quality_status",
        when(F.size(col("dq_errors")) == 0, "PASS").otherwise("FAIL")
    )
    
    return df_with_status

# --- Integrating into process_to_silver ---
# Inside the try block of process_to_silver:
# ... read bronze_df ...
validated_df = apply_quality_checks(bronze_df)
validated_df.cache() # Cache if filtering multiple times

good_df = validated_df.filter(col("quality_status") == "PASS").drop("dq_errors", "quality_status")
bad_df = validated_df.filter(col("quality_status") == "FAIL")

# --- Process good_df to Silver (using MERGE etc. as before) ---
# Select and transform columns from good_df
silver_ready_df = good_df.select(...) 
# ... merge silver_ready_df into silver_table_path ...

# --- Write bad_df to Quarantine ---
if bad_df.count() > 0: # Check if bad_df is not empty before writing
    print(f"Writing {bad_df.count()} records to quarantine: {quarantine_path}")
    bad_df.write.format("delta").mode("append").partitionBy("ingest_timestamp").save(quarantine_path)

validated_df.unpersist()
# ... rest of the function ...
```

**Code Explanation (DQ):**

1.  **`apply_quality_checks` Function:** Encapsulates DQ logic.
2.  **Define Checks:** Boolean Spark SQL expressions define conditions for failure (e.g., `isnull`, `<`, `~rlike`).
3.  **Generate Error Messages:** Uses `when` to create string messages if a check fails, otherwise `null`. Collects these into an array column `dq_errors`.
4.  **Filter Nulls:** Removes the `null` entries from the `dq_errors` array using `F.expr("filter(...)")`.
5.  **Determine Status:** Checks the size of the filtered `dq_errors` array. If empty, status is "PASS"; otherwise, "FAIL".
6.  **Integration:**
    *   Calls `apply_quality_checks` on the Bronze DataFrame.
    *   Caches the result as it's used multiple times (for filtering good/bad).
    *   Filters into `good_df` and `bad_df` based on `quality_status`.
    *   Processes `good_df` for the Silver layer (selecting needed columns, dropping DQ columns).
    *   Writes `bad_df` (including the `dq_errors` column) to the quarantine Delta table.
    *   Unpersists the cached DataFrame.

### Conclusion

The Medallion Architecture provides a structured, scalable approach to building reliable data lakehouses. By progressively refining data through Bronze, Silver, and Gold layers using robust PySpark pipelines, organizations can improve data quality, ensure auditability, and deliver trustworthy data for analytics and AI. Implementing comprehensive data quality checks, leveraging Delta Lake features for reliability and idempotency, and adopting sound pipeline engineering principles (modularity, logging, configuration) are crucial for the success of this architecture. PySpark provides the powerful tools and APIs necessary to build and manage these sophisticated data processing workflows effectively.

---
**End of Lesson 15 Notes**