# **PHASE 2: DATA ENGINEERING (Days 5-8)**

## **DAY 6 (14/01/26) - Delta Lake Advanced**



### **Section 1 - Learn**:

### **_1. Bronze (raw) → Silver (cleaned) → Gold (aggregated)_**

The **Medallion Architecture** (also known as "Multi-hop" architecture) is a data design pattern used to organize data within a Lakehouse. It incrementally improves the structure and quality of data as it flows through each layer.

##### **1. Bronze Layer (The Raw Data)**

* **The "Landing Zone":** Data is ingested exactly as it appears in the source system (REST APIs, IoT sensors, Databases).
* **Format:** Usually stored in its original format (JSON, CSV, Parquet) but managed as a Delta table.
* **Retention:** Acts as a permanent historical archive. Because it is never modified, you can always re-process the data if your logic changes later.
* **Key Characteristic:** Contains "dirty" data, including duplicates, nulls, and raw timestamps.

##### **2. Silver Layer (The Cleaned & Standardized Data)**

* **The "Integration Layer":** Data from the Bronze layer is parsed, cleaned, and normalized.
* **Transformation:** Includes filtering out bad records, handling missing values, standardizing date formats, and joining related tables (e.g., joining "Sales" with "Product Names").
* **Schema Enforcement:** This is where strict schemas are applied to ensure high-quality data.
* **Purpose:** This layer is typically used by Data Scientists and Power Analysts for ad-hoc exploration.

##### **3. Gold Layer (The Business-Ready Data)**

* **The "Presentation Layer":** Data is organized into consumption-ready "Project" or "Department" views.
* **Transformation:** Involves heavy aggregations (e.g., "Daily Revenue by Region") and complex business logic.
* **Format:** Often organized in a **Star Schema** (Fact and Dimension tables) optimized for speed and reporting.
* **Purpose:** Powers production BI dashboards (Tableau, PowerBI) and final executive reporting.


##### **Summary of the Layers**

| Feature | Bronze | Silver | Gold |
| --- | --- | --- | --- |
| **Data Quality** | Raw / Dirty | Cleaned / Conformed | Aggregated / Validated |
| **Primary User** | Data Engineers | Data Scientists / Analysts | Business Users / Execs |
| **Complexity** | Low (Ingest) | Medium (Clean/Join) | High (Business Logic) |
| **Truth** | Source Truth | Technical Truth | Business Truth |

##### **Why Use This Architecture?**

* **Traceability:** If a number in a Gold dashboard looks wrong, you can trace it back through Silver to the raw Bronze data to find the bug.
* **Efficiency:** You don't have to re-clean the raw data every time you want to build a new report; you simply start from the already-cleaned Silver layer.
* **Security:** You can restrict access so that regular analysts only see the "Gold" or "Silver" data, keeping sensitive "Bronze" PII data hidden.

---

### **_2. Best practices for each layer_**

Maintaining a Medallion architecture requires a balance between data integrity and processing speed. Here are the industry best practices for each layer:

##### **1. Bronze Layer (Raw)**

* **Keep it Immutable:** Never update or delete rows in Bronze. It should be an "append-only" record of exactly what came from the source.
* **Minimal Transformation:** Avoid any complex logic here. Only add "metadata" columns like `ingestion_timestamp` and `source_file_name` to help with auditing.
* **Retain Original Data Types:** Store data as strings or raw types if necessary to prevent ingestion failures due to schema changes at the source.
* **Organize by Ingestion Date:** Use a folder structure or partition by `year/month/day` of ingestion to make incremental processing easier.

##### **2. Silver Layer (Cleaned)**

* **Enforce Strict Schemas:** This is the gatekeeper. Use Delta Lake's schema enforcement to ensure all columns have the correct data types.
* **Deduplicate Early:** Use `dropDuplicates()` or Delta `MERGE` to remove any redundant records that might have come from multiple Bronze loads.
* **Standardize Formats:** Convert all timestamps to UTC, normalize currency codes, and trim whitespace from string columns.
* **Apply Data Masking:** If you have PII (Personally Identifiable Information), mask or pseudonymize it here so that downstream analysts in the Gold layer don't see sensitive data.
* **Z-Order on Join Keys:** Z-Order the Silver tables on columns frequently used for joins or filters to speed up Gold layer aggregations.

##### **3. Gold Layer (Aggregated)**

* **Optimize for Read Performance:** Since BI tools (PowerBI/Tableau) query this layer, use `OPTIMIZE` and `ZORDER` aggressively to ensure sub-second response times.
* **Use Star Schemas:** Organize data into Fact tables (events/transactions) and Dimension tables (users/products) to make it intuitive for business users.
* **Implement Business Logic:** This is the only layer where complex business definitions (e.g., "What defines a Churned Customer?") should live.
* **Minimize Column Count:** Don't include every available column. Only provide the data necessary for the specific dashboard or report to reduce I/O overhead.
* **Use Descriptive Naming:** Ensure column names make sense to non-technical users (e.g., use `total_revenue` instead of `sum_rev_amt_usd`).

##### **General Cross-Layer Best Practices**

* **Incremental Processing:** Use **Auto Loader** or **Delta Live Tables (DLT)** to process only new data from one layer to the next, rather than re-processing the entire dataset every time.
* **Automated Testing:** Run data quality checks (expectations) at each hop. For example, check that `order_id` is never null before moving data from Silver to Gold.
* **Vacuum Regularly:** Set a `VACUUM` schedule to clean up old file versions and save on storage costs while maintaining enough history for recovery.

---

### **_3. Incremental processing patterns_**

Incremental processing is the practice of processing only the **new or changed data** since the last run, rather than re-processing the entire dataset. This is the cornerstone of efficient Lakehouse pipelines.

Here are the primary patterns used in Databricks:

#### 1. Auto Loader (Cloud Files)

* **The Concept:** Specifically designed to ingest millions of files from cloud storage (S3/ADLS/GCS) incrementally and efficiently.
* **Checkpointing:** Uses a directory to keep track of which files have already been processed, so it never pulls the same file twice even if the job restarts.
* **Schema Inference & Evolution:** It can automatically detect new columns in raw files and update the schema without crashing the pipeline.
* **Scalability:** Uses "File Discovery" (listing) or "Notification" (SNS/SQS) modes to handle massive volumes of files that would crash standard Spark `read` commands.

#### 2. Structured Streaming

* **The Engine:** The underlying Spark engine that treats a stream of data as an "unbounded table."
* **Micro-Batching:** Processes data in small increments (e.g., every 10 seconds or every 1 minute) to provide near real-time updates.
* **Fault Tolerance:** Uses **Checkpoints** and **Write-Ahead Logs** to ensure that if a cluster fails, it picks up exactly where it left off, providing "exactly-once" processing guarantees.

#### 3. Delta Table Streaming (Source & Sink)

* **Delta as a Source:** You can treat a Delta table as a streaming source. Spark reads the **Transaction Log** to find exactly which new Parquet files were added since the last offset.
* **Delta as a Sink:** You can stream data *into* a Delta table. Because Delta is ACID compliant, your streaming writes are always consistent.
* **The Pattern:** `spark.readStream.format("delta").load(path)` → `writeStream.format("delta").start(path)`.

#### 4. Delta Live Tables (DLT)

* **Declarative Framework:** You define the "What" (the transformation logic) and DLT handles the "How" (orchestration, scaling, and error handling).
* **CDC (Change Data Capture):** Includes built-in support for `APPLY CHANGES INTO`, which simplifies processing "Upserts" from source databases.
* **Expectations:** Allows you to define data quality rules (e.g., `EXPECT (id IS NOT NULL)`) and automatically handles records that fail the test.

#### 5. Change Data Feed (CDF)

* **Row-Level Tracking:** A feature enabled on Delta tables that records the "type" of change for every row (Insert, Update, or Delete).
* **Efficient Downstream Sync:** Instead of comparing two massive tables to find differences, the downstream job simply reads the "Feed" to see exactly which rows changed and applies those changes.
* **Enablement:** `ALTER TABLE my_table SET TBLPROPERTIES (delta.enableChangeDataFeed = true)`.

#### **Comparison: When to use which?**

| Pattern | Best For... | Key Advantage |
| --- | --- | --- |
| **Auto Loader** | Ingesting raw files (JSON, CSV, etc.) | High performance, handles schema drift. |
| **Structured Streaming** | Real-time dashboards or alerts | Low latency (seconds). |
| **Delta Live Tables** | Complex Medallion pipelines | Built-in monitoring and quality checks. |
| **Change Data Feed** | Propagating Updates/Deletes | Avoids expensive full-table scans. |


---

### **Practice**

In [0]:
from pyspark.sql import functions as F

In [0]:
def load_ecommerce_dataset(Month_name):
    df = spark.read.csv(f"/Volumes/workspace/ecommerce/ecommerce_data/2019-{Month_name}.csv", header=True, inferSchema=True)
    return df

In [0]:
# df_n = load_ecommerce_dataset("Nov")
df_o = load_ecommerce_dataset("Oct")

#### **1. Design 3-layer architecture** 

In [0]:
df_o.withColumn(
    "ingestion_ts",
    F.current_timestamp()
).write.format(
    "delta"
).mode(
    "overwrite"
).saveAsTable(
    "workspace.bronze.ecommerce_raw"
)

In [0]:
# Silver layer

bronze = spark.read.table("workspace.bronze.ecommerce_raw")
silver = bronze.filter(
    F.col("price") > 0
).filter(
    F.col("price") < 10000
).dropDuplicates(
    ["user_session", "event_time"]
).withColumn(
    "event_date", F.to_date("event_time")
).withColumn(
    "price_tier",
    F.when(F.col("price") < 10, "budget")
     .when(F.col("price") < 50, "mid")
     .otherwise("premium")
)
silver.write.format("delta").mode("overwrite").saveAsTable("workspace.silver.ecommerce_cleaned")

In [0]:
# GOLD
silver = spark.read.table("workspace.silver.ecommerce_cleaned")
product_perf = silver.groupBy("product_id", "product_name") \
    .agg(
        F.countDistinct(F.when(F.col("event_type")=="view", "user_id")).alias("views"),
        F.countDistinct(F.when(F.col("event_type")=="purchase", "user_id")).alias("purchases"),
        F.sum(F.when(F.col("event_type")=="purchase", "price")).alias("revenue")
    ).withColumn("conversion_rate", F.col("purchases")/F.col("views")*100)

silver.write.format("delta").mode("overwrite").saveAsTable("workspace.gold.products")

---

#### __2. Verification__

In [0]:
bronze = spark.read.table("workspace.bronze.ecommerce_raw")
display(bronze.head(10))

In [0]:
silver = spark.read.table("workspace.silver.ecommerce_cleaned")
display(silver.head(10))

In [0]:
gold = spark.read.table("workspace.gold.products")
display(gold.head(10))

---

### **Resources**
- [Medallion architecture](https://www.databricks.com/glossary/medallion-architecture)
- [Video](https://www.youtube.com/watch?v=njjBdmAQnR0)

----