# **PHASE 1: FOUNDATION (Days 1-4)**

## **DAY 4 (12/01/26) - Delta Lake Introduction**



### **Section 1 - Learn**:

### **_1. What is Delta Lake_**

Delta Lake is an open-source storage layer that brings **reliability** and **performance** to data lakes. It essentially turns a "Data Lake" into a "Data Lakehouse" by adding a transactional layer (via a JSON-based Transaction Log) on top of standard Parquet files.

Here are the 7 key pillars of Delta Lake:

##### 1. ACID Transactions

* **Reliability:** Like a traditional database, Delta Lake ensures that data operations (Inserts, Updates, Deletes) are **Atomic**.
* **No Partial Writes:** If a job fails halfway, Delta Lake automatically rolls back the change, ensuring your table is never left in a corrupted or "half-written" state.

##### 2. The Transaction Log (`_delta_log`)

* **Single Source of Truth:** Every change made to the table is recorded in an ordered sequence of JSON files in a hidden directory.
* **Metadata Performance:** Instead of scanning thousands of files on S3 or ADLS to find data, Spark simply reads the log to know exactly which files are relevant, making query start times significantly faster.

##### 3. Time Travel (Data Versioning)

* **History Tracking:** Because every transaction is logged, you can query a table as it existed at a specific point in time or version number.
* **Use Case:** This is essential for auditing, reproducing machine learning models, or "undoing" an accidental data deletion.
* *SQL Example:* `SELECT * FROM table VERSION AS OF 5`



##### 4. Schema Enforcement & Evolution

* **Enforcement:** Prevents "data swamp" issues by rejecting any write that doesn't match the table's defined schema (e.g., adding a string to an integer column).
* **Evolution:** Allows you to safely change your table structure over time (like adding new columns) without having to rewrite the entire dataset.

##### 5. The Medallion Architecture

Delta Lake is almost always used with a "multi-hop" design to manage data quality:

* **Bronze (Raw):** Keeps the data in its original, messy state for audit history.
* **Silver (Cleaned):** Data is filtered, joined, and standardized for "Enterprise" views.
* **Gold (Aggregated):** Clean, high-level summaries ready for BI dashboards and final reporting.

##### 6. Unified Batch and Streaming

* A Delta table can act as both a **Batch source** and a **Streaming sink** simultaneously. This eliminates the need for complex "Lambda" architectures where you'd otherwise need separate systems for real-time and historical data.

##### 7. Performance Optimizations

* **Z-Ordering / Multi-dimensional Clustering:** Reorganizes data on disk so that related information is stored in the same files, allowing Spark to skip over massive amounts of irrelevant data during a query.
* **Small File Compaction:** Automatically merges tiny, inefficient files into larger, "right-sized" files to improve I/O performance.



---

### **_2. ACID transactions_**

**ACID** is the standard for reliability in database systems. In a traditional Data Lake (like raw S3 or HDFS), files can easily become corrupted if a write job fails halfway. Delta Lake solves this by adding a **Transaction Log** that guarantees these four properties:

##### 1. **Atomicity (All or Nothing)**

* **The Problem:** In a standard data lake, if a job writing 100 files crashes at file #50, those 50 files remain in your lake, leaving you with "partial" or "dirty" data.
* **The Delta Fix:** A transaction only "exists" once it is committed to the **Transaction Log** (`_delta_log`). If a job fails at 99%, the log is never updated, and readers simply ignore the uncommitted files as if they never existed.

##### 2. **Consistency (Valid State Only)**

* **The Problem:** Bad data (e.g., a string in a "Price" column) can break downstream reports.
* **The Delta Fix:** Delta Lake enforces **Schema Enforcement**. It checks every write against the table's "rules" (data types, null constraints). If the data doesn't fit, the transaction is rejected before it can "pollute" the table.

##### 3. **Isolation (No Interference)**

* **The Problem:** What happens when one user is reading a table while another user is deleting half of it? Usually, the reader gets an error or incorrect results.
* **The Delta Fix:** Delta Lake uses **Optimistic Concurrency Control**. It allows multiple users to read and write at the same time. Readers always see a "snapshot" of the data as it was when they started, unaffected by ongoing writes.

##### 4. **Durability (Permanent Changes)**

* **The Problem:** A system crash right after a "Success" message could lead to data loss if it wasn't truly saved to disk.
* **The Delta Fix:** Once a transaction is recorded in the `_delta_log`, it is considered permanent. Since the log and the data files reside on reliable cloud storage (like AWS S3 or Azure Data Lake), the data survives even if the entire Spark cluster goes down.

##### **How the Transaction Log Works**

1. **The Log:** Every change is written as a JSON file (e.g., `000001.json`) in the `_delta_log` folder.
2. **The Truth:** When you query a table, Spark doesn't just "look at the files." It first reads the **JSON Log** to see which specific files are "valid" and "current."
3. **The Cleanup:** Files that are "deleted" stay on disk but are marked as "removed" in the log. They only disappear physically when you run the `VACUUM` command.

---

### **_3. Schema Enforcement_**

In Delta Lake, **Schema Enforcement** (also known as schema validation) is a safety mechanism that ensures data quality by rejecting any write operation that does not match a table's predefined structure.

Unlike traditional "Schema-on-Read" data lakes (like raw Parquet or CSV on S3), Delta Lake uses **Schema-on-Write** to act as a gatekeeper.

##### **The 3 Strict Rules of Enforcement**

When you try to write a DataFrame to a Delta table, Delta Lake checks for these three things:

* **No Extra Columns:** The incoming DataFrame cannot contain any columns that aren't already in the target table.
* **Data Type Match:** If a column in the table is an `Integer`, you cannot write a `String` to it. (Note: Small "upcasts" like `Byte` to `Integer` are sometimes allowed).
* **Case Sensitivity:** Delta Lake is case-preserving but insensitive by default. You cannot have both a "Name" and "name" column; attempting to add one that differs only by case will trigger an error.

##### **What Happens During a Mismatch?**

If any of the rules above are broken, Delta Lake will:

1. **Stop the Transaction:** No data is written to the disk.
2. **Throw an Exception:** You will see an `AnalysisException` in your notebook.
3. **Preserve the Log:** The transaction log remains clean, and readers are unaffected by the failed attempt.

> **Note:** It is perfectly fine if your incoming data has **fewer** columns than the target table. Delta Lake will simply insert `null` values into the missing columns for those new rows.

##### **Schema Enforcement vs. Schema Evolution**

While enforcement is about **protection**, evolution is about **flexibility**.

| Feature | Schema Enforcement (Default) | Schema Evolution (Optional) |
| --- | --- | --- |
| **Purpose** | Prevents "Data Swamps" by blocking bad data. | Allows table structure to change over time. |
| **Action** | Rejects the write if schemas don't match. | Automatically adds new columns to the table. |
| **Command** | Standard `.write.save()` | Requires `.option("mergeSchema", "true")` |

##### **How to "Bypass" Enforcement (Evolution)**

If you actually *want* to add a new column from your DataFrame to your existing table, you must explicitly enable **Schema Evolution**:

```python
# This will fail if 'new_column' doesn't exist in the target table
df.write.format("delta").mode("append").save(path)

# This will succeed and update the table's schema automatically
df.write.format("delta") \
  .mode("append") \
  .option("mergeSchema", "true") \
  .save(path)

```

---

### **_4. Delta vs Parquet_**

In short, **Delta Lake is built on top of Parquet**. While Parquet is an efficient **file format** for storing big data, Delta Lake is a **storage layer** (or table format) that adds a management system—the Transaction Log—on top of those files.

##### **Key Differences**

* **Transactional Support:** **Parquet** does not support ACID transactions; if a write fails halfway, you end up with "dirty" data. **Delta** uses a JSON-based transaction log to ensure every operation is all-or-nothing (Atomic).
* **Mutability:** **Parquet** files are immutable. To delete or update a single row, you have to rewrite the entire file. **Delta** supports `UPDATE`, `DELETE`, and `MERGE` commands by managing changes through its metadata layer.
* **Metadata Performance:** To read a **Parquet** lake, Spark must "list" all files on disk, which is slow for large lakes. **Delta** stores file paths and statistics (like min/max values) in its log, allowing Spark to find data almost instantly.
* **Time Travel:** **Parquet** only shows you the data as it exists right now. **Delta** keeps a versioned history of changes, allowing you to query the table as it appeared at any point in the past.
* **Schema Evolution:** **Parquet** can be brittle if columns change. **Delta** allows you to safely evolve your schema (adding columns) or enforce strict data quality rules to prevent "data swamp" issues.
* **Unified Processing:** **Delta** allows the same table to be used for both batch processing and real-time streaming simultaneously, a feat that is extremely difficult to coordinate with raw Parquet files.


##### **Comparison Table**

| Feature | Apache Parquet | Delta Lake |
| --- | --- | --- |
| **Type** | File Format (`.parquet`) | Table Format / Storage Layer |
| **ACID Transactions** | No | Yes |
| **Updates / Deletes** | Difficult (Rewrite required) | Easy (Native DML support) |
| **Data Versioning** | No | Yes (Time Travel) |
| **Data Skipping** | Basic (Footer-based) | Advanced (Z-Ordering & Statistics) |
| **Open Source** | Yes | Yes |


##### **When to use what?**

* **Use Parquet:** For simple, static data archival where the data never changes and you don't need advanced features like versioning.
* **Use Delta:** For almost all modern production pipelines, especially those involving ETL, frequent updates, or multi-user access.

---

### **Practice**

In [0]:
from pyspark.sql import functions as F

In [0]:
def load_ecommerce_dataset(Month_name):
    df = spark.read.csv(f"/Volumes/workspace/ecommerce/ecommerce_data/2019-{Month_name}.csv", header=True, inferSchema=True)
    return df

In [0]:
# df_n = load_ecommerce_dataset("Nov")
df_o = load_ecommerce_dataset("Oct")

#### **1. Convert CSV to Delta format** 

In [0]:
df_o.write.format("delta").mode("overwrite").saveAsTable("workspace.default.df_o_table")

#### **2. Create Delta tables (SQL and PySpark)** 

In [0]:
df_o.write.format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable("df_o_table")

#### **3. Test schema enforcement**

In [0]:
from pyspark.sql import Row

In [0]:
# Create a 'bad' row with an extra column 'location'
bad_data = [Row(
    event_time="2023-10-01 10:00:00", 
    event_type="view", 
    product_id=123, 
    location="New York" # This column doesn't exist in the table
)]

df_bad = spark.createDataFrame(bad_data)

# Trying to append this to our Delta table
try:
    df_bad.write.format("delta").mode("append").saveAsTable("df_o_table")
except Exception as e:
    print("SCHEMA ENFORCEMENT TRIGGERED")
    print(e)

SCHEMA ENFORCEMENT TRIGGERED
[DELTA_FAILED_TO_MERGE_FIELDS] Failed to merge fields 'event_time' and 'event_time'.

JVM stacktrace:
com.databricks.sql.transaction.tahoe.DeltaAnalysisException
	at com.databricks.sql.transaction.tahoe.schema.SchemaMergingUtils$.$anonfun$mergeDataTypes$1(SchemaMergingUtils.scala:231)
	at scala.collection.ArrayOps$.map$extension(ArrayOps.scala:936)
	at com.databricks.sql.transaction.tahoe.schema.SchemaMergingUtils$.merge$1(SchemaMergingUtils.scala:217)
	at com.databricks.sql.transaction.tahoe.schema.SchemaMergingUtils$.mergeDataTypes(SchemaMergingUtils.scala:335)
	at com.databricks.sql.transaction.tahoe.schema.SchemaMergingUtils$.mergeSchemas(SchemaMergingUtils.scala:179)
	at com.databricks.sql.transaction.tahoe.schema.ImplicitMetadataOperation$.mergeSchema(ImplicitMetadataOperation.scala:359)
	at com.databricks.sql.transaction.tahoe.schema.ImplicitMetadataOperation.updateMetadata(ImplicitMetadataOperation.scala:112)
	at com.databricks.sql.transaction.tahoe

---

#### __4. Handle duplicate inserts__

In [0]:
# Remove exact row duplicates
df_clean = df_o.dropDuplicates()

# Remove duplicates based on specific primary keys
df_deduped = df_o.dropDuplicates(["user_id", "event_time", "product_id", "event_type"])

print(f"Rows before: {df_o.count()} | Rows after: {df_deduped.count()}")

Rows before: 42448764 | Rows after: 42413557


---

### **Resources**
- [Delta lake](https://docs.databricks.com/delta/)
- [Tutorials](https://docs.databricks.com/delta/tutorial.html)

----