1. What is Delta Lake?

Delta Lake is an open-source storage layer that brings reliability and governance to data lakes.

Delta Lake provides:

ACID transactions for reliable writes

Schema enforcement to prevent bad data

Scalable metadata handling

Time travel for audit and rollback

üìå Internally, a Delta table is:

Parquet data files + _delta_log (transaction log)

#Load CSV & Observe Lazy Evaluation
##Code (Transformation only)

In [0]:
#Load the CSV dataset as a DataFrame.
events = spark.read.option("header", "true") \
    .option("inferSchema", "true") \
    .csv("/Volumes/workspace/ecommerce/ecommerce_data/2019-Nov.csv")


Explanation

Spark does not read data yet

Only logical and physical plans are created

This demonstrates Spark‚Äôs lazy evaluation model

In [0]:
events.printSchema()
events.explain()


What to Observe

Spark builds logical and physical plans

No job runs

No data is read yet

 Note:

Spark is lazy ‚Äî reading data does not trigger execution.

#Convert CSV ‚Üí Delta (First ACID Write)

Task: Write DataFrame as Delta

In [0]:
events.write \
    .format("delta") \
    .mode("overwrite") \
    .save("/Volumes/workspace/ecommerce/ecommerce_data/delta/events")


Verify Storage

In [0]:
display(dbutils.fs.ls("/Volumes/workspace/ecommerce/ecommerce_data/delta/events"))


What Happens Internally

Parquet files written

_delta_log directory created

Transaction version = 0

Key Learning:

Delta commits data atomically ‚Äî either everything is written or nothing is.

ACID Transactions (Concept)
Delta Lake guarantees:

Atomicity ‚Äì all or nothing writes

Consistency ‚Äì valid state after each transaction

Isolation ‚Äì concurrent reads/writes are safe

Durability ‚Äì committed data survives failures

üìå Each write operation creates a new Delta version

#Create Delta Tables 
##PySpark ‚Äì Managed Delta Table

External Delta Table (SQL)

In [0]:
events.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("events_managed")



Why managed table?

Unity Catalog‚Äìgoverned

No external location errors

Best practice for Databricks

#Create Delta Table Using SQL


In [0]:
%sql
CREATE OR REPLACE TABLE ecommerce_events_delta_sql
USING DELTA
AS
SELECT * FROM events_managed;


In [0]:
%sql

select count(*) from events_managed

Schema Enforcement (On Existing Delta Table)
Objective

Verify that Delta Lake prevents incompatible schema writes.

In [0]:
wrong_schema_df = spark.createDataFrame(
    [("x", "y", "z")],
    ["a", "b", "c"]
)

try:
    wrong_schema_df.write \
        .format("delta") \
        .mode("append") \
        .saveAsTable("events_managed")
except Exception as e:
    print("Schema enforcement triggered:")
    print(e)


Test Schema Enforcement
Attempt to write wrong schema

Expected Result

‚ùå Write fails
‚úî Delta blocks the operation due to schema mismatch

What This Proves

Delta validates column names and data types

Invalid writes are rejected before data corruption

Schema enforcement works even after data is loaded

üìå This protection is not available with plain Parquet tables

#Handle Duplicate Inserts (ACID Transactions)
Objective

Understand how Delta handles repeated inserts safely.

In [0]:
spark.table("events_managed") \
    .write \
    .format("delta") \
    .mode("append") \
    .saveAsTable("events_managed")


In [0]:
%sql
select count(*) from events_managed

What Happens Internally

Each append creates a new Delta transaction

Data is added consistently

No partial writes or corruption

Readers always see a stable snapshot

üìå Delta does not auto-deduplicate, but it guarantees safe writes

ACID Transactions (Validated Practically)

From the above steps, Delta guarantees:

Atomicity
The append either fully succeeds or fails.

Consistency
Table remains valid after every write.

Isolation
Queries never see partial data during writes.

Durability
Once committed, data is persisted in the transaction log.

In [0]:
%sql
DESCRIBE HISTORY events_managed;


You will see:

Operation type (WRITE / APPEND)

Version number

Timestamp

User / notebook info

üìå This confirms Delta‚Äôs auditable transaction model.

4. Handle Duplicate Inserts (Delta Lake)
Key Point (Very Important)

üëâ Delta Lake does NOT handle deduplication automatically
üëâ It guarantees ACID safety, not row-level uniqueness

Let us clean up the duplicates first


In [0]:
%sql

Drop table events_managed

In [0]:
events_df = (
    spark.read
    .option("header", "true")
    .csv("/Volumes/workspace/ecommerce/ecommerce_data/2019-Nov.csv")
)

events_df.write.mode("append").saveAsTable("events_managed")


In [0]:
%sql

 SELECT COUNT(*) FROM events_managed


In [0]:
new_events = (
    spark.read
    .option("header", "true")
    .csv("/Volumes/workspace/ecommerce/ecommerce_data/2019-Nov.csv")
)

new_events.createOrReplaceTempView("new_events")


In [0]:
%sql
MERGE INTO events_managed t
USING new_events s
ON t.user_id = s.user_id
AND t.product_id = s.product_id
AND t.event_time = s.event_time
WHEN NOT MATCHED THEN
  INSERT *


In [0]:
%sql
SELECT COUNT(*) FROM events_managed


Delta Lake vs Parquet (Quick Comparison)

Parquet:
- Columnar file format (storage only)
- No transactions
- No schema enforcement
- No built-in deduplication or updates
- Appends can easily create duplicates
- Good for immutable, append-only data

Delta Lake:
- Storage format + transaction log (_delta_log)
- Supports ACID transactions
- Enforces schema (and can evolve it)
- Supports UPDATE, DELETE, MERGE
- Prevents partial writes and corrupt
