## 1. Incremental MERGE (Upserts)

MERGE helps handle incremental updates by:
- Updating matching records
- Inserting new records
This prevents duplicate data and avoids full overwrites.

In [None]:
from delta.tables import DeltaTable

# MERGE for incremental updates
deltaTable = DeltaTable.forPath(spark, "/delta/events")
updates = spark.read.csv("/path/to/new_data.csv", header=True, inferSchema=True)

deltaTable.alias("t").merge(
    updates.alias("s"),
    "t.user_session = s.user_session AND t.event_time = s.event_time"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

## 2. Query Historical Versions (Time Travel)

Delta Lake supports reading older versions of a table using:
- versionAsOf (specific version number)
- timestampAsOf (data as of a date/time)

In [None]:
# Read older version of Delta table
v0 = spark.read.format("delta").option("versionAsOf", 0).load("/delta/events")

# Read Delta table as of a timestamp
yesterday = spark.read.format("delta") \
    .option("timestampAsOf", "2024-01-01") \
    .load("/delta/events")

## 3. Optimize Tables

OPTIMIZE compacts small files into bigger ones to improve performance.
ZORDER improves query speed when filtering on specific columns.

In [None]:
spark.sql("OPTIMIZE events_table ZORDER BY (event_type, user_id)")

## 4. Clean Old Files (VACUUM)

VACUUM removes unused old files to save storage.
Retention period is set to avoid deleting important historical versions too soon.

In [None]:
spark.sql("VACUUM events_table RETAIN 168 HOURS")

### Key Takeaway

Delta Lake supports reliable data pipelines by enabling:
- Incremental upserts (MERGE)
- Auditing and rollback (Time Travel)
- Performance improvements (OPTIMIZE + ZORDER)
- Storage cleanup (VACUUM)