# Modern Table Formats: Delta Lake, Iceberg, Hudi

Modern data lakes have evolved beyond simple file storage to support **ACID transactions**, **schema evolution**, and **time travel**. This notebook explores three leading open table formats that bring data warehouse-like reliability to data lakes.

---

## Table of Contents
1. [The Problem: ACID on Data Lakes](#acid-problem)
2. [Delta Lake](#delta-lake)
3. [Apache Iceberg](#apache-iceberg)
4. [Apache Hudi](#apache-hudi)
5. [Comparison of Table Formats](#comparison)
6. [Key Takeaways](#takeaways)

---

## 1. The Problem: ACID on Data Lakes <a id="acid-problem"></a>

### Traditional Data Lake Challenges

Traditional data lakes (storing raw Parquet, ORC, or JSON files) suffer from several critical limitations:

| Challenge | Description |
|-----------|-------------|
| **No ACID Transactions** | Concurrent writes can corrupt data; partial failures leave inconsistent state |
| **No Schema Enforcement** | Schema drift leads to data quality issues |
| **No Time Travel** | Cannot query historical versions of data |
| **Inefficient Updates/Deletes** | Entire partitions must be rewritten for a single row change |
| **Small File Problem** | Streaming workloads create many small files, degrading read performance |

### ACID Properties Explained

```
┌─────────────────────────────────────────────────────────────────────┐
│                        ACID PROPERTIES                             │
├─────────────────────────────────────────────────────────────────────┤
│  A - Atomicity     │ All operations succeed or all fail together   │
│  C - Consistency   │ Data remains valid after transaction          │
│  I - Isolation     │ Concurrent transactions don't interfere       │
│  D - Durability    │ Committed changes persist despite failures    │
└─────────────────────────────────────────────────────────────────────┘
```

### How Table Formats Solve This

Modern table formats add a **metadata layer** on top of data files:

```
┌──────────────────────────────────────────────────────┐
│                   Query Engine                       │
│            (Spark, Trino, Flink, etc.)              │
└───────────────────────┬──────────────────────────────┘
                        │
┌───────────────────────▼──────────────────────────────┐
│              Table Format Layer                      │
│       (Delta Lake / Iceberg / Hudi)                 │
│  • Transaction Log    • Schema Management           │
│  • Version Control    • File Statistics             │
└───────────────────────┬──────────────────────────────┘
                        │
┌───────────────────────▼──────────────────────────────┐
│              Data Files (Parquet/ORC)                │
│                 on Object Storage                    │
│            (S3, ADLS, GCS, HDFS)                    │
└──────────────────────────────────────────────────────┘
```

---

## 2. Delta Lake <a id="delta-lake"></a>

**Delta Lake** is an open-source storage layer developed by Databricks that brings ACID transactions to Apache Spark and big data workloads.

### Key Features

| Feature | Description |
|---------|-------------|
| **ACID Transactions** | Serializable isolation; concurrent reads/writes safely handled |
| **Time Travel** | Query any historical version using version numbers or timestamps |
| **Schema Evolution** | Add, rename, or reorder columns without rewriting data |
| **Schema Enforcement** | Prevent bad data from entering the table |
| **MERGE (Upserts)** | Efficient insert, update, delete in a single operation |
| **Z-Ordering** | Multi-dimensional clustering for faster queries |
| **Change Data Feed** | Track row-level changes for CDC pipelines |

### Architecture

```
Delta Table Structure:
──────────────────────────────────────────
my_delta_table/
├── _delta_log/                  # Transaction log
│   ├── 00000000000000000000.json
│   ├── 00000000000000000001.json
│   ├── 00000000000000000002.json
│   └── 00000000000000000010.checkpoint.parquet
├── part-00000-xxx.parquet       # Data files
├── part-00001-xxx.parquet
└── part-00002-xxx.parquet
```

In [None]:
# Delta Lake Example with PySpark
# Note: Requires PySpark with delta-spark package

from pyspark.sql import SparkSession
from delta import *

# Initialize Spark with Delta Lake
spark = SparkSession.builder \
    .appName("DeltaLakeDemo") \
    .config("spark.jars.packages", "io.delta:delta-spark_2.12:3.1.0") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

In [None]:
# Create a Delta Table
data = [
    (1, "Alice", "Engineering", 75000),
    (2, "Bob", "Marketing", 65000),
    (3, "Charlie", "Engineering", 80000),
]

columns = ["id", "name", "department", "salary"]
df = spark.createDataFrame(data, columns)

# Write as Delta format
df.write.format("delta").mode("overwrite").save("/tmp/employees_delta")

print("Delta table created successfully!")

In [None]:
# Time Travel - Query Historical Versions
from delta.tables import DeltaTable

# Make some updates to create versions
delta_table = DeltaTable.forPath(spark, "/tmp/employees_delta")

# Update: Give Engineering a raise
delta_table.update(
    condition="department = 'Engineering'",
    set={"salary": "salary * 1.1"}
)

# Query current version
print("Current Version:")
spark.read.format("delta").load("/tmp/employees_delta").show()

# Query previous version (version 0)
print("Version 0 (Original):")
spark.read.format("delta").option("versionAsOf", 0).load("/tmp/employees_delta").show()

# Query by timestamp
# spark.read.format("delta").option("timestampAsOf", "2024-01-15").load("/tmp/employees_delta").show()

In [None]:
# MERGE Operation (Upsert)

# New/updated employee data
updates = [
    (2, "Bob", "Sales", 70000),      # Update: Bob moved to Sales
    (4, "Diana", "Engineering", 85000),  # Insert: New employee
]

updates_df = spark.createDataFrame(updates, columns)

# Perform MERGE operation
delta_table.alias("target").merge(
    updates_df.alias("source"),
    "target.id = source.id"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

print("After MERGE:")
spark.read.format("delta").load("/tmp/employees_delta").show()

In [None]:
# Schema Evolution

# Add a new column by enabling schema evolution
new_data = [
    (5, "Eve", "HR", 60000, "2024-01-15"),  # New column: hire_date
]

new_df = spark.createDataFrame(new_data, columns + ["hire_date"])

# Write with schema merge enabled
new_df.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .save("/tmp/employees_delta")

print("Schema after evolution:")
spark.read.format("delta").load("/tmp/employees_delta").printSchema()

In [None]:
# View Table History
delta_table = DeltaTable.forPath(spark, "/tmp/employees_delta")
delta_table.history().select("version", "timestamp", "operation", "operationMetrics").show(truncate=False)

---

## 3. Apache Iceberg <a id="apache-iceberg"></a>

**Apache Iceberg** is an open table format originally developed at Netflix for massive-scale analytics. It's designed for **engine-agnostic** usage and **hidden partitioning**.

### Key Features

| Feature | Description |
|---------|-------------|
| **Hidden Partitioning** | Users don't need to know partition columns; Iceberg handles it |
| **Partition Evolution** | Change partitioning scheme without rewriting data |
| **Schema Evolution** | Full schema evolution with column ID tracking |
| **Time Travel** | Query snapshots by ID or timestamp |
| **Engine Agnostic** | Works with Spark, Flink, Trino, Dremio, and more |
| **Branching & Tagging** | Git-like data management (experimental) |

### Architecture

```
Iceberg Table Structure:
──────────────────────────────────────────
Catalog (e.g., Hive Metastore, AWS Glue)
    │
    └── Points to → metadata/v1.metadata.json (current version pointer)
                            │
┌───────────────────────────┴────────────────────────────┐
│                   Metadata Layer                       │
├────────────────────────────────────────────────────────┤
│  metadata/                                             │
│  ├── v1.metadata.json      # Table metadata           │
│  ├── v2.metadata.json      # Updated metadata         │
│  ├── snap-xxx.avro         # Snapshot manifests       │
│  └── manifest-xxx.avro     # File manifests           │
└───────────────────────────┬────────────────────────────┘
                            │
┌───────────────────────────▼────────────────────────────┐
│                    Data Layer                          │
├────────────────────────────────────────────────────────┤
│  data/                                                 │
│  ├── dt=2024-01-15/                                    │
│  │   ├── file1.parquet                                 │
│  │   └── file2.parquet                                 │
│  └── dt=2024-01-16/                                    │
│      └── file3.parquet                                 │
└────────────────────────────────────────────────────────┘
```

In [None]:
# Apache Iceberg with PySpark
# Requires: spark with iceberg-spark-runtime package

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("IcebergDemo") \
    .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.3") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.local.type", "hadoop") \
    .config("spark.sql.catalog.local.warehouse", "/tmp/iceberg-warehouse") \
    .getOrCreate()

In [None]:
# Create an Iceberg table with partitioning
spark.sql("""
    CREATE TABLE IF NOT EXISTS local.db.events (
        event_id BIGINT,
        event_type STRING,
        event_time TIMESTAMP,
        user_id BIGINT,
        payload STRING
    )
    USING iceberg
    PARTITIONED BY (days(event_time), event_type)
""")

# Note: Hidden partitioning! Users query by event_time, not by partition

In [None]:
# Insert data
spark.sql("""
    INSERT INTO local.db.events VALUES
        (1, 'click', timestamp'2024-01-15 10:30:00', 100, '{"page": "home"}'),
        (2, 'purchase', timestamp'2024-01-15 11:00:00', 100, '{"amount": 99.99}'),
        (3, 'click', timestamp'2024-01-16 09:00:00', 101, '{"page": "product"}')
""")

# Query - no need to specify partition columns!
spark.sql("""
    SELECT * FROM local.db.events 
    WHERE event_time > timestamp'2024-01-15 00:00:00'
""").show()

In [None]:
# Partition Evolution - Change partitioning without rewriting data!
spark.sql("""
    ALTER TABLE local.db.events 
    ADD PARTITION FIELD bucket(16, user_id)
""")

# New data will use the new partitioning; old data remains unchanged
print("Partition evolution applied! New writes will use additional bucketing.")

In [None]:
# Time Travel with Iceberg

# View table snapshots
spark.sql("SELECT * FROM local.db.events.snapshots").show(truncate=False)

# Query a specific snapshot
# spark.sql("SELECT * FROM local.db.events VERSION AS OF 123456789").show()

# Query by timestamp
# spark.sql("SELECT * FROM local.db.events TIMESTAMP AS OF '2024-01-15 12:00:00'").show()

In [None]:
# Iceberg Maintenance Operations

# Expire old snapshots (cleanup)
spark.sql("""
    CALL local.system.expire_snapshots(
        table => 'db.events',
        older_than => TIMESTAMP '2024-01-14 00:00:00',
        retain_last => 5
    )
""")

# Rewrite data files (compaction)
spark.sql("""
    CALL local.system.rewrite_data_files(
        table => 'db.events'
    )
""")

---

## 4. Apache Hudi <a id="apache-hudi"></a>

**Apache Hudi** (Hadoop Upserts Deletes and Incrementals) was developed at Uber for **streaming data lake** use cases and is optimized for **incremental processing**.

### Key Features

| Feature | Description |
|---------|-------------|
| **Upserts/Deletes** | First-class support for record-level updates |
| **Incremental Queries** | Efficiently query only changed data |
| **Two Table Types** | Copy-on-Write (CoW) and Merge-on-Read (MoR) |
| **Time Travel** | Query historical versions via timeline |
| **Streaming Ingestion** | Native integration with Spark Streaming, Flink |
| **Automatic Compaction** | Background compaction for MoR tables |

### Table Types

```
┌─────────────────────────────────────────────────────────────────────────┐
│                      HUDI TABLE TYPES                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   Copy-on-Write (CoW)              Merge-on-Read (MoR)                 │
│   ─────────────────────            ─────────────────────               │
│   • Rewrites files on update       • Logs updates separately           │
│   • Higher write latency           • Lower write latency               │
│   • Faster reads                   • Slower reads (merge required)     │
│   • Best for: batch, infrequent    • Best for: streaming, frequent     │
│     updates                          updates                            │
│                                                                         │
│   ┌─────────────┐                  ┌─────────────┐  ┌────────────┐     │
│   │  Parquet    │                  │  Parquet    │  │ Log Files  │     │
│   │  (base)     │                  │  (base)     │  │ (updates)  │     │
│   └─────────────┘                  └─────────────┘  └────────────┘     │
│         │                                   \           /              │
│         ▼                                    \         /               │
│   ┌─────────────┐                         ┌──────────────┐             │
│   │  Parquet    │                         │ Merge at     │             │
│   │  (updated)  │                         │ Read Time    │             │
│   └─────────────┘                         └──────────────┘             │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
```

In [None]:
# Apache Hudi with PySpark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("HudiDemo") \
    .config("spark.jars.packages", "org.apache.hudi:hudi-spark3.5-bundle_2.12:0.14.1") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog") \
    .getOrCreate()

In [None]:
# Create a Hudi table (Copy-on-Write)
from pyspark.sql.types import StructType, StructField, StringType, LongType, TimestampType
from datetime import datetime

data = [
    (1, "order_001", "Alice", 150.00, datetime(2024, 1, 15, 10, 30)),
    (2, "order_002", "Bob", 250.00, datetime(2024, 1, 15, 11, 0)),
    (3, "order_003", "Charlie", 100.00, datetime(2024, 1, 15, 12, 0)),
]

columns = ["id", "order_id", "customer", "amount", "order_time"]
df = spark.createDataFrame(data, columns)

# Hudi table configuration
hudi_options = {
    "hoodie.table.name": "orders",
    "hoodie.datasource.write.recordkey.field": "id",
    "hoodie.datasource.write.partitionpath.field": "",
    "hoodie.datasource.write.precombine.field": "order_time",
    "hoodie.datasource.write.operation": "upsert",
    "hoodie.upsert.shuffle.parallelism": 2,
}

# Write to Hudi
df.write.format("hudi") \
    .options(**hudi_options) \
    .mode("overwrite") \
    .save("/tmp/hudi_orders")

print("Hudi table created!")

In [None]:
# Upsert operation
updates = [
    (2, "order_002", "Bob", 300.00, datetime(2024, 1, 15, 14, 0)),  # Update amount
    (4, "order_004", "Diana", 500.00, datetime(2024, 1, 15, 15, 0)),  # New order
]

updates_df = spark.createDataFrame(updates, columns)

updates_df.write.format("hudi") \
    .options(**hudi_options) \
    .mode("append") \
    .save("/tmp/hudi_orders")

# Read current state
print("After Upsert:")
spark.read.format("hudi").load("/tmp/hudi_orders").select(columns).show()

In [None]:
# Incremental Query - Get only changed records

# Get commits timeline
commits_df = spark.read.format("hudi").load("/tmp/hudi_orders") \
    .select("_hoodie_commit_time").distinct().orderBy("_hoodie_commit_time")
commits_df.show()

# Incremental query from a specific commit
# begin_time = "20240115100000"  # Format: yyyyMMddHHmmss
# incremental_df = spark.read.format("hudi") \
#     .option("hoodie.datasource.query.type", "incremental") \
#     .option("hoodie.datasource.read.begin.instanttime", begin_time) \
#     .load("/tmp/hudi_orders")

In [None]:
# Time Travel with Hudi

# Query as of a specific timestamp
# historical_df = spark.read.format("hudi") \
#     .option("as.of.instant", "20240115120000") \
#     .load("/tmp/hudi_orders")

# View timeline
timeline_df = spark.read.format("hudi").load("/tmp/hudi_orders") \
    .select("_hoodie_commit_time", "_hoodie_record_key", "customer", "amount")
timeline_df.show()

---

## 5. Comparison of Table Formats <a id="comparison"></a>

### Feature Comparison Matrix

| Feature | Delta Lake | Apache Iceberg | Apache Hudi |
|---------|------------|----------------|-------------|
| **Origin** | Databricks | Netflix | Uber |
| **First Release** | 2019 | 2018 | 2016 |
| **License** | Apache 2.0 | Apache 2.0 | Apache 2.0 |
| **ACID Transactions** | ✅ | ✅ | ✅ |
| **Time Travel** | ✅ | ✅ | ✅ |
| **Schema Evolution** | ✅ | ✅ (best) | ✅ |
| **Partition Evolution** | ❌ | ✅ | ✅ (limited) |
| **Hidden Partitioning** | ❌ | ✅ | ❌ |
| **Incremental Processing** | Change Data Feed | ✅ | ✅ (best) |
| **Engine Support** | Spark-centric | Multi-engine | Spark/Flink |
| **Small File Handling** | OPTIMIZE + Z-Order | Compaction | Auto-compaction |
| **Best For** | Spark/Databricks | Multi-engine analytics | Streaming CDC |

### When to Use Each Format

```
┌─────────────────────────────────────────────────────────────────────────┐
│                    CHOOSING A TABLE FORMAT                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Use DELTA LAKE when:                                                   │
│  • Using Databricks or primarily Spark                                  │
│  • Need tight integration with MLflow                                   │
│  • Want simplest getting-started experience                             │
│  • Batch-heavy workloads with occasional updates                        │
│                                                                         │
│  Use ICEBERG when:                                                      │
│  • Multi-engine environment (Spark + Trino + Flink)                    │
│  • Need hidden partitioning for user simplicity                        │
│  • Want partition evolution without rewrites                            │
│  • Large-scale analytics with complex schemas                           │
│                                                                         │
│  Use HUDI when:                                                         │
│  • Heavy streaming/CDC workloads                                        │
│  • Need efficient incremental processing                                │
│  • Frequent upserts with low latency requirements                       │
│  • Database replication to data lake                                    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
```

### Performance Characteristics

| Workload | Best Format | Reason |
|----------|-------------|--------|
| Batch analytics | Iceberg | Hidden partitioning, excellent query planning |
| Streaming CDC | Hudi (MoR) | Optimized for frequent small updates |
| ML Pipelines | Delta Lake | MLflow integration, feature store support |
| Multi-engine | Iceberg | Broadest engine compatibility |
| Databricks ecosystem | Delta Lake | Native integration, optimized performance |

In [None]:
# Side-by-side: Creating tables in each format

sample_data = [
    (1, "product_a", 100),
    (2, "product_b", 200),
    (3, "product_c", 150),
]
columns = ["id", "product", "quantity"]

# This is conceptual - each requires its own Spark configuration

# Delta Lake
delta_write = """
df.write.format("delta")
    .mode("overwrite")
    .save("/path/to/delta_table")
"""

# Apache Iceberg
iceberg_write = """
df.writeTo("catalog.db.iceberg_table")
    .using("iceberg")
    .createOrReplace()
"""

# Apache Hudi
hudi_write = """
df.write.format("hudi")
    .option("hoodie.table.name", "hudi_table")
    .option("hoodie.datasource.write.recordkey.field", "id")
    .option("hoodie.datasource.write.precombine.field", "quantity")
    .mode("overwrite")
    .save("/path/to/hudi_table")
"""

print("Delta Lake Write:")
print(delta_write)
print("\nIceberg Write:")
print(iceberg_write)
print("\nHudi Write:")
print(hudi_write)

---

## 6. Key Takeaways <a id="takeaways"></a>

### Summary

```
┌─────────────────────────────────────────────────────────────────────────┐
│                         KEY TAKEAWAYS                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  1. MODERN TABLE FORMATS SOLVE CRITICAL DATA LAKE LIMITATIONS          │
│     • ACID transactions prevent data corruption                         │
│     • Time travel enables debugging and regulatory compliance           │
│     • Schema evolution allows flexible data modeling                    │
│                                                                         │
│  2. ALL THREE FORMATS PROVIDE SIMILAR CORE CAPABILITIES                │
│     • ACID guarantees                                                   │
│     • Time travel / versioning                                          │
│     • Efficient updates and deletes                                     │
│     • Schema management                                                 │
│                                                                         │
│  3. CHOOSE BASED ON YOUR ECOSYSTEM AND WORKLOAD                        │
│     • Delta Lake → Databricks/Spark-centric environments               │
│     • Iceberg → Multi-engine, complex analytics                         │
│     • Hudi → Streaming and CDC-heavy workloads                          │
│                                                                         │
│  4. CONSIDER OPERATIONAL ASPECTS                                        │
│     • All require maintenance (compaction, cleanup)                     │
│     • Cloud integration varies (AWS/Azure/GCP support)                 │
│     • Tooling ecosystem and community support matter                   │
│                                                                         │
│  5. THE INDUSTRY IS CONVERGING                                          │
│     • UniForm (Delta) can read Iceberg/Hudi metadata                   │
│     • Apache XTable enables cross-format translation                    │
│     • Future: more interoperability between formats                    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
```

### Quick Decision Matrix

| If you need... | Choose |
|----------------|--------|
| Easiest Spark integration | **Delta Lake** |
| Best multi-engine support | **Apache Iceberg** |
| Streaming/CDC optimization | **Apache Hudi** |
| Partition evolution | **Apache Iceberg** |
| Databricks compatibility | **Delta Lake** |
| AWS native integration | All (Iceberg has best Athena/EMR support) |

### Further Reading

- [Delta Lake Documentation](https://docs.delta.io/)
- [Apache Iceberg Documentation](https://iceberg.apache.org/docs/latest/)
- [Apache Hudi Documentation](https://hudi.apache.org/docs/overview/)
- [Apache XTable (OneTable)](https://xtable.apache.org/) - Cross-format translation