Skip to content

vigneshc/TableFormatsExplorer

Repository files navigation

Table Formats Explorer

Side-by-side comparison of three modern table/storage formats through runnable Python wrappers and agent-driven exploration.

What this repo teaches

Format Category Key concept
Delta Lake ACID data-lake table Transaction log (_delta_log) drives versioning; checkpoints compact the log
Apache Iceberg Analytic table format Snapshot/manifest tree separates metadata from data; branches are snapshot refs
SlateDB Embedded LSM key-value store WAL → L0 SSTs → compacted runs; every state transition is visible in manifests

Intended workflow

This repo is for educational purposes and designed to be explored with a coding agent :

  1. Clone the repo and install dependencies.
  2. Open a chat.
  3. Ask questions like "show me what happens to Iceberg metadata when I append a row" or "walk me through the SlateDB LSM phases".
  4. The agent writes and runs code using the wrappers in table_formats_demo/, saves tables under demo_output/<format>/, and exports human-readable YAML explanation files under demo_output/<format>/scratch/.
  5. The agent explains what changed and links to the generated files.

Rule for agents: always use the wrappers in table_formats_demo/ (or write code that calls their APIs) to generate outputs. Never hand-craft YAML or JSON manifests. See AGENTS.md for full guidance.

Project structure

All the code in this repo are coding agent generated, with very brief review by author. The repo is only meant for demo and educational purposes.

table_formats_demo/
├── base/           # Shared models (TableRow, OperationResult, …) and abstract TableFormat
├── delta/          # DeltaFormat  – wraps deltalake
├── iceberg/        # IcebergFormat – wraps pyiceberg with SQLite catalog
├── slatedb/        # SlateDBFormat – wraps slatedb (LSM-tree key-value store)
└── utils/          # yaml_helpers, logging_config

demos/              # Runnable end-to-end demo scripts
tests/              # pytest test suite (one file per wrapper)
demo_output/        # Generated at runtime, git-ignored
  delta/
    users/          # Delta table files
    scratch/        # YAML explanation files (transaction log entries)
  iceberg/
    default/users/  # Iceberg data + metadata
    catalog/        # SQLite catalog
    scratch/        # YAML explanation files (metadata snapshots, manifests)
  slatedb/
    users/          # SlateDB table files (wal/, manifest/, compacted/)
    scratch/        # YAML explanation files (manifest versions)

Setup

Prerequisites: Python 3.11+, uv

git clone https://github.com/vigneshc/TableFormatsDemo.git
cd TableFormatsDemo
uv sync --dev

Running demos

Each demo script creates a fresh table, runs a complete lifecycle of operations, and writes YAML scratch files for inspection.

uv run python demos/delta_demo.py
uv run python demos/iceberg_demo.py
uv run python demos/slatedb_demo.py

Output is written to demo_output/<format>/. Scratch files land in demo_output/<format>/scratch/.

Running tests

uv run pytest
uv run pytest --cov=table_formats_demo --cov-report=html

Key APIs used by agents

Delta Lake

from table_formats_demo.delta.delta_format import DeltaFormat
delta = DeltaFormat(base_path="demo_output/delta", table_name="users")
delta.create_table(initial_data=...)   # writes _delta_log/00000000000000000000.json
delta.append_data(...)                  # new version entry in transaction log
delta.perform_maintenance()             # checkpoint + vacuum + compact
delta.export_scratch("demo_output/delta/scratch")  # YAML per log entry

Apache Iceberg

from table_formats_demo.iceberg.iceberg_format import IcebergFormat
iceberg = IcebergFormat(base_path="demo_output/iceberg", table_name="users")
iceberg.create_table(initial_data=...)      # snapshot 0, manifest list/manifest
iceberg.append_data(...)                     # new snapshot with new manifest
iceberg.perform_maintenance()                # full-table rewrite → compacted snapshot
iceberg.create_branch("feature")             # named snapshot ref in metadata
iceberg.export_scratch("demo_output/iceberg/scratch")  # YAML per .json and .avro file

SlateDB

from table_formats_demo.slatedb.slatedb_format import SlateDBFormat
db = SlateDBFormat(base_path="demo_output/slatedb", table_name="users")
db.create_table(initial_data=...)        # opens DB, writes rows
db.flush_wal_only()                      # WAL SST on disk, memtable unchanged
db.flush_memtable_to_l0()               # memtable → L0 SST, manifest updated
db.compact_l0_to_lower_levels()         # L0 SSTs → compacted runs
db.create_clone(clone_name="snapshot")  # zero-copy checkpoint-based clone
db.export_scratch("demo_output/slatedb/scratch")  # YAML per manifest version

Format comparison cheat-sheet

Concept Delta Lake Iceberg SlateDB
Versioning unit Transaction log entry Snapshot Manifest
Metadata format JSON (+ Parquet checkpoint) JSON + Avro FlatBuffer (readable via admin API)
Compaction optimize.compact() Full-table overwrite L0 → compacted runs
Branching Not supported (use clone) Snapshot refs Checkpoint-based clone
Catalog None (path-based) SQL (SQLite here) N/A

License

MIT

Status

About

Learn table formats such as Apache Iceberg

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages