## 1. Introduction
---
    
In this tutorial, we explore and compare two optimization strategies in Databricks: Partitioning with Z-Ordering and Liquid Clustering. These techniques help optimize data skipping and read performance for large Delta Lake tables.

**Explanation**
- Partitioning splits data into directories based on column values.
- Z-Ordering reorders data within files to colocate related information.
- Liquid Clustering introduces automatic optimization without strict partition boundaries.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import expr
from pprint import pprint
import json 

# spark comes instatiated out of the box in databricks. However, if you're running locally, you'll need to instantiate it.
# We will instantiate to showcase how to instantiate it.
spark = SparkSession.builder.appName("DeltaOptimization").getOrCreate()


## 2. Dataset Preparation
---
We'll create a synthetic dataset that simulates a typical claims dataset, saved as a Delta table.

**Explanation**
- The dataset includes `claim_id`, `member_id`, `claim_date`, `claim_amount`, and `state`.
- We'll use this dataset to demonstrate both optimization approaches.

In [None]:
df = spark.range(0, 100)
claims_df = df.withColumn("claim_id", expr("id")) \
              .withColumn("member_id", expr("id % 100000")) \
              .withColumn("claim_date", expr("date_add('2020-01-01', cast(id % 10 as int))")) \
              .withColumn("state", expr("CASE WHEN id % 5 = 0 THEN 'CA' WHEN id % 5 = 1 THEN 'TX' WHEN id % 5 = 2 THEN 'NY' WHEN id % 5 = 3 THEN 'FL' ELSE 'WA' END")) \
              .withColumn("claim_amount", expr("round(rand() * 1000, 2)"))
claims_df.display()

In [None]:
#TODO: Add ingestion time clustering here

## 3. Traditional Partitioning + Z-Ordering
---
We write the data into a Delta table using `state` as the partition column and then apply Z-Ordering on `claim_date`.

**Explanation**
- Partitioning creates directory structures like `/state=CA/`.
- Z-Ordering within each partition sorts data to improve skipping on `claim_date`.
- Z-Ordering physically colocates similar `claim_date` values into the same files, so when queries filter by date, Spark can skip unrelated files more effectively.
- The Spark optimizer uses **min/max statistics per file** to determine if a file can be skipped for a query predicate.

**File Structure Example**:
```
/tmp/delta/claims_partitioned/
├── state=CA/
│   ├── part-0000.snappy.parquet
│   └── ...
├── state=TX/
│   ├── part-0000.snappy.parquet
...
```

In [None]:
# Removing the old data for new tutorial runs
%rm -rf /dbfs/tmp/delta/claims_partitioned

In [None]:
claims_df.write.format("delta") \
    .partitionBy("state") \
    .mode("overwrite") \
    .save("/tmp/delta/claims_partitioned")

In [None]:
# Partition structure
%ls /dbfs/tmp/delta/claims_partitioned

In [None]:
# File structure
%ls /dbfs/tmp/delta/claims_partitioned/state=CA

In [None]:
# Inspecting the first delta log
with open("/dbfs/tmp/delta/claims_partitioned/_delta_log/00000000000000000000.json", "r") as f:
    for line in f:
        pprint(json.loads(line))

In [None]:
# File observation prior to optimizing
spark.read.format("delta").load(
    "dbfs:/tmp/delta/claims_partitioned/state=CA"
).withColumn("file_name", F.element_at(F.split(F.input_file_name(), "/"), -1)).display()

In [None]:
# Optimizing the parquet files
results = spark.sql("OPTIMIZE delta.`/tmp/delta/claims_partitioned` ZORDER BY (claim_date)")

# Inspecting the first delta log
with open("/dbfs/tmp/delta/claims_partitioned/_delta_log/00000000000000000001.json", "r") as f:
    for line in f:
        pprint(json.loads(line))

In [None]:
# File structure
%ls /dbfs/tmp/delta/claims_partitioned/state=CA

In [None]:
# File inspection after to optimizing
spark.read.format("delta").load(
    "dbfs:/tmp/delta/claims_partitioned/state=CA"
).withColumn("file_name", F.element_at(F.split(F.input_file_name(), "/"), -1)).display()

**Why Z-Ordering ordered files by `claim_date`**
- Z-Ordering sorts the data within each partition by the specified columns—in this case, `claim_date`.
- This ensures that data with similar `claim_date` values ends up together in fewer files.

**Benefit of Ordering**
- When a query filters on `claim_date`, Spark reads only the small subset of files that contain the relevant dates.
- Spark uses **file-level statistics (min/max values)** for `claim_date` to skip files that are outside the filter range. You can see these statistics in the `*.json` files of the `_delta_log`.
- Example:
    - id: 0 → `claim_date`: 2020-01-01 → file: `part-00000`
    - id: 5 → `claim_date`: 2020-01-06 → same file: `part-00000`
- Thus, file `part-00000` holds a range of sorted dates, enabling **data skipping** and faster reads.

The `OPTIMIZE` command with Z-Ordering coalesces small files (e.g., many part files) into larger, fewer files and reorders the data by the Z-Order columns.

**So why files are combined?**
- Spark and Delta Lake aim to reduce the small file problem, which can hurt performance due to high metadata and shuffle overhead.
- When `OPTIMIZE` runs, it rewrites many small files into fewer, larger files (typically 1 GB) and physically sorts the rows within those files using a space-filling curve (like Z-order).
- If two small files both contain rows for similar `claim_date` values, they are merged into one and sorted.

**Benefits of combining**
- Improves query performance through better data skipping.
- Reduces file system overhead and metadata load.
- Enhances parallel read performance by creating more evenly sized files.

## 4. Liquid Clustering
---
Next, we load the same dataset into a non-partitioned Delta table and enable Liquid Clustering with clustering on `state` and `claim_date`.

**Explanation**
- Liquid clustering avoids rigid directories by clustering within files.
- More flexible for high-cardinality or skewed columns.

**File Structure Example**:
```
/tmp/delta/claims_liquid/
├── part-0000.snappy.parquet
├── part-0001.snappy.parquet
...
```

In [None]:
# Removing the old data for new tutorial runs
%rm -rf /dbfs/tmp/delta/claims_liquid/

In [None]:
claims_df.write.format("delta") \
    .mode("overwrite") \
    .save("/tmp/delta/claims_liquid")

spark.sql("ALTER TABLE delta.`/tmp/delta/claims_liquid` CLUSTER BY (state, claim_date)")

In [None]:
# No partitioned structure
%ls /dbfs/tmp/delta/claims_liquid

In [None]:
spark.read.format("delta").load("dbfs:/tmp/delta/claims_liquid/").withColumn("file_name", F.element_at(F.split(F.input_file_name(), "/"), -1)).display()

In [None]:
%ls /dbfs/tmp/delta/claims_liquid/_delta_log/

In [None]:
# Inspecting the last delta log
with open("/dbfs/tmp/delta/claims_partitioned/_delta_log/00000000000000000003.json", "r") as f:
    for line in f:
        pprint(json.loads(line))

## 5. Performance Comparison
---
Measure performance using Spark queries with `%timeit` for skipping benefits.

**Explanation**
- Try filtering by `state='CA' AND claim_date='2020-06-01'`
- Compare scan statistics from both tables

In [None]:
%timeit spark.read.format("delta").load("/tmp/delta/claims_partitioned").filter("state = 'CA' AND claim_date = '2020-01-06'").show()

In [None]:
%timeit spark.read.format("delta").load("/tmp/delta/claims_liquid").filter("state = 'CA' AND claim_date = '2020-01-06'").show()

In [None]:
%timeit spark.read.format("delta").load("dbfs:/tmp/delta/claims_partitioned/") \
    .groupBy("state", "claim_date") \
    .agg(F.count("claim_id").alias("total_claims")) \
    .withColumn("file_name", F.element_at(F.split(F.input_file_name(), "/"), -1)) \
    .show()

In [None]:
%timeit spark.read.format("delta").load("dbfs:/tmp/delta/claims_liquid/") \
    .groupBy("state", "claim_date") \
    .agg(F.count("claim_id").alias("total_claims")) \
    .withColumn("file_name", F.element_at(F.split(F.input_file_name(), "/"), -1)) \
    .show()

## 6. Conclusion
---
Use Partitioning + Z-Ordering for low-cardinality, evenly distributed columns. Use Liquid Clustering for high-cardinality or skewed distributions.

**Explanation**
- Partitioning is directory-based and static.
- Liquid Clustering is more dynamic and easier to maintain.
- Choose based on data distribution and access patterns.

**Limitations of liquid clustering**
- Statistucs are collected for the first 32 columns in the delta table, clustering outside this limit will not work.
- You can only cluster for up to 4 columns.
