###Delta Lake (Lakehouse Performance Optimization, Cost Saving & Best Practices)

# Delta Lake OPTIMIZE ‚Äì Internal Working (Deep Dive)

## What problem OPTIMIZE solves

Delta Lake tables often suffer from the **small files problem** due to:

* Frequent `INSERT`s
* `MERGE`, `UPDATE`, `DELETE` operations
* Streaming writes

Small files cause:

* Excessive metadata reads
* Slow query planning
* Inefficient disk I/O

üëâ **`OPTIMIZE` compacts many small files into fewer large files.**

---

## High-level definition

```sql
OPTIMIZE table_name;
```

> **OPTIMIZE rewrites data files by combining small Parquet files into fewer large Parquet files without changing table data or schema.**

---

## Internal Working (Step-by-Step)

### Step 1: Read Delta Transaction Log (`_delta_log`)

OPTIMIZE starts by scanning:

```
_delta_log/*.json
```

It identifies:

* Active data files
* File sizes
* Partition information

‚ùó Files already removed from the log are ignored.

---

### Step 2: Select candidate files

Delta selects **small files** (typically < ~128MB):

* Only within the **same partition**
* Never mixes files across partitions

Example:<br>
region=North/file1 + file2 ‚Üí region=North/file_new<br>
region=North/file1 + region=South/file3 ‚Üí one file:- Files from different partitions are never merged together.<br>


```
partition = 'North'
  ‚îú‚îÄ‚îÄ file1 (10MB)
  ‚îú‚îÄ‚îÄ file2 (20MB)
  ‚îú‚îÄ‚îÄ file3 (15MB)
```

---

### Step 3: Spark reads selected files

Spark:

* Reads the selected small files into memory
* Applies **no transformations**
* Preserves rows exactly

‚ö† No aggregation
‚ö† No filtering
‚ö† No deduplication

---

### Step 4: Write new large files

Spark writes:

* Fewer Parquet files
* Files close to target size (~128MB)

Example:

```
Before OPTIMIZE:
  10 files √ó 12MB

After OPTIMIZE:
  1 file √ó 120MB
```

---

### Step 5: Atomic Delta commit

Delta performs an **atomic commit**:

* `ADD` actions ‚Üí new optimized files
* `REMOVE` actions ‚Üí old small files

Results:

* A **new table version**
* No partial state visible to readers

‚úî ACID guarantees preserved

---

### Step 6: Logical deletion of old files

Old small files:

* Are marked as **removed** in the transaction log
* Still exist physically in storage
* Are invisible to queries

Physical deletion happens only via:

```sql
VACUUM table_name;
```

---

## OPTIMIZE with Z-ORDER

```sql
OPTIMIZE table_name ZORDER BY (customer_id, region);
```

Additional behavior:

* Rows are reordered using a **space-filling curve**
* Related column values are colocated
* Improves **data skipping** for selective queries

‚ö† Z-ORDER is CPU and I/O intensive
‚ö† Rewrites **all files in scope**

---

## What OPTIMIZE does NOT do

| Myth               | Reality       |
| ------------------ | ------------- |
| Deletes files      | ‚ùå VACUUM does |
| Changes data       | ‚ùå             |
| Changes schema     | ‚ùå             |
| Removes duplicates | ‚ùå             |
| Repartitions table | ‚ùå             |

---

## Why OPTIMIZE is safe during reads

Delta Lake ensures:

* Readers see **either old or new snapshot**
* Never partial or corrupted data

This is guaranteed by:

* Snapshot isolation
* Versioned transaction log

---

## Performance impact

**Before OPTIMIZE**

```
Query ‚Üí 1000 small files ‚Üí slow planning + I/O
```

**After OPTIMIZE**

```
Query ‚Üí 20 large files ‚Üí faster scan + pruning
```

---

## Production best practices

Optimize only hot data:

```sql
OPTIMIZE table_name
WHERE date >= current_date() - 7;
```

Benefits:

* Avoids rewriting cold partitions
* Reduces compute cost

---

## Interview-ready summary

> **OPTIMIZE compacts small Delta files by reading active files from the transaction log, rewriting them into larger files per partition, and committing the changes atomically while logically removing old files.**


In [0]:
%sql
use lakehousecat.deltadb

In [0]:
%sql
CREATE OR REPLACE TABLE tblsales
(
  sales_id INT,
  product_id INT,
  region STRING,
  sales_amount DOUBLE,
  sales_date DATE
)
USING DELTA

In [0]:
%sql
select * from tblsales

In [0]:
%sql
INSERT INTO tblsales VALUES
  (1, 101, 'North', 1000.50, '2025-10-16'),
  (2, 102, 'South', 500.75, '2025-10-16'),
  (3, 103, 'East', 700.20, '2025-10-16'),
  (4, 104, 'West', 1200.00, '2025-10-16');

INSERT INTO tblsales VALUES
  (5, 101, 'North', 800.00, '2025-10-17'),
  (6, 102, 'South', 450.00, '2025-10-17'),
  (7, 103, 'East', 600.00, '2025-10-17'),
  (8, 104, 'West', 1100.00, '2025-10-17');


In [0]:
%sql
select * from tblsales;

In [0]:
%sql
--Check fragmentation (numFiles & sizeInBytes)
DESCRIBE DETAIL tblsales;

-- Each INSERT is:
-- One Delta transaction
-- Produces one commit
-- Writes one or more data files
-- Because:
-- Small data
-- No repartitioning
-- Default settings
-- üëâ Each transaction produced 1 data file, so total files = 2
-- Per transaction, Delta writes one or more files depending on Spark execution.
-- 2. Data size
-- Small inserts ‚Üí often 1 file
-- Large inserts ‚Üí many files

Both numbers exist, but they mean different things.

128 MB ‚Üí traditional Parquet / Spark best-practice target

~1 GB ‚Üí Delta Lake OPTIMIZE default target file size

So for Delta OPTIMIZE, the correct default is ~1 GB, not 128 MB.

In [0]:
%sql
--Optimize the table
--This performs file compaction:
--Combines many small Parquet files into fewer large files (around 1 GB default).
--Improves read performance and reduces metadata overhead.
OPTIMIZE tblsales;

In [0]:
%sql
-- Verify compaction
-- After optimization, run:
DESCRIBE DETAIL tblsales;

#### 2. ZORDER
- ZORDER is an optional feature used with OPTIMIZE to colocate related data physically in the same set of files by sorting.
- Reduces file scan for queries filtering on ZORDER columns.
- Works best for columns used frequently in WHERE clauses.

#### EXAMPLE USE CASE:
- Periodically optimize large Delta tables with frequent writes/updates.
- Use ZORDER on high-selectivity/filtering columns to improve read performance.


# Delta Lake Z-ORDER ‚Äî Explained with One Concrete Example

## Scenario

We have a **sales Delta table**:

```sql
CREATE TABLE tblsales (
  order_id INT,
  customer_id INT,
  region STRING,
  amount DOUBLE
)
USING DELTA
PARTITIONED BY (region);
```

The table is frequently queried by **customer_id**.

---

## Data (inside one partition: `region = 'North'`)

| order_id | customer_id | amount |
| -------- | ----------- | ------ |
| 1        | 101         | 500    |
| 2        | 305         | 200    |
| 3        | 102         | 800    |
| 4        | 501         | 300    |
| 5        | 103         | 900    |
| 6        | 302         | 100    |

Rows are written in **random order**.

---

## Without Z-ORDER (default layout)

Files after normal OPTIMIZE:

```
region=North/
  ‚îú‚îÄ‚îÄ file1 ‚Üí customer_id [101, 501]
  ‚îú‚îÄ‚îÄ file2 ‚Üí customer_id [102, 305]
  ‚îú‚îÄ‚îÄ file3 ‚Üí customer_id [103, 302]
```

### Query

```sql
SELECT * FROM tblsales
WHERE region = 'North' AND customer_id = 101;
```

### What happens

* Spark must scan **all 3 files**
* Because 101 may exist anywhere

‚ùå Poor data skipping

---

## OPTIMIZE with Z-ORDER

```sql
OPTIMIZE tblsales
ZORDER BY (customer_id);
```

---

## How Z-ORDER rearranges data

Z-ORDER:

* Computes a **Z-value** for `customer_id`
* Sorts rows by this Z-value
* Writes rows with similar `customer_id` **together**

Sorted order (conceptual):

```
101 ‚Üí 102 ‚Üí 103 ‚Üí 302 ‚Üí 305 ‚Üí 501
```

---

## After Z-ORDER (new file layout)

```
region=North/
  ‚îú‚îÄ‚îÄ file1 ‚Üí customer_id [101, 102, 103]
  ‚îú‚îÄ‚îÄ file2 ‚Üí customer_id [302, 305]
  ‚îú‚îÄ‚îÄ file3 ‚Üí customer_id [501]
```

Each file now covers a **tight min/max range**.

---

## Same query after Z-ORDER

```sql
SELECT * FROM tblsales
WHERE region = 'North' AND customer_id = 101;
```

### File skipping logic

| File  | customer_id min‚Äìmax | Read?  |
| ----- | ------------------- | ------ |
| file1 | 101‚Äì103             | ‚úÖ YES  |
| file2 | 302‚Äì305             | ‚ùå SKIP |
| file3 | 501‚Äì501             | ‚ùå SKIP |

‚úÖ **Only 1 file scanned**

---

## Why performance improves

* File-level statistics become precise
* Spark skips irrelevant files
* Less I/O, faster queries

---

## What Z-ORDER does NOT do

* ‚ùå No index creation
* ‚ùå No data filtering
* ‚ùå No deduplication
* ‚ùå No cross-partition mixing

---

## One-line interview takeaway

> **Z-ORDER physically colocates similar column values inside Delta files so that selective queries scan fewer files using data skipping.**


# Delta Lake OPTIMIZE & Z-ORDER ‚Äî Use Case Explained

This document explains **why and when** `OPTIMIZE` and `Z-ORDER` are used in real-world Delta Lake systems, using a **practical production scenario**.

---

## Real-world scenario: E-commerce Orders Table

### Table

```sql
orders_delta
```

### Data ingestion pattern

* Streaming job inserts new orders every **5 minutes**
* CDC job performs `MERGE` every **hour**
* Millions of rows added daily

This is a **write-heavy Delta table**.

---

## Problem 1: Why OPTIMIZE is needed

### What happens without OPTIMIZE

Every write creates small Parquet files:

```
Day 1  ‚Üí 2,000 files (5‚Äì20 MB)
Day 2  ‚Üí 4,000 files
Day 7  ‚Üí 15,000+ files
```

### Query example

```sql
SELECT *
FROM orders_delta
WHERE order_date = '2026-01-28';
```

Even though only one day is queried:

* Spark must open **thousands of files**
* Query planning and I/O become slow

‚ùå Data is correct
‚ùå Performance is poor

---

## OPTIMIZE use case

### Business requirement

> Queries on **recent orders** must be fast.

### Solution

Run `OPTIMIZE` **periodically** (for example, once per day):

```sql
OPTIMIZE orders_delta
WHERE order_date >= current_date() - 7;
```

### Result

* Small files are compacted into large files (~1 GB)
* File count is drastically reduced
* Queries become significantly faster

‚úÖ This is the **OPTIMIZE use case**

---

## Problem 2: Why Z-ORDER is needed

Even after OPTIMIZE, data inside files is **not ordered**.

Example layout:

```
region=US/
  ‚îú‚îÄ‚îÄ file1 ‚Üí customer_id [1 ‚Ä¶ 1,000,000]
  ‚îú‚îÄ‚îÄ file2 ‚Üí customer_id [1 ‚Ä¶ 1,000,000]
```

Each file contains a wide range of customers.

---

## Query pattern that causes slowness

```sql
SELECT *
FROM orders_delta
WHERE customer_id = 987654;
```

Spark still:

* Scans many large files
* Because the customer‚Äôs data is spread everywhere

‚ùå File count is low
‚ùå File skipping is ineffective

---

## Z-ORDER use case

### Business requirement

> Customer-specific queries must be fast.

### Solution

Apply Z-ORDER on a **high-selectivity column**:

```sql
OPTIMIZE orders_delta
WHERE order_date >= current_date() - 7
ZORDER BY (customer_id);
```

### Result

* Rows for the same `customer_id` are physically colocated
* File-level min/max statistics become tighter
* Only a few files are scanned per query

‚úÖ This is the **Z-ORDER use case**

---

## Why OPTIMIZE and Z-ORDER are used together

They solve **different but complementary problems**:

| Problem                | Feature  |
| ---------------------- | -------- |
| Too many small files   | OPTIMIZE |
| Too many files scanned | Z-ORDER  |

Combined usage improves both:

* Write efficiency
* Read performance

---

## Simple mental model

* **OPTIMIZE** ‚Üí fixes *how many files exist*
* **Z-ORDER** ‚Üí fixes *which files are read*

---

## Interview-ready summary

> **In large Delta tables with frequent writes, OPTIMIZE is run periodically to compact small files, and Z-ORDER is applied on high-selectivity columns to ensure selective queries scan fewer files and perform faster.**


In [0]:
%sql

-- Step 1 ‚Äì Create the Delta table
use lakehousecat.deltadb;
CREATE OR REPLACE TABLE customer_txn (
    txn_id INT,
    customer_id INT,
    region STRING,
    txn_amount DOUBLE,
    txn_type STRING,
    transaction_date DATE
)
USING DELTA;

In [0]:
%sql
describe history customer_txn

In [0]:
%sql
--Step 2 ‚Äì Insert multiple small batches
--Each insert writes a few small Parquet files.
-- Batch 1
INSERT INTO customer_txn VALUES
 (1, 1001, 'North', 250.00, 'Online', '2025-10-01'),
 (2, 1002, 'South', 400.00, 'Offline', '2025-10-02'),
 (3, 1003, 'West', 600.00, 'Online', '2025-10-03');

-- Batch 2
INSERT INTO customer_txn VALUES
 (4, 1001, 'North', 300.00, 'Offline', '2025-10-01'),
 (5, 1004, 'East', 750.00, 'Online', '2025-10-02'),
 (6, 1005, 'South', 180.00, 'Online', '2025-10-03');

-- Batch 3
INSERT INTO customer_txn VALUES
 (7, 1001, 'North', 270.00, 'Online', '2025-10-01'),
 (8, 1003, 'West', 500.00, 'Offline', '2025-10-02'),
 (9, 1002, 'South', 900.00, 'Online', '2025-10-03');



# OPTIMIZE vs Z-ORDER ‚Äî Simple File‚ÄëLevel Explanation

This document explains **what is happening** in the example step by step, in **very simple terms**, focusing only on **files and folders**.

---

## Table setup

```sql
customer_txn
PARTITIONED BY (region)
```

This means **each region is a folder**.

---

## STEP 1: Before OPTIMIZE

```
region=North
    part-0
    part-1
    part-2
region=South
    part-0
    part-1
    part-2
region=West
    part-0
    part-1
region=East
    part-0
```

### What this means

* Each `region` folder has **multiple small files**
* Rows inside files are in **random order**

### Query example

```sql
SELECT * FROM customer_txn WHERE region = 'North';
```

Spark behavior:

* Goes only to `region=North/` folder (partition pruning)
* Reads **all files inside that folder**

---

## STEP 2: After `OPTIMIZE customer_txn`

```
region=North
    part-3   ‚Üê new large file
region=South
    part-3
region=West
    part-2
region=East
    part-1
```

(Old files are logically removed)

### What OPTIMIZE does

* Reads all small files **within the same region**
* Combines them into **fewer, larger files**
* Creates a **new Delta version**

### What OPTIMIZE does NOT do

* ‚ùå Does NOT sort rows
* ‚ùå Does NOT mix regions

### Benefit

* Fewer files to read
* Faster queries

---

## STEP 3: OPTIMIZE with Z-ORDER

```sql
OPTIMIZE customer_txn
ZORDER BY (transaction_date);
```

---

## STEP 4: After `OPTIMIZE + ZORDER`

```
region=North
    part-3   ‚Üê rows grouped by transaction_date
region=South
    part-3   ‚Üê rows grouped by transaction_date
region=West
    part-2   ‚Üê rows grouped by transaction_date
region=East
    part-1   ‚Üê rows grouped by transaction_date
```

### What Z-ORDER does

* Rewrites files again
* **Arranges rows inside each file**
* Keeps similar `transaction_date` values close together

---

## Why Z-ORDER helps

Query:

```sql
SELECT *
FROM customer_txn
WHERE region = 'North'
  AND transaction_date = '2026-01-02';
```

### Without Z-ORDER

* Entire file must be scanned

### With Z-ORDER

* Spark checks file metadata (min/max date)
* Skips irrelevant data
* Reads much less data

---

## Very simple analogy

* **Partition (region)** ‚Üí folders
* **Files** ‚Üí notebooks
* **OPTIMIZE** ‚Üí combine many notebooks into one
* **Z-ORDER** ‚Üí sort pages inside the notebook

---

## One-line takeaway (important)

> **OPTIMIZE reduces the number of files per partition, and Z-ORDER arranges rows inside those files so queries scan less data.**


In [0]:
%sql
describe history customer_txn;

In [0]:
%sql
-- Step 3 ‚Äì Inspect fragmentation (numFiles & sizeInBytes)
DESCRIBE DETAIL customer_txn;

In [0]:
%sql
-- Step 4 ‚Äì Run OPTIMIZE ZORDER - watch out the metrics - zOrderStats
-- Now compact and physically order data.
OPTIMIZE customer_txn 
ZORDER BY (transaction_date);

In [0]:
%sql
DESCRIBE HISTORY customer_txn

In [0]:
%sql
-- Step 3 ‚Äì Inspect fragmentation
DESCRIBE DETAIL customer_txn;

OPTIMIZE rewrites Parquet files, so metadata is regenerated per new file; the data content remains the same, but total file size may change slightly due to improved compression and reduced metadata overhead.

####3. Partitioning
Partitioning is the practice of physically splitting a table's data into separate **folders** based on a column.<br>
Good partition columns:<br>
- Low cardinality (low difference columns such as date, age, city, region, gender)
- Columns used Frequently used in filters
- Stable (we can't change the partition columns very frequently)

# Partitioning in Delta Lake ‚Äî Detailed Explanation

This document explains **partitioning** step by step: what it is, why it exists, how Spark/Delta use it internally, and how it works together with OPTIMIZE and Z-ORDER.

---

## What is Partitioning? (Plain English)

> **Partitioning means physically splitting table data into folders based on column values.**

* Each partition = **one folder**
* Each folder contains rows for **only one partition value**

---

## Simple Example

```sql
CREATE TABLE customer_txn (
  txn_id INT,
  customer_id INT,
  region STRING,
  transaction_date DATE,
  amount DOUBLE
)
USING DELTA
PARTITIONED BY (region);
```

### Physical layout in storage

```
customer_txn/
  ‚îú‚îÄ‚îÄ region=North/
  ‚îÇ     ‚îú‚îÄ‚îÄ part-000.parquet
  ‚îÇ     ‚îú‚îÄ‚îÄ part-001.parquet
  ‚îú‚îÄ‚îÄ region=South/
  ‚îÇ     ‚îú‚îÄ‚îÄ part-000.parquet
  ‚îú‚îÄ‚îÄ region=West/
  ‚îÇ     ‚îú‚îÄ‚îÄ part-000.parquet
```

Each folder contains **only that region‚Äôs data**.

---

## Why Partitioning Exists (Very Important)

Partitioning exists mainly for **partition pruning**:

> **Avoid reading unnecessary data.**

---

## Query Without Partitioning

```sql
SELECT * FROM customer_txn WHERE region = 'North';
```

* Spark scans the **entire table**
* Filters rows after reading

‚ùå Slow
‚ùå Expensive

---

## Query With Partitioning

Same query:

```sql
SELECT * FROM customer_txn WHERE region = 'North';
```

Spark behavior:

* Reads only `region=North/`
* Skips South, West, East folders

‚úÖ Fast
‚úÖ Cheap

This is called **partition pruning**.

---

## How Spark Uses Partitioning Internally

1. Spark analyzes the query
2. Detects filter on partition column
3. Reads table metadata
4. Prunes irrelevant folders
5. Scans only matching partitions

---

## Partitioning vs Spark Partitions (Common Confusion)

| Concept         | Meaning                    |
| --------------- | -------------------------- |
| Delta partition | Physical folder in storage |
| Spark partition | Parallel execution task    |

They are **not the same**.

---

## Good Partition Columns (Rules of Thumb)

A good partition column:

* Low to medium cardinality
* Frequently used in filters
* Grows naturally over time

### Good examples

* `date`
* `region`
* `country`
* `year`, `month`

### Bad examples

* `customer_id`
* `order_id`
* `uuid`
* `transaction_id`

---

## Why High-Cardinality Partitioning Is Bad

Partitioning by `customer_id` creates:

```
customer_id=1/
customer_id=2/
customer_id=3/
...
```

Problems:

* Millions of folders
* Huge metadata overhead
* Slow query planning
* OPTIMIZE becomes ineffective

‚ùå Anti-pattern

---

## Partitioning + OPTIMIZE

Partitioning decides:

> **Which folder to read**

OPTIMIZE decides:

> **How many files inside the folder**

Example:

```
region=North/
  part-0 (5MB)
  part-1 (7MB)
  part-2 (6MB)
```

After OPTIMIZE:

```
region=North/
  part-3 (18MB)
```

---

## Partitioning + Z-ORDER

Partitioning:

* Prunes folders

Z-ORDER:

* Skips files **inside folders**

Query example:

```sql
WHERE region = 'North'
  AND transaction_date = '2026-01-15'
```

Execution order:

1. Partition pruning ‚Üí folder selection
2. Z-ORDER ‚Üí file skipping

---

## Multi-Column Partitioning

```sql
PARTITIONED BY (year, month)
```

Storage layout:

```
year=2026/
  month=01/
  month=02/
```

‚ö† Too many partition columns cause deep folder structures
‚ö† Over-partitioning hurts performance

---

## When NOT to Use Partitioning

* Small tables
* Columns rarely used in filters
* High-cardinality columns
* Temporary or exploratory data

---

## Interview-Ready Summary

> **Partitioning physically organizes data into folders based on column values, enabling Spark to prune irrelevant data and scan only what is required.**

---

## One-Line Mental Model

* Partitioning ‚Üí which folders to read
* OPTIMIZE ‚Üí how many files inside
* Z-ORDER ‚Üí which files inside


In [0]:
%sql
use lakehousecat.deltadb;
CREATE OR REPLACE TABLE customer_txn_part1 (
    txn_id INT,
    customer_id INT,
    region STRING,
    txn_amount DOUBLE,
    txn_type STRING,
    transaction_date DATE
) 
using delta
partitioned by (transaction_date);
insert into customer_txn_part1 select * from customer_txn;
--or
create or replace table customer_txn_part 
partitioned by (transaction_date) 
as select * from customer_txn;

In [0]:
%sql
explain select * from customer_txn_part1 where transaction_date='2025-10-01';

In [0]:
%python
#Just to show you how the data is partitioned in the filesystem (behind the scene)
spark.sql("select * from customer_txn").write.partitionBy("region").format("delta").save("/Volumes/lakehousecat/deltadb/datalake/cust_txns_partdelta")


#equivalent CTAS in Pyspark python programming
spark.sql("select * from customer_txn").write.partitionBy("region").saveAsTable("customer_txn_part2")

In [0]:
display(spark.sql('SHOW PARTITIONS customer_txn_part'))

In [0]:
display(spark.sql("SHOW PARTITIONS customer_txn_part2"))

In [0]:
%sql
SELECT * 
FROM customer_txn_part
WHERE transaction_date BETWEEN '2025-10-01' AND '2025-10-01';--picks the data from the 2025-10-01 folder directly and show the result quickly.

####4. Vaccum
*VACUUM* in Delta Lake removes old, unused files to free up storage, default retention hours is 168. These files come from operations like DELETE, UPDATE, or MERGE and are kept temporarily so time-travel queries can work.<br>

Before VACUUM<br>
Active + deleted parquet files exist<br>

After VACUUM<br>
Only ACTIVE parquet files remains and delete Old parquet files (from UPDATE/MERGE/DELETE)<br>
Logs remain maintained (will not delete logs, only old data deleted)<br>
Time travel beyond retention becomes impossible<br>

In [0]:
%sql
VACUUM drugstbl_merge RETAIN 168 HOURS;
--SET spark.databricks.delta.retentionDurationCheck.enabled = false;

# Active vs Deleted Parquet Files (Before VACUUM)

This document explains **what it means when we say**:

> **Before VACUUM: Active + deleted parquet files exist**

in the simplest and clearest way.

---

## Very Important Rule (Read First)

> **In Delta Lake, DELETE does NOT mean the file is removed from storage.**

Delta Lake uses **logical deletion**, not physical deletion.

---

## Step 1: Initial state (no changes yet)

Parquet files in storage:

```
part-0.parquet
part-1.parquet
part-2.parquet
```

All files are:

* Active
* Used by queries

---

## Step 2: A write operation happens

You run any of the following:

* `OPTIMIZE`
* `UPDATE`
* `DELETE`
* `MERGE`

Delta Lake:

1. Creates **new Parquet file(s)**
2. Marks old files as **REMOVED in metadata**

Example:

```
part-3.parquet   ‚Üê new file
```

---

## Step 3: What exists AFTER the operation

### Physically on storage (S3 / ADLS / DBFS)

```
part-0.parquet   ‚Üê exists
part-1.parquet   ‚Üê exists
part-2.parquet   ‚Üê exists
part-3.parquet   ‚Üê exists
```

### Logically in Delta metadata (`_delta_log`)

* **Active files** (used by queries)

  * `part-3.parquet`

* **Deleted files (logical)**

  * `part-0.parquet`
  * `part-1.parquet`
  * `part-2.parquet`

üëâ This state is called:

> **Active + deleted parquet files exist**

---

## What ‚Äúdeleted parquet files‚Äù really means

Deleted parquet files:

* Are **NOT visible** to current queries
* Are **NOT counted** in table metadata
* **DO still exist physically** on storage

This is called **logical deletion**.

---

## Why Delta keeps deleted files

Delta Lake keeps old files for:

### 1. Time travel

```sql
SELECT * FROM table VERSION AS OF 10;
```

### 2. Running queries safety

Queries started earlier can still read old files.

### 3. ACID guarantees

No partial or corrupted reads.

---

## Step 4: What VACUUM does

```sql
VACUUM table_name;
```

VACUUM:

* Permanently deletes **logically removed files**
* Keeps only active files
* Frees storage space

### After VACUUM

```
part-3.parquet   ‚Üê exists
part-0.parquet   ‚Üê deleted ‚ùå
part-1.parquet   ‚Üê deleted ‚ùå
part-2.parquet   ‚Üê deleted ‚ùå
```

---

## Key difference to remember

| Term         | Meaning                                 |
| ------------ | --------------------------------------- |
| Active file  | Used by current table version           |
| Deleted file | Removed from metadata, still on storage |
| OPTIMIZE     | Logical delete + rewrite                |
| VACUUM       | Physical delete                         |

---

## One-line summary (Interview-ready)

> **Before VACUUM, Delta tables contain both active Parquet files used by the current version and logically deleted Parquet files that still exist on storage for time travel and consistency.**

---

## Ultra-short mental model

* OPTIMIZE / DELETE ‚Üí logical delete
* Files still exist
* VACUUM ‚Üí physical delete


####5. Liquid Clustering
*Liquid Clustering is the* Next-generation data clustering feature that automatically manages physical data organization on disk to minimize scan cost for frequently queried columns only on Delta tables by performing automatic Z-Ordering, Partitioning and Optimize.<br>
while clustering in databricks delta does partition happens literally?
No, liquid clustering does not create literal physical partitions (subdirectories). 
still we get the benifits of partitioning while doing clustering?
Yes, you absolutely still get the benefits of partitioning while doing clustering.

**Partition vs Liquid Clustering**
| Use case                       | Recommendation         |
| ------------------------------ | ---------------------- |
| High-cardinality columns       | Liquid clustering    |
| Frequently changing filters    | Liquid clustering    |
| Streaming / incremental loads  | Liquid clustering    |
| Static, low-cardinality (date) | Partition OR Liquid |
| Legacy Hive-style tables       | Partition           |


**Typical Use Cases**
- Large tables with frequent inserts, updates, and deletes.
- Query filtering on specific columns like customer_id, region, order_date.

Running OPTIMIZE at table creation time only optimizes the data that exists at that moment. Any new inserts/appends after that are NOT optimized.

Why this happens (conceptually)

Delta Lake stores data as immutable files.

CREATE TABLE ‚Üí creates metadata

Initial INSERT / CTAS ‚Üí creates data files

OPTIMIZE ‚Üí rewrites those existing files only

Later INSERT / STREAMING WRITE ‚Üí new files are added

Those new files remain unoptimized

| Feature                        | Manual or Automatic?            | Why                                               |
| ------------------------------ | ------------------------------- | ------------------------------------------------- |
| **Z-ORDER**                    | ‚úÖ **Manual**                    | Runs only when you execute `OPTIMIZE ‚Ä¶ ZORDER BY` |
| **OPTIMIZE (file compaction)** | ‚ö†Ô∏è **Manual by default**        | Rewrites files only when triggered                |
| **PARTITION BY**               | ‚ùå **NOT manual after creation** | Applied **at write time**, not rerun              |
| **Liquid Clustering**          | ‚úÖ **Automatic**                 | Clusters data during new writes                   |


# Liquid Clustering in Delta Lake ‚Äî Detailed Explanation (with Example)

This document explains **Liquid Clustering** clearly and completely: what it is, why it exists, how it works internally, how it differs from partitioning and Z-ORDER, and when to use it ‚Äî with a concrete example.

---

## What is Liquid Clustering? (Plain English)

> **Liquid Clustering is a dynamic data layout technique in Delta Lake that automatically organizes data based on frequently filtered columns, without fixed partitions.**

Key idea:

* No static folders like partitioning
* No manual reorganization like Z-ORDER
* Delta automatically **maintains clustering over time**

---

## Why Liquid Clustering Exists

Traditional approaches have problems:

### Partitioning problems

* Fixed at table creation
* Over-partitioning causes too many folders
* Under-partitioning causes large scans
* Hard to change later

### Z-ORDER problems

* Manual operation
* Expensive full rewrites
* Clustering degrades with new writes

üëâ **Liquid Clustering solves both.**

---

## How Liquid Clustering Works (High Level)

1. You define **clustering columns**
2. Delta tracks clustering metadata
3. New writes are **automatically clustered**
4. Background optimization maintains layout
5. No fixed directory structure

---

## Simple Example

### Create a table with Liquid Clustering

```sql
CREATE TABLE customer_txn (
  txn_id BIGINT,
  customer_id BIGINT,
  region STRING,
  transaction_date DATE,
  amount DOUBLE
)
USING DELTA
CLUSTER BY (customer_id, transaction_date);
```

Here:

* `customer_id` and `transaction_date` are **clustering columns**

---

## How Data Is Stored (Important)

Unlike partitioning:

```
/customer_txn/
  ‚îú‚îÄ‚îÄ part-0001.parquet
  ‚îú‚îÄ‚îÄ part-0002.parquet
  ‚îú‚îÄ‚îÄ part-0003.parquet
```

There are **no `customer_id=...` folders**.

Instead:

* Rows with similar `customer_id` and `transaction_date` are colocated **inside files**
* File-level min/max stats are optimized

---

## Query Example

```sql
SELECT *
FROM customer_txn
WHERE customer_id = 101
  AND transaction_date = '2026-01-15';
```

### What Delta does internally

1. Uses file-level statistics
2. Skips files that don‚Äôt match
3. Reads only a small subset of files

üëâ Similar benefit to Z-ORDER, but **maintained automatically**.

---

## What Happens on New Writes

When new data is inserted:

```sql
INSERT INTO customer_txn VALUES (...);
```

Delta:

* Writes data already clustered
* Avoids layout degradation
* No need to rerun Z-ORDER

---

## Comparison: Partitioning vs Z-ORDER vs Liquid Clustering

| Feature                  | Partitioning | Z-ORDER | Liquid Clustering |
| ------------------------ | ------------ | ------- | ----------------- |
| Folder-based             | Yes          | No      | No                |
| Fixed layout             | Yes          | No      | No                |
| Manual maintenance       | No           | Yes     | No                |
| Handles high cardinality | No           | Yes     | Yes               |
| Degrades with new writes | No           | Yes     | No                |
| Automatic optimization   | No           | No      | Yes               |

---

## Liquid Clustering vs Z-ORDER (Key Difference)

| Aspect           | Z-ORDER            | Liquid Clustering    |
| ---------------- | ------------------ | -------------------- |
| Trigger          | Manual OPTIMIZE    | Automatic            |
| Rewrite cost     | High               | Incremental          |
| Layout stability | Degrades           | Maintained           |
| Best for         | Batch optimization | Continuous workloads |

---

## When to Use Liquid Clustering

Use Liquid Clustering when:

* Table is large
* Queries frequently filter on certain columns
* Columns have **high cardinality**
* Continuous writes / streaming data
* You want minimal operational overhead

---

## When NOT to Use Liquid Clustering

* Small tables
* Rarely queried tables
* Simple partition-based filtering is enough
* Very infrequent writes

---

## How It Works Internally (Simplified)

* Delta tracks clustering statistics
* Uses adaptive file compaction
* Maintains balanced data distribution
* Preserves ACID guarantees

(No user-visible background jobs required)

---

## One-Line Mental Model

> **Partitioning chooses folders, Z-ORDER rearranges files manually, Liquid Clustering continuously maintains data organization automatically.**

---

## Interview-Ready Summary

> **Liquid Clustering is an automatic, adaptive data layout technique in Delta Lake that continuously clusters data based on query patterns, eliminating the need for fixed partitions or manual Z-ORDER operations.**


In [0]:
%sql
use lakehousecat.deltadb

In [0]:
%sql
DROP TABLE IF EXISTS sales_orders_liquid

In [0]:
%sql
CREATE TABLE IF NOT EXISTS sales_orders_liquid
(
  order_id INT,
  customer_id INT,
  region STRING,
  product STRING,
  quantity INT,
  price DOUBLE,
  order_date DATE
)
USING DELTA
CLUSTER BY(customer_id,region);--clustering column can be high or low cardinal, unlike partition which requires only low cardinal columns.
--column order used in cluster by is based on the primary filter, ie. whether you first filter based on customer_id or region, accordingly keep the coloumns order.

In [0]:
%sql
-- Each insert simulates separate data ingestion.

INSERT INTO sales_orders_liquid VALUES
 (1, 101, 'North', 'Laptop', 2, 65000, '2025-10-01'),
 (2, 102, 'South', 'Headphones', 5, 2500, '2025-10-01'),
 (3, 103, 'West', 'Desk Chair', 3, 4500, '2025-10-02');

INSERT INTO sales_orders_liquid VALUES
 (4, 101, 'North', 'Keyboard', 1, 1200, '2025-10-03'),
 (5, 104, 'East', 'Monitor', 2, 9500, '2025-10-03'),
 (6, 105, 'South', 'Mouse', 4, 700, '2025-10-03');


In [0]:
%sql
SELECT * FROM sales_orders_liquid where customer_id=102;

In [0]:
%sql
DESCRIBE DETAIL sales_orders_liquid

In [0]:
%sql
UPDATE sales_orders_liquid
SET price = price * 1.05
WHERE region = 'North';

In [0]:
%sql
DESCRIBE DETAIL sales_orders_liquid

In [0]:
%sql
DESCRIBE HISTORY sales_orders_liquid;--It proves the optimize and zordering is done naturally (look at the operation column)

In [0]:
%sql
DELETE FROM sales_orders_liquid
WHERE region = 'East';

In [0]:
%sql
DESCRIBE HISTORY sales_orders_liquid;

In [0]:
spark.sql("select * from lakehousecat.deltadb.sales_orders_liquid order by region").write.clusterBy("region").format("csv").save("/Volumes/lakehousecat/deltadb/datalake/cust_txns_clustercsv",mode='overwrite')

5,104,East,Monitor,2,9500.0,2025-10-03<br>
1,101,North,Laptop,2,68250.0,2025-10-01<br>
4,101,North,Keyboard,1,1260.0,2025-10-03<br>
2,102,South,Headphones,5,2500.0,2025-10-01<br>
6,105,South,Mouse,4,700.0,2025-10-03<br>
3,103,West,Desk Chair,3,4500.0,2025-10-02<br>

####6. Delta Table ‚Äì CLONE

Delta Cloning allows to create a **copy of a Delta table** efficiently:
- **Full clone**: independent copy of data and metadata  
- **Shallow clone**: metadata-only copy referencing the same underlying data files  

**Clone vs CTAS**
| Aspect                  | CLONE (Delta Lake)                     | CTAS (Create Table As Select)                |
| ----------------------- | -------------------------------------- | -------------------------------------------- |
| Type                    | Delta Lake feature                     | Standard SQL feature                         |
| Data copy               | Metadata-only (Shallow) or full (Deep) | Full physical data copy                      |
| Speed                   | Very fast (especially Shallow Clone)   | Slower for large tables                      |
| Storage usage           | Minimal for Shallow Clone              | High (duplicates data)                       |
| Time travel & history   | Preserved                              | Not preserved                                |
| Schema                  | Exact copy                             | Can be modified                              |
| Dependency on source    | Shallow clone depends on source files  | Fully independent                            |
| Use case                | Dev/Test copies, backups, experiments  | Aggregations, filtered or transformed tables |
| Source table type       | Delta tables only                      | Delta or non-Delta tables                    |

# Delta Table ‚Äì CLONE (Full Clone vs Shallow Clone)

Delta Lake provides **CLONE** to create a copy of an existing Delta table efficiently.

There are two types of clones:
- **Full Clone**
- **Shallow Clone**

---

## What does ‚Äúreferencing the same underlying data files‚Äù mean?

**Shallow clone does NOT copy the actual data files (Parquet files).**  
It only creates a new Delta table with its own metadata that **points to the same Parquet files** as the source table.

---

## Delta Table Structure Reminder

A Delta table consists of:

1. **Data files** ‚Üí Parquet files (`.parquet`)
2. **Metadata** ‚Üí Delta transaction log (`_delta_log/`)

---

## Original Table (Source)

/sales_table/<br>
‚îú‚îÄ‚îÄ _delta_log/<br>
‚îú‚îÄ‚îÄ part-0001.parquet<br>
‚îú‚îÄ‚îÄ part-0002.parquet<br>


---

## Shallow Clone

```sql
CREATE TABLE sales_clone
SHALLOW CLONE sales_table;

WHAT HAPPEND:<br>
/sales_clone/<br>
 ‚îú‚îÄ‚îÄ _delta_log/   (new metadata)<br>
 ‚îú‚îÄ‚îÄ (no parquet files copied)<br>

 Internally:
 sales_clone ‚Üí references ‚Üí part-0001.parquet
sales_clone ‚Üí references ‚Üí part-0002.parquet

Key Point

Both tables read the same Parquet files

Only metadata is duplicated

This is what ‚Äúreferencing the same underlying data files‚Äù means.


##### CTAS (Create Table as Select)

**Full copy** creates an **independent copy**:
- Data files are **copied**
- No metadata copy

**CREATE TABLE sales_ctas AS<br>
SELECT * FROM sales_source<br>**

CTAS:

- Reads data from sales_source
- Writes NEW Parquet files
- Creates a brand-new Delta table
- Does NOT copy source table metadata

What Is ‚ÄúMetadata‚Äù Here?

- Metadata includes:
- Transaction history (versions)
- Commit logs
- Operation history
- Optimization info (Z-ORDER, clustering)
- Table properties

üìå CTAS does not copy any of this.

Source Table<br>
/sales_source/<br>
 ‚îú‚îÄ‚îÄ _delta_log/<br>
 ‚îÇ    ‚îú‚îÄ‚îÄ 000000.json<br>
 ‚îÇ    ‚îú‚îÄ‚îÄ 000001.json<br>
 ‚îú‚îÄ‚îÄ part-0001.parquet<br>
 ‚îú‚îÄ‚îÄ part-0002.parquet<br>

 CTAS Table<br>
 /sales_ctas/<br>
 ‚îú‚îÄ‚îÄ _delta_log/<br>
 ‚îÇ    ‚îú‚îÄ‚îÄ 000000.json   ‚Üê NEW log (fresh history)<br>
 ‚îú‚îÄ‚îÄ part-0001.parquet  ‚Üê NEW files<br>
 ‚îú‚îÄ‚îÄ part-0002.parquet  ‚Üê NEW files<br>

‚úî Data copied
‚úî Metadata rebuilt from scratch


| Feature             | CTAS   | Full Clone | Shallow Clone |
| ------------------- | ------ | ---------- | ------------- |
| Data files copied   | ‚úÖ Yes  | ‚úÖ Yes      | ‚ùå No          |
| Metadata copied     | ‚ùå No   | ‚úÖ Yes      | ‚úÖ Yes         |
| Transaction history | ‚ùå No   | ‚úÖ Yes      | ‚úÖ Yes         |
| Speed               | Slower | Fast       | Very fast     |
| VACUUM safe         | ‚úÖ Yes  | ‚úÖ Yes      | ‚ùå No          |


In [0]:
%sql
CREATE TABLE sales_orders_ctas AS
SELECT * FROM sales_orders_liquid;

In [0]:
%sql
DESCRIBE HISTORY sales_orders_ctas

In [0]:
%sql
DESC HISTORY sales_orders_liquid;


##### Full Clone

**Full clone** creates an **independent copy**:
- Data files are **copied**
- Medata copied
- Uses more storage


In [0]:
%sql
CREATE TABLE sales_orders_sclone 
CLONE sales_orders_liquid;

In [0]:
%sql
describe history sales_orders_liquid

In [0]:
%sql
describe history sales_orders_sclone

##### Shallow Clone

**Shallow clone** creates a **metadata-only copy**:
- Shares the same underlying data files
- Very fast, uses minimal extra storage
- A shallow clone shares data files, but it does NOT share the transaction log
- Even if two tables point to the same data files, they are logically independent because they have separate logs.

In [0]:
%sql
CREATE OR REPLACE TABLE sales_orders_l_sclone
SHALLOW CLONE sales_orders_sclone;

In [0]:
%sql
-- Verify shallow clone
SELECT * FROM sales_orders_l_sclone;

In [0]:
%sql
DESCRIBE HISTORY sales_orders_l_sclone;

In [0]:
%sql
INSERT INTO sales_orders_sclone VALUES
 (7, 101, 'North', 'Keyboard', 1, 1200, '2025-10-04');

In [0]:
%sql
INSERT INTO sales_orders_sclone VALUES
 (7, 101, 'North', 'Keyboard', 1, 1200, '2025-10-04');

In [0]:
%sql
UPDATE sales_orders_sclone
SET price = 200.1
WHERE region = 'South';

In [0]:
%sql
-- Verify shallow clone
SELECT * FROM sales_orders_sclone;

In [0]:
%sql
-- Still points the old data files
SELECT * FROM sales_orders_l_sclone;

####7. Deletion Vector
A Deletion Vector is a metadata structure that marks specific rows as deleted inside a Parquet file, without rewriting the file.<br>
Eg. Instead of rewriting whole files, Delta just says: ‚Äúrow 3, row 15, row 102 are deleted‚Äù
DV Benifits:
- Parquet file count is unchanged
- New DV files exist internally

If you disable DV:
- File rewrite happens
- New parquet files created


In [0]:
%sql
CREATE OR REPLACE TABLE orders_dv AS
SELECT
  id AS order_id,
  CASE WHEN id % 2 = 0 THEN 'APAC' ELSE 'EMEA' END AS region
FROM range(0, 20);
select * from orders_dv;

In [0]:
%sql
ALTER TABLE orders_dv
SET TBLPROPERTIES ('delta.enableDeletionVectors' = true);

In [0]:
%sql
DESCRIBE DETAIL orders_dv;

In [0]:
%sql
DELETE FROM orders_dv WHERE region = 'APAC';

In [0]:
%sql
DESCRIBE HISTORY orders_dv

In [0]:
%sql
DESCRIBE DETAIL orders_dv;

In [0]:
%sql
ALTER TABLE orders_dv
SET TBLPROPERTIES ('delta.enableDeletionVectors' = false);

In [0]:
%sql
DELETE FROM orders_dv WHERE order_id = 3;

In [0]:
%sql
DESCRIBE DETAIL orders_dv;

In [0]:
%sql
--Check the OperationMetrics
DESCRIBE HISTORY orders_dv;