###Deltalake & Lakehouse Optimization Usecases

![](/Workspace/Users/sachinchitte4@gmail.com/MI_databricks_MainRepo/5_all_databricks_workouts/DELTA OPTIMIZATIONS.png)

> **Data skew** is a situation where data is unevenly distributed across partitions, causing some tasks to process much more data than others, which leads to performance degradation. It commonly occurs during joins, groupBy, and shuffle operations. It can be handled using techniques like salting, broadcast joins, better partitioning, and Spark‚Äôs adaptive query execution.This causes performance problems.


- **Data skew** occurs when the data being processed is not evenly distributed across the Spark cluster, causing some tasks to take significantly longer to complete than others. This can lead to inefficient resource utilization, longer processing times, and ultimately, slower performance. Data skew can arise from various factors, including uneven data distribution, skewed join keys, or uneven data processing patterns




### Salting:
![](/Workspace/Users/sachinchitte4@gmail.com/MI_databricks_MainRepo/5_all_databricks_workouts/salting.png)

####1. Handling Data Skew & Query Performance (Optimize & Z-Order)
Scenario: The analytics team reports that queries filtering silver_shipments by source_city and shipment_date are becoming slow as data volume grows.

Task: Run the OPTIMIZE command with ZORDER on the silver_shipments table to co-locate related data in the same files.

Outcome:
Why did we choose source_city and shipment_date for Z-Ordering instead of shipment_id? Think about high cardinality vs. query filtering

In [0]:
%sql
OPTIMIZE catalog2_we47.schema2_we47.silver_shipments
ZORDER BY (source_city, shipment_date);

- The main reason is:
- üëâ ZORDER should be applied on columns that are frequently used in WHERE clauses.
- source_city ‚Üí Used frequently ‚úÖ
- shipment_date ‚Üí Used frequently ‚úÖ
- shipment_id ‚Üí Rarely used
- ZORDER works best when it matches query filters.
| Column        | Cardinality | Meaning        |
| ------------- | ----------- | -------------- |
| source_city   | Low‚ÄìMedium  | Limited cities |
| shipment_date | Low         | One per day    |
| shipment_id   | Very High   | Unique per row |


#### 2. Speeding up Regional Queries (Partition Pruning)
Scenario: The dashboard team reports that queries filtering for orgin_hub_city with "New York" shipments from the gold_core_curated_tbl table are scanning the entire dataset (Terabytes of data), even though New York is only 5% of the data. This is racking up compute costs.

Task: Re-create the gold_core_curated_tbl table partitioned by orgin_hub_city. Run a query filtering for one city to demonstrate "Partition Pruning" (where Spark skips files that don't match the filter).

Outcome: Verify the partition filtering is applied or not, by performing explain plan, check for the PartitionFilters in the output.

In [0]:
%sql
CREATE OR REPLACE TABLE catalog2_we47.schema2_we47.gold_core_curated_tbl_partitioned
USING DELTA
PARTITIONED BY (origin_hub_city)
AS
SELECT *
FROM catalog2_we47.schema2_we47.gold_core_curated_tbl;

In [0]:
%sql
DESCRIBE DETAIL catalog2_we47.schema2_we47.gold_core_curated_tbl_partitioned;

In [0]:
%sql
---Query for Mumbai (On Partitioned Table)--
SELECT *
FROM catalog2_we47.schema2_we47.gold_core_curated_tbl_partitioned
WHERE origin_hub_city = 'Mumbai';

In [0]:
%sql
----------Verify Partition Pruning---------
EXPLAIN FORMATTED
SELECT *
FROM catalog2_we47.schema2_we47.gold_core_curated_tbl_partitioned
WHERE origin_hub_city = 'Mumbai';
-----o/p --PartitionFilters: [isnotnull(origin_hub_city#15454), (origin_hub_city#15454 = Mumbai)]---

#### 3. Storage Cost Savings (Vacuum)
Scenario: Your Project pipeline runs every hour, creating many small files and obsolete versions of data. Your storage costs are rising. You need to clean up files that are no longer needed for time travel.

Task: Execute a Vacuum command to remove data files older than the retention threshold.

Outcome: Performance improvement, cost saving, best practices.

Observation: Perform the describe history and find whether vacuum is completed.

In [0]:
%sql
DESCRIBE HISTORY catalog2_we47.schema2_we47.gold_core_curated_tbl_partitioned;

In [0]:
%sql
-----Deletes inactive files older than 200 hours-------
VACUUM catalog2_we47.schema2_we47.gold_core_curated_tbl_partitioned RETAIN 200 HOURS;

In [0]:
%sql
DESCRIBE HISTORY catalog2_we47.schema2_we47.gold_core_curated_tbl_partitioned;

####4. Modern Data Layout (Liquid Clustering)
Scenario: You are redesigning the silver_shipments table. You want to avoid the "small files" problem and need a flexible layout that adapts to changing query patterns automatically without rewriting the table.

Task: Re-create the silver_shipments table using Liquid Clustering on the shipment_id column.

Outcome: Liquid Clustering over traditional partitioning when the cardinality of shipment_id is very high.

In [0]:
%sql
CREATE OR REPLACE TABLE catalog2_we47.schema2_we47.silver_shipments_liquid
USING DELTA
CLUSTER BY (shipment_id)
AS
SELECT *
FROM catalog2_we47.schema2_we47.silver_shipments;

In [0]:
%sql
DESCRIBE DETAIL catalog2_we47.schema2_we47.silver_shipments_liquid;

In [0]:
%sql
SELECT *
FROM catalog2_we47.schema2_we47.silver_shipments_liquid
WHERE shipment_id = 6000037;

We recreated silver_shipments using Liquid Clustering on shipment_id because it has very high cardinality. Partitioning on such a column would lead to over-partitioning and metadata overhead. Liquid Clustering automatically organizes data inside files and handles compaction, providing efficient data skipping and adaptability to changing query patterns.

#### 5. Cost Efficient Environment Cloning (Shallow Clone)
Scenario: The QA team needs to test an update on the gold_core_curated_tbl table. The table is 5TB in size. You cannot afford to duplicate the storage cost just for a test and the update should not affect the copied table.

catalog2_we47.schema2_we47.gold_core_curated_tbl

Task: Create a Shallow Clone of the gold table for the QA team.

Outcome: If we delete records from the source table (gold_core_curated_tbl), will the QA table (gold_core_curated_tbl_qa) be affected? Why or why not?

In [0]:
%sql
CREATE TABLE catalog2_we47.schema2_we47.gold_core_curated_tbl_qa
SHALLOW CLONE catalog2_we47.schema2_we47.gold_core_curated_tbl;

- When you create a shallow clone, Databricks copies the Transaction Log (the Delta Log) of the source table but points to the existing data files of that source.
- This is the most important thing to remember: A shallow clone is dependent on the source table.

- If you run a VACUUM command on the source table and it deletes old data files that the shallow clone is still pointing to, your shallow clone will break. It will try to find a file that no longer exists, and your queries will fail.
- a Shallow Clone is a metadata-only copy of a table. It‚Äôs incredibly fast and efficient because it doesn't actually move or duplicate the heavy data files (Parquet files) sitting in your cloud storage

#### 6. Disaster Recovery (Time Travel & Restore)
Scenario: A junior data engineer accidentally ran a logic error that corrupted the gold_core_curated_tbl table 15 minutes ago. You need to revert the table to its previous state immediately.

Task: Use Delta Lake's Restore feature to roll back the table.

Outcome:What is the difference between querying with VERSION AS OF (Time Travel) and running RESTORE?

In [0]:
%sql
DESCRIBE HISTORY catalog2_we47.schema2_we47.gold_core_curated_tbl;

In [0]:
%sql
RESTORE TABLE catalog2_we47.schema2_we47.gold_core_curated_tbl
TO VERSION AS OF 1;

In [0]:
%sql
---Verify After Restore---
SELECT COUNT(*)
FROM catalog2_we47.schema2_we47.gold_core_curated_tbl;

In [0]:
%sql
DESCRIBE HISTORY catalog2_we47.schema2_we47.gold_core_curated_tbl;

### Difference Between VERSION AS OF (Time Travel) and RESTORE
> SELECT *
FROM gold_core_curated_tbl VERSION AS OF 1;
------>What it Does
- üëâ Shows how the table looked in the past.
- ‚úî Read-only
- ‚úî Does NOT change the table
- ‚úî Temporary view of old data
- ‚úî Other users see latest data
- ‚úî No recovery happens

-- Compare old vs new---------------
- SELECT COUNT(*) FROM gold_core_curated_tbl VERSION AS OF 1;
- SELECT COUNT(*) FROM gold_core_curated_tbl;

In [0]:
%sql
-- Compare old vs new---------------
SELECT COUNT(*) FROM gold_core_curated_tbl VERSION AS OF 1;
SELECT COUNT(*) FROM gold_core_curated_tbl;

In [0]:
%sql
-------------2. RESTORE ‚Üí Permanent Rollback----------
-----Example
RESTORE TABLE gold_core_curated_tbl TO VERSION AS OF 1;

| Feature       | VERSION AS OF (Time Travel) | RESTORE         |
| ------------- | --------------------------- | --------------- |
| Purpose       | View past data              | Roll back table |
| Table State   | Unchanged                   | Changed         |
| Read/Write    | Read-only                   | Write operation |
| Permanent     | ‚ùå No                        | ‚úÖ Yes           |
| New Version   | ‚ùå No                        | ‚úÖ Yes           |
| Affects Users | ‚ùå No                        | ‚úÖ Yes           |
| Use Case      | Audit/Debug                 | Recovery        |
