###Deltalake & Lakehouse Optimization Usecases

![](/Workspace/Users/infoblisstech@gmail.com/databricks-code-repo/5_all_databricks_workouts/DELTA OPTIMIZATIONS.png)

####1. Handling Data Skew & Query Performance (Optimize & Z-Order)
Scenario: The analytics team reports that queries filtering silver_shipments by source_city and shipment_date are becoming slow as data volume grows.

Task: Run the OPTIMIZE command with ZORDER on the silver_shipments table to co-locate related data in the same files.

Outcome:
Why did we choose source_city and shipment_date for Z-Ordering instead of shipment_id? Think about high cardinality vs. query filtering

In [0]:
%sql
DESCRIBE DETAIL logistics.schema.silver_shipments;


In [0]:
%sql
CREATE OR REPLACE TABLE silver_shipments_optimized
FROM logistics.schema.silver_shipments;

In [0]:
%sql
OPTIMIZE silver_shipments_optimized ZORDER BY (source_city, shipment_date);
-- SELECT *  FROM prodcatalog_wd36.logistics_wd36.silver_shipments;
DESCRIBE HISTORY silver_shipments_optimized;
-- Here shipment_id is high cardinality column, all values are unique. so cannot be used for ZORDER BY
-- Here source_city is low cardinality column, so can be used for ZORDER BY


In [0]:
spark.sql("OPTIMIZE logistics.schema.silver_shipments ZORDER BY (source_city, shipment_date)").display()


In [0]:
# /Volumes/workspace/default/logistics_project_data/customer1.xlsx
# for practice ignore this

# df = spark.read \
#   .format("csv") \
#   .option("header", "true") \
#   .option("inferSchema", "true") \
#   .load("/Volumes/workspace/default/logistics_project_data/customer1.xlsx").toDF("cust_id", "cust_name", "age", "cust_city", "plan")

# df.write \
#   .format("delta") \
#   .mode("overwrite") \
#   .saveAsTable("customer1_optimization")


#### 2. Speeding up Regional Queries (Partition Pruning)
Scenario: The dashboard team reports that queries filtering for orgin_hub_city with "New York" shipments from the gold_core_curated_tbl table are scanning the entire dataset (Terabytes of data), even though New York is only 5% of the data. This is racking up compute costs.

Task: Re-create the gold_core_curated_tbl table partitioned by orgin_hub_city. Run a query filtering for one city to demonstrate "Partition Pruning" (where Spark skips files that don't match the filter).

Outcome: Verify the partition filtering is applied or not, by performing explain plan, check for the PartitionFilters in the output.

In [0]:
%sql

create or replace table curated_tbl_optimized partitioned by (origin_hub_city) as select * from logistics.schema.core_curated_tbl;
select * from curated_tbl_optimized;


In [0]:
%sql
explain select * from curated_tbl_optimized where origin_hub_city ='Newyork';

In [0]:
%sql
select * from curated_tbl_optimized where origin_hub_city ='Newyork';

#### 3. Storage Cost Savings (Vacuum)
Scenario: Your Project pipeline runs every hour, creating many small files and obsolete versions of data. Your storage costs are rising. You need to clean up files that are no longer needed for time travel.

Task: Execute a Vacuum command to remove data files older than the retention threshold.

Outcome: Perform the describe history and find whether vacuum is completed.

In [0]:
%sql
VACUUM curated_tbl_optimized RETAIN 168 HOURS;
DESC HISTORY curated_tbl_optimized;

####4. Modern Data Layout (Liquid Clustering)
Scenario: You are redesigning the silver_shipments table. You want to avoid the "small files" problem and need a flexible layout that adapts to changing query patterns automatically without rewriting the table.

Task: Re-create the silver_shipments table using Liquid Clustering on the shipment_id column.

Outcome: Liquid Clustering over traditional partitioning when the cardinality of shipment_id is very high.

In [0]:
%sql
SELECT * FROM logistics.schema.silver_shipments;

In [0]:
%sql
CREATE OR REPLACE TABLE silver_shipments_optimized
AS
SELECT * FROM logistics.schema.silver_shipments
CLUSTER BY (shipment_id);

DESC HISTORY silver_shipments_optimized;

#### 5. Cost Efficient Environment Cloning (Shallow Clone)
Scenario: The QA team needs to test an update on the gold_core_curated_tbl table. The table is 5TB in size. You cannot afford to duplicate the storage cost just for a test and the update should not affect the copied table.

Task: Create a Shallow Clone of the gold table for the QA team.

Outcome: If we delete records from the source table (gold_core_curated_tbl), will the QA table (gold_core_curated_tbl_qa) be affected? Why or why not?

In [0]:
%sql
DROP TABLE IF EXISTS core_curated_tbl_qa;

CREATE TABLE core_curated_tbl_qa
SHALLOW CLONE curated_tbl_optimized;

In [0]:
%sql
SELECT * FROM core_curated_tbl_qa;

In [0]:
%sql
DESCRIBE HISTORY core_curated_tbl_qa;

Even though we delete or insert data from the source table **(curated_tbl_optimized)**, QA table **(shallow cloned table - core_curated_tbl_qa)** data will not be affected since we are using the metadata(versions of datas) not the data files.

#### 6. Disaster Recovery (Time Travel & Restore)
Scenario: A junior data engineer accidentally ran a logic error that corrupted the gold_core_curated_tbl table 15 minutes ago. You need to revert the table to its previous state immediately.

Task: Use Delta Lake's Restore feature to roll back the table.

Outcome:What is the difference between querying with VERSION AS OF (Time Travel) and running RESTORE?

In [0]:
%sql
RESTORE TABLE curated_tbl_optimized
TO VERSION AS OF 1;

In [0]:
%sql
DESC HISTORY curated_tbl_optimized;