###Deltalake & Lakehouse Optimization Usecases

![](/Workspace/Users/infoblisstech@gmail.com/databricks-code-repo/5_all_databricks_workouts/DELTA OPTIMIZATIONS.png)

####1. Handling Data Skew & Query Performance (Optimize & Z-Order)
Scenario: The analytics team reports that queries filtering silver_shipments by source_city and shipment_date are becoming slow as data volume grows.

Task: Run the OPTIMIZE command with ZORDER on the silver_shipments table to co-locate related data in the same files.

Outcome:
Why did we choose source_city and shipment_date for Z-Ordering instead of shipment_id? Think about high cardinality vs. query filtering

In [0]:
display(spark.sql("select * from catalog2_we47.schema2_we47.silver_shipments"))

In [0]:
%sql
describe detail catalog2_we47.schema2_we47.silver_shipments;

In [0]:
%sql
 OPTIMIZE catalog2_we47.schema2_we47.silver_shipments ZORDER BY(source_city,shipment_date)

In [0]:
%sql
describe detail catalog2_we47.schema2_we47.silver_shipments;

#### 2. Speeding up Regional Queries (Partition Pruning)
Scenario: The dashboard team reports that queries filtering for orgin_hub_city with "New York" shipments from the gold_core_curated_tbl table are scanning the entire dataset (Terabytes of data), even though New York is only 5% of the data. This is racking up compute costs.

Task: Re-create the gold_core_curated_tbl table partitioned by orgin_hub_city. Run a query filtering for one city to demonstrate "Partition Pruning" (where Spark skips files that don't match the filter).

Outcome: Verify the partition filtering is applied or not, by performing explain plan, check for the PartitionFilters in the output.

In [0]:
#Schema Modeling (Denormalization)
#adding partition by as part of optimization
spark.sql(f"""
          CREATE OR REPLACE TABLE {GOLDDB}.gold_core_curated_tbl
USING DELTA
PARTITIONED BY (origin_hub_city)         
AS
SELECT
    s.shipment_id,
    CONCAT(
        SUBSTRING(s.staff_full_name, 1, 2),
        '****',
        SUBSTRING(s.staff_full_name, -1, 1)
    ) AS masked_staff_name,
    s.role,
    s.origin_hub_city,
    s.latitude,
    s.longitude,
    sh.shipment_cost,
    sh.shipment_year,
    sh.shipment_month,
    sh.route_segment,
    sh.cost_per_kg,
    sh.tax_amount,
    sh.ingestion_timestamp,
    sh.is_expedited,
    sh.is_weekend,
    sh.is_high_value,
    sh.order_prefix,
    sh.order_sequence,
    sh.ship_day,
    sh.route_lane
FROM silver_staff_geo_latlong_tv s
INNER JOIN {SILVERDB}.silver_shipments sh
    USING (shipment_id)
""")

In [0]:
spark.sql (f"""show partitions catalog2_we47.schema2_we47.gold_core_curated_tbl;""").display()

In [0]:
%sql
select * from catalog2_we47.schema2_we47.gold_core_curated_tbl;

#### 3. Storage Cost Savings (Vacuum)
Scenario: Your Project pipeline runs every hour, creating many small files and obsolete versions of data. Your storage costs are rising. You need to clean up files that are no longer needed for time travel.

Task: Execute a Vacuum command to remove data files older than the retention threshold.

Outcome: Performance improvement, cost saving, best practices.

Observation: Perform the describe history and find whether vacuum is completed.

In [0]:
%sql
DESCRIBE DETAIL catalog2_we47.schema2_we47.gold_core_curated_tbl;

In [0]:
%sql
VACUUM catalog2_we47.schema2_we47.gold_core_curated_tbl retain 168 hours;

####4. Modern Data Layout (Liquid Clustering)
Scenario: You are redesigning the silver_shipments table. You want to avoid the "small files" problem and need a flexible layout that adapts to changing query patterns automatically without rewriting the table.

Task: Re-create the silver_shipments table using Liquid Clustering on the shipment_id column.

Outcome: Liquid Clustering over traditional partitioning when the cardinality of shipment_id is very high.

In [0]:
spark.sql(f"""create table {SILVERDB}.silver_shipments_liquid
using delta
cluster by (shipment_id) as
(
select * from {SILVERDB}.silver_shipments
)
;""")

display(spark.sql(f"""describe detail catalog2_we47.schema2_we47.silver_shipments_liquid;""")

display(spark.sql("select * from catalog2_we47.schema2_we47.silver_shipments_liquid;"))

#### 5. Cost Efficient Environment Cloning (Shallow Clone)
Scenario: The QA team needs to test an update on the gold_core_curated_tbl table. The table is 5TB in size. You cannot afford to duplicate the storage cost just for a test and the update should not affect the original table.

Task: Create a Shallow Clone of the gold table for the QA team.

Outcome: If we delete records from the source table (gold_core_curated_tbl), will the QA table (gold_core_curated_tbl_qa) be affected & vice versa? Why or why not?

In [0]:

spark.sql(f"""create or replace table {GOLDDB}.core_curated_shallow_clone
shallow clone {GOLDDB}.core_curated_tbl""");

spark.sql(f"""select * from {GOLDDB}.core_curated_shallow_clone""").display()

spark.sql (f"""describe detail {GOLDDB}.core_curated_shallow_clone;""").display()

#### 6. Disaster Recovery (Time Travel & Restore)
Scenario: A junior data engineer accidentally ran a logic error that corrupted the gold_core_curated_tbl table 15 minutes ago. You need to revert the table to its previous state immediately.

Task: Use Delta Lake's Restore feature to roll back the table.

Outcome:What is the difference between querying with VERSION AS OF (Time Travel) and running RESTORE?

In [0]:
%sql
describe history catalog2_we47.schema2_we47.gold_core_curated_tbl;
--select count(*) from catalog2_we47.schema2_we47.gold_core_curated_tbl;

In [0]:
%sql
RESTORE TABLE catalog2_we47.schema2_we47.gold_core_curated_tbl TO VERSION AS OF 0;

In [0]:
%sql
select * from catalog2_we47.schema2_we47.gold_core_curated_tbl;