#Analyze Query Execution Plans
##Objective

Understand how Spark executes a query before optimizing anything.

Rule: Never optimize before understanding the plan

# Table Without Partitions
##Explain FULL execution plan
###Observe the Physical Plan Steps


In [0]:
spark.sql("""
SELECT *
FROM silver.events
WHERE event_type = 'purchase'
""").explain(True)


###Even without partitioning, Spark applies predicate pushdown at the file level.
However, without partition pruning, Spark must still scan all files, limiting performance gains.

In [0]:
df = spark.sql("""
SELECT *
FROM silver.events
WHERE event_type = 'purchase'
""")
df.explain(True)


In [0]:
spark.sql("""
SELECT count(*)
FROM silver.events
WHERE event_type = 'purchase'
""").explain(True)


Learned how to inspect logical & physical execution plans

âœ” Verified predicate pushdown on a non-partitioned Delta table

âœ” Observed full table scan behavior

âœ” Identified clear justification for partitioning

#Choose Partition Columns
event_date	Time-based filtering (most common),
event_type	Highly selective (purchase, view, etc.)

In [0]:
%sql
CREATE TABLE IF NOT EXISTS silver.events_part
USING DELTA
PARTITIONED BY (event_date, event_type)
AS
SELECT *
FROM silver.events;


In [0]:
%sql
DESCRIBE DETAIL silver.events_part;


#Prove Partition Pruning
##Now rerun explain() on the partitioned table:

In [0]:
spark.sql("""
SELECT *
FROM silver.events_part
WHERE event_type = 'purchase'
  AND event_date = '2019-11-05'
""").explain(True)


#After partitioning by event_date and event_type, Spark applied partition pruning, significantly reducing scanned data.

#Apply OPTIMIZE with ZORDER
Why these columns?

user_id â†’ frequent point lookups

product_id â†’ common joins / aggregations

In [0]:
%sql
OPTIMIZE silver.events_part
ZORDER BY (user_id, product_id);


In [0]:
%sql
DESCRIBE HISTORY silver.events_part;


# ZORDER Effect 
 OPTIMIZE with ZORDER to improve data locality and enable data skipping for high-cardinality filters.

In [0]:
spark.sql("""
SELECT *
FROM silver.events_part
WHERE user_id = 12345
""").explain(True)


#Benchmark Improvements

ðŸŽ¯ Objective

Prove that:

Partitioning + ZORDER improve performance

Caching helps repeated/iterative queries

We will compare before vs after.

#Before Partitioning
Expected:

Full scan

Slower execution

In [0]:
spark.sql("""
SELECT count(*)
FROM silver.events
WHERE user_id = 12345
  AND event_type = 'purchase'
""").explain(True)


In [0]:
import time

start = time.time()
spark.sql("""
SELECT *
FROM silver.events
WHERE user_id = 12345
  AND event_type = 'purchase'
""").count()

print(f"Unoptimized Time: {time.time() - start:.2f} seconds")


#After Partitioning 
âœ… see lower execution time
(This proves partition pruning + data skipping)

In [0]:
spark.sql("""
SELECT count(*)
FROM silver.events_part
WHERE user_id = 12345
  AND event_type = 'purchase'
""").explain(True)


In [0]:
start = time.time()
spark.sql("""
SELECT *
FROM silver.events_part
WHERE event_type = 'purchase'
  AND user_id = 12345
""").count()

print(f"Optimized Time: {time.time() - start:.2f} seconds")


#Apply Caching
Caching is useful when:

Same dataset is queried repeatedly

Dashboards / iterative analytics

In [0]:
cached_df = spark.table("silver.events_part").cache()
cached_df.count()  # materialize cache
