### Day 10 â€” Performance Optimization 

This notebook covers: query execution plans, partitioning, OPTIMIZE + ZORDER, and benchmarking improvements on the e-commerce transactions dataset.


## Analyzing query execution plans

starting by inspecting the physical plan for a common aggregation query. 


In [0]:
%sql
EXPLAIN FORMATTED
SELECT Product_Category, COUNT(*) AS total_purchases, ROUND(SUM(Purchase_Amount),2) AS revenue
FROM default.ecommerce_transactions
WHERE Transaction_Date >= '2025-01-01'
GROUP BY Product_Category
ORDER BY revenue DESC;


plan
"== Physical Plan == AdaptiveSparkPlan (7) +- == Initial Plan ==  ColumnarToRow (6)  +- PhotonResultStage (5)  +- PhotonSort (4)  +- PhotonGroupingAgg (3)  +- PhotonProject (2)  +- PhotonScan parquet workspace.default.ecommerce_transactions (1) (1) PhotonScan parquet workspace.default.ecommerce_transactions Output [3]: [Product_Category#15472, Purchase_Amount#15473, Transaction_Date#15475] DictionaryFilters: [(Transaction_Date#15475 >= 2025-01-01)] Location: PreparedDeltaFileIndex [s3://dbstorage-prod-ddsbc/uc/4c57a8ce-0d74-47d7-a48c-ab32ee4283dc/9397e82b-e750-426d-a9f7-45294d9825b9/__unitystorage/catalogs/67ba59bb-62e1-4fd3-9186-cab1ffe4d2cb/tables/c0b45f50-b853-4c38-ba3f-5b1c7b52bfa2] ReadSchema: struct RequiredDataFilters: [isnotnull(Transaction_Date#15475), (Transaction_Date#15475 >= 2025-01-01)] (2) PhotonProject Input [3]: [Product_Category#15472, Purchase_Amount#15473, Transaction_Date#15475] Arguments: [Product_Category#15472, Purchase_Amount#15473] (3) PhotonGroupingAgg Input [2]: [Product_Category#15472, Purchase_Amount#15473] Arguments: [Product_Category#15472], [count(1), sum(Purchase_Amount#15473)], [count(1)#15476L, sum(Purchase_Amount)#15478], [Product_Category#15472, count(1)#15476L AS total_purchases#15451L, round(sum(Purchase_Amount)#15478, 2) AS revenue#15452], true (4) PhotonSort Input [3]: [Product_Category#15472, total_purchases#15451L, revenue#15452] Arguments: [revenue#15452 DESC NULLS LAST] (5) PhotonResultStage Input [3]: [Product_Category#15472, total_purchases#15451L, revenue#15452] (6) ColumnarToRow Input [3]: [Product_Category#15472, total_purchases#15451L, revenue#15452] (7) AdaptiveSparkPlan Output [3]: [Product_Category#15472, total_purchases#15451L, revenue#15452] Arguments: isFinalPlan=false == Photon Explanation == The query is fully supported by Photon. == Optimizer Statistics (table names per statistics state) ==  missing = partial = full = ecommerce_transactions"


#### Plan for a selective filter

This checks how Spark plans a query with a filter on `User_Name` 

In [0]:
%sql
EXPLAIN FORMATTED
SELECT *
FROM default.ecommerce_transactions
WHERE User_Name = 'Ava Clark';

plan
"== Physical Plan == * ColumnarToRow (3) +- PhotonResultStage (2)  +- PhotonScan parquet workspace.default.ecommerce_transactions (1) (1) PhotonScan parquet workspace.default.ecommerce_transactions Output [8]: [Transaction_ID#15510L, User_Name#15511, Age#15512L, Country#15513, Product_Category#15514, Purchase_Amount#15515, Payment_Method#15516, Transaction_Date#15517] DictionaryFilters: [(User_Name#15511 = Ava Clark)] Location: PreparedDeltaFileIndex [s3://dbstorage-prod-ddsbc/uc/4c57a8ce-0d74-47d7-a48c-ab32ee4283dc/9397e82b-e750-426d-a9f7-45294d9825b9/__unitystorage/catalogs/67ba59bb-62e1-4fd3-9186-cab1ffe4d2cb/tables/c0b45f50-b853-4c38-ba3f-5b1c7b52bfa2] ReadSchema: struct RequiredDataFilters: [isnotnull(User_Name#15511), (User_Name#15511 = Ava Clark)] (2) PhotonResultStage Input [8]: [Transaction_ID#15510L, User_Name#15511, Age#15512L, Country#15513, Product_Category#15514, Purchase_Amount#15515, Payment_Method#15516, Transaction_Date#15517] (3) ColumnarToRow [codegen id : 1] Input [8]: [Transaction_ID#15510L, User_Name#15511, Age#15512L, Country#15513, Product_Category#15514, Purchase_Amount#15515, Payment_Method#15516, Transaction_Date#15517] == Photon Explanation == The query is fully supported by Photon. == Optimizer Statistics (table names per statistics state) ==  missing = partial = full = ecommerce_transactions"


## Creating a partitioned Delta table

Create a new Delta table partitioned by `Transaction_Date` to enable **partition pruning** for date-based filters

In [0]:
%sql
CREATE OR REPLACE TABLE default.ecommerce_txn_part
USING DELTA
PARTITIONED BY (Transaction_Date)
AS
SELECT *
FROM default.ecommerce_transactions;

num_affected_rows,num_inserted_rows


### Validating table metadata

Confirm the new table is **Delta** and check the `partitionColumns` in the output.

In [0]:
%sql
DESCRIBE DETAIL default.ecommerce_txn_part;


format,id,name,description,location,createdAt,lastModified,partitionColumns,clusteringColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion,tableFeatures,statistics,clusterByAuto
delta,fcff97aa-d4cc-4ccd-89bf-b488a373e832,workspace.default.ecommerce_txn_part,,,2026-01-18T13:48:24.605Z,2026-01-18T13:48:42.000Z,List(Transaction_Date),List(),731,2810107,"Map(delta.parquet.compression.codec -> zstd, delta.enableDeletionVectors -> true)",3,7,"List(appendOnly, deletionVectors, invariants)","Map(numRowsDeletedByDeletionVectors -> 0, numDeletionVectors -> 0)",False


### Row count check

Verifying the partitioned table has the expected number of rows (matching the source table or not).


In [0]:
%sql
SELECT COUNT(*) FROM default.ecommerce_txn_part;


COUNT(*)
50000


## OPTIMIZE + ZORDER

Run file compaction + data clustering to improve query performance.
- `OPTIMIZE` reduces small files.
- `ZORDER` clusters data by frequently filtered columns to improve **data skipping**

In [0]:
%sql
OPTIMIZE default.ecommerce_txn_part
ZORDER BY (User_Name, Product_Category);

path,metrics
,"List(0, 0, List(null, null, 0.0, 0, 0), List(null, null, 0.0, 0, 0), 731, List(minCubeSize(107374182400), List(0, 0), List(731, 2810107), 0, List(0, 0), 0, null), null, 0, 0, 731, 731, false, 0, 0, 1768744127318, 1768744130643, 8, 0, null, List(0, 0), null, 8, 8, 0, 0, null)"


## Benchmark 

Use `EXPLAIN FORMATTED` to inspect how the **original** table scans data for a date filter.


In [0]:
%sql
EXPLAIN FORMATTED
SELECT COUNT(*)
FROM default.ecommerce_transactions
WHERE Transaction_Date = '2025-01-01';


plan
"== Physical Plan == AdaptiveSparkPlan (6) +- == Initial Plan ==  ColumnarToRow (5)  +- PhotonResultStage (4)  +- PhotonAgg (3)  +- PhotonProject (2)  +- PhotonScan parquet workspace.default.ecommerce_transactions (1) (1) PhotonScan parquet workspace.default.ecommerce_transactions Output [1]: [Transaction_Date#16349] DictionaryFilters: [(Transaction_Date#16349 = 2025-01-01)] Location: PreparedDeltaFileIndex [s3://dbstorage-prod-ddsbc/uc/4c57a8ce-0d74-47d7-a48c-ab32ee4283dc/9397e82b-e750-426d-a9f7-45294d9825b9/__unitystorage/catalogs/67ba59bb-62e1-4fd3-9186-cab1ffe4d2cb/tables/c0b45f50-b853-4c38-ba3f-5b1c7b52bfa2] ReadSchema: struct RequiredDataFilters: [isnotnull(Transaction_Date#16349), (Transaction_Date#16349 = 2025-01-01)] (2) PhotonProject Input [1]: [Transaction_Date#16349] (3) PhotonAgg Input: [] Arguments: [count(1)], [count(1)#16350L], [count(1)#16350L AS COUNT(*)#16352L], true (4) PhotonResultStage Input [1]: [COUNT(*)#16352L] (5) ColumnarToRow Input [1]: [COUNT(*)#16352L] (6) AdaptiveSparkPlan Output [1]: [COUNT(*)#16352L] Arguments: isFinalPlan=false == Photon Explanation == The query is fully supported by Photon. == Optimizer Statistics (table names per statistics state) ==  missing = partial = full = ecommerce_transactions"


### Explain on partitioned table

Comparing the plan for the same date filter on the **partitioned** Delta table.


In [0]:
%sql
EXPLAIN FORMATTED
SELECT COUNT(*)
FROM default.ecommerce_txn_part
WHERE Transaction_Date = '2025-01-01';

plan
== Physical Plan == LocalTableScan (1) (1) LocalTableScan Output [1]: [COUNT(*)#16393L] Arguments: [COUNT(*)#16393L] == Photon Explanation == Photon does not fully support the query because:  Unsupported node: LocalTableScan [COUNT(*)#16393L]. Reference node: 	LocalTableScan [COUNT(*)#16393L]


#### Runtime benchmark on both tables (original and partitioned table)


Run the same date-filter query on the tables and compare runtime.

In [0]:
%sql
SELECT COUNT(*)
FROM default.ecommerce_transactions
WHERE Transaction_Date = '2025-01-01';


COUNT(*)
77


In [0]:
%sql
SELECT COUNT(*)
FROM default.ecommerce_txn_part
WHERE Transaction_Date = '2025-01-01';


COUNT(*)
77


#### Runtime benchmark (Selective filters)

Runing a selective filter on both tables(**original** and **partitioned + ZORDER** ) using a specific `User_Name` + `Product_Category` and compare runtime.


In [0]:
%sql
SELECT COUNT(*)
FROM default.ecommerce_transactions
WHERE User_Name = 'Ava Clark'
  AND Product_Category = 'Clothing';


COUNT(*)
72


In [0]:
%sql
SELECT COUNT(*)
FROM default.ecommerce_txn_part
WHERE User_Name = 'Ava Clark'
  AND Product_Category = 'Clothing';


COUNT(*)
72
