# Performance Optimization

Creating 2 versions of tables

In [0]:
(
  spark.table("ecom.bronze.events")
  .orderBy("brand", "category")
  .write
  .mode("overwrite")
  .partitionBy("event_type")
  .saveAsTable("ecom.bronze.events_partition")
)

read all files without partitioning

In [0]:
%sql
SELECT 
* FROM ecom.bronze.events
WHERE event_type = 'purchase' AND brand is not null AND category is not null


90% removed from reading after partitioning

In [0]:
%sql
SELECT 
* FROM ecom.bronze.events_partition
WHERE event_type = 'purchase' AND brand is not null AND category is not null

Before Zordering reads 38+ files

In [0]:
%sql
SELECT * FROM ecom.bronze.events
WHERE user_id = '460307564'

Z-Ordering with High Cardinality

In [0]:
%sql
OPTIMIZE ecom.bronze.events
ZORDER BY (user_id)

Now, the same query reads only 1 file (98% filtered)

In [0]:
%sql
SELECT * FROM ecom.bronze.events
WHERE user_id = '460307564'

Diff when we cache a table (unable to demo since it requires classic compute not available in free version)

In [0]:
df = spark.read.load("/Volumes/workspace/ecommerce/ecommerce_data/2019-Oct.csv", format="csv", header=True, inferSchema=True)
df.cache()

In [0]:
df2 = spark.read.load("/Volumes/workspace/ecommerce/ecommerce_data/2019-Oct.csv", format="csv", header="true", inferSchema="true")

In [0]:
import time
start = time.time()
df.select("brand", "category").distinct().count()
end = time.time()
print(end - start)

In [0]:
import time
start = time.time()
df2.select("brand", "category").distinct().count()
end = time.time()
print(end - start)