%md
## Table Optimization and Clustering (Performance Tuning)

This step enhances **query performance** and **storage efficiency** across all silver tables by leveraging Databricks Delta Lake’s physical optimization features — `OPTIMIZE`, `ZORDER`, and automatic maintenance properties.  
These operations ensure faster reads, efficient joins, and reduced long-term storage overhead.

---

### What This Code Does

1. **OPTIMIZE**  
   - Compacts many small Parquet files into larger ones (typically 256 MB–1 GB).  
   - Reduces file overhead and metadata scanning.  
   - Improves query and join performance by minimizing I/O operations.  

2. **ZORDER BY**  
   - Reorders data within files to cluster rows with similar key values (e.g., `user_sk`, `product_sk`).  
   - Enables **data skipping**, allowing Databricks to avoid reading irrelevant file blocks.  
   - Ideal for optimizing queries and joins filtered on key columns such as `user_sk` or `product_sk`.  

3. **ANALYZE TABLE … COMPUTE STATISTICS**  
   - Updates column-level statistics for the query optimizer.  
   - Produces more efficient query execution plans based on data distribution.  

4. **ALTER TABLE … SET TBLPROPERTIES**  
   - Activates Delta’s **auto-optimize** and **auto-compaction** features:  
     - Automatically merges small files after each write.  
     - Performs background compaction without manual `OPTIMIZE` commands.  

---

### Why This Is Useful

- **Faster query performance:** Reduces latency for dashboards, notebooks, and ad-hoc analytics.  
- **Lower compute cost:** Minimizes the number of files and data scanned during queries.  
- **Ongoing efficiency:** Keeps large, frequently updated tables (e.g., `fact_user_product_behavior`) optimized automatically.  
- **Scalability:** Ensures consistent performance even as data volume and query concurrency increase.  
- **Maintainability:** Reduces the need for manual tuning while preserving predictable system behavior.

---

**Result:**  
All silver tables remain **physically optimized**, with **faster joins, lower cost, and stable performance** as the dataset scales over time.


In [0]:
%sql
USE CATALOG instacart;

In [0]:
%sql
-- Recommended: run on a Photon cluster

-- 1) dim_user: cluster by user_id for fast lookups
OPTIMIZE instacart.silver.dim_user
ZORDER BY (user_id);

ANALYZE TABLE instacart.silver.dim_user COMPUTE STATISTICS FOR ALL COLUMNS;
