# Autonomous Performance Tuning: Partitioning vs. Clustering

## 1. The Challenge: Static Partitioning Limitations
Traditional Hive-style partitioning (e.g., partitioning by `date`) can become a bottleneck as data volume and access patterns evolve.
*   **Data Skew:** Daily partitions may vary significantly in size, leading to inefficient resource utilization.
*   **Maintenance Overhead:** Changing partition keys requires costly `full overwrite` operations.
*   **Complexity:** Optimizing for multiple access patterns often involves complex, manual Z-Ordering routines.

## 2. The Solution: Delta Liquid Clustering
We are transitioning towards **Delta Liquid Clustering** to enable autonomous, self-managing data layouts.
*   **Self-Managing:** Delta Lake automatically clusters data based on access patterns without rigid physical directories.
*   **Flexible:** Clustering keys can be redefined instantly via `ALTER TABLE` without rewriting history.
*   **Skew Resilient:** Automatically handles unequal data distribution.

## 3. Auto-Discovery Logic
This script serves as a "Tuning Advisor." It:
1.  **Analyzes Query History:** Scrapes `system.query.history` to identify frequently accessed columns.
2.  **Evaluates Cardinality:** Checks column statistics to determine the optimal clustering strategy.
3.  **Recommends Configuration:** Suggests `CLUSTER BY` vs `PARTITIONED BY` based on data characteristics.

In [0]:
from pyspark.sql.functions import col, count, lower, explode, split, expr
import re

# --- CONFIGURATION ---
TARGET_TABLE = "silver_audit"

class TableAutoTuner:
    def __init__(self, spark, table_name):
        self.spark = spark
        self.table_name = table_name
        self.valid_columns = []
        self.candidates = []

    def _get_valid_columns(self):
        """
        Fetches the actual schema of the table to validate extracted text.
        """
        print(f"[INFO] Fetching schema for {self.table_name}...")
        try:
            # Get list of actual column names (e.g., ['symbol', 'date', 'eps'])
            self.valid_columns = self.spark.read.table(self.table_name).columns
            print(f"[INFO] Valid Schema Columns: {self.valid_columns}")
        except Exception as e:
            print(f"[ERROR] Could not read table schema: {str(e)}")

    def _extract_columns_from_history(self):
        """
        Parses system history and finds which valid columns are used most often.
        """
        print(f"[INFO] Mining query history for {self.table_name}...")
        
        # 1. Fetch raw SQL statements involving this table
        history_df = self.spark.sql(f"""
            SELECT statement_text 
            FROM system.query.history 
            WHERE statement_text LIKE '%{self.table_name}%'
              AND statement_type = 'SELECT'
              AND start_time > date_add(now(), -30)
        """)

        if history_df.count() == 0:
            print("[WARN] No history found. Cannot optimize.")
            return

        # 2. Tokenize the SQL text
        # We split by non-alphanumeric characters to isolate words
        words_df = (history_df
                    .select(explode(split(lower(col("statement_text")), "[^a-z0-9_]")).alias("word"))
                    .filter(col("word") != "")
                   )

        # 3. Filter for words that are ACTUAL table columns
        # This prevents "WHERE" or "SELECT" from being treated as column names
        # AND prevents errors like 'company_symbol' vs 'symbol'
        valid_cols_broadcast = [c.lower() for c in self.valid_columns]
        
        matched_cols_df = (words_df
                           .filter(col("word").isin(valid_cols_broadcast))
                           .groupBy("word")
                           .agg(count("*").alias("frequency"))
                           .sort(col("frequency").desc())
                           .limit(5) # Top 5 used columns
                          )

        print("[INFO] Top detected columns from history:")
        matched_cols_df.show()
        
        # Convert to list for further processing
        self.candidates = [row["word"] for row in matched_cols_df.collect()]

    def _analyze_cardinality_and_recommend(self):
        """
        Decides between Partitioning vs. Z-Order based on unique values.
        """
        if not self.candidates:
            print("[INFO] No valid usage patterns found.")
            return

        print("\n--- Optimization Recommendations ---")
        
        for col_name in self.candidates:
            try:
                # Check how many unique values exist
                distinct_count = self.spark.table(self.table_name).select(col_name).distinct().count()
                
                print(f"Column: '{col_name}' | Distinct Values: {distinct_count}")
                
                # LOGIC MATRIX
                if distinct_count > 10000:
                    print(f"  -> RECOMMENDATION: **Z-ORDER** (High Cardinality).")
                    print(f"     Why: Too many unique values for partitioning. Z-Order will skip data effectively.")
                    
                elif 50 < distinct_count <= 10000:
                    print(f"  -> RECOMMENDATION: **Z-ORDER** (Medium Cardinality).")
                    print(f"     Why: Safe for Z-Order. Risky for partitioning (potential small files).")
                    
                else: # < 50
                    print(f"  -> RECOMMENDATION: **PARTITION** (Low Cardinality).")
                    print(f"     Why: Few unique values. Good candidate for physical folder separation.")
                    
                print("-" * 50)

            except Exception as e:
                print(f"[ERROR] failed to analyze {col_name}: {str(e)}")

    def run(self):
        print(f"--- Auto-Discovery Tuner for {self.table_name} ---")
        self._get_valid_columns()
        self._extract_columns_from_history()
        self._analyze_cardinality_and_recommend()

# --- EXECUTION ---
tuner = TableAutoTuner(spark, TARGET_TABLE)
tuner.run()