COMPUTE STORAGE METRICS is not available on our cluster

# Data Lake Health Audit & Maintenance Logic

## 1. Pipeline Parameters
The script accepts dynamic inputs from the orchestration pipeline (e.g., Azure Data Factory, Databricks Jobs) to control data retention safety.

| Parameter Name | Default Value | Description |
| :--- | :--- | :--- |
| `vacuum_retention_days` | **14** | The safety window for history retention. Files older than this are permanently removed during cleanup. |

## 2. Standards & Thresholds
We define specific metrics to categorize table health and trigger maintenance actions.

| Metric | Threshold | Standard | Rationale |
| :--- | :--- | :--- | :--- |
| **Small Files** | `< 128 MB` | **File Efficiency** | Files smaller than 128 MB cause "small file problem," slowing down read operations and increasing metadata overhead. |
| **Zombie Bloat** | `> 10%` | **Storage Efficiency** | If inactive (deleted/overwritten) data exceeds 10% of total physical storage, it indicates wasted cost and storage. |
| **Safety Limit** | `7 Days (Min)` | **Data Safety** | Delta Lake prevents `VACUUM` retention below 7 days by default to protect against data corruption during concurrent writes/reads. |

## 3. Analysis Process
The `TableHealthAuditor` performs a two-step analysis for every table in the scope:

1.  **Logical Analysis (Metadata):**
    * Runs `DESCRIBE DETAIL` to query the Delta Transaction Log.
    * Retrieves the number of *active* files and *active* data size.
    * Confirms the table is accessible and identifies partitioning columns.

2.  **Physical Analysis (Storage Scan):**
    * Recursively scans the underlying S3 bucket path using `dbutils.fs.ls`.
    * Calculates the *total* physical size and *total* count of all objects in the directory.
    * **Calculation:** `Inactive Data = Total Physical Size - Active Logical Size`.

## 4. Maintenance Decisions & Conditions
Based on the derived metrics, the script automatically executes the following commands:

### Decision A: Run OPTIMIZE
* **Condition:** Average File Size is **> 0 MB** AND **< 128 MB**.
* **Action:** Executes `OPTIMIZE table_name`.
* **Result:** Small files are compacted into larger files (target ~1GB) to improve read performance.

### Decision B: Run VACUUM
* **Condition:** Zombie/Bloat Percentage is **â‰¥ 10%**.
* **Action:** Executes `VACUUM table_name RETAIN {retention_days} DAYS`.
* **Result:** Physically deletes files that are no longer referenced by the Delta log and are older than the retention period.

## 5. Limitations
* **Recursive Listing Cost:** The physical S3 scan (`_get_physical_stats`) iterates through files individually. For tables with millions of files, this step can be slow and may hit API rate limits.
* **Vacuum Safety:** The script respects the retention parameter strictly. If high storage waste exists but files are *newer* than the retention period (e.g., created yesterday), `VACUUM` will **not** delete them, ensuring time-travel capability is preserved.

In [0]:
from pyspark.sql.utils import AnalysisException
from pyspark.sql import DataFrame
import pandas as pd

# 1. Pipeline Parameters
# Initialize widget for retention period
dbutils.widgets.text("vacuum_retention_days", "14", "Vacuum Retention (Days)")

# Validate and retrieve retention period
try:
    RETENTION_DAYS = int(dbutils.widgets.get("vacuum_retention_days"))
except ValueError:
    raise ValueError("The 'vacuum_retention_days' parameter must be an integer.")

# Define the scope of the audit
TABLE_SCOPE = ["bronze_audit", "silver_audit", "quarantine_audit"]

class TableHealthAuditor:
    # Maintenance Thresholds
    SMALL_FILE_THRESHOLD_MB = 128.0
    ZOMBIE_BLOAT_THRESHOLD_PCT = 10.0

    def __init__(self, spark, dbutils, retention_days):
        self.spark = spark
        self.dbutils = dbutils
        self.retention_days = retention_days
        self.results = []

    def _get_logical_stats(self, table_name):
        """Retrieves active metadata using DESCRIBE DETAIL."""
        try:
            df = self.spark.sql(f"DESCRIBE DETAIL {table_name}")
            stats = df.select("location", "numFiles", "sizeInBytes", "partitionColumns").collect()[0]
            return {
                "path": stats["location"],
                "active_files": stats["numFiles"],
                "active_bytes": stats["sizeInBytes"],
                "partitions": stats["partitionColumns"],
                "status": "ACCESSIBLE"
            }
        except AnalysisException as e:
            return {"status": "MISSING", "error": str(e)}

    def _get_physical_stats(self, path):
        """Recursively scans S3 to find the actual storage footprint."""
        total_size = 0
        total_files = 0
        
        # Iterative stack to avoid recursion depth limits
        stack = [path]
        
        try:
            while stack:
                current_path = stack.pop()
                items = self.dbutils.fs.ls(current_path)
                
                for item in items:
                    if item.isDir:
                        stack.append(item.path)
                    else:
                        total_size += item.size
                        total_files += 1
        except Exception as e:
            # Handle permissions or path not found
            return 0, 0, str(e)
            
        return total_size, total_files, None

    def _perform_maintenance(self, table_name, avg_file_size_mb, bloat_pct):
        """Executes OPTIMIZE or VACUUM based on calculated metrics."""
        print(f"[INFO] Evaluating maintenance for {table_name}...")

        # Condition 1: OPTIMIZE (Small Files)
        if 0 < avg_file_size_mb < self.SMALL_FILE_THRESHOLD_MB:
            print(f"[WARNING] Average file size ({avg_file_size_mb:.2f} MB) is below threshold. Executing OPTIMIZE...")
            try:
                self.spark.sql(f"OPTIMIZE {table_name}")
                print(f"[INFO] OPTIMIZE command completed for {table_name}.")
            except Exception as e:
                print(f"[ERROR] Failed to OPTIMIZE {table_name}: {str(e)}")
        else:
            print(f"[INFO] File size is healthy. OPTIMIZE skipped.")

        # Condition 2: VACUUM (Storage Bloat)
        if bloat_pct >= self.ZOMBIE_BLOAT_THRESHOLD_PCT:
            print(f"[WARNING] Storage waste ({bloat_pct:.2f}%) exceeds limit. Executing VACUUM with {self.retention_days} days retention...")
            try:
                self.spark.sql(f"VACUUM {table_name} RETAIN {self.retention_days} DAYS")
                print(f"[INFO] VACUUM command completed for {table_name}.")
            except Exception as e:
                print(f"[ERROR] Failed to VACUUM {table_name}: {str(e)}")
        else:
            print(f"[INFO] Storage waste is within limits. VACUUM skipped.")

    def analyze_table(self, table_name):
        print(f"Analyzing {table_name}...")
        
        # 1. Logical Stats (Delta Metadata)
        logical = self._get_logical_stats(table_name)
        if logical["status"] != "ACCESSIBLE":
            print(f"[ERROR] Could not analyze {table_name}: {logical.get('error')}")
            return

        # 2. Physical Stats (S3 Scan)
        phy_bytes, phy_files, phy_err = self._get_physical_stats(logical["path"])
        
        # 3. Calculate Derived Metrics
        active_bytes = logical["active_bytes"]
        inactive_bytes = phy_bytes - active_bytes
        bloat_pct = (inactive_bytes / phy_bytes * 100) if phy_bytes > 0 else 0
        
        avg_file_size_mb = (active_bytes / logical["active_files"] / (1024*1024)) if logical["active_files"] > 0 else 0
        
        # 4. Trigger Maintenance Actions
        self._perform_maintenance(table_name, avg_file_size_mb, bloat_pct)

        # 5. Determine Reporting Status (Post-Maintenance check reference)
        if avg_file_size_mb < 100:
            file_status = "WARNING: Small Files"
        elif avg_file_size_mb > 2000:
            file_status = "WARNING: Oversized Files"
        else:
            file_status = "OK"

        if bloat_pct > 30:
            vacuum_status = "WARNING: High Storage Waste"
        else:
            vacuum_status = "OK"

        # 6. Compile Report
        report = {
            "Table Name": table_name,
            "Partitioning": str(logical["partitions"]),
            "Active Size (MB)": round(active_bytes / (1024*1024), 2),
            "Total S3 Size (MB)": round(phy_bytes / (1024*1024), 2),
            "Inactive Data (MB)": round(inactive_bytes / (1024*1024), 2),
            "Waste %": round(bloat_pct, 2),
            "Avg File Size (MB)": round(avg_file_size_mb, 2),
            "File Health": file_status,
            "Storage Health": vacuum_status
        }
        self.results.append(report)

    def print_formal_report(self):
        if not self.results:
            print("No results to display.")
            return

        print("\n" + "="*80)
        print(f"{'DATA LAKE HEALTH AUDIT REPORT':^80}")
        print("="*80 + "\n")

        for row in self.results:
            print(f"TABLE: {row['Table Name']}")
            print("-" * 40)
            print(f"Configuration:")
            print(f"  Partition Strategy   : {row['Partitioning']}")
            
            print(f"\nStorage Metrics:")
            print(f"  Active Data Volume   : {row['Active Size (MB)']} MB")
            print(f"  Total Physical usage : {row['Total S3 Size (MB)']} MB")
            print(f"  Inactive / History   : {row['Inactive Data (MB)']} MB ({row['Waste %']}%)")
            print(f"  Status               : {row['Storage Health']}")

            print(f"\nFile Efficiency:")
            print(f"  Average File Size    : {row['Avg File Size (MB)']} MB")
            print(f"  Status               : {row['File Health']}")
            print("\n" + "."*80 + "\n")

# --- EXECUTION ---
# Pass retention_days to the Auditor
auditor = TableHealthAuditor(spark, dbutils, RETENTION_DAYS)

# Iterate through the defined tables
for table in TABLE_SCOPE:
    auditor.analyze_table(table)

# Display final formal text report
auditor.print_formal_report()