# üöÄ **Daily Data Pipeline - Execution Guide**
## Bronze to Silver Layer Processing with Data Quality

---

## üéØ **PRODUCTION WORKFLOW - Execute in This Order**

### **Phase 1: Configuration** ‚è±Ô∏è ~5 seconds

**Cell 1 - Configuration (Auto-discover databases and tables)**
* Scans `bronze/mysql/` to discover all databases automatically
* Auto-discovers all tables within each database
* Creates `all_database_configs` variable with paths
* **Output:** 2 databases (retail_db, students_db), 8 tables total

---

### **Phase 2: Data Quality Analysis** ‚è±Ô∏è ~30 seconds

**Cell 2 - Analyze MySQL Bronze Data Quality**
* Checks all MySQL tables for nulls, duplicates, data types
* Displays sample records and schema
* Generates quality report
* **Issues Found:** 34 duplicates, null values in several tables

**Cell 3 - Analyze Event Hub Bronze Data Quality**
* Analyzes Event Hub AVRO data structure
* Checks Body field (JSON content), SequenceNumber, Offset
* Shows load date distribution

---

### **Phase 3: Data Cleaning** ‚è±Ô∏è ~1 minute

**Cell 4 - Define Cleaning Rules**
* Global rules: remove duplicates, trim strings, handle nulls
* Table-specific rules: email validation, required fields, dedupe keys

**Cell 5 - Apply Cleaning to MySQL Data**
* Removes 34 duplicates (27% reduction)
* Trims strings, filters invalid emails, removes null required fields
* Creates temp views: `cleaned_{db_name}_{table_name}`
* **Results:** 124 ‚Üí 90 records

**Cell 6 - Apply Cleaning to Event Hub Data**
* Removes duplicates, parses JSON Body field
* Decodes base64 content, flattens nested JSON
* Creates temp view: `cleaned_eventhub_events`

---

### **Phase 4: Validation** ‚è±Ô∏è ~20 seconds

**Cell 7 - Validate Cleaned Data Quality**
* Verifies no nulls in required fields
* Confirms no duplicates remain
* **Result:** ‚úÖ 8/8 tables passed validation

---

### **Phase 5: Event Hub Processing** ‚è±Ô∏è ~1 minute

**Cell 8 - Event Hub ‚Üí Bronze**
* Reads AVRO files from `streamingingestionsathya/eventhub/`
* Uses Auto Loader with `recursiveFileLookup`
* Processes nested folders: `partition/year/month/day/hour/minute`
* Writes to `bronze/eventhub/eventhub_events/` as Delta
* Partitions by load_date, uses checkpoint

**Cell 9 - Event Hub Bronze ‚Üí Silver**
* Reads from bronze Event Hub Delta table
* Adds processing metadata
* Writes to `silver/eventhub/eventhub_events/`

---

### **Phase 6: MySQL to Silver** ‚è±Ô∏è ~30 seconds

**Cell 10 - MySQL Cleaned Data ‚Üí Silver**
* Reads CLEANED data from temp views (Cell 5)
* Adds processing metadata
* Writes to `silver/mysql/{database}/{table}/` as Delta
* **Output:** 90 clean, validated records

---

### **Phase 7: Verification** ‚è±Ô∏è ~10 seconds

**Cell 11 - Complete Silver Layer Summary**
* Verifies all MySQL databases and Event Hub in silver
* Shows record counts per table
* **Final:** 3 sources, 9 tables, 92 total records

---

## ‚è±Ô∏è **TOTAL DURATION: ~4 minutes**

---

## üìÖ **SCHEDULING RECOMMENDATIONS**

### **Option 1: Daily Batch Processing** ‚≠ê RECOMMENDED

**Schedule:** Daily at 2:00 AM  
**Cron:** `0 2 * * *`  
**Run Cells:** 1 ‚Üí 2 ‚Üí 3 ‚Üí 4 ‚Üí 5 ‚Üí 6 ‚Üí 7 ‚Üí 8 ‚Üí 9 ‚Üí 10 ‚Üí 11

**Best For:**
* Standard daily data processing
* Both MySQL and Event Hub updated once per day
* Cost-effective (runs once per day)

**Databricks Workflow Setup:**
```
Task 1: Configuration & Cleaning
  Cells: 1, 2, 3, 4, 5, 6, 7
  Duration: ~2 min
  Dependencies: None

Task 2: Event Hub Processing
  Cells: 8, 9
  Duration: ~1 min
  Dependencies: Task 1

Task 3: MySQL to Silver
  Cell: 10
  Duration: ~30 sec
  Dependencies: Task 1

Task 4: Verification
  Cell: 11
  Duration: ~10 sec
  Dependencies: Task 2, Task 3
```

---

### **Option 2: Frequent Event Hub, Daily MySQL**

**Event Hub Schedule:** Every hour  
**Cron:** `0 * * * *`  
**Run Cells:** 1 ‚Üí 8 ‚Üí 9 ‚Üí 11

**MySQL Schedule:** Daily at 2:00 AM  
**Cron:** `0 2 * * *`  
**Run Cells:** 1 ‚Üí 2 ‚Üí 4 ‚Üí 5 ‚Üí 7 ‚Üí 10 ‚Üí 11

**Best For:**
* Near real-time Event Hub processing
* Batch MySQL processing
* Different SLAs for different sources

---

### **Option 3: Skip Analysis (After Initial Setup)**

**Schedule:** Daily at 2:00 AM  
**Run Cells:** 1 ‚Üí 4 ‚Üí 5 ‚Üí 6 ‚Üí 7 ‚Üí 8 ‚Üí 9 ‚Üí 10 ‚Üí 11

**Best For:**
* After initial data quality assessment
* Faster execution (skip analysis Cells 2-3)
* Stable data sources with known quality

**Note:** Run Cells 2-3 weekly/monthly to monitor quality trends

---

## üìä **DATA QUALITY RESULTS**

### **Bronze Layer (Raw):**
* Total records: 124
* Issues: 34 duplicates, null values in multiple tables

### **Silver Layer (Cleaned):**
* Total records: 92 (90 MySQL + 2 Event Hub)
* Quality: ‚úÖ All validations passed (8/8 tables)
* Improvement: 27% reduction in bad data

### **Specific Improvements:**
* customer_details: 9 ‚Üí 6 records (3 duplicates removed)
* orders: 42 ‚Üí 21 records (21 duplicates removed)
* users: 20 ‚Üí 10 records (10 duplicates removed)

---

## üìç **STORAGE LOCATIONS**

**Bronze Layer:**
* MySQL: `bronze/mysql/{database}/{table}/load_date=YYYY-MM-DD/`
* Event Hub: `bronze/eventhub/eventhub_events/load_date=YYYY-MM-DD/`

**Silver Layer:**
* MySQL: `silver/mysql/{database}/{table}/`
* Event Hub: `silver/eventhub/eventhub_events/`

**Checkpoints:**
* Event Hub to Bronze: `checkpoints/eventhub_to_bronze/eventhub_events/`
* Event Hub to Silver: `checkpoints/bronze_to_silver/eventhub/eventhub_events/`

---

## üîë **KEY FEATURES**

‚úÖ **Fully Dynamic** - Auto-discovers new databases and tables  
‚úÖ **Incremental** - Only processes new files (checkpoint-based)  
‚úÖ **Data Quality** - Analysis, cleaning, validation built-in  
‚úÖ **Multi-Source** - Handles MySQL + Event Hub seamlessly  
‚úÖ **Scalable** - Add new sources without code changes  
‚úÖ **Monitored** - Clear success/failure indicators  

---

## ‚ö° **PERFORMANCE NOTES**

**First Run:**
* Duration: ~4 minutes
* Processes ALL existing data
* Creates checkpoints

**Subsequent Runs:**
* Duration: ~30 seconds (if no new data)
* Only processes NEW files
* Fast checkpoint lookup

**Auto Loader Behavior:**
* Tracks files by path/name only (not content)
* Modified files with same name are SKIPPED
* Recommendation: Use append-only pattern with unique filenames

---

## üìù **NOTEBOOK PATH**

`/Repos/sathyarajeshpk@gmail.com/MigrationProject/Notebooks/Bronze_setup`

---

## üìß **CONTACT**

**Owner:** sathyarajeshpk@gmail.com  
**Workspace:** datamigrationsathya (Azure)  
**Compute:** Serverless Interactive Cluster

In [0]:
# ============================================
# FULLY DYNAMIC CONFIGURATION
# Auto-discovers databases AND tables
# ============================================

# Base storage account and container
storage_account = "datamigrationsathya"
container = "datalake"

# Path structure configuration
layer = "bronze"  # bronze, silver, gold
source_system = "mysql"  # mysql, postgres, etc.

# Base path for the source system
source_base_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/{layer}/{source_system}/"

print("=" * 70)
print("AUTO-DISCOVERING DATABASES AND TABLES")
print("=" * 70)
print(f"\nSource system: {source_system}")
print(f"Base path: {source_base_path}")

# ============================================
# AUTO-DISCOVER ALL DATABASES
# ============================================

all_database_configs = []

try:
    # List all database folders under bronze/mysql/
    database_folders = dbutils.fs.ls(source_base_path)
    
    print(f"\nFound {len(database_folders)} database(s):\n")
    
    for db_folder in database_folders:
        if db_folder.isDir():
            database_name = db_folder.name.rstrip('/')
            
            # Construct paths for this database
            bronze_path = f"{source_base_path}{database_name}/"
            silver_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/silver/{source_system}/{database_name}/"
            checkpoint_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/checkpoints/bronze_to_silver/{source_system}/{database_name}/"
            
            # Auto-discover tables in this database
            try:
                table_folders = dbutils.fs.ls(bronze_path)
                tables = [t.name.rstrip('/') for t in table_folders if t.isDir()]
                
                if tables:  # Only add if tables exist
                    all_database_configs.append({
                        "database_name": database_name,
                        "bronze_path": bronze_path,
                        "silver_path": silver_path,
                        "checkpoint_path": checkpoint_path,
                        "tables": tables
                    })
                    
                    print(f"  ‚úì {database_name}: {len(tables)} table(s)")
                    for table in tables:
                        print(f"      - {table}")
                    print()
                else:
                    print(f"  ‚ö† {database_name}: No tables found (skipping)\n")
                    
            except Exception as e:
                print(f"  ‚úó {database_name}: Error reading tables - {str(e)}\n")
                continue
    
    if not all_database_configs:
        print("\n‚ö†Ô∏è  No databases with tables found!")
        print("Please check your bronze layer structure.")
    else:
        print("=" * 70)
        print(f"SUMMARY: {len(all_database_configs)} database(s) ready to process")
        total_tables = sum(len(config['tables']) for config in all_database_configs)
        print(f"Total tables across all databases: {total_tables}")
        print("=" * 70)
        
except Exception as e:
    print(f"\n‚ùå Error discovering databases: {str(e)}")
    print("\nFalling back to manual configuration...")
    
    # Fallback: Manual configuration
    database_name = "retail_db"
    bronze_base_path = f"{source_base_path}{database_name}/"
    silver_base_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/silver/{source_system}/{database_name}/"
    checkpoint_base_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/checkpoints/bronze_to_silver/{source_system}/{database_name}/"
    
    # Auto-discover tables
    try:
        folders = dbutils.fs.ls(bronze_base_path)
        tables_to_process = [folder.name.rstrip('/') for folder in folders if folder.isDir()]
        print(f"Using manual database: {database_name}")
        print(f"Found {len(tables_to_process)} tables: {tables_to_process}")
    except:
        tables_to_process = ["customer_details"]
        print(f"Using fallback table list: {tables_to_process}")
    
    # Create single database config for backward compatibility
    all_database_configs = [{
        "database_name": database_name,
        "bronze_path": bronze_base_path,
        "silver_path": silver_base_path,
        "checkpoint_path": checkpoint_base_path,
        "tables": tables_to_process
    }]

print("\n‚úÖ Configuration complete!")
print("\nNote: Auto Loader will recursively process all files in subdirectories (e.g., load_date partitions)")

In [0]:
# Comprehensive data quality analysis for MySQL bronze layer

from pyspark.sql.functions import col, count, when, isnan, isnull, sum as spark_sum, avg, min as spark_min, max as spark_max

print("=" * 70)
print("MYSQL BRONZE LAYER - DATA QUALITY ANALYSIS")
print("=" * 70)

if 'all_database_configs' not in dir() or not all_database_configs:
    print("\n‚ùå Please run Cell 6 (Configuration) first!")
else:
    quality_report = []
    
    for config in all_database_configs:
        db_name = config['database_name']
        bronze_path = config['bronze_path']
        tables = config['tables']
        
        print(f"\n{'='*70}")
        print(f"Database: {db_name}")
        print(f"{'='*70}")
        
        for table_name in tables:
            print(f"\nüìä Analyzing: {table_name}")
            print("-" * 70)
            
            try:
                # Read bronze data
                df = spark.read.parquet(f"{bronze_path}{table_name}/")
                
                total_records = df.count()
                total_columns = len(df.columns)
                
                print(f"  Total Records: {total_records:,}")
                print(f"  Total Columns: {total_columns}")
                
                # 1. NULL ANALYSIS
                print(f"\n  üìã Null Analysis:")
                null_counts = df.select([spark_sum(when(col(c).isNull(), 1).otherwise(0)).alias(c) for c in df.columns]).collect()[0]
                
                has_nulls = False
                for col_name in df.columns:
                    null_count = null_counts[col_name]
                    if null_count > 0:
                        null_pct = (null_count / total_records) * 100
                        print(f"    ‚ö† {col_name}: {null_count:,} nulls ({null_pct:.1f}%)")
                        has_nulls = True
                
                if not has_nulls:
                    print(f"    ‚úì No null values found")
                
                # 2. DUPLICATE ANALYSIS
                print(f"\n  üîÑ Duplicate Analysis:")
                duplicate_count = total_records - df.dropDuplicates().count()
                if duplicate_count > 0:
                    print(f"    ‚ö† Found {duplicate_count:,} duplicate records ({(duplicate_count/total_records)*100:.1f}%)")
                else:
                    print(f"    ‚úì No duplicates found")
                
                # 3. DATA TYPE ANALYSIS
                print(f"\n  üìù Schema:")
                for field in df.schema.fields:
                    print(f"    - {field.name}: {field.dataType}")
                
                # 4. SAMPLE DATA
                print(f"\n  üìÑ Sample Records (first 3):")
                display(df.limit(3))
                
                # Store quality metrics
                quality_report.append({
                    'database': db_name,
                    'table': table_name,
                    'total_records': total_records,
                    'total_columns': total_columns,
                    'has_nulls': has_nulls,
                    'duplicate_count': duplicate_count
                })
                
            except Exception as e:
                print(f"  ‚ùå Error analyzing {table_name}: {str(e)[:100]}")
    
    # Summary Report
    print(f"\n\n{'='*70}")
    print("QUALITY ANALYSIS SUMMARY")
    print(f"{'='*70}")
    
    for report in quality_report:
        status = "‚ö† Issues Found" if (report['has_nulls'] or report['duplicate_count'] > 0) else "‚úì Clean"
        print(f"\n{report['database']}.{report['table']}: {status}")
        print(f"  Records: {report['total_records']:,}")
        if report['has_nulls']:
            print(f"  ‚ö† Has null values")
        if report['duplicate_count'] > 0:
            print(f"  ‚ö† Has {report['duplicate_count']:,} duplicates")

In [0]:
# Data quality analysis for Event Hub bronze layer

from pyspark.sql.functions import col, count, when, length, size

print("=" * 70)
print("EVENT HUB BRONZE LAYER - DATA QUALITY ANALYSIS")
print("=" * 70)

eventhub_name = "eventhub_events"
bronze_eventhub_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/bronze/eventhub/{eventhub_name}/"

try:
    df_eh = spark.read.format("delta").load(bronze_eventhub_path)
    
    total_records = df_eh.count()
    print(f"\nTotal Records: {total_records:,}")
    print(f"Total Columns: {len(df_eh.columns)}")
    
    # 1. NULL ANALYSIS
    print(f"\nüìã Null Analysis:")
    null_counts = df_eh.select([spark_sum(when(col(c).isNull(), 1).otherwise(0)).alias(c) for c in df_eh.columns]).collect()[0]
    
    for col_name in df_eh.columns:
        null_count = null_counts[col_name]
        if null_count > 0:
            null_pct = (null_count / total_records) * 100
            print(f"  ‚ö† {col_name}: {null_count:,} nulls ({null_pct:.1f}%)")
    
    # 2. BODY FIELD ANALYSIS (contains JSON data)
    print(f"\nüì¶ Body Field Analysis:")
    if 'Body' in df_eh.columns:
        # Check for empty bodies
        empty_bodies = df_eh.filter(col('Body').isNull()).count()
        print(f"  Empty bodies: {empty_bodies}")
        
        # Sample body content
        print(f"\n  Sample Body content (first record):")
        sample_body = df_eh.select('Body').limit(1).collect()[0]['Body']
        if sample_body:
            import base64
            decoded = base64.b64decode(sample_body).decode('utf-8')
            print(f"  {decoded[:200]}...")
    
    # 3. DUPLICATE ANALYSIS
    print(f"\nüîÑ Duplicate Analysis:")
    duplicate_count = total_records - df_eh.dropDuplicates(['SequenceNumber', 'Offset']).count()
    if duplicate_count > 0:
        print(f"  ‚ö† Found {duplicate_count:,} duplicate records")
    else:
        print(f"  ‚úì No duplicates found")
    
    # 4. LOAD DATE DISTRIBUTION
    print(f"\nüìÖ Load Date Distribution:")
    if 'load_date' in df_eh.columns:
        df_eh.groupBy('load_date').count().orderBy('load_date').show()
    
    # 5. SCHEMA
    print(f"\nüìù Schema:")
    df_eh.printSchema()
    
    # 6. SAMPLE DATA
    print(f"\nüìÑ Sample Records:")
    display(df_eh.limit(3))
    
except Exception as e:
    print(f"\n‚ùå Error: {str(e)}")
    print("Make sure Event Hub data has been copied to bronze layer (run Cell 20)")

In [0]:
# Define data cleaning rules and transformations

from pyspark.sql.functions import col, trim, upper, lower, regexp_replace, to_timestamp, coalesce, lit

print("=" * 70)
print("DATA CLEANING RULES DEFINITION")
print("=" * 70)

# ============================================
# CLEANING RULES CONFIGURATION
# ============================================

cleaning_rules = {
    # Global rules applied to all tables
    'global': {
        'remove_duplicates': True,
        'trim_strings': True,
        'handle_nulls': True,
        'standardize_dates': True
    },
    
    # Table-specific rules
    'table_specific': {
        'customer_details': {
            'required_fields': ['customer_id', 'customer_name'],
            'email_validation': True,
            'dedupe_key': ['customer_id']
        },
        'orders': {
            'required_fields': ['order_id'],
            'dedupe_key': ['order_id'],
            'filter_negative_amounts': True
        },
        'users': {
            'required_fields': ['user_id', 'name'],
            'dedupe_key': ['user_id']
        },
        'eventhub_events': {
            'required_fields': ['SequenceNumber', 'Offset'],
            'dedupe_key': ['SequenceNumber', 'Offset'],
            'parse_body_json': True
        }
    }
}

print("\n‚úÖ Cleaning rules defined:")
print(f"\nüìã Global Rules:")
for rule, enabled in cleaning_rules['global'].items():
    print(f"  - {rule}: {enabled}")

print(f"\nüìã Table-Specific Rules:")
for table, rules in cleaning_rules['table_specific'].items():
    print(f"\n  {table}:")
    for rule, value in rules.items():
        print(f"    - {rule}: {value}")

print("\n" + "=" * 70)
print("Rules ready to apply!")
print("=" * 70)

In [0]:
# Apply cleaning transformations to MySQL bronze data

from pyspark.sql.functions import col, trim, regexp_replace, when, length, coalesce

def clean_mysql_table(df, table_name, rules):
    """
    Apply cleaning transformations to a DataFrame
    """
    df_clean = df
    
    # Get table-specific rules
    table_rules = rules['table_specific'].get(table_name, {})
    
    # 1. REMOVE DUPLICATES
    if rules['global']['remove_duplicates']:
        dedupe_key = table_rules.get('dedupe_key', None)
        if dedupe_key:
            initial_count = df_clean.count()
            df_clean = df_clean.dropDuplicates(dedupe_key)
            removed = initial_count - df_clean.count()
            if removed > 0:
                print(f"    ‚úì Removed {removed:,} duplicates based on {dedupe_key}")
    
    # 2. TRIM STRING COLUMNS
    if rules['global']['trim_strings']:
        string_cols = [field.name for field in df_clean.schema.fields if str(field.dataType) == 'StringType']
        for col_name in string_cols:
            df_clean = df_clean.withColumn(col_name, trim(col(col_name)))
        if string_cols:
            print(f"    ‚úì Trimmed {len(string_cols)} string columns")
    
    # 3. FILTER REQUIRED FIELDS
    required_fields = table_rules.get('required_fields', [])
    if required_fields:
        initial_count = df_clean.count()
        for field in required_fields:
            if field in df_clean.columns:
                df_clean = df_clean.filter(col(field).isNotNull())
        removed = initial_count - df_clean.count()
        if removed > 0:
            print(f"    ‚úì Filtered {removed:,} records with null required fields")
    
    # 4. EMAIL VALIDATION (if applicable)
    if table_rules.get('email_validation', False):
        email_cols = [c for c in df_clean.columns if 'email' in c.lower()]
        for email_col in email_cols:
            initial_count = df_clean.count()
            df_clean = df_clean.filter(
                (col(email_col).isNull()) | 
                (col(email_col).rlike(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'))
            )
            removed = initial_count - df_clean.count()
            if removed > 0:
                print(f"    ‚úì Filtered {removed:,} records with invalid emails")
    
    # 5. FILTER NEGATIVE AMOUNTS (if applicable)
    if table_rules.get('filter_negative_amounts', False):
        amount_cols = [c for c in df_clean.columns if 'amount' in c.lower() or 'price' in c.lower()]
        for amount_col in amount_cols:
            if amount_col in df_clean.columns:
                initial_count = df_clean.count()
                df_clean = df_clean.filter((col(amount_col).isNull()) | (col(amount_col) >= 0))
                removed = initial_count - df_clean.count()
                if removed > 0:
                    print(f"    ‚úì Filtered {removed:,} records with negative {amount_col}")
    
    return df_clean


print("=" * 70)
print("APPLYING CLEANING TRANSFORMATIONS - MYSQL")
print("=" * 70)

if 'all_database_configs' not in dir() or not all_database_configs:
    print("\n‚ùå Please run Cell 6 (Configuration) first!")
else:
    cleaning_summary = []
    
    for config in all_database_configs:
        db_name = config['database_name']
        bronze_path = config['bronze_path']
        tables = config['tables']
        
        print(f"\n{'='*70}")
        print(f"Database: {db_name}")
        print(f"{'='*70}")
        
        for table_name in tables:
            print(f"\n  üßπ Cleaning: {table_name}")
            
            try:
                # Read bronze data
                df_bronze = spark.read.parquet(f"{bronze_path}{table_name}/")
                initial_count = df_bronze.count()
                
                # Apply cleaning
                df_clean = clean_mysql_table(df_bronze, table_name, cleaning_rules)
                final_count = df_clean.count()
                
                removed = initial_count - final_count
                retention_pct = (final_count / initial_count) * 100 if initial_count > 0 else 0
                
                print(f"    üìä Initial: {initial_count:,} ‚Üí Final: {final_count:,} ({retention_pct:.1f}% retained)")
                
                # Store cleaned DataFrame for later use
                df_clean.createOrReplaceTempView(f"cleaned_{db_name}_{table_name}")
                
                cleaning_summary.append({
                    'database': db_name,
                    'table': table_name,
                    'initial': initial_count,
                    'final': final_count,
                    'removed': removed
                })
                
            except Exception as e:
                print(f"    ‚ùå Error: {str(e)[:100]}")
    
    # Summary
    print(f"\n\n{'='*70}")
    print("CLEANING SUMMARY")
    print(f"{'='*70}")
    
    total_initial = sum(s['initial'] for s in cleaning_summary)
    total_final = sum(s['final'] for s in cleaning_summary)
    total_removed = sum(s['removed'] for s in cleaning_summary)
    
    for summary in cleaning_summary:
        print(f"\n{summary['database']}.{summary['table']}:")
        print(f"  Initial: {summary['initial']:,}")
        print(f"  Final: {summary['final']:,}")
        print(f"  Removed: {summary['removed']:,}")
    
    print(f"\n{'='*70}")
    print(f"TOTAL: {total_initial:,} ‚Üí {total_final:,} (removed {total_removed:,})")
    print(f"{'='*70}")

In [0]:
# Apply cleaning transformations to Event Hub bronze data

from pyspark.sql.functions import col, from_json, schema_of_json, base64, unbase64
import json

print("=" * 70)
print("APPLYING CLEANING TRANSFORMATIONS - EVENT HUB")
print("=" * 70)

eventhub_name = "eventhub_events"
bronze_eventhub_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/bronze/eventhub/{eventhub_name}/"

try:
    df_eh = spark.read.format("delta").load(bronze_eventhub_path)
    initial_count = df_eh.count()
    
    print(f"\nInitial records: {initial_count:,}")
    
    # 1. REMOVE DUPLICATES
    print(f"\nüßπ Cleaning steps:")
    df_clean = df_eh.dropDuplicates(['SequenceNumber', 'Offset'])
    removed_dupes = initial_count - df_clean.count()
    if removed_dupes > 0:
        print(f"  ‚úì Removed {removed_dupes:,} duplicates")
    else:
        print(f"  ‚úì No duplicates found")
    
    # 2. FILTER NULL REQUIRED FIELDS
    df_clean = df_clean.filter(
        col('SequenceNumber').isNotNull() & 
        col('Offset').isNotNull()
    )
    removed_nulls = df_eh.count() - df_clean.count() - removed_dupes
    if removed_nulls > 0:
        print(f"  ‚úì Filtered {removed_nulls:,} records with null required fields")
    
    # 3. PARSE BODY JSON (if Body field exists)
    if 'Body' in df_clean.columns:
        print(f"\n  üì¶ Parsing Body JSON field...")
        
        # Decode base64 Body to string
        from pyspark.sql.functions import expr
        df_clean = df_clean.withColumn('body_string', expr("cast(unbase64(Body) as string)"))
        
        # Try to infer JSON schema from sample
        sample_body = df_clean.select('body_string').filter(col('body_string').isNotNull()).limit(1).collect()
        
        if sample_body and sample_body[0]['body_string']:
            try:
                # Infer schema from sample JSON
                json_schema = schema_of_json(sample_body[0]['body_string'])
                
                # Parse JSON
                df_clean = df_clean.withColumn('parsed_body', from_json(col('body_string'), json_schema))
                
                # Flatten parsed JSON fields
                if 'parsed_body' in df_clean.columns:
                    # Get all fields from parsed_body struct
                    parsed_fields = df_clean.select('parsed_body.*').columns
                    for field in parsed_fields:
                        df_clean = df_clean.withColumn(f"body_{field}", col(f"parsed_body.{field}"))
                    
                    print(f"    ‚úì Parsed Body JSON into {len(parsed_fields)} fields")
            except Exception as e:
                print(f"    ‚ö† Could not parse JSON: {str(e)[:100]}")
    
    final_count = df_clean.count()
    retention_pct = (final_count / initial_count) * 100 if initial_count > 0 else 0
    
    print(f"\nüìä Summary:")
    print(f"  Initial: {initial_count:,}")
    print(f"  Final: {final_count:,}")
    print(f"  Removed: {initial_count - final_count:,}")
    print(f"  Retention: {retention_pct:.1f}%")
    
    # Store cleaned DataFrame
    df_clean.createOrReplaceTempView("cleaned_eventhub_events")
    
    print(f"\n‚úÖ Event Hub data cleaned and ready!")
    print(f"\nüìÑ Sample cleaned data:")
    display(df_clean.limit(5))
    
except Exception as e:
    print(f"\n‚ùå Error: {str(e)}")
    print("Make sure Event Hub data has been copied to bronze layer (run Cell 20)")

In [0]:
# Validate cleaned data quality

print("=" * 70)
print("DATA QUALITY VALIDATION - POST CLEANING")
print("=" * 70)

validation_results = []

# Validate MySQL tables
if 'all_database_configs' in dir() and all_database_configs:
    for config in all_database_configs:
        db_name = config['database_name']
        tables = config['tables']
        
        for table_name in tables:
            view_name = f"cleaned_{db_name}_{table_name}"
            
            try:
                df = spark.table(view_name)
                
                # Check for nulls in required fields
                table_rules = cleaning_rules['table_specific'].get(table_name, {})
                required_fields = table_rules.get('required_fields', [])
                
                null_in_required = False
                for field in required_fields:
                    if field in df.columns:
                        null_count = df.filter(col(field).isNull()).count()
                        if null_count > 0:
                            null_in_required = True
                            break
                
                # Check for duplicates
                dedupe_key = table_rules.get('dedupe_key', [])
                has_duplicates = False
                if dedupe_key:
                    total = df.count()
                    unique = df.dropDuplicates(dedupe_key).count()
                    has_duplicates = (total != unique)
                
                status = "‚úÖ PASS" if (not null_in_required and not has_duplicates) else "‚ö†Ô∏è ISSUES"
                
                validation_results.append({
                    'table': f"{db_name}.{table_name}",
                    'status': status,
                    'null_in_required': null_in_required,
                    'has_duplicates': has_duplicates,
                    'record_count': df.count()
                })
                
            except Exception as e:
                validation_results.append({
                    'table': f"{db_name}.{table_name}",
                    'status': "‚ùå ERROR",
                    'error': str(e)[:50]
                })

# Validate Event Hub
try:
    df_eh = spark.table("cleaned_eventhub_events")
    
    null_in_required = df_eh.filter(
        col('SequenceNumber').isNull() | col('Offset').isNull()
    ).count() > 0
    
    has_duplicates = df_eh.count() != df_eh.dropDuplicates(['SequenceNumber', 'Offset']).count()
    
    status = "‚úÖ PASS" if (not null_in_required and not has_duplicates) else "‚ö†Ô∏è ISSUES"
    
    validation_results.append({
        'table': 'eventhub.eventhub_events',
        'status': status,
        'null_in_required': null_in_required,
        'has_duplicates': has_duplicates,
        'record_count': df_eh.count()
    })
except:
    pass

# Display results
print("\nüìã Validation Results:\n")
for result in validation_results:
    print(f"{result['status']} {result['table']}")
    if 'record_count' in result:
        print(f"    Records: {result['record_count']:,}")
    if result.get('null_in_required'):
        print(f"    ‚ö† Has nulls in required fields")
    if result.get('has_duplicates'):
        print(f"    ‚ö† Has duplicate records")
    if 'error' in result:
        print(f"    ‚ùå Error: {result['error']}")
    print()

print("=" * 70)
passed = sum(1 for r in validation_results if r['status'] == "‚úÖ PASS")
total = len(validation_results)
print(f"Validation: {passed}/{total} tables passed")
print("=" * 70)

In [0]:
# ============================================
# PRODUCTION: Write CLEANED MySQL data to Silver Layer
# This replaces the old Cell 19 - now uses cleaned data
# ============================================

from pyspark.sql.functions import current_timestamp, lit

print("=" * 70)
print("MYSQL CLEANED DATA TO SILVER PIPELINE")
print("=" * 70)

if 'all_database_configs' not in dir() or not all_database_configs:
    print("\n‚ùå ERROR: Please run Cell 6 first!")
else:
    print(f"\n‚úÖ Found configuration for {len(all_database_configs)} database(s)\n")
    
    for config in all_database_configs:
        db_name = config['database_name']
        silver_path = config['silver_path']
        tables = config['tables']
        
        print(f"\n{'='*70}")
        print(f"Database: {db_name}")
        print(f"{'='*70}")
        
        for table_name in tables:
            print(f"\n  üìä Processing: {db_name}/{table_name}")
            
            try:
                # Read CLEANED data from temp view
                view_name = f"cleaned_{db_name}_{table_name}"
                df_clean = spark.table(view_name)
                
                # Add processing metadata
                df_silver = (df_clean
                    .withColumn("processing_timestamp", current_timestamp())
                    .withColumn("source_table", lit(table_name))
                    .withColumn("source_database", lit(db_name))
                )
                
                # Write to silver layer (overwrite mode for cleaned data)
                df_silver.write\
                    .format("delta")\
                    .mode("overwrite")\
                    .option("mergeSchema", "true")\
                    .save(f"{silver_path}{table_name}/")
                
                record_count = df_silver.count()
                print(f"     ‚úÖ Written {record_count:,} cleaned records to silver")
                
            except Exception as e:
                print(f"     ‚ùå Error: {str(e)[:100]}")
                continue
        
        print(f"\n  ‚úÖ Completed database: {db_name}")
    
    print("\n" + "="*70)
    print("‚úÖ MySQL Cleaned Data Pipeline Complete!")
    print("="*70)

In [0]:
# ============================================
# DAILY PRODUCTION PIPELINE
# Processes all databases and tables discovered in Cell 1
# ============================================

from pyspark.sql.functions import current_timestamp, lit

print("=" * 70)
print("Starting Daily Bronze to Silver Pipeline")
print("=" * 70)

# Use the all_database_configs from Cell 1
if 'all_database_configs' not in dir() or not all_database_configs:
    print("\n‚ùå ERROR: Please run Cell 1 first to discover databases and tables!")
    print("Cell 1 creates the 'all_database_configs' variable needed for processing.")
else:
    print(f"\n‚úÖ Found configuration for {len(all_database_configs)} database(s)")
    total_tables = sum(len(config['tables']) for config in all_database_configs)
    print(f"Total tables to process: {total_tables}\n")
    
    # Process each database
    for config in all_database_configs:
        db_name = config['database_name']
        bronze_path = config['bronze_path']
        silver_path = config['silver_path']
        checkpoint_path = config['checkpoint_path']
        tables = config['tables']
        
        print(f"\n{'='*70}")
        print(f"Database: {db_name}")
        print(f"Tables: {len(tables)}")
        print(f"{'='*70}")
        
        # Process each table in this database
        for table_name in tables:
            print(f"\n  üìä Processing: {db_name}/{table_name}")
            
            try:
                # Read with Auto Loader (only processes new files)
                df_stream = (spark.readStream
                    .format("cloudFiles")
                    .option("cloudFiles.format", "parquet")
                    .option("cloudFiles.schemaLocation", f"{checkpoint_path}{table_name}/schema")
                    .option("cloudFiles.inferColumnTypes", "true")
                    .option("recursiveFileLookup", "true")
                    .load(f"{bronze_path}{table_name}/")
                )
                
                # Add metadata columns
                df_enriched = (df_stream
                    .withColumn("processing_timestamp", current_timestamp())
                    .withColumn("source_table", lit(table_name))
                    .withColumn("source_database", lit(db_name))
                )
                
                # Write to silver layer
                query = (df_enriched.writeStream
                    .format("delta")
                    .option("checkpointLocation", f"{checkpoint_path}{table_name}/checkpoint")
                    .option("mergeSchema", "true")
                    .outputMode("append")
                    .trigger(availableNow=True)
                    .start(f"{silver_path}{table_name}/")
                )
                
                # Wait for completion
                query.awaitTermination()
                
                print(f"     ‚úÖ Successfully processed {table_name}")
                
            except Exception as e:
                print(f"     ‚ùå Error processing {table_name}: {str(e)[:100]}")
                continue
        
        print(f"\n  ‚úÖ Completed database: {db_name}")
    
    print("\n" + "="*70)
    print("‚úÖ Daily Pipeline Complete!")
    print("="*70)

In [0]:
# ============================================
# EVENT HUB TO BRONZE PIPELINE
# Run this to copy new Event Hub data to bronze layer
# ============================================

from pyspark.sql.functions import current_timestamp, lit, to_date

print("=" * 70)
print("EVENT HUB TO BRONZE PIPELINE")
print("=" * 70)

# Configuration
eventhub_source = f"abfss://{container}@{storage_account}.dfs.core.windows.net/streamingingestionsathya/eventhub/"
eventhub_name = "eventhub_events"

# Bronze destination
bronze_eventhub_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/bronze/eventhub/{eventhub_name}/"
checkpoint_eventhub_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/checkpoints/eventhub_to_bronze/{eventhub_name}/"

print(f"\nSource: {eventhub_source}")
print(f"Destination: {bronze_eventhub_path}")
print(f"\n‚è≥ Processing Event Hub data...\n")

try:
    # Read Event Hub data with Auto Loader
    df_eventhub = (spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "avro")  # Event Hub Capture uses AVRO
        .option("cloudFiles.schemaLocation", f"{checkpoint_eventhub_path}schema")
        .option("cloudFiles.inferColumnTypes", "true")
        .option("recursiveFileLookup", "true")
        .load(eventhub_source)
    )
    
    # Add metadata and partition by date
    df_enriched = (df_eventhub
        .withColumn("ingestion_timestamp", current_timestamp())
        .withColumn("source_system", lit("eventhub"))
        .withColumn("load_date", to_date(current_timestamp()))
    )
    
    # Write to bronze layer
    query = (df_enriched.writeStream
        .format("delta")
        .outputMode("append")
        .option("checkpointLocation", f"{checkpoint_eventhub_path}checkpoint")
        .option("mergeSchema", "true")
        .partitionBy("load_date")
        .trigger(availableNow=True)
        .start(bronze_eventhub_path)
    )
    
    query.awaitTermination()
    
    print("\n" + "=" * 70)
    print("‚úÖ Event Hub data copied to Bronze layer!")
    print("=" * 70)
    
    # Verify
    df_bronze = spark.read.format("delta").load(bronze_eventhub_path)
    record_count = df_bronze.count()
    
    print(f"\nTotal records in bronze: {record_count:,}")
    
    if "load_date" in df_bronze.columns:
        dates = df_bronze.select("load_date").distinct().count()
        print(f"Distinct load dates: {dates}")
    
except Exception as e:
    print(f"\n‚ùå Error: {str(e)[:100]}")
    import traceback
    traceback.print_exc()

In [0]:
# ============================================
# EVENT HUB BRONZE TO SILVER PIPELINE
# Run this after copying Event Hub data to bronze
# ============================================

from pyspark.sql.functions import current_timestamp, lit

print("=" * 70)
print("EVENT HUB BRONZE TO SILVER PIPELINE")
print("=" * 70)

# Configuration
eventhub_name = "eventhub_events"
bronze_eventhub_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/bronze/eventhub/{eventhub_name}/"
silver_eventhub_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/silver/eventhub/{eventhub_name}/"
checkpoint_silver_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/checkpoints/bronze_to_silver/eventhub/{eventhub_name}/"

print(f"\nBronze: {bronze_eventhub_path}")
print(f"Silver: {silver_eventhub_path}")
print(f"\n‚è≥ Processing Event Hub bronze to silver...\n")

try:
    # Read from bronze layer
    df_bronze = (spark.readStream
        .format("delta")
        .load(bronze_eventhub_path)
    )
    
    # Add processing metadata
    df_enriched = (df_bronze
        .withColumn("processing_timestamp", current_timestamp())
        .withColumn("source_table", lit(eventhub_name))
        .withColumn("source_database", lit("eventhub"))
    )
    
    # Write to silver layer
    query = (df_enriched.writeStream
        .format("delta")
        .outputMode("append")
        .option("checkpointLocation", f"{checkpoint_silver_path}checkpoint")
        .option("mergeSchema", "true")
        .trigger(availableNow=True)
        .start(silver_eventhub_path)
    )
    
    query.awaitTermination()
    
    print("\n" + "=" * 70)
    print("‚úÖ Event Hub data processed to Silver layer!")
    print("=" * 70)
    
    # Verify
    df_silver = spark.read.format("delta").load(silver_eventhub_path)
    record_count = df_silver.count()
    
    print(f"\nTotal records in silver: {record_count:,}")
    
    if "load_date" in df_silver.columns:
        dates = df_silver.select("load_date").distinct().count()
        print(f"Distinct load dates: {dates}")
    
except Exception as e:
    print(f"\n‚ùå Error: {str(e)[:100]}")
    import traceback
    traceback.print_exc()

In [0]:
# Complete verification of all data sources in silver layer

print("=" * 70)
print("COMPLETE SILVER LAYER SUMMARY")
print("=" * 70)

total_sources = 0
total_tables = 0
total_records = 0

# 1. MySQL Databases
print("\nüìä MYSQL DATABASES")
print("=" * 70)

if 'all_database_configs' in dir() and all_database_configs:
    for config in all_database_configs:
        db_name = config['database_name']
        silver_path = config['silver_path']
        tables = config['tables']
        
        print(f"\n  Database: {db_name}")
        
        db_records = 0
        for table_name in tables:
            try:
                df = spark.read.format("delta").load(f"{silver_path}{table_name}/")
                count = df.count()
                db_records += count
                print(f"    ‚úì {table_name}: {count:,} records")
            except Exception as e:
                print(f"    ‚úó {table_name}: Not processed")
        
        print(f"  Subtotal: {db_records:,} records")
        total_sources += 1
        total_tables += len(tables)
        total_records += db_records
else:
    print("  ‚ö† No MySQL databases configured")

# 2. Event Hub
print("\n\nüì° EVENT HUB")
print("=" * 70)

eventhub_name = "eventhub_events"
silver_eventhub_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/silver/eventhub/{eventhub_name}/"

try:
    df_eh = spark.read.format("delta").load(silver_eventhub_path)
    eh_count = df_eh.count()
    
    print(f"\n  Source: eventhub")
    print(f"    ‚úì {eventhub_name}: {eh_count:,} records")
    
    if "load_date" in df_eh.columns:
        dates = df_eh.select("load_date").distinct().count()
        print(f"    ‚úì Load dates: {dates}")
    
    print(f"  Subtotal: {eh_count:,} records")
    total_sources += 1
    total_tables += 1
    total_records += eh_count
    
except Exception as e:
    print(f"\n  ‚úó Event Hub: Not processed or error")
    print(f"     {str(e)[:80]}")

# Final Summary
print("\n\n" + "=" * 70)
print("FINAL SUMMARY")
print("=" * 70)
print(f"\n  Total Sources: {total_sources}")
print(f"  Total Tables/Streams: {total_tables}")
print(f"  Total Records: {total_records:,}")
print("\n" + "=" * 70)

if total_records > 0:
    print("\n‚úÖ SUCCESS! All data sources processed to silver layer!")
    print("\nüìç Silver Layer Locations:")
    print(f"   - MySQL: abfss://{container}@{storage_account}.dfs.core.windows.net/silver/mysql/")
    print(f"   - Event Hub: abfss://{container}@{storage_account}.dfs.core.windows.net/silver/eventhub/")
else:
    print("\n‚ö†Ô∏è No data found in silver layer. Please run the processing pipelines first.")

---
# ‚ö° **Delta Lake Optimization**

## **Why Optimize?**

* **Query Performance** - Up to 10-100x faster
* **Storage Efficiency** - 30-50% reduction
* **Cost Savings** - Less storage + faster queries

---

## **Techniques**

### **1. OPTIMIZE** - File Compaction
* Combines small files into larger files
* Run weekly after incremental loads

### **2. Z-ORDER** - Data Clustering  
* Co-locates related data
* Enables data skipping
* Best for 2-4 columns

### **3. Liquid Clustering** ‚≠ê RECOMMENDED
* Automatic incremental clustering
* Self-optimizing
* Best for 3-5 columns

### **4. Optimized Writes**
* Auto-compacts during writes
* Enable at session start

### **5. VACUUM**
* Removes old file versions
* Run monthly
* Saves 20-50% storage

In [0]:
# Compact small files in Silver layer

print("=" * 70)
print("OPTIMIZING SILVER LAYER - FILE COMPACTION")
print("=" * 70)

if 'all_database_configs' not in dir() or not all_database_configs:
    print("\n‚ùå Please run Cell 2 (Configuration) first!")
else:
    # Optimize MySQL tables
    for config in all_database_configs:
        db_name = config['database_name']
        silver_path = config['silver_path']
        tables = config['tables']
        
        print(f"\nDatabase: {db_name}")
        
        for table_name in tables:
            try:
                table_path = f"{silver_path}{table_name}/"
                spark.sql(f"OPTIMIZE delta.`{table_path}`")
                print(f"  ‚úÖ {table_name}: Optimized")
            except Exception as e:
                print(f"  ‚ùå {table_name}: {str(e)[:80]}")
    
    # Optimize Event Hub
    eventhub_name = "eventhub_events"
    silver_eventhub_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/silver/eventhub/{eventhub_name}/"
    
    try:
        spark.sql(f"OPTIMIZE delta.`{silver_eventhub_path}`")
        print(f"\n  ‚úÖ eventhub/{eventhub_name}: Optimized")
    except Exception as e:
        print(f"\n  ‚ùå eventhub/{eventhub_name}: {str(e)[:80]}")
    
    print(f"\n{'='*70}")
    print("‚úÖ OPTIMIZE COMPLETE")
    print(f"{'='*70}")

In [0]:
# Apply Z-ORDER clustering for query performance

print("=" * 70)
print("APPLYING Z-ORDER CLUSTERING")
print("=" * 70)

# Z-ORDER columns for each table
zorder_config = {
    'customer_details': ['customer_id', 'load_date'],
    'customer_orders': ['order_id', 'customer_id'],
    'orders': ['order_id', 'user_id'],
    'users': ['user_id', 'country'],
    'monthly_active_users': ['year__of_month'],
    'departments': ['department_id'],
    'employees': ['employee_id', 'department_id'],
    'students_table': ['id', 'grade'],
    'eventhub_events': ['load_date', 'SequenceNumber']
}

if 'all_database_configs' not in dir() or not all_database_configs:
    print("\n‚ùå Please run Cell 2 first!")
else:
    # Z-ORDER MySQL tables
    for config in all_database_configs:
        db_name = config['database_name']
        silver_path = config['silver_path']
        tables = config['tables']
        
        print(f"\nDatabase: {db_name}")
        
        for table_name in tables:
            if table_name in zorder_config:
                zorder_cols = zorder_config[table_name]
                try:
                    table_path = f"{silver_path}{table_name}/"
                    zorder_clause = ', '.join(zorder_cols)
                    spark.sql(f"OPTIMIZE delta.`{table_path}` ZORDER BY ({zorder_clause})")
                    print(f"  ‚úÖ {table_name}: Z-ORDERed by {', '.join(zorder_cols)}")
                except Exception as e:
                    print(f"  ‚ùå {table_name}: {str(e)[:80]}")
    
    # Z-ORDER Event Hub
    eventhub_name = "eventhub_events"
    if eventhub_name in zorder_config:
        zorder_cols = zorder_config[eventhub_name]
        silver_eventhub_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/silver/eventhub/{eventhub_name}/"
        try:
            zorder_clause = ', '.join(zorder_cols)
            spark.sql(f"OPTIMIZE delta.`{silver_eventhub_path}` ZORDER BY ({zorder_clause})")
            print(f"\n  ‚úÖ eventhub/{eventhub_name}: Z-ORDERed by {', '.join(zorder_cols)}")
        except Exception as e:
            print(f"\n  ‚ùå eventhub/{eventhub_name}: {str(e)[:80]}")
    
    print(f"\n{'='*70}")
    print("‚úÖ Z-ORDER COMPLETE")
    print(f"{'='*70}")

In [0]:
# Enable Liquid Clustering (DBR 13.3+)

print("=" * 70)
print("ENABLING LIQUID CLUSTERING")
print("=" * 70)

clustering_config = {
    'customer_details': ['customer_id', 'load_date'],
    'customer_orders': ['customer_id', 'order_id', 'load_date'],
    'orders': ['user_id', 'order_date', 'load_date'],
    'users': ['user_id', 'country', 'load_date'],
    'monthly_active_users': ['year__of_month', 'load_date'],
    'departments': ['department_id', 'load_date'],
    'employees': ['department_id', 'employee_id', 'load_date'],
    'students_table': ['grade', 'id', 'load_date'],
    'eventhub_events': ['load_date', 'SequenceNumber']
}

if 'all_database_configs' not in dir() or not all_database_configs:
    print("\n‚ùå Please run Cell 2 first!")
else:
    for config in all_database_configs:
        db_name = config['database_name']
        silver_path = config['silver_path']
        tables = config['tables']
        
        print(f"\nDatabase: {db_name}")
        
        for table_name in tables:
            if table_name in clustering_config:
                cluster_cols = clustering_config[table_name]
                try:
                    table_path = f"{silver_path}{table_name}/"
                    cluster_clause = ', '.join(cluster_cols)
                    spark.sql(f"ALTER TABLE delta.`{table_path}` CLUSTER BY ({cluster_clause})")
                    print(f"  ‚úÖ {table_name}: Liquid Clustering enabled")
                except Exception as e:
                    if "CLUSTER BY is only supported" in str(e):
                        print(f"  ‚ö†Ô∏è  {table_name}: Not supported (use Z-ORDER instead)")
                    else:
                        print(f"  ‚ùå {table_name}: {str(e)[:80]}")
    
    # Event Hub
    eventhub_name = "eventhub_events"
    if eventhub_name in clustering_config:
        cluster_cols = clustering_config[eventhub_name]
        silver_eventhub_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/silver/eventhub/{eventhub_name}/"
        try:
            cluster_clause = ', '.join(cluster_cols)
            spark.sql(f"ALTER TABLE delta.`{silver_eventhub_path}` CLUSTER BY ({cluster_clause})")
            print(f"\n  ‚úÖ eventhub/{eventhub_name}: Liquid Clustering enabled")
        except Exception as e:
            if "CLUSTER BY is only supported" in str(e):
                print(f"\n  ‚ö†Ô∏è  eventhub/{eventhub_name}: Not supported (use Z-ORDER)")
            else:
                print(f"\n  ‚ùå eventhub/{eventhub_name}: {str(e)[:80]}")
    
    print(f"\n{'='*70}")
    print("‚úÖ LIQUID CLUSTERING COMPLETE")
    print(f"{'='*70}")

In [0]:
# Enable Optimized Writes for better write performance

print("=" * 70)
print("ENABLING OPTIMIZED WRITES")
print("=" * 70)

# Enable optimized writes
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
print("‚úÖ spark.databricks.delta.optimizeWrite.enabled = true")

# Enable auto compaction
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")
print("‚úÖ spark.databricks.delta.autoCompact.enabled = true")

print("\n" + "=" * 70)
print("‚úÖ OPTIMIZED WRITES ENABLED")
print("=" * 70)
print("\nüí° Applies to all future writes in this session")

In [0]:
# VACUUM removes old file versions (Run monthly)
# WARNING: Permanently deletes old versions

print("=" * 70)
print("VACUUM - CLEAN UP OLD FILE VERSIONS")
print("=" * 70)

print("\n‚ö†Ô∏è  WARNING: Code is commented for safety")
print("   Uncomment to execute\n")

retention_hours = 168  # 7 days

# UNCOMMENT TO RUN:
# if 'all_database_configs' in dir() and all_database_configs:
#     for config in all_database_configs:
#         db_name = config['database_name']
#         silver_path = config['silver_path']
#         for table_name in config['tables']:
#             try:
#                 table_path = f"{silver_path}{table_name}/"
#                 spark.sql(f"VACUUM delta.`{table_path}` RETAIN {retention_hours} HOURS")
#                 print(f"  ‚úÖ {db_name}.{table_name}: Vacuumed")
#             except Exception as e:
#                 print(f"  ‚ùå {db_name}.{table_name}: {str(e)[:80]}")
#     
#     # Event Hub
#     eventhub_name = "eventhub_events"
#     silver_eventhub_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/silver/eventhub/{eventhub_name}/"
#     try:
#         spark.sql(f"VACUUM delta.`{silver_eventhub_path}` RETAIN {retention_hours} HOURS")
#         print(f"\n  ‚úÖ eventhub/{eventhub_name}: Vacuumed")
#     except Exception as e:
#         print(f"\n  ‚ùå eventhub/{eventhub_name}: {str(e)[:80]}")

print("\nüí° Uncomment code above to run VACUUM")