# Schema and Partition Evolution with Apache Iceberg

This comprehensive tutorial explores Apache Iceberg's powerful evolution capabilities - schema evolution and partition evolution. These features allow you to adapt your data lake architecture over time without costly data rewrites or downtime.

## Learning Objectives
- Master schema evolution techniques (add, rename, drop, reorder columns)
- Understand partition evolution and optimization strategies
- Learn to handle data type changes and compatibility
- Explore backward and forward compatibility patterns
- Practice evolution monitoring and best practices

## Prerequisites
- Completion of notebooks 1-5
- Understanding of data modeling concepts
- Basic knowledge of partitioning strategies

## Why Evolution Matters
- **Business Requirements Change**: New fields, different analytics needs
- **Performance Optimization**: Better partitioning for query patterns
- **Cost Efficiency**: Avoid expensive data rewrites
- **Zero Downtime**: Keep systems running during changes

## 1. Environment Setup and Current State Analysis

Let's begin by establishing our environment and understanding the current table structure.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import (
    StructType, StructField, StringType, IntegerType, 
    BooleanType, TimestampType, LongType, DoubleType
)
from pyspark.sql.functions import (
    col, current_timestamp, lit, count, 
    max as spark_max, min as spark_min, 
    coalesce, round as spark_round, avg
)
from datetime import datetime
import json

# Configuration constants
CATALOG_NAME = "rest.`play_iceberg`"
TABLE_NAME = "users"

In [2]:
def create_spark_session():
    """Create Spark session with evolution-friendly settings."""
    return SparkSession.builder \
        .appName("Iceberg Schema and Partition Evolution") \
        .config("spark.sql.adaptive.enabled", "true") \
        .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
        .config("spark.sql.iceberg.handle-timestamp-without-timezone", "true") \
        .getOrCreate()

def print_session_info(spark):
    """Print session configuration information."""
    print(f"Spark Version: {spark.version}")
    print(f"Default Catalog: {spark.conf.get('spark.sql.defaultCatalog', 'Not set')}")

# Initialize Spark session
spark = create_spark_session()
print_session_info(spark)

# Set working catalog
spark.sql(f"USE {CATALOG_NAME}")
print(f"\nWorking with catalog: {CATALOG_NAME}")

25/07/01 13:48:09 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


Spark Version: 3.5.5
Default Catalog: rest

Working with catalog: rest.`play_iceberg`


In [3]:
def analyze_table_structure(table_name):
    """Analyze and display current table structure and statistics."""
    print("=== CURRENT TABLE ANALYSIS ===\n")
    
    # Get detailed table information
    table_desc = spark.sql(f"DESCRIBE EXTENDED {table_name}")
    print("Current Schema and Partitioning:")
    table_desc.show(50, truncate=False)
    
    # Get current data statistics
    current_stats = spark.sql(f"""
        SELECT 
            COUNT(*) as total_records,
            COUNT(DISTINCT created_year) as distinct_years,
            COUNT(DISTINCT created_month) as distinct_months,
            COUNT(DISTINCT created_day) as distinct_days,
            COUNT(CASE WHEN is_active = true THEN 1 END) as active_users,
            COUNT(CASE WHEN is_active = false THEN 1 END) as inactive_users
        FROM {table_name}
    """)
    print("\nCurrent Data Statistics:")
    current_stats.show()

# Analyze current table
analyze_table_structure(TABLE_NAME)

=== CURRENT TABLE ANALYSIS ===

Current Schema and Partitioning:
+----------------------------+----------------------------------------------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                                                             |comment|
+----------------------------+----------------------------------------------------------------------------------------------------------------------+-------+
|user_id                     |bigint                                                                                                                |NULL   |
|username                    |string                                                                                                                |NULL   |
|email                       |string                                                                                                             

[Stage 0:>                                                          (0 + 1) / 1]

+-------------+--------------+---------------+-------------+------------+--------------+
|total_records|distinct_years|distinct_months|distinct_days|active_users|inactive_users|
+-------------+--------------+---------------+-------------+------------+--------------+
|         1006|             1|              2|            2|         506|           500|
+-------------+--------------+---------------+-------------+------------+--------------+



                                                                                

In [4]:
def show_partition_distribution(table_name):
    """Display current partition distribution and sample data."""
    print("Current Partition Distribution:")
    partition_stats = spark.sql(f"""
        SELECT 
            created_year,
            created_month,
            created_day,
            COUNT(*) as record_count,
            COUNT(CASE WHEN is_active = true THEN 1 END) as active_count,
            COUNT(CASE WHEN is_active = false THEN 1 END) as inactive_count
        FROM {table_name} 
        GROUP BY created_year, created_month, created_day
        ORDER BY created_year, created_month, created_day
    """)
    partition_stats.show()

    print("\nCurrent Sample Data:")
    sample_data = spark.sql(f"""
        SELECT user_id, username, email, is_active 
        FROM {table_name} 
        ORDER BY user_id 
        LIMIT 10
    """)
    sample_data.show()

# Show partition distribution
show_partition_distribution(TABLE_NAME)

Current Partition Distribution:
+------------+-------------+-----------+------------+------------+--------------+
|created_year|created_month|created_day|record_count|active_count|inactive_count|
+------------+-------------+-----------+------------+------------+--------------+
|        2025|            6|         27|         999|         499|           500|
|        2025|            7|          1|           7|           7|             0|
+------------+-------------+-----------+------------+------------+--------------+


Current Sample Data:
+-------+----------------+--------------------+---------+
|user_id|        username|               email|is_active|
+-------+----------------+--------------------+---------+
|      1|  updated_user_1|updated.user.1@ex...|     true|
|      2|  updated_user_2|updated.user.2@ex...|     true|
|      3|  updated_user_3|updated.user.3@ex...|     true|
|      4|  updated_user_4|updated.user.4@ex...|     true|
|      5|  updated_user_5|updated.user.5@ex...|

## 2. Schema Evolution - Adding Columns

Let's start with the most common schema evolution: adding new columns. This is a metadata-only operation that doesn't require rewriting existing data.

In [5]:
def add_schema_columns(table_name):
    """Add multiple new columns to demonstrate schema evolution."""
    print("=== SCHEMA EVOLUTION: ADDING COLUMNS ===\n")
    
    columns_to_add = [
        ("country", "STRING", "Geographic location"),
        ("registration_source", "STRING", "Source of user registration (web, mobile, api)"),
        ("user_score", "DOUBLE", "User engagement score"),
        ("last_login_at", "TIMESTAMP", "Last login timestamp")
    ]
    
    for i, (col_name, col_type, description) in enumerate(columns_to_add, 1):
        print(f"{i}. Adding '{col_name}' column...")
        if description and "Source of" in description:
            spark.sql(f"ALTER TABLE {table_name} ADD COLUMN {col_name} {col_type} COMMENT '{description}'")
        else:
            spark.sql(f"ALTER TABLE {table_name} ADD COLUMN {col_name} {col_type}")
        print(f"   ✓ {col_name} column added")
        if i < len(columns_to_add):
            print()

# Add new columns
add_schema_columns(TABLE_NAME)

=== SCHEMA EVOLUTION: ADDING COLUMNS ===

1. Adding 'country' column...
   ✓ country column added

2. Adding 'registration_source' column...
   ✓ registration_source column added

3. Adding 'user_score' column...
   ✓ user_score column added

4. Adding 'last_login_at' column...
   ✓ last_login_at column added


In [6]:
def verify_schema_evolution(table_name):
    """Verify the schema evolution and show updated structure."""
    print("Updated Schema:")
    spark.sql(f"DESCRIBE TABLE {table_name}").show(20, truncate=False)

    print("\nExisting Data (new columns show as NULL):")
    evolved_data = spark.sql(f"""
        SELECT 
            user_id, username, is_active,
            country, registration_source, user_score, last_login_at
        FROM {table_name} 
        ORDER BY user_id 
        LIMIT 5
    """)
    evolved_data.show(truncate=False)

# Verify schema changes
verify_schema_evolution(TABLE_NAME)

Updated Schema:
+-----------------------+-------------+----------------------------------------------+
|col_name               |data_type    |comment                                       |
+-----------------------+-------------+----------------------------------------------+
|user_id                |bigint       |NULL                                          |
|username               |string       |NULL                                          |
|email                  |string       |NULL                                          |
|is_active              |boolean      |NULL                                          |
|created_year           |int          |NULL                                          |
|created_month          |int          |NULL                                          |
|created_day            |int          |NULL                                          |
|updated_at             |timestamp_ntz|NULL                                          |
|country                |st

In [7]:
def insert_evolved_data(table_name):
    """Insert new data using the evolved schema."""
    print("Inserting new records with evolved schema:")
    
    insert_query = f"""
        INSERT INTO {table_name} 
        (user_id, username, email, is_active, created_year, created_month, created_day,
         updated_at, country, registration_source, user_score, last_login_at)
        VALUES 
        (2001, 'global_user_1', 'global1@company.com', true, 2025, 6, 27,
         current_timestamp(), 'Canada', 'web', 85.5, current_timestamp()),
        (2002, 'global_user_2', 'global2@company.com', true, 2025, 6, 27,
         current_timestamp(), 'Germany', 'mobile', 92.0, current_timestamp()),
        (2003, 'global_user_3', 'global3@company.com', false, 2025, 6, 27,
         current_timestamp(), 'Japan', 'api', 78.3, null)
    """
    
    spark.sql(insert_query)
    print("✓ New users inserted successfully")

# Insert evolved data
insert_evolved_data(TABLE_NAME)

Inserting new records with evolved schema:
✓ New users inserted successfully


In [8]:
def show_mixed_schema_data(table_name):
    """Display data showing both old and new schema versions."""
    print("Data with Mixed Schema Evolution:")
    mixed_data = spark.sql(f"""
        SELECT 
            user_id, username, country, registration_source, user_score,
            CASE 
                WHEN country IS NULL THEN 'Legacy User'
                ELSE 'New Schema User'
            END as user_type
        FROM {table_name} 
        WHERE user_id IN (1, 2, 3, 2001, 2002, 2003)
        ORDER BY user_id
    """)
    mixed_data.show(truncate=False)

    print("\nSchema Version Statistics:")
    version_stats = spark.sql(f"""
        SELECT 
            CASE 
                WHEN country IS NULL THEN 'Original Schema'
                ELSE 'Evolved Schema'
            END as schema_version,
            COUNT(*) as user_count,
            COUNT(CASE WHEN is_active = true THEN 1 END) as active_count
        FROM {table_name} 
        GROUP BY CASE WHEN country IS NULL THEN 'Original Schema' ELSE 'Evolved Schema' END
    """)
    version_stats.show()

# Show mixed schema data
show_mixed_schema_data(TABLE_NAME)

Data with Mixed Schema Evolution:
+-------+--------------+-------+-------------------+----------+---------------+
|user_id|username      |country|registration_source|user_score|user_type      |
+-------+--------------+-------+-------------------+----------+---------------+
|1      |updated_user_1|NULL   |NULL               |NULL      |Legacy User    |
|2      |updated_user_2|NULL   |NULL               |NULL      |Legacy User    |
|3      |updated_user_3|NULL   |NULL               |NULL      |Legacy User    |
|2001   |global_user_1 |Canada |web                |85.5      |New Schema User|
|2002   |global_user_2 |Germany|mobile             |92.0      |New Schema User|
|2003   |global_user_3 |Japan  |api                |78.3      |New Schema User|
+-------+--------------+-------+-------------------+----------+---------------+


Schema Version Statistics:
+---------------+----------+------------+
| schema_version|user_count|active_count|
+---------------+----------+------------+
| Evolved S

## 3. Schema Evolution - Column Operations

Explore additional schema evolution operations like renaming and reordering columns.

In [9]:
def perform_column_operations(table_name):
    """Demonstrate advanced column operations like renaming."""
    print("=== ADVANCED SCHEMA OPERATIONS ===\n")
    
    # Check if user_score column exists before renaming
    schema_df = spark.sql(f"DESCRIBE TABLE {table_name}")
    column_names = [row['col_name'] for row in schema_df.collect()]
    
    if 'user_score' in column_names:
        print("1. Renaming 'user_score' to 'engagement_score'...")
        spark.sql(f"ALTER TABLE {table_name} RENAME COLUMN user_score TO engagement_score")
        print("   ✓ Column renamed successfully")
    elif 'engagement_score' in column_names:
        print("1. Column 'user_score' already renamed to 'engagement_score'")
        print("   ✓ Column rename previously completed")
    else:
        print("1. Neither 'user_score' nor 'engagement_score' found in schema")

    print("\n2. Updating column comment for engagement_score...")
    try:
        spark.sql(f"""
            ALTER TABLE {table_name} ALTER COLUMN engagement_score 
            COMMENT 'User engagement score (0-100 scale)'
        """)
        print("   ✓ Column comment updated")
    except Exception as e:
        print(f"   ⚠ Comment update failed: {str(e)}")

def show_updated_schema(table_name):
    """Show the updated schema after column operations."""
    print("\nUpdated schema after column operations:")
    schema_info = spark.sql(f"DESCRIBE TABLE {table_name}")
    schema_info.show(truncate=False)

# Perform column operations
perform_column_operations(TABLE_NAME)
show_updated_schema(TABLE_NAME)

=== ADVANCED SCHEMA OPERATIONS ===

1. Renaming 'user_score' to 'engagement_score'...
   ✓ Column renamed successfully

2. Updating column comment for engagement_score...
   ✓ Column comment updated

Updated schema after column operations:
+-----------------------+-------------+----------------------------------------------+
|col_name               |data_type    |comment                                       |
+-----------------------+-------------+----------------------------------------------+
|user_id                |bigint       |NULL                                          |
|username               |string       |NULL                                          |
|email                  |string       |NULL                                          |
|is_active              |boolean      |NULL                                          |
|created_year           |int          |NULL                                          |
|created_month          |int          |NULL                     

## 4. Partition Evolution - Adding New Partition Fields

Now let's explore partition evolution, which allows us to optimize query performance by changing partitioning strategies.

In [10]:
def analyze_partition_patterns(table_name):
    """Analyze current query patterns for partition evolution decisions."""
    print("=== PARTITION EVOLUTION ANALYSIS ===\n")
    print("Current partitioning: created_year, created_month, created_day")
    print("\nAnalyzing query patterns that would benefit from additional partitioning...")

    # Distribution by is_active (common filter)
    print("\nDistribution by is_active (common filter):")
    active_distribution = spark.sql(f"""
        SELECT 
            is_active,
            COUNT(*) as user_count,
            ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM {table_name}), 2) as percentage
        FROM {table_name} 
        GROUP BY is_active
        ORDER BY is_active
    """)
    active_distribution.show()

    # Distribution by country (for geographic queries)
    print("\nDistribution by country:")
    country_distribution = spark.sql(f"""
        SELECT 
            COALESCE(country, 'Unknown') as country,
            COUNT(*) as user_count
        FROM {table_name} 
        GROUP BY country
        ORDER BY user_count DESC
    """)
    country_distribution.show()

# Analyze partition patterns
analyze_partition_patterns(TABLE_NAME)

=== PARTITION EVOLUTION ANALYSIS ===

Current partitioning: created_year, created_month, created_day

Analyzing query patterns that would benefit from additional partitioning...

Distribution by is_active (common filter):
+---------+----------+----------+
|is_active|user_count|percentage|
+---------+----------+----------+
|    false|       501|     49.65|
|     true|       508|     50.35|
+---------+----------+----------+


Distribution by country:
+-------+----------+
|country|user_count|
+-------+----------+
|Unknown|      1006|
|Germany|         1|
| Canada|         1|
|  Japan|         1|
+-------+----------+



In [11]:
def evolve_partitioning(table_name):
    """Add new partition fields for better query optimization."""
    print("Adding is_active as partition field...")
    try:
        spark.sql(f"ALTER TABLE {table_name} ADD PARTITION FIELD is_active")
        print("✓ is_active partition field added")
    except Exception as e:
        if "already exists" in str(e) or "Partition field" in str(e):
            print("✓ is_active partition field already exists")
        else:
            print(f"⚠ Partition field addition failed: {str(e)}")

def show_partition_info(table_name):
    """Display updated partition information."""
    print("\nUpdated partition information:")
    partition_info = spark.sql(f"DESCRIBE EXTENDED {table_name}")
    
    # Filter for partition-related information
    relevant_info = partition_info.filter(
        (col("col_name").contains("Partition")) |
        (col("col_name").isin(["created_year", "created_month", "created_day", "is_active"])) |
        (col("col_name") == "# col_name")
    )
    relevant_info.show(truncate=False)

# Evolve partitioning
evolve_partitioning(TABLE_NAME)
show_partition_info(TABLE_NAME)

Adding is_active as partition field...
✓ is_active partition field added

Updated partition information:
+-----------------------+---------+-------+
|col_name               |data_type|comment|
+-----------------------+---------+-------+
|is_active              |boolean  |NULL   |
|created_year           |int      |NULL   |
|created_month          |int      |NULL   |
|created_day            |int      |NULL   |
|# Partition Information|         |       |
|# col_name             |data_type|comment|
|created_year           |int      |NULL   |
|created_month          |int      |NULL   |
|created_day            |int      |NULL   |
|is_active              |boolean  |NULL   |
+-----------------------+---------+-------+



In [12]:
# Insert data that will use the new partition specification
print("Inserting data with new partition specification:")

spark.sql("""
    INSERT INTO users 
    (user_id, username, email, is_active, created_year, created_month, created_day, 
     updated_at, country, registration_source, engagement_score, last_login_at)
    VALUES 
    (3001, 'partition_test_active', 'active@test.com', true, 2025, 6, 28, 
     current_timestamp(), 'USA', 'web', 88.0, current_timestamp()),
    (3002, 'partition_test_inactive', 'inactive@test.com', false, 2025, 6, 28, 
     current_timestamp(), 'UK', 'mobile', 45.0, null),
    (3003, 'partition_test_active2', 'active2@test.com', true, 2025, 6, 28, 
     current_timestamp(), 'France', 'api', 91.5, current_timestamp())
""")

print("✓ Test data inserted with new partitioning")

Inserting data with new partition specification:
✓ Test data inserted with new partitioning


In [13]:
def add_geographic_partitioning(table_name):
    """Add geographic partitioning for optimization."""
    print("\nAdding country-based partitioning for geographic optimization...")
    try:
        spark.sql(f"ALTER TABLE {table_name} ADD PARTITION FIELD bucket(5, country)")
        print("✓ Country bucket partition field added (5 buckets)")
    except Exception as e:
        if "already exists" in str(e) or "Partition field" in str(e):
            print("✓ Country bucket partition field already exists")
        else:
            print(f"⚠ Country bucket partition addition failed: {str(e)}")

def show_final_partition_spec(table_name):
    """Show final partition specification."""
    print("\nFinal partition specification:")
    final_partition_info = spark.sql(f"DESCRIBE EXTENDED {table_name}")
    
    # Show partition-related rows
    partition_rows = []
    for row in final_partition_info.collect():
        col_name = row['col_name']
        if ('Partition' in col_name or 
            col_name in ['created_year', 'created_month', 'created_day', 'is_active'] or
            col_name == '# col_name' or
            'bucket' in str(row['data_type']).lower()):
            partition_rows.append(row)
    
    if partition_rows:
        partition_df = spark.createDataFrame(partition_rows)
        partition_df.show(truncate=False)
    else:
        print("No partition information found")

# Add geographic partitioning
add_geographic_partitioning(TABLE_NAME)
show_final_partition_spec(TABLE_NAME)


Adding country-based partitioning for geographic optimization...
✓ Country bucket partition field added (5 buckets)

Final partition specification:


                                                                                

+--------------+-------------------------------------------------------------------------------------------------+-------+
|col_name      |data_type                                                                                        |comment|
+--------------+-------------------------------------------------------------------------------------------------+-------+
|is_active     |boolean                                                                                          |NULL   |
|created_year  |int                                                                                              |NULL   |
|created_month |int                                                                                              |NULL   |
|created_day   |int                                                                                              |NULL   |
|# Partitioning|                                                                                                 |       |
|Part 4        |

## 5. Evolution Impact Analysis

Let's analyze how evolution affects query performance and data organization.

In [14]:
# Analyze partition distribution after evolution
print("=== EVOLUTION IMPACT ANALYSIS ===")
print()

# Show how data is distributed across partitions
partition_analysis = spark.sql("""
    SELECT 
        created_year,
        created_month,
        created_day,
        is_active,
        COUNT(*) as record_count,
        COUNT(DISTINCT country) as distinct_countries
    FROM users 
    GROUP BY created_year, created_month, created_day, is_active
    ORDER BY created_year, created_month, created_day, is_active
""")

print("Partition Distribution Analysis:")
partition_analysis.show()

# Show schema evolution timeline
print("\nSchema Evolution Timeline:")
evolution_timeline = spark.sql("""
    SELECT 
        CASE 
            WHEN country IS NULL THEN 'Phase 1: Original Schema'
            WHEN user_id < 3000 THEN 'Phase 2: Schema + Geography'
            ELSE 'Phase 3: Full Evolution'
        END as evolution_phase,
        COUNT(*) as user_count,
        MIN(user_id) as min_user_id,
        MAX(user_id) as max_user_id
    FROM users 
    GROUP BY CASE 
        WHEN country IS NULL THEN 'Phase 1: Original Schema'
        WHEN user_id < 3000 THEN 'Phase 2: Schema + Geography'
        ELSE 'Phase 3: Full Evolution'
    END
    ORDER BY min_user_id
""")
evolution_timeline.show(truncate=False)

=== EVOLUTION IMPACT ANALYSIS ===

Partition Distribution Analysis:
+------------+-------------+-----------+---------+------------+------------------+
|created_year|created_month|created_day|is_active|record_count|distinct_countries|
+------------+-------------+-----------+---------+------------+------------------+
|        2025|            6|         27|    false|         501|                 1|
|        2025|            6|         27|     true|         501|                 2|
|        2025|            6|         28|    false|           1|                 1|
|        2025|            6|         28|     true|           2|                 2|
|        2025|            7|          1|     true|           7|                 0|
+------------+-------------+-----------+---------+------------+------------------+


Schema Evolution Timeline:
+---------------------------+----------+-----------+-----------+
|evolution_phase            |user_count|min_user_id|max_user_id|
+-------------------------

In [15]:
# Test queries that benefit from partition evolution
print("Testing queries optimized by partition evolution:")
print()

# Query 1: Filter by is_active (now partitioned)
print("1. Query filtering by is_active (leverages new partitioning):")
active_users_query = spark.sql("""
    SELECT 
        country,
        registration_source,
        COUNT(*) as active_users,
        AVG(engagement_score) as avg_engagement
    FROM users 
    WHERE is_active = true
    GROUP BY country, registration_source
    ORDER BY active_users DESC
""")
active_users_query.show()

# Query 2: Geographic analysis (benefits from country bucketing)
print("\n2. Geographic analysis (leverages country bucketing):")
geographic_query = spark.sql("""
    SELECT 
        COALESCE(country, 'Unknown') as country,
        COUNT(*) as total_users,
        COUNT(CASE WHEN is_active = true THEN 1 END) as active_users,
        ROUND(AVG(engagement_score), 2) as avg_engagement,
        COUNT(CASE WHEN last_login_at IS NOT NULL THEN 1 END) as users_with_login
    FROM users 
    GROUP BY country
    ORDER BY total_users DESC
""")
geographic_query.show()

Testing queries optimized by partition evolution:

1. Query filtering by is_active (leverages new partitioning):
+-------+-------------------+------------+--------------+
|country|registration_source|active_users|avg_engagement|
+-------+-------------------+------------+--------------+
|   NULL|               NULL|         506|          NULL|
| France|                api|           1|          91.5|
|    USA|                web|           1|          88.0|
| Canada|                web|           1|          85.5|
|Germany|             mobile|           1|          92.0|
+-------+-------------------+------------+--------------+


2. Geographic analysis (leverages country bucketing):
+-------+-----------+------------+--------------+----------------+
|country|total_users|active_users|avg_engagement|users_with_login|
+-------+-----------+------------+--------------+----------------+
|Unknown|       1006|         506|          NULL|               0|
|Germany|          1|           1|       

## 6. Evolution Best Practices and Monitoring

Let's explore monitoring and best practices for managing evolving schemas.

In [16]:
def monitor_schema_complexity(table_name):
    """Monitor schema complexity and data quality metrics."""
    print("=== EVOLUTION MONITORING AND BEST PRACTICES ===\n")
    
    # Get schema information
    schema_df = spark.sql(f"DESCRIBE TABLE {table_name}")
    schema_rows = schema_df.collect()
    
    # Calculate complexity metrics
    total_columns = len([row for row in schema_rows if row['col_name'] and not row['col_name'].startswith('#')])
    nullable_columns = len([row for row in schema_rows if row['data_type'] and 'nullable' not in row['data_type']])
    
    # Create metrics DataFrame
    metrics_data = [
        ('Schema Complexity', str(total_columns)),
        ('Nullable Columns', str(nullable_columns))
    ]
    
    metrics_df = spark.createDataFrame(metrics_data, ['metric', 'value'])
    print("Schema Complexity Metrics:")
    metrics_df.show()

def monitor_data_quality(table_name):
    """Monitor data quality across evolution phases."""
    data_quality = spark.sql(f"""
        SELECT 
            'Total Records' as metric,
            CAST(COUNT(*) as STRING) as value
        FROM {table_name}
        
        UNION ALL
        
        SELECT 
            'Records with Country' as metric,
            CAST(COUNT(*) as STRING) as value
        FROM {table_name} WHERE country IS NOT NULL
        
        UNION ALL
        
        SELECT 
            'Records with Engagement Score' as metric,
            CAST(COUNT(*) as STRING) as value
        FROM {table_name} WHERE engagement_score IS NOT NULL
        
        UNION ALL
        
        SELECT 
            'Records with Last Login' as metric,
            CAST(COUNT(*) as STRING) as value
        FROM {table_name} WHERE last_login_at IS NOT NULL
        
        ORDER BY metric
    """)
    print("\nData Quality Across Evolution:")
    data_quality.show(truncate=False)

# Monitor evolution
monitor_schema_complexity(TABLE_NAME)
monitor_data_quality(TABLE_NAME)

=== EVOLUTION MONITORING AND BEST PRACTICES ===

Schema Complexity Metrics:
+-----------------+-----+
|           metric|value|
+-----------------+-----+
|Schema Complexity|   17|
| Nullable Columns|   17|
+-----------------+-----+


Data Quality Across Evolution:
+-----------------------------+-----+
|metric                       |value|
+-----------------------------+-----+
|Records with Country         |6    |
|Records with Engagement Score|6    |
|Records with Last Login      |4    |
|Total Records                |1012 |
+-----------------------------+-----+



In [17]:
# Demonstrate backward compatibility
print("Demonstrating Backward Compatibility:")
print()

# Old-style query (original schema) still works
print("1. Original schema query (still functional):")
original_query = spark.sql("""
    SELECT 
        user_id, username, email, is_active, 
        created_year, created_month, created_day
    FROM users 
    WHERE created_year = 2025 AND created_month = 6
    ORDER BY user_id
    LIMIT 5
""")
original_query.show()

# New schema query with all features
print("\n2. Full evolved schema query:")
evolved_query = spark.sql("""
    SELECT 
        user_id, username, country, registration_source,
        engagement_score, is_active,
        CASE 
            WHEN last_login_at IS NOT NULL THEN 'Recent User'
            WHEN country IS NOT NULL THEN 'Geographic User'
            ELSE 'Legacy User'
        END as user_category
    FROM users 
    WHERE user_id IN (1, 2001, 3001)
    ORDER BY user_id
""")
evolved_query.show(truncate=False)

Demonstrating Backward Compatibility:

1. Original schema query (still functional):
+-------+----------------+--------------------+---------+------------+-------------+-----------+
|user_id|        username|               email|is_active|created_year|created_month|created_day|
+-------+----------------+--------------------+---------+------------+-------------+-----------+
|      8|new_premium_user| premium@company.com|     true|        2025|            6|         27|
|    100|   bulk_user_100|bulk.user.100@exa...|     true|        2025|            6|         27|
|    101|   bulk_user_101|bulk.user.101@exa...|     true|        2025|            6|         27|
|    102|   bulk_user_102|bulk.user.102@exa...|    false|        2025|            6|         27|
|    103|   bulk_user_103|bulk.user.103@exa...|     true|        2025|            6|         27|
+-------+----------------+--------------------+---------+------------+-------------+-----------+


2. Full evolved schema query:
+-------+--

## 7. Advanced Evolution Scenarios

Explore complex evolution scenarios and edge cases.

In [18]:
def demonstrate_advanced_scenarios():
    """Demonstrate advanced evolution scenarios and edge cases."""
    print("=== ADVANCED EVOLUTION SCENARIOS ===\n")
    
    print("Current partition fields before optimization:")
    try:
        current_partitions = spark.sql(f"""
            SELECT partition, spec_id, record_count, file_count 
            FROM rest.`play_iceberg`.{TABLE_NAME}.partitions
        """)
        current_partitions.show()
    except Exception as e:
        print(f"Note: Partition metadata query not available in this environment: {e}")
        print("Partition fields can be viewed in DESCRIBE EXTENDED output")

def analyze_evolution_benefits(table_name):
    """Create a summary of evolution benefits."""
    print("\nEvolution Benefits Analysis:")
    benefits_analysis = spark.sql(f"""
        WITH evolution_stats AS (
            SELECT 
                COUNT(*) as total_users,
                COUNT(CASE WHEN country IS NOT NULL THEN 1 END) as users_with_geography,
                COUNT(CASE WHEN engagement_score IS NOT NULL THEN 1 END) as users_with_scores,
                COUNT(CASE WHEN registration_source IS NOT NULL THEN 1 END) as users_with_source,
                COUNT(DISTINCT COALESCE(country, 'Unknown')) as distinct_countries,
                COUNT(DISTINCT created_year || '-' || created_month || '-' || created_day) as distinct_date_partitions
            FROM {table_name}
        )
        SELECT 
            'Total Users' as metric,
            CAST(total_users as STRING) as value,
            '100%' as coverage
        FROM evolution_stats
        
        UNION ALL
        
        SELECT 
            'Geographic Coverage' as metric,
            CAST(users_with_geography as STRING) as value,
            CAST(ROUND(users_with_geography * 100.0 / total_users, 1) as STRING) || '%' as coverage
        FROM evolution_stats
        
        UNION ALL
        
        SELECT 
            'Engagement Scoring' as metric,
            CAST(users_with_scores as STRING) as value,
            CAST(ROUND(users_with_scores * 100.0 / total_users, 1) as STRING) || '%' as coverage
        FROM evolution_stats
        
        UNION ALL
        
        SELECT 
            'Source Tracking' as metric,
            CAST(users_with_source as STRING) as value,
            CAST(ROUND(users_with_source * 100.0 / total_users, 1) as STRING) || '%' as coverage
        FROM evolution_stats
    """)
    benefits_analysis.show(truncate=False)

# Demonstrate advanced scenarios
demonstrate_advanced_scenarios()
analyze_evolution_benefits(TABLE_NAME)

=== ADVANCED EVOLUTION SCENARIOS ===

Current partition fields before optimization:
+--------------------+-------+------------+----------+
|           partition|spec_id|record_count|file_count|
+--------------------+-------+------------+----------+
|{2025, 6, 28, tru...|      1|           2|         1|
|{2025, 7, 1, NULL...|      0|           7|         1|
|{2025, 6, 28, fal...|      1|           1|         1|
|{2025, 6, 27, NUL...|      0|        1002|         3|
+--------------------+-------+------------+----------+


Evolution Benefits Analysis:
+-------------------+-----+--------+
|metric             |value|coverage|
+-------------------+-----+--------+
|Total Users        |1012 |100%    |
|Geographic Coverage|6    |0.6%    |
|Engagement Scoring |6    |0.6%    |
|Source Tracking    |6    |0.6%    |
+-------------------+-----+--------+



## 8. Evolution Strategy Recommendations

Compile best practices and recommendations for managing schema evolution.

In [19]:
def create_evolution_recommendations():
    """Create evolution strategy recommendations."""
    print("=== EVOLUTION STRATEGY RECOMMENDATIONS ===\n")
    
    recommendations = [
        ("Schema Evolution", "Add columns as nullable to maintain compatibility"),
        ("Schema Evolution", "Use meaningful column names and comments"),
        ("Schema Evolution", "Plan for data type compatibility"),
        ("Partition Evolution", "Add partition fields based on query patterns"),
        ("Partition Evolution", "Use bucketing for high-cardinality fields"),
        ("Partition Evolution", "Monitor partition distribution and skew"),
        ("Data Quality", "Validate data during evolution phases"),
        ("Data Quality", "Maintain backward compatibility"),
        ("Performance", "Test queries after evolution changes"),
        ("Performance", "Monitor file count and compaction needs")
    ]

    recommendations_df = spark.createDataFrame(
        recommendations, 
        ["category", "recommendation"]
    )
    print("Evolution Best Practices:")
    recommendations_df.show(truncate=False)

def show_final_table_state(table_name):
    """Show final evolved table state without using information_schema."""
    print("\nFinal Evolved Table State:")
    
    # Get schema column count
    schema_df = spark.sql(f"DESCRIBE TABLE {table_name}")
    schema_rows = [row for row in schema_df.collect() if row['col_name'] and not row['col_name'].startswith('#')]
    column_count = len(schema_rows)
    
    # Get other metrics from the table
    final_state = spark.sql(f"""
        SELECT 
            'Total Records' as aspect,
            CAST(COUNT(*) as STRING) as count
        FROM {table_name}
        
        UNION ALL
        
        SELECT 
            'Active Partitions' as aspect,
            CAST(COUNT(DISTINCT created_year || '-' || created_month || '-' || created_day || '-' || is_active) as STRING) as count
        FROM {table_name}
        
        ORDER BY aspect
    """)
    
    # Create final state with schema columns
    final_data = [('Schema Columns', str(column_count))]
    final_data.extend([(row['aspect'], row['count']) for row in final_state.collect()])
    
    final_df = spark.createDataFrame(final_data, ['aspect', 'count'])
    final_df.show()

# Create recommendations and show final state
create_evolution_recommendations()
show_final_table_state(TABLE_NAME)

=== EVOLUTION STRATEGY RECOMMENDATIONS ===

Evolution Best Practices:


                                                                                

+-------------------+-------------------------------------------------+
|category           |recommendation                                   |
+-------------------+-------------------------------------------------+
|Schema Evolution   |Add columns as nullable to maintain compatibility|
|Schema Evolution   |Use meaningful column names and comments         |
|Schema Evolution   |Plan for data type compatibility                 |
|Partition Evolution|Add partition fields based on query patterns     |
|Partition Evolution|Use bucketing for high-cardinality fields        |
|Partition Evolution|Monitor partition distribution and skew          |
|Data Quality       |Validate data during evolution phases            |
|Data Quality       |Maintain backward compatibility                  |
|Performance        |Test queries after evolution changes             |
|Performance        |Monitor file count and compaction needs          |
+-------------------+-------------------------------------------

## 9. Summary and Cleanup

Summarize the evolution journey and clean up the session.

In [20]:
def print_evolution_summary():
    """Create comprehensive evolution summary."""
    print("=== SCHEMA AND PARTITION EVOLUTION TUTORIAL SUMMARY ===\n")
    print("EVOLUTION JOURNEY COMPLETED:")
    print("=" * 50)
    print()
    
    achievements = {
        "📋 SCHEMA EVOLUTION ACHIEVEMENTS": [
            "Added geographic information (country)",
            "Added user registration tracking (registration_source)", 
            "Added engagement scoring (engagement_score)",
            "Added user activity tracking (last_login_at)",
            "Renamed columns for clarity",
            "Added meaningful column comments"
        ],
        "🗂️ PARTITION EVOLUTION ACHIEVEMENTS": [
            "Added is_active partitioning for query optimization",
            "Added country bucketing for geographic queries", 
            "Maintained backward compatibility",
            "Optimized for common query patterns"
        ],
        "📊 DATA QUALITY MAINTENANCE": [
            "Preserved all existing data during evolution",
            "Maintained data type consistency",
            "Handled NULL values appropriately", 
            "Enabled gradual schema adoption"
        ]
    }
    
    for category, items in achievements.items():
        print(category + ":")
        for item in items:
            print(f"   ✓ {item}")
        print()

def show_final_statistics(table_name):
    """Display final table statistics."""
    final_summary = spark.sql(f"""
        SELECT 
            COUNT(*) as total_records,
            COUNT(CASE WHEN country IS NOT NULL THEN 1 END) as records_with_geography,
            COUNT(CASE WHEN engagement_score IS NOT NULL THEN 1 END) as records_with_scoring,
            COUNT(DISTINCT COALESCE(country, 'Unknown')) as countries_represented,
            MAX(user_id) as highest_user_id
        FROM {table_name}
    """)
    print("FINAL TABLE STATISTICS:")
    final_summary.show()

def print_key_learnings():
    """Print key learnings from the evolution process."""
    learnings = [
        "Schema evolution enables zero-downtime table modifications",
        "Partition evolution optimizes query performance over time", 
        "Iceberg handles compatibility across different schema versions",
        "Evolution should be driven by actual usage patterns",
        "Monitoring and validation are crucial during evolution"
    ]
    
    print("\nKEY LEARNINGS:")
    for learning in learnings:
        print(f"• {learning}")
    
    print("\n🎉 Tutorial completed successfully!")
    print("Your table is now evolved and optimized for modern analytics workloads.")

# Generate summary
print_evolution_summary()
show_final_statistics(TABLE_NAME)
print_key_learnings()

=== SCHEMA AND PARTITION EVOLUTION TUTORIAL SUMMARY ===

EVOLUTION JOURNEY COMPLETED:

📋 SCHEMA EVOLUTION ACHIEVEMENTS:
   ✓ Added geographic information (country)
   ✓ Added user registration tracking (registration_source)
   ✓ Added engagement scoring (engagement_score)
   ✓ Added user activity tracking (last_login_at)
   ✓ Renamed columns for clarity
   ✓ Added meaningful column comments

🗂️ PARTITION EVOLUTION ACHIEVEMENTS:
   ✓ Added is_active partitioning for query optimization
   ✓ Added country bucketing for geographic queries
   ✓ Maintained backward compatibility
   ✓ Optimized for common query patterns

📊 DATA QUALITY MAINTENANCE:
   ✓ Preserved all existing data during evolution
   ✓ Maintained data type consistency
   ✓ Handled NULL values appropriately
   ✓ Enabled gradual schema adoption

FINAL TABLE STATISTICS:
+-------------+----------------------+--------------------+---------------------+---------------+
|total_records|records_with_geography|records_with_scoring|coun

In [21]:
def cleanup_session():
    """Perform comprehensive session cleanup."""
    print("Performing session cleanup...")
    
    try:
        # Clear any cached data
        spark.catalog.clearCache()
        print("✓ Spark catalog cache cleared")
        
        # Show final session information
        print("\nSession Information:")
        print(f"Application: {spark.sparkContext.getConf().get('spark.app.name')}")
        print(f"Catalog: {spark.conf.get('spark.sql.defaultCatalog', 'Not set')}")
        
        print("\n✅ Evolution tutorial completed successfully!")
        print("Your Iceberg table now supports modern analytics with evolved schema and optimized partitioning.")
        
    except Exception as e:
        print(f"Cleanup completed with minor warnings: {e}")
    
    finally:
        print("\n" + "="*60)
        print("TUTORIAL COMPLETE - READY FOR PRODUCTION WORKLOADS")
        print("="*60)

# Perform cleanup
cleanup_session()

Performing session cleanup...
✓ Spark catalog cache cleared

Session Information:
Application: PySparkShell
Catalog: rest

✅ Evolution tutorial completed successfully!
Your Iceberg table now supports modern analytics with evolved schema and optimized partitioning.

TUTORIAL COMPLETE - READY FOR PRODUCTION WORKLOADS
