# Querying Iceberg Tables with Apache Spark

This notebook demonstrates how to query Apache Iceberg tables using Apache Spark SQL and DataFrame APIs. Spark provides comprehensive support for Iceberg tables with advanced features like partition pruning, predicate pushdown, and vectorized execution.

## Learning Objectives

By the end of this notebook, you'll understand:
- How to configure Spark for Iceberg table access
- How to use Spark SQL for querying Iceberg tables
- How to leverage Spark DataFrame API for data operations
- How to optimize queries using Iceberg metadata
- How to perform advanced analytics with Spark on Iceberg

## Prerequisites

- Completed notebooks 1-3 (table creation, data insertion, basic querying)
- Apache Spark with Iceberg extensions
- Docker environment running
- Understanding of Spark SQL and DataFrames

## Why Spark with Iceberg?

Spark is the premier choice for Iceberg operations because:
- **Native Integration**: Built-in Iceberg support in Spark 3.x
- **Full Feature Set**: Complete ACID operations, schema evolution
- **Performance**: Advanced optimizations like vectorized execution
- **Scalability**: Distributed processing for large datasets
- **Ecosystem**: Rich integration with data platforms
- **SQL Compatibility**: Standard SQL interface with extensions

## Environment Setup

Initialize Spark session with Iceberg configuration. The configuration is loaded from `spark-defaults.conf` which contains all necessary Iceberg settings.

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

# Create Spark session with Iceberg configuration
# Configuration is loaded from spark-defaults.conf
spark = SparkSession.builder \
    .appName("Iceberg Analytics with Spark") \
    .getOrCreate()

print("Spark Session Configuration:")
print("=" * 40)
print(f"Spark version: {spark.version}")
print(f"Default catalog: {spark.conf.get('spark.sql.defaultCatalog')}")
print(f"Iceberg extensions: {spark.conf.get('spark.sql.extensions')}")

# Verify Iceberg configuration
print("\nIceberg Configuration:")
print("=" * 30)
print(f"Catalog URI: {spark.conf.get('spark.sql.catalog.rest.uri')}")
print(f"Warehouse: {spark.conf.get('spark.sql.catalog.rest.warehouse')}")
print(f"S3 Endpoint: {spark.conf.get('spark.sql.catalog.rest.s3.endpoint')}")

print("\nSpark session ready for Iceberg operations")

Spark Session Configuration:
Spark version: 3.5.5
Default catalog: rest
Iceberg extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Iceberg Configuration:
Catalog URI: http://iceberg-rest:8181/
Warehouse: s3://warehouse/
S3 Endpoint: http://minio:9000

Spark session ready for Iceberg operations


25/07/01 13:44:48 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


## Catalog and Namespace Exploration

Explore the Iceberg catalog structure and verify our table exists.

In [2]:
# Explore catalog structure
print("Catalog Structure Exploration:")
print("=" * 40)

# Show available namespaces
print("\n1. Available Namespaces:")
namespaces_df = spark.sql("SHOW NAMESPACES")
namespaces_df.show()

# Set default namespace
spark.sql("USE rest.`play_iceberg`")
print("\n2. Switched to 'rest.play_iceberg' namespace")

# Show tables in namespace
print("\n3. Tables in Namespace:")
tables_df = spark.sql("SHOW TABLES")
tables_df.show()

# Get table count
table_count = tables_df.count()
print(f"\nFound {table_count} table(s) in the namespace")

# Verify users table exists
users_table_exists = (
    tables_df.filter("tableName = 'users'")
    .count() > 0
)

if users_table_exists:
    print("✓ Users table found and accessible")
else:
    print("✗ Users table not found - run previous notebooks first")
    raise Exception("Users table not found")

Catalog Structure Exploration:

1. Available Namespaces:
+------------+
|   namespace|
+------------+
|play_iceberg|
+------------+


2. Switched to 'rest.play_iceberg' namespace

3. Tables in Namespace:
+------------+---------+-----------+
|   namespace|tableName|isTemporary|
+------------+---------+-----------+
|play_iceberg|    users|      false|
+------------+---------+-----------+



                                                                                


Found 1 table(s) in the namespace
✓ Users table found and accessible


## Table Structure Inspection

Examine the table schema, partitioning, and metadata to understand its structure.

In [3]:
# Detailed table inspection
print("Table Structure Analysis:")
print("=" * 35)

# Get detailed table description
print("\n1. Extended Table Description:")
table_desc = spark.sql("DESCRIBE EXTENDED users")
table_desc.show(50, truncate=False)

# Extract key information
desc_rows = table_desc.collect()
table_info = {}

for row in desc_rows:
    if row['col_name'] and row['data_type']:
        if row['col_name'].startswith('#'):
            continue
        if row['col_name'] in ['Name', 'Type', 'Location', 'Provider']:
            table_info[row['col_name']] = row['data_type']

print("\n2. Table Summary:")
for key, value in table_info.items():
    print(f"   {key}: {value}")

# Show table schema in a cleaner format
print("\n3. Table Schema:")
users_df = spark.table("users")
users_df.printSchema()

print("\n4. Schema Information:")
print(f"   Total columns: {len(users_df.columns)}")
print(f"   Column names: {users_df.columns}")

Table Structure Analysis:

1. Extended Table Description:
+----------------------------+----------------------------------------------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                                                             |comment|
+----------------------------+----------------------------------------------------------------------------------------------------------------------+-------+
|user_id                     |bigint                                                                                                                |NULL   |
|username                    |string                                                                                                                |NULL   |
|email                       |string                                                                                                                |NUL

## Basic SQL Queries

Execute fundamental SQL queries to explore the data and verify table functionality.

In [4]:
# Basic SQL queries
print("Basic SQL Query Examples:")
print("=" * 35)

# Query 1: Check if table has data
print("\n1. Record Count:")
count_result = spark.sql("SELECT COUNT(*) as total_records FROM users")
count_result.show()

record_count = count_result.collect()[0]['total_records']
print(f"Total records in table: {record_count}")

if record_count == 0:
    print("\nTable is empty. Let's check table history and snapshots.")
    
    # Check table history
    try:
        history_df = spark.sql("SELECT * FROM rest.`play_iceberg`.users.history")
        print("\nTable History:")
        history_df.show()
        
        if history_df.count() == 0:
            print("No snapshots found. Table was created but no data inserted.")
            print("Please run notebook 2 to insert sample data.")
        
    except Exception as e:
        print(f"Error checking table history: {e}")
        
else:
    # Query 2: Show sample data
    print("\n2. Sample Data (first 10 records):")
    sample_data = spark.sql("SELECT * FROM users ORDER BY user_id LIMIT 10")
    sample_data.show(truncate=False)
    
    # Query 3: Data type verification
    print("\n3. Data Types and Sample Values:")
    for column in users_df.columns:
        dtype = dict(users_df.dtypes)[column]
        sample_val = users_df.select(column).first()[0]
        print(f"   {column}: {dtype} (sample: {sample_val})")
    
    # Query 4: Basic aggregations
    print("\n4. Basic Statistics:")
    stats_query = """
    SELECT 
        COUNT(*) as total_users,
        SUM(CASE WHEN is_active THEN 1 ELSE 0 END) as active_users,
        SUM(CASE WHEN NOT is_active THEN 1 ELSE 0 END) as inactive_users,
        COUNT(DISTINCT created_year) as years_span,
        MIN(updated_at) as earliest_update,
        MAX(updated_at) as latest_update
    FROM users
    """
    
    stats_df = spark.sql(stats_query)
    stats_df.show(truncate=False)

Basic SQL Query Examples:

1. Record Count:
+-------------+
|total_records|
+-------------+
|            5|
+-------------+

Total records in table: 5

2. Sample Data (first 10 records):


                                                                                

+-------+-------------+-------------------------+---------+------------+-------------+-----------+--------------------------+
|user_id|username     |email                    |is_active|created_year|created_month|created_day|updated_at                |
+-------+-------------+-------------------------+---------+------------+-------------+-----------+--------------------------+
|1      |john_doe     |john.doe@example.com     |true     |2025        |7            |1          |2025-07-01 13:41:16.173663|
|2      |jane_smith   |jane.smith@example.com   |true     |2025        |7            |1          |2025-07-01 13:41:16.173663|
|3      |alice_wonder |alice.wonder@example.com |false    |2025        |7            |1          |2025-07-01 13:41:16.173663|
|4      |bob_builder  |bob.builder@example.com  |true     |2025        |7            |1          |2025-07-01 13:41:16.173663|
|5      |charlie_brown|charlie.brown@example.com|true     |2025        |7            |1          |2025-07-01 13:41:16.

## DataFrame API Operations

Demonstrate Spark DataFrame API for programmatic data operations.

In [5]:
# DataFrame API examples
print("Spark DataFrame API Examples:")
print("=" * 40)

# Load table as DataFrame
users_df = spark.table("users")

if users_df.count() > 0:
    # Example 1: Column selection and filtering
    print("\n1. Active Users (DataFrame API):")
    active_users = (
        users_df
        .filter(F.col("is_active") == True)
        .select("user_id", "username", "email")
        .orderBy("user_id")
    )
    active_users.show()
    
    # Example 2: Aggregations with groupBy
    print("\n2. User Count by Activity Status:")
    activity_summary = (
        users_df
        .groupBy("is_active")
        .agg(
            F.count("*").alias("user_count"),
            F.collect_list("username").alias("usernames")
        )
        .orderBy("is_active")
    )
    activity_summary.show(truncate=False)
    
    # Example 3: String operations
    print("\n3. Email Domain Analysis:")
    email_analysis = (
        users_df
        .withColumn(
            "email_domain", 
            F.split(F.col("email"), "@").getItem(1)
        )
        .withColumn(
            "username_length",
            F.length(F.col("username"))
        )
        .select(
            "username", "email", "email_domain", 
            "username_length", "is_active"
        )
        .orderBy("username_length")
    )
    email_analysis.show(truncate=False)
    
    # Example 4: Window functions
    print("\n4. User Ranking with Window Functions:")
    from pyspark.sql.window import Window
    
    window_spec = Window.orderBy(F.col("user_id"))
    window_spec_unbounded = Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
    
    total_count = users_df.count()
    
    ranked_users = (
        users_df
        .withColumn("user_rank", F.row_number().over(window_spec))
        .withColumn("total_users", F.lit(total_count))
        .withColumn("is_first_half", F.col("user_rank") <= (F.col("total_users") / 2))
        .select(
            "user_rank", "user_id", "username", 
            "is_active", "is_first_half"
        )
        .orderBy("user_rank")
    )
    ranked_users.show()
    
else:
    print("\nTable is empty - DataFrame API examples skipped")
    print("Please run notebook 2 to insert sample data first")

Spark DataFrame API Examples:

1. Active Users (DataFrame API):
+-------+-------------+--------------------+
|user_id|     username|               email|
+-------+-------------+--------------------+
|      1|     john_doe|john.doe@example.com|
|      2|   jane_smith|jane.smith@exampl...|
|      4|  bob_builder|bob.builder@examp...|
|      5|charlie_brown|charlie.brown@exa...|
+-------+-------------+--------------------+


2. User Count by Activity Status:
+---------+----------+--------------------------------------------------+
|is_active|user_count|usernames                                         |
+---------+----------+--------------------------------------------------+
|false    |1         |[alice_wonder]                                    |
|true     |4         |[john_doe, jane_smith, bob_builder, charlie_brown]|
+---------+----------+--------------------------------------------------+


3. Email Domain Analysis:
+-------------+-------------------------+------------+--------------

25/07/01 13:45:07 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/07/01 13:45:07 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/07/01 13:45:07 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/07/01 13:45:07 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/07/01 13:45:07 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


+---------+-------+-------------+---------+-------------+
|user_rank|user_id|     username|is_active|is_first_half|
+---------+-------+-------------+---------+-------------+
|        1|      1|     john_doe|     true|         true|
|        2|      2|   jane_smith|     true|         true|
|        3|      3| alice_wonder|    false|        false|
|        4|      4|  bob_builder|     true|        false|
|        5|      5|charlie_brown|     true|        false|
+---------+-------+-------------+---------+-------------+



## Advanced Analytical Queries

Demonstrate complex analytical operations that showcase Spark's capabilities.

In [6]:
# Advanced analytical queries
print("Advanced Analytics Examples:")
print("=" * 40)

if users_df.count() > 0:
    # Analysis 1: Comprehensive user profile analysis
    print("\n1. Comprehensive User Profile Analysis:")
    profile_analysis = spark.sql("""
    SELECT 
        user_id,
        username,
        CASE 
            WHEN LENGTH(username) < 8 THEN 'Short'
            WHEN LENGTH(username) < 12 THEN 'Medium'
            ELSE 'Long'
        END as username_category,
        email,
        SPLIT_PART(email, '@', 2) as domain,
        is_active,
        CASE 
            WHEN is_active THEN 'Active User'
            ELSE 'Inactive User'
        END as status_description,
        created_year,
        updated_at,
        DATEDIFF(CURRENT_DATE(), DATE(updated_at)) as days_since_update
    FROM users
    ORDER BY user_id
    """)
    profile_analysis.show(truncate=False)
    
    # Analysis 2: Temporal pattern analysis
    print("\n2. Temporal Pattern Analysis:")
    temporal_analysis = spark.sql("""
    SELECT 
        created_year,
        created_month,
        created_day,
        COUNT(*) as users_created,
        COUNT(CASE WHEN is_active THEN 1 END) as active_users_created,
        ROUND(
            COUNT(CASE WHEN is_active THEN 1 END) * 100.0 / COUNT(*), 
            2
        ) as active_percentage,
        COLLECT_LIST(username) as usernames
    FROM users
    GROUP BY created_year, created_month, created_day
    ORDER BY created_year, created_month, created_day
    """)
    temporal_analysis.show(truncate=False)
    
    # Analysis 3: Complex aggregations with multiple dimensions
    print("\n3. Multi-Dimensional Analysis:")
    multi_dim_analysis = spark.sql("""
    WITH user_metrics AS (
        SELECT 
            *,
            LENGTH(username) as username_len,
            LENGTH(email) as email_len,
            SPLIT_PART(email, '@', 2) as domain
        FROM users
    )
    SELECT 
        domain,
        is_active,
        COUNT(*) as user_count,
        AVG(username_len) as avg_username_length,
        AVG(email_len) as avg_email_length,
        MIN(username) as first_username_alphabetically,
        MAX(username) as last_username_alphabetically,
        ARRAY_JOIN(COLLECT_LIST(username), ', ') as all_usernames
    FROM user_metrics
    GROUP BY domain, is_active
    ORDER BY domain, is_active DESC
    """)
    multi_dim_analysis.show(truncate=False)
    
    # Analysis 4: Performance demonstration with explain plan
    print("\n4. Query Performance Analysis:")
    performance_query = """
    SELECT 
        is_active,
        COUNT(*) as user_count,
        AVG(LENGTH(username)) as avg_username_length
    FROM users
    WHERE created_year = 2025
    GROUP BY is_active
    """
    
    print("Query execution plan:")
    spark.sql(performance_query).explain(True)
    
    print("\nQuery result:")
    spark.sql(performance_query).show()
    
else:
    print("\nTable is empty - advanced analytics examples skipped")
    print("Please run notebook 2 to insert sample data first")

Advanced Analytics Examples:

1. Comprehensive User Profile Analysis:
+-------+-------------+-----------------+-------------------------+-----------+---------+------------------+------------+--------------------------+-----------------+
|user_id|username     |username_category|email                    |domain     |is_active|status_description|created_year|updated_at                |days_since_update|
+-------+-------------+-----------------+-------------------------+-----------+---------+------------------+------------+--------------------------+-----------------+
|1      |john_doe     |Medium           |john.doe@example.com     |example.com|true     |Active User       |2025        |2025-07-01 13:41:16.173663|0                |
|2      |jane_smith   |Medium           |jane.smith@example.com   |example.com|true     |Active User       |2025        |2025-07-01 13:41:16.173663|0                |
|3      |alice_wonder |Long             |alice.wonder@example.com |example.com|false    |Inacti

## Partition and Performance Analysis

Analyze table partitioning and demonstrate performance optimizations.

In [7]:
# Partition and performance analysis
print("Partition and Performance Analysis:")
print("=" * 45)

if users_df.count() > 0:
    # Analysis 1: Partition distribution
    print("\n1. Partition Distribution:")
    partition_dist = spark.sql("""
    SELECT 
        created_year,
        created_month,
        created_day,
        COUNT(*) as record_count,
        COUNT(DISTINCT user_id) as unique_users
    FROM users
    GROUP BY created_year, created_month, created_day
    ORDER BY created_year, created_month, created_day
    """)
    partition_dist.show()
    
    # Analysis 2: Demonstrate partition pruning
    print("\n2. Partition Pruning Demonstration:")
    
    # Get a specific partition value
    sample_partition = spark.sql(
        "SELECT DISTINCT created_year, created_month, created_day FROM users LIMIT 1"
    ).collect()[0]
    
    year, month, day = (
        sample_partition['created_year'],
        sample_partition['created_month'],
        sample_partition['created_day']
    )
    
    partition_query = f"""
    SELECT 
        user_id, username, email, is_active
    FROM users
    WHERE created_year = {year}
      AND created_month = {month}
      AND created_day = {day}
    ORDER BY user_id
    """
    
    print(f"Querying partition: year={year}, month={month}, day={day}")
    
    # Show execution plan
    print("\nExecution plan (should show partition pruning):")
    spark.sql(partition_query).explain()
    
    print("\nPartition query result:")
    spark.sql(partition_query).show()
    
    # Analysis 3: Query performance comparison
    print("\n3. Performance Comparison:")
    
    # Full table scan
    full_scan_query = "SELECT COUNT(*) FROM users WHERE is_active = true"
    
    # Partition-pruned query
    pruned_query = f"""
    SELECT COUNT(*) FROM users 
    WHERE is_active = true 
      AND created_year = {year}
    """
    
    print("Full table scan result:")
    spark.sql(full_scan_query).show()
    
    print("Partition-pruned query result:")
    spark.sql(pruned_query).show()
    
    print("\nPerformance Benefits:")
    print("- Partition pruning reduces data scan volume")
    print("- Columnar storage enables efficient projections")
    print("- Iceberg metadata accelerates query planning")
    print("- Vectorized execution improves computational efficiency")
    
else:
    print("\nTable is empty - partition analysis skipped")
    print("Please run notebook 2 to insert sample data first")

Partition and Performance Analysis:

1. Partition Distribution:
+------------+-------------+-----------+------------+------------+
|created_year|created_month|created_day|record_count|unique_users|
+------------+-------------+-----------+------------+------------+
|        2025|            7|          1|           5|           5|
+------------+-------------+-----------+------------+------------+


2. Partition Pruning Demonstration:
Querying partition: year=2025, month=7, day=1

Execution plan (should show partition pruning):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [user_id#999L ASC NULLS FIRST], true, 0
   +- Exchange rangepartitioning(user_id#999L ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=936]
      +- Project [user_id#999L, username#1000, email#1001, is_active#1002]
         +- BatchScan rest.play_iceberg.users[user_id#999L, username#1000, email#1001, is_active#1002, created_year#1003, created_month#1004, created_day#1005] rest.play_iceberg.users (

## Iceberg Metadata Exploration

Explore Iceberg's metadata tables to understand table internals.

In [8]:
# Iceberg metadata exploration
print("Iceberg Metadata Exploration:")
print("=" * 40)

try:
    # Explore table history
    print("\n1. Table History (Snapshots):")
    history_df = spark.sql("SELECT * FROM rest.`play_iceberg`.users.history")
    history_df.show(truncate=False)
    
    if history_df.count() > 0:
        print(f"Total snapshots: {history_df.count()}")
        
        # Show manifest information
        print("\n2. Manifest Files:")
        try:
            manifests_df = spark.sql("SELECT * FROM rest.`play_iceberg`.users.manifests")
            manifests_df.show(truncate=False)
            print(f"Total manifest files: {manifests_df.count()}")
        except Exception as e:
            print(f"Could not access manifests: {e}")
        
        # Show data files information
        print("\n3. Data Files:")
        try:
            files_df = spark.sql("""
            SELECT 
                file_path,
                file_format,
                record_count,
                file_size_in_bytes,
                ROUND(file_size_in_bytes / 1024.0 / 1024.0, 3) as file_size_mb
            FROM rest.`play_iceberg`.users.files
            """)
            files_df.show(truncate=False)
            
            # File statistics
            file_stats = files_df.agg(
                F.count("*").alias("total_files"),
                F.sum("record_count").alias("total_records"),
                F.sum("file_size_in_bytes").alias("total_size_bytes"),
                F.avg("file_size_in_bytes").alias("avg_file_size_bytes")
            )
            
            print("\nFile Statistics:")
            file_stats.show()
            
        except Exception as e:
            print(f"Could not access data files metadata: {e}")
    
    else:
        print("No table history found - table may be empty")
    
except Exception as e:
    print(f"Error accessing metadata tables: {e}")
    print("This may be normal if the table has no data yet")

print("\nMetadata Benefits:")
print("- Complete audit trail of all table changes")
print("- File-level statistics for query optimization")
print("- Time travel capabilities through snapshots")
print("- Schema evolution tracking")

Iceberg Metadata Exploration:

1. Table History (Snapshots):
+-----------------------+-------------------+---------+-------------------+
|made_current_at        |snapshot_id        |parent_id|is_current_ancestor|
+-----------------------+-------------------+---------+-------------------+
|2025-07-01 13:41:23.087|3824926247776142618|NULL     |true               |
+-----------------------+-------------------+---------+-------------------+

Total snapshots: 1

2. Manifest Files:
+-------+---------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+------------------------------------------------------------------------+
|content|path                                                                                   |length|partition_spec_id|added_snapshot_id  |added_d

## Performance Optimization Tips

Best practices for optimizing Spark queries on Iceberg tables.

In [9]:
# Performance optimization demonstrations
print("Spark-Iceberg Performance Optimization:")
print("=" * 50)

print("\n1. Current Spark Configuration:")
important_configs = [
    'spark.sql.adaptive.enabled',
    'spark.sql.adaptive.coalescePartitions.enabled',
    'spark.serializer',
    'spark.sql.execution.arrow.pyspark.enabled'
]

for config in important_configs:
    try:
        value = spark.conf.get(config)
        print(f"   {config}: {value}")
    except Exception:
        print(f"   {config}: Not configured")

print("\n2. Optimization Strategies:")
print("   ✓ Column Pruning: Select only needed columns")
print("   ✓ Predicate Pushdown: Apply filters early")
print("   ✓ Partition Pruning: Use partition columns in WHERE")
print("   ✓ Vectorized Execution: Enabled by default")
print("   ✓ Adaptive Query Execution: Dynamic optimization")

if users_df.count() > 0:
    print("\n3. Optimization Examples:")
    
    # Efficient query example
    efficient_query = """
    SELECT username, email
    FROM users
    WHERE is_active = true
      AND created_year = 2025
    ORDER BY user_id
    LIMIT 5
    """
    
    print("   Efficient Query (with optimizations):")
    print(f"   {efficient_query.strip()}")
    
    result = spark.sql(efficient_query)
    result.show()
    
    print("   Optimizations applied:")
    print("   - Column projection (username, email only)")
    print("   - Predicate pushdown (is_active, created_year)")
    print("   - Result limiting (LIMIT 5)")
    print("   - Partition pruning (created_year filter)")

print("\n4. Best Practices Summary:")
best_practices = [
    "Use partition columns in WHERE clauses",
    "Select only required columns",
    "Apply filters as early as possible",
    "Use appropriate data types",
    "Enable adaptive query execution",
    "Monitor query plans with EXPLAIN",
    "Cache frequently accessed data",
    "Use columnar formats (Parquet)"
]

for i, practice in enumerate(best_practices, 1):
    print(f"   {i}. {practice}")

Spark-Iceberg Performance Optimization:

1. Current Spark Configuration:
   spark.sql.adaptive.enabled: true
   spark.sql.adaptive.coalescePartitions.enabled: true
   spark.serializer: org.apache.spark.serializer.KryoSerializer
   spark.sql.execution.arrow.pyspark.enabled: false

2. Optimization Strategies:
   ✓ Column Pruning: Select only needed columns
   ✓ Predicate Pushdown: Apply filters early
   ✓ Partition Pruning: Use partition columns in WHERE
   ✓ Vectorized Execution: Enabled by default
   ✓ Adaptive Query Execution: Dynamic optimization

3. Optimization Examples:
   Efficient Query (with optimizations):
   SELECT username, email
    FROM users
    WHERE is_active = true
      AND created_year = 2025
    ORDER BY user_id
    LIMIT 5


+-------------+--------------------+
|     username|               email|
+-------------+--------------------+
|     john_doe|john.doe@example.com|
|   jane_smith|jane.smith@exampl...|
|  bob_builder|bob.builder@examp...|
|charlie_brown|charlie.brown@exa...|
+-------------+--------------------+

   Optimizations applied:
   - Column projection (username, email only)
   - Predicate pushdown (is_active, created_year)
   - Result limiting (LIMIT 5)
   - Partition pruning (created_year filter)

4. Best Practices Summary:
   1. Use partition columns in WHERE clauses
   2. Select only required columns
   3. Apply filters as early as possible
   4. Use appropriate data types
   5. Enable adaptive query execution
   6. Monitor query plans with EXPLAIN
   7. Cache frequently accessed data
   8. Use columnar formats (Parquet)


## Summary

This notebook demonstrated comprehensive querying of Apache Iceberg tables using Apache Spark:

### What We Accomplished:
1. **Environment Setup**: Configured Spark with Iceberg extensions
2. **Catalog Exploration**: Discovered namespaces and tables
3. **Table Inspection**: Examined schema and metadata
4. **SQL Querying**: Executed basic and advanced SQL operations
5. **DataFrame API**: Demonstrated programmatic data operations
6. **Advanced Analytics**: Complex analytical queries and aggregations
7. **Performance Analysis**: Partition pruning and optimization
8. **Metadata Exploration**: Iceberg's internal metadata tables

### Key Features Demonstrated:
- **Native Integration**: Seamless Spark-Iceberg connectivity
- **SQL Compatibility**: Standard SQL with Iceberg extensions
- **Performance Optimization**: Partition pruning, predicate pushdown
- **Metadata Access**: Complete table history and file information
- **Analytical Capabilities**: Complex aggregations and window functions

### Performance Benefits Observed:
- **Partition Pruning**: Reduced data scanning through partition filters
- **Column Projection**: Efficient columnar data access
- **Vectorized Execution**: High-performance analytical processing
- **Adaptive Optimization**: Dynamic query plan adjustments
- **Metadata Utilization**: Fast query planning using table statistics

### Use Cases for Spark-Iceberg:
1. **Data Lake Analytics**: Large-scale analytical processing
2. **ETL Pipelines**: Extract, transform, and load operations
3. **Real-time Analytics**: Streaming data processing
4. **Data Science**: Exploratory data analysis and ML feature engineering
5. **Reporting**: Business intelligence and dashboard data preparation

Apache Spark provides the most comprehensive and performant interface for working with Iceberg tables, making it the preferred choice for production data lake operations.

In [10]:
# Session cleanup
print("Spark-Iceberg querying demonstration completed!")
print("\nKey takeaways:")
print("- Spark provides native, high-performance Iceberg support")
print("- SQL and DataFrame APIs offer flexible query options")
print("- Advanced optimizations enable efficient large-scale analytics")
print("- Metadata tables provide deep insights into table internals")
print("\nSpark session remains active for further operations")

Spark-Iceberg querying demonstration completed!

Key takeaways:
- Spark provides native, high-performance Iceberg support
- SQL and DataFrame APIs offer flexible query options
- Advanced optimizations enable efficient large-scale analytics
- Metadata tables provide deep insights into table internals

Spark session remains active for further operations
