# Querying Iceberg Tables with DuckDB

This notebook demonstrates how to query Apache Iceberg tables using DuckDB, a high-performance analytical database engine. DuckDB's native Iceberg support provides efficient querying capabilities with SQL.

## Learning Objectives

By the end of this notebook, you'll understand:
- How to configure DuckDB for Iceberg table access
- How to connect DuckDB to Iceberg REST catalogs
- How to perform efficient SQL queries on Iceberg data
- How to leverage DuckDB's analytical capabilities
- Best practices for DuckDB-Iceberg integration

## Prerequisites

- Completed notebooks 1-2 (table creation and data insertion)
- DuckDB library installed
- Docker environment running
- PyIceberg library available

## Why DuckDB with Iceberg?

DuckDB is an excellent choice for Iceberg analytics because:
- **Native Support**: Built-in Iceberg connector
- **Performance**: Vectorized execution engine
- **SQL Compatibility**: Standard SQL interface
- **Lightweight**: No server setup required
- **Analytics Focus**: Optimized for analytical workloads
- **Arrow Integration**: Efficient data exchange

## Environment Setup

Import necessary libraries and establish connections to both PyIceberg and DuckDB.

In [1]:
import duckdb
from pyiceberg.catalog import load_catalog
import pandas as pd

print("Libraries imported successfully:")
print(f"- DuckDB version: {duckdb.__version__}")
print("- PyIceberg: Available")
print("- Pandas: Available for result display")
print("\nReady for Iceberg querying with DuckDB")

Libraries imported successfully:
- DuckDB version: 1.3.1
- PyIceberg: Available
- Pandas: Available for result display

Ready for Iceberg querying with DuckDB


## Iceberg Catalog Verification

First, let's verify our Iceberg table exists and contains data using PyIceberg.

In [2]:
# Configure PyIceberg catalog connection
catalog_config = {
    "uri": "http://localhost:8181",
    "s3.endpoint": "http://localhost:9000",
    "s3.access-key-id": "admin",
    "s3.secret-access-key": "password",
    "s3.path-style-access": "true",
}

# Load catalog and verify table
try:
    catalog = load_catalog("rest", **catalog_config)
    print("PyIceberg catalog connected")
    
    # List available namespaces
    namespaces = list(catalog.list_namespaces())
    print(f"Available namespaces: {namespaces}")
    
    # Load users table
    users_table = catalog.load_table("play_iceberg.users")
    print("Users table loaded successfully")
    
    # Display table information
    print("\nTable Information:")
    print(f"- Schema fields: {len(users_table.schema().fields)}")
    print(f"- Partition fields: {len(users_table.spec().fields)}")
    
    # Get current data count
    current_data = users_table.scan().to_pandas()
    print(f"- Current record count: {len(current_data)}")
    
except Exception as e:
    print(f"Error connecting to Iceberg catalog: {e}")
    print("Please ensure Docker services are running and previous notebooks completed")
    raise

PyIceberg catalog connected
Available namespaces: [('play_iceberg',)]
Users table loaded successfully

Table Information:
- Schema fields: 8
- Partition fields: 3
- Current record count: 5


## Table Data Preview

Let's examine the current table structure and data using PyIceberg before querying with DuckDB.

In [3]:
# Display table schema and current data
print("Table Schema:")
print("=" * 20)
print(users_table.schema())

print("\nTable Partitioning:")
print("=" * 25)
print(users_table.spec())

print("\nCurrent Data Sample:")
print("=" * 30)
print(current_data.head())

print("\nData Summary:")
print(f"- Total records: {len(current_data)}")
print(f"- Active users: {current_data['is_active'].sum()}")
print(f"- Unique usernames: {current_data['username'].nunique()}")
print(f"- Date range: {current_data['created_year'].unique()}")

Table Schema:
table {
  1: user_id: required long
  2: username: required string
  3: email: required string
  4: is_active: required boolean
  5: created_year: required int
  6: created_month: required int
  7: created_day: required int
  8: updated_at: required timestamp
}

Table Partitioning:
[
  1000: created_year: identity(5)
  1001: created_month: identity(6)
  1002: created_day: identity(7)
]

Current Data Sample:
   user_id       username                      email  is_active  created_year  \
0        1       john_doe       john.doe@example.com       True          2025   
1        2     jane_smith     jane.smith@example.com       True          2025   
2        3   alice_wonder   alice.wonder@example.com      False          2025   
3        4    bob_builder    bob.builder@example.com       True          2025   
4        5  charlie_brown  charlie.brown@example.com       True          2025   

   created_month  created_day                 updated_at  
0              7            1

## DuckDB Setup and Configuration

Configure DuckDB to connect to our Iceberg tables via the REST catalog and MinIO storage.

In [4]:
# Create DuckDB connection
print("Setting up DuckDB connection...")
conn = duckdb.connect()

# Install and load Iceberg extension
print("Installing DuckDB Iceberg extension...")
try:
    conn.execute("INSTALL iceberg")
    conn.execute("LOAD iceberg")
    print("Iceberg extension loaded successfully")
except Exception as e:
    print(f"Extension installation failed: {e}")
    print("Note: Extension may already be installed")

# Configure S3/MinIO settings
print("\nConfiguring S3/MinIO settings...")
s3_config_commands = [
    "SET s3_endpoint = 'localhost:9000'",
    "SET s3_access_key_id = 'admin'",
    "SET s3_secret_access_key = 'password'",
    "SET s3_use_ssl = false",
    "SET s3_url_style = 'path'"
]

for cmd in s3_config_commands:
    try:
        conn.execute(cmd)
        print(f"  ✓ {cmd}")
    except Exception as e:
        print(f"  ✗ {cmd} - Error: {e}")

print("\nDuckDB S3 configuration completed")

Setting up DuckDB connection...
Installing DuckDB Iceberg extension...
Iceberg extension loaded successfully

Configuring S3/MinIO settings...
  ✓ SET s3_endpoint = 'localhost:9000'
  ✓ SET s3_access_key_id = 'admin'
  ✓ SET s3_secret_access_key = 'password'
  ✓ SET s3_use_ssl = false
  ✓ SET s3_url_style = 'path'

DuckDB S3 configuration completed


## Iceberg Catalog Connection in DuckDB

Establish connection between DuckDB and the Iceberg REST catalog.

In [5]:
# Configure Iceberg catalog in DuckDB
print("Configuring Iceberg catalog in DuckDB...")

try:
    # Create Iceberg secret for authentication
    conn.execute("""
        CREATE OR REPLACE SECRET iceberg_secret (
            TYPE iceberg,
            CLIENT_ID 'admin',
            CLIENT_SECRET 'password',
            ENDPOINT 'http://localhost:8181'
        )
    """)
    print("  ✓ Iceberg secret created")
    
    # Attach Iceberg catalog
    conn.execute("""
        ATTACH 'play_iceberg' AS iceberg_catalog (
            TYPE iceberg, 
            SECRET iceberg_secret
        )
    """)
    print("  ✓ Iceberg catalog attached successfully")
    
except Exception as e:
    print(f"  ✗ Catalog configuration failed: {e}")
    print("  Trying alternative configuration...")
    
    # Alternative: Direct table access without catalog
    try:
        # This approach may work if catalog attachment fails
        print("  Attempting direct table access method")
    except Exception as e2:
        print(f"  Alternative method also failed: {e2}")
        raise

print("\nDuckDB-Iceberg integration configured")

Configuring Iceberg catalog in DuckDB...
  ✓ Iceberg secret created
  ✓ Iceberg catalog attached successfully

DuckDB-Iceberg integration configured


## Table Discovery and Verification

Verify that DuckDB can see and access our Iceberg tables.

In [6]:
# List available tables in DuckDB
print("Discovering tables in DuckDB...")

try:
    # Show all available tables
    tables_result = conn.execute("SHOW ALL TABLES").fetchall()
    
    print("\nAvailable tables:")
    for table_info in tables_result:
        catalog_name, schema_name, table_name = table_info[0], table_info[1], table_info[2]
        print(f"  - {catalog_name}.{schema_name}.{table_name}")
    
    # Check if our users table is accessible
    users_table_found = any(
        'users' in str(table_info) for table_info in tables_result
    )
    
    if users_table_found:
        print("\n✓ Users table found and accessible via DuckDB")
    else:
        print("\n⚠ Users table not found in DuckDB catalog listing")
        print("  This may be normal - we can still try direct queries")
        
except Exception as e:
    print(f"Table discovery failed: {e}")
    print("Proceeding with direct query attempts")

print("\nTable discovery completed")

Discovering tables in DuckDB...

Available tables:
  - iceberg_catalog.play_iceberg.users

✓ Users table found and accessible via DuckDB

Table discovery completed


## Basic SQL Queries

Execute basic SQL queries against the Iceberg table using DuckDB.

In [7]:
# Execute basic queries against the Iceberg table
print("Executing basic SQL queries...")

# Define the table reference
table_ref = '"iceberg_catalog"."play_iceberg"."users"'

try:
    # Query 1: Select all records
    print("\n1. SELECT ALL RECORDS:")
    print("=" * 30)
    result1 = conn.execute(f"SELECT * FROM {table_ref}").fetchdf()
    print(f"Total records: {len(result1)}")
    print(result1.head())
    
    # Query 2: Count records
    print("\n2. RECORD COUNT:")
    print("=" * 20)
    count_result = conn.execute(f"SELECT COUNT(*) as total_users FROM {table_ref}").fetchone()
    print(f"Total users: {count_result[0]}")
    
    # Query 3: Filter active users
    print("\n3. ACTIVE USERS ONLY:")
    print("=" * 25)
    active_users = conn.execute(f"""
        SELECT username, email, is_active 
        FROM {table_ref} 
        WHERE is_active = true
    """).fetchdf()
    print(f"Active users: {len(active_users)}")
    print(active_users)
    
    # Query 4: Column projection
    print("\n4. CONTACT INFORMATION:")
    print("=" * 30)
    contacts = conn.execute(f"""
        SELECT username, email 
        FROM {table_ref} 
        ORDER BY username
    """).fetchdf()
    print(contacts)
    
except Exception as e:
    print(f"Query execution failed: {e}")
    print("\nTroubleshooting tips:")
    print("- Verify catalog attachment was successful")
    print("- Check table name and namespace")
    print("- Ensure S3/MinIO configuration is correct")
    raise

print("\nBasic queries completed successfully")

Executing basic SQL queries...

1. SELECT ALL RECORDS:
Total records: 5
   user_id       username                      email  is_active  created_year  \
0        1       john_doe       john.doe@example.com       True          2025   
1        2     jane_smith     jane.smith@example.com       True          2025   
2        3   alice_wonder   alice.wonder@example.com      False          2025   
3        4    bob_builder    bob.builder@example.com       True          2025   
4        5  charlie_brown  charlie.brown@example.com       True          2025   

   created_month  created_day                 updated_at  
0              7            1 2025-07-01 13:41:16.173663  
1              7            1 2025-07-01 13:41:16.173663  
2              7            1 2025-07-01 13:41:16.173663  
3              7            1 2025-07-01 13:41:16.173663  
4              7            1 2025-07-01 13:41:16.173663  

2. RECORD COUNT:
Total users: 5

3. ACTIVE USERS ONLY:
Active users: 4
        usernam

## Advanced Analytical Queries

Demonstrate DuckDB's analytical capabilities with more complex queries.

In [8]:
# Advanced analytical queries showcasing DuckDB's capabilities
print("Executing advanced analytical queries...")

try:
    # Query 1: Aggregation by activity status
    print("\n1. USER ACTIVITY SUMMARY:")
    print("=" * 35)
    activity_summary = conn.execute(f"""
        SELECT 
            is_active,
            COUNT(*) as user_count,
            ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage
        FROM {table_ref}
        GROUP BY is_active
        ORDER BY is_active DESC
    """).fetchdf()
    print(activity_summary)
    
    # Query 2: String analysis
    print("\n2. USERNAME ANALYSIS:")
    print("=" * 30)
    username_analysis = conn.execute(f"""
        SELECT 
            username,
            LENGTH(username) as username_length,
            CASE 
                WHEN LENGTH(username) < 8 THEN 'Short'
                WHEN LENGTH(username) < 12 THEN 'Medium'
                ELSE 'Long'
            END as length_category
        FROM {table_ref}
        ORDER BY LENGTH(username) DESC
    """).fetchdf()
    print(username_analysis)
    
    # Query 3: Date/time analysis
    print("\n3. TEMPORAL ANALYSIS:")
    print("=" * 28)
    temporal_analysis = conn.execute(f"""
        SELECT 
            created_year,
            created_month,
            created_day,
            COUNT(*) as users_created,
            MIN(updated_at) as earliest_update,
            MAX(updated_at) as latest_update
        FROM {table_ref}
        GROUP BY created_year, created_month, created_day
        ORDER BY created_year, created_month, created_day
    """).fetchdf()
    print(temporal_analysis)
    
    # Query 4: Complex filtering and ranking
    print("\n4. USER RANKING BY EMAIL DOMAIN:")
    print("=" * 40)
    domain_analysis = conn.execute(f"""
        SELECT 
            username,
            email,
            SPLIT_PART(email, '@', 2) as domain,
            is_active,
            ROW_NUMBER() OVER (ORDER BY username) as user_rank
        FROM {table_ref}
        WHERE email LIKE '%@example.com'
        ORDER BY domain, username
    """).fetchdf()
    print(domain_analysis)
    
except Exception as e:
    print(f"Advanced query execution failed: {e}")
    raise

print("\nAdvanced analytical queries completed")

Executing advanced analytical queries...

1. USER ACTIVITY SUMMARY:
   is_active  user_count  percentage
0       True           4        80.0
1      False           1        20.0

2. USERNAME ANALYSIS:
        username  username_length length_category
0  charlie_brown               13            Long
1   alice_wonder               12            Long
2    bob_builder               11          Medium
3     jane_smith               10          Medium
4       john_doe                8          Medium

3. TEMPORAL ANALYSIS:
   created_year  created_month  created_day  users_created  \
0          2025              7            1              5   

             earliest_update              latest_update  
0 2025-07-01 13:41:16.173663 2025-07-01 13:41:16.173663  

4. USER RANKING BY EMAIL DOMAIN:
        username                      email       domain  is_active  user_rank
0   alice_wonder   alice.wonder@example.com  example.com      False          1
1    bob_builder    bob.builder@example.co

## Performance Analysis

Analyze query performance and demonstrate DuckDB's optimization capabilities.

In [9]:
# Performance analysis and optimization demonstration
print("Analyzing query performance...")

try:
    # Enable query profiling
    conn.execute("PRAGMA enable_profiling")
    
    # Query with EXPLAIN for optimization analysis
    print("\n1. QUERY EXECUTION PLAN:")
    print("=" * 35)
    explain_result = conn.execute(f"""
        EXPLAIN SELECT 
            is_active, 
            COUNT(*) as user_count 
        FROM {table_ref} 
        WHERE created_year = 2025 
        GROUP BY is_active
    """).fetchall()
    
    for row in explain_result:
        print(row[1])  # Print the explain plan
    
    # Execute the actual query
    print("\n2. FILTERED AGGREGATION RESULT:")
    print("=" * 40)
    filtered_result = conn.execute(f"""
        SELECT 
            is_active, 
            COUNT(*) as user_count,
            ARRAY_AGG(username) as usernames
        FROM {table_ref} 
        WHERE created_year = 2025 
        GROUP BY is_active
        ORDER BY is_active DESC
    """).fetchdf()
    print(filtered_result)
    
    # Test partition pruning (if applicable)
    print("\n3. PARTITION PRUNING TEST:")
    print("=" * 35)
    partition_query = conn.execute(f"""
        SELECT 
            created_year,
            created_month,
            created_day,
            COUNT(*) as record_count
        FROM {table_ref}
        WHERE created_year = 2025 
          AND created_month = 6
        GROUP BY created_year, created_month, created_day
    """).fetchdf()
    print(partition_query)
    
    print("\nPerformance Observations:")
    print("- DuckDB uses vectorized execution for fast processing")
    print("- Iceberg metadata enables partition pruning")
    print("- Column-oriented storage optimizes analytical queries")
    print("- Query planning leverages table statistics")
    
except Exception as e:
    print(f"Performance analysis failed: {e}")
    
print("\nPerformance analysis completed")

Analyzing query performance...

1. QUERY EXECUTION PLAN:
┌───────────────────────────┐
│       HASH_GROUP_BY       │
│    ────────────────────   │
│         Groups: #0        │
│                           │
│        Aggregates:        │
│        count_star()       │
│                           │
│          ~0 Rows          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│         is_active         │
│                           │
│          ~1 Rows          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│       ICEBERG_SCAN        │
│    ────────────────────   │
│         Function:         │
│        ICEBERG_SCAN       │
│                           │
│        Projections:       │
│         is_active         │
│                           │
│          Filters:         │
│     created_year=2025     │
│                           │
│          ~1 Rows          │
└───────────────────────────┘


2. FILTERED

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
         SELECT              is_active,              COUNT(*) as user_count,             ARRAY_AGG(username) as usernames         FROM "iceberg_catalog"."play_iceberg"."users"          WHERE created_year = 2025          GROUP BY is_active         ORDER BY is_active DESC     
┌─────────────────────────────────────┐
│┌──────────────────────────��────────┐│
││         HTTPFS HTTP Stats         ││
││                                   ││
││            in: 0 bytes            ││
││            out: 0 bytes           ││
││              #HEAD: 0             ││
││              #GET: 0              ││
││              #PUT: 0              ││
││              #POST: 0             ││
││             #DELETE: 0            ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌──

## Comparison: DuckDB vs PyIceberg

Compare query results between DuckDB and PyIceberg to verify consistency.

In [10]:
# Compare results between DuckDB and PyIceberg
print("Comparing DuckDB vs PyIceberg query results...")

try:
    # DuckDB query
    print("\n1. DUCKDB RESULTS:")
    print("=" * 25)
    duckdb_result = conn.execute(f"""
        SELECT 
            COUNT(*) as total_count,
            SUM(CASE WHEN is_active THEN 1 ELSE 0 END) as active_count,
            AVG(LENGTH(username)) as avg_username_length
        FROM {table_ref}
    """).fetchdf()
    print(duckdb_result)
    
    # PyIceberg equivalent
    print("\n2. PYICEBERG RESULTS:")
    print("=" * 27)
    pyiceberg_data = users_table.scan().to_pandas()
    
    pyiceberg_summary = {
        'total_count': len(pyiceberg_data),
        'active_count': pyiceberg_data['is_active'].sum(),
        'avg_username_length': pyiceberg_data['username'].str.len().mean()
    }
    
    pyiceberg_df = pd.DataFrame([pyiceberg_summary])
    print(pyiceberg_df)
    
    # Verification
    print("\n3. CONSISTENCY CHECK:")
    print("=" * 27)
    
    consistency_checks = [
        ('Total Count', 
         duckdb_result['total_count'].iloc[0], 
         pyiceberg_summary['total_count']),
        ('Active Count', 
         duckdb_result['active_count'].iloc[0], 
         pyiceberg_summary['active_count']),
        ('Avg Username Length', 
         round(duckdb_result['avg_username_length'].iloc[0], 2), 
         round(pyiceberg_summary['avg_username_length'], 2))
    ]
    
    all_consistent = True
    for metric, duckdb_val, pyiceberg_val in consistency_checks:
        is_consistent = duckdb_val == pyiceberg_val
        all_consistent = all_consistent and is_consistent
        status = "✓" if is_consistent else "✗"
        print(f"  {status} {metric}: DuckDB={duckdb_val}, PyIceberg={pyiceberg_val}")
    
    if all_consistent:
        print("\n✅ All results are consistent between DuckDB and PyIceberg")
    else:
        print("\n⚠️ Some inconsistencies found - investigate further")
        
except Exception as e:
    print(f"Comparison failed: {e}")

print("\nComparison analysis completed")

Comparing DuckDB vs PyIceberg query results...

1. DUCKDB RESULTS:
   total_count  active_count  avg_username_length
0            5           4.0                 10.8

2. PYICEBERG RESULTS:
   total_count  active_count  avg_username_length
0            5             4                 10.8

3. CONSISTENCY CHECK:
  ✓ Total Count: DuckDB=5, PyIceberg=5
  ✓ Active Count: DuckDB=4.0, PyIceberg=4
  ✓ Avg Username Length: DuckDB=10.8, PyIceberg=10.8

✅ All results are consistent between DuckDB and PyIceberg

Comparison analysis completed


┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
         SELECT              COUNT(*) as total_count,             SUM(CASE WHEN is_active THEN 1 ELSE 0 END) as active_count,             AVG(LENGTH(username)) as avg_username_length         FROM "iceberg_catalog"."play_iceberg"."users"     
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
��│         HTTPFS HTTP Stats         ││
││                                   ││
││            in: 0 bytes            ││
││            out: 0 bytes           ││
││              #HEAD: 0             ││
││              #GET: 0              ││
││              #PUT: 0              ││
││              #POST: 0             ││
││             #DELETE: 0            ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌────────────────────────────────────

## Best Practices and Optimization Tips

Guidelines for effective DuckDB-Iceberg integration in production environments.

In [11]:
# Demonstrate best practices for DuckDB-Iceberg integration
print("DuckDB-Iceberg Best Practices:")
print("=" * 40)

print("\n1. QUERY OPTIMIZATION:")
print("   - Use column projection to reduce data transfer")
print("   - Apply filters early to leverage partition pruning")
print("   - Utilize DuckDB's vectorized execution")
print("   - Consider query result caching for repeated analyses")

print("\n2. CONNECTION MANAGEMENT:")
print("   - Reuse connections for multiple queries")
print("   - Configure appropriate timeouts")
print("   - Handle connection failures gracefully")
print("   - Monitor connection pool usage")

print("\n3. PERFORMANCE OPTIMIZATION:")
print("   - Enable query profiling for performance analysis")
print("   - Use EXPLAIN to understand query plans")
print("   - Leverage Iceberg's metadata for efficient scanning")
print("   - Consider data locality and network bandwidth")

print("\n4. ERROR HANDLING:")
print("   - Implement retry logic for transient failures")
print("   - Validate table accessibility before queries")
print("   - Handle schema evolution gracefully")
print("   - Log query performance metrics")

# Demonstrate query optimization example
print("\n5. OPTIMIZATION EXAMPLE:")
print("=" * 30)

try:
    # Optimized query with early filtering and column projection
    optimized_query = f"""
        SELECT username, email 
        FROM {table_ref}
        WHERE is_active = true 
          AND created_year = 2025
        ORDER BY username
        LIMIT 10
    """
    
    optimized_result = conn.execute(optimized_query).fetchdf()
    print(f"Optimized query returned {len(optimized_result)} results")
    print(optimized_result)
    
    print("\nOptimization features used:")
    print("  ✓ Column projection (username, email only)")
    print("  ✓ Early filtering (is_active, created_year)")
    print("  ✓ Result limiting (LIMIT 10)")
    print("  ✓ Sorted output (ORDER BY username)")
    
except Exception as e:
    print(f"Optimization example failed: {e}")

print("\nBest practices demonstration completed")

DuckDB-Iceberg Best Practices:

1. QUERY OPTIMIZATION:
   - Use column projection to reduce data transfer
   - Apply filters early to leverage partition pruning
   - Utilize DuckDB's vectorized execution
   - Consider query result caching for repeated analyses

2. CONNECTION MANAGEMENT:
   - Reuse connections for multiple queries
   - Configure appropriate timeouts
   - Handle connection failures gracefully
   - Monitor connection pool usage

3. PERFORMANCE OPTIMIZATION:
   - Enable query profiling for performance analysis
   - Use EXPLAIN to understand query plans
   - Leverage Iceberg's metadata for efficient scanning
   - Consider data locality and network bandwidth

4. ERROR HANDLING:
   - Implement retry logic for transient failures
   - Validate table accessibility before queries
   - Handle schema evolution gracefully
   - Log query performance metrics

5. OPTIMIZATION EXAMPLE:
Optimized query returned 4 results
        username                      email
0    bob_builder    bob

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
         SELECT username, email          FROM "iceberg_catalog"."play_iceberg"."users"         WHERE is_active = true            AND created_year = 2025         ORDER BY username         LIMIT 10     
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         HTTPFS HTTP Stats         ��│
││                                   ││
││            in: 0 bytes            ││
││            out: 0 bytes           ││
││              #HEAD: 0             ││
││              #GET: 0              ││
││              #PUT: 0              ││
││              #POST: 0             ││
││             #DELETE: 0            ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌────────────────────────────────────────────────┐
│┌─────────────────────────

## Cleanup and Session Management

Properly close connections and clean up resources.

In [12]:
# Clean up connections and resources
print("Cleaning up connections and resources...")

try:
    # Close DuckDB connection
    if 'conn' in locals():
        conn.close()
        print("✓ DuckDB connection closed")
    
    print("✓ Cleanup completed successfully")
    
except Exception as e:
    print(f"Cleanup error: {e}")

print("\nSession ended - resources released")

Cleaning up connections and resources...
✓ DuckDB connection closed
✓ Cleanup completed successfully

Session ended - resources released


## Summary

This notebook demonstrated comprehensive querying of Apache Iceberg tables using DuckDB:

### What We Accomplished:
1. **Environment Setup**: Configured DuckDB with Iceberg extension
2. **Catalog Integration**: Connected DuckDB to Iceberg REST catalog
3. **Basic Querying**: Executed fundamental SQL operations
4. **Advanced Analytics**: Demonstrated complex analytical queries
5. **Performance Analysis**: Explored query optimization techniques
6. **Result Verification**: Compared DuckDB and PyIceberg outputs

### Key Benefits Demonstrated:
- **SQL Interface**: Standard SQL for familiar querying
- **High Performance**: Vectorized execution engine
- **Native Integration**: Built-in Iceberg support
- **Analytical Focus**: Optimized for complex analytics
- **Lightweight**: No server infrastructure required

### Technical Concepts Learned:
- **Catalog Configuration**: How to connect DuckDB to Iceberg catalogs
- **Query Optimization**: Leveraging partition pruning and column projection
- **Performance Analysis**: Using EXPLAIN and profiling tools
- **Error Handling**: Robust connection and query management

### Use Cases for DuckDB-Iceberg:
1. **Interactive Analysis**: Ad-hoc data exploration
2. **Reporting**: Analytical reports and dashboards
3. **Data Science**: Exploratory data analysis
4. **ETL Validation**: Data quality checks and validation
5. **Performance Testing**: Query performance analysis

DuckDB provides an excellent SQL interface for Iceberg tables, combining the flexibility of SQL with the performance benefits of columnar analytics engines.