# Data Insertion with Polars and Apache Iceberg

This notebook demonstrates how to insert data into an Apache Iceberg table using Polars DataFrame. We'll explore data preparation, schema alignment, and efficient data loading techniques.

## Learning Objectives

By the end of this notebook, you'll understand:
- How to prepare data using Polars DataFrame
- How to align DataFrame schema with Iceberg table schema
- How to perform efficient data insertion into Iceberg tables
- How to verify data insertion and query results
- Best practices for data loading and type conversion

## Prerequisites

- Completed notebook 1 (table creation)
- Polars library installed
- PyIceberg library available
- Docker environment running

## Why Polars with Iceberg?

Polars is an excellent choice for Iceberg data operations because:
- **Performance**: Fast, memory-efficient operations
- **Arrow Integration**: Native Apache Arrow support
- **Type Safety**: Strong typing with automatic conversions
- **API Design**: Clean, intuitive DataFrame operations
- **Lazy Evaluation**: Optimized query planning

## Environment Setup

Import necessary libraries and establish connection to the Iceberg catalog.

In [1]:
from pyiceberg.catalog import load_catalog
from datetime import datetime, timezone
import polars as pl
import pyarrow as pa

print("Libraries imported successfully:")
print(f"- Polars version: {pl.__version__}")
print(f"- PyArrow version: {pa.__version__}")
print("- PyIceberg: Available")
print("\nReady for data insertion operations")

Libraries imported successfully:
- Polars version: 1.31.0
- PyArrow version: 20.0.0
- PyIceberg: Available

Ready for data insertion operations


## Catalog Connection and Table Loading

Connect to the Iceberg catalog and load the table we created in notebook 1.

In [2]:
# Configure catalog connection
catalog_config = {
    "uri": "http://localhost:8181",
    "s3.endpoint": "http://localhost:9000",
    "s3.access-key-id": "admin",
    "s3.secret-access-key": "password",
    "s3.path-style-access": "true",
}

# Load catalog
try:
    catalog = load_catalog("rest", **catalog_config)
    print("Catalog connection established")
    
    # Load the users table
    users_table = catalog.load_table("play_iceberg.users")
    print("Users table loaded successfully")
    
    # Display table information
    print(f"\nTable: {users_table.name()}")
    print(f"Schema fields: {len(users_table.schema().fields)}")
    print(f"Partition fields: {len(users_table.spec().fields)}")
    
except Exception as e:
    print(f"Error connecting to catalog or loading table: {e}")
    print("Please ensure notebook 1 has been completed and Docker services are running")
    raise

Catalog connection established
Users table loaded successfully

Table: ('play_iceberg', 'users')
Schema fields: 8
Partition fields: 3


## Schema Inspection

Before inserting data, let's examine the table schema to understand the expected data types and structure.

In [3]:
# Examine table schema
print("Iceberg Table Schema:")
print("=" * 30)
iceberg_schema = users_table.schema()
print(iceberg_schema)

# Get Arrow schema for data type alignment
print("\nArrow Schema (for type conversion):")
print("=" * 40)
arrow_schema = iceberg_schema.as_arrow()
print(arrow_schema)

# Display field information for reference
print("\nField Details:")
print("=" * 20)
for field in iceberg_schema.fields:
    required = "Required" if field.required else "Optional"
    print(f"  {field.field_id}: {field.name} ({field.field_type}) - {required}")

print("\nThis schema will guide our DataFrame creation")

Iceberg Table Schema:
table {
  1: user_id: required long
  2: username: required string
  3: email: required string
  4: is_active: required boolean
  5: created_year: required int
  6: created_month: required int
  7: created_day: required int
  8: updated_at: required timestamp
}

Arrow Schema (for type conversion):
user_id: int64 not null
  -- field metadata --
  PARQUET:field_id: '1'
username: large_string not null
  -- field metadata --
  PARQUET:field_id: '2'
email: large_string not null
  -- field metadata --
  PARQUET:field_id: '3'
is_active: bool not null
  -- field metadata --
  PARQUET:field_id: '4'
created_year: int32 not null
  -- field metadata --
  PARQUET:field_id: '5'
created_month: int32 not null
  -- field metadata --
  PARQUET:field_id: '6'
created_day: int32 not null
  -- field metadata --
  PARQUET:field_id: '7'
updated_at: timestamp[us] not null
  -- field metadata --
  PARQUET:field_id: '8'

Field Details:
  1: user_id (long) - Required
  2: username (string) -

## Data Preparation with Polars

Create sample user data using Polars DataFrame. We'll demonstrate best practices for:
- **Data Type Alignment**: Ensure types match Iceberg schema
- **Partition Value Creation**: Generate partition key values
- **Data Quality**: Include realistic sample data
- **Timestamp Handling**: Proper datetime formatting

In [4]:
# Create sample user data
print("Creating sample user data...")

# Get current timestamp for audit fields
current_time = datetime.now(timezone.utc)
print(f"Using timestamp: {current_time}")

# Sample user data
sample_users = [
    {
        "user_id": 1,
        "username": "john_doe",
        "email": "john.doe@example.com",
        "is_active": True
    },
    {
        "user_id": 2,
        "username": "jane_smith",
        "email": "jane.smith@example.com",
        "is_active": True
    },
    {
        "user_id": 3,
        "username": "alice_wonder",
        "email": "alice.wonder@example.com",
        "is_active": False
    },
    {
        "user_id": 4,
        "username": "bob_builder",
        "email": "bob.builder@example.com",
        "is_active": True
    },
    {
        "user_id": 5,
        "username": "charlie_brown",
        "email": "charlie.brown@example.com",
        "is_active": True
    }
]

print(f"Prepared {len(sample_users)} user records")
print("Data includes mix of active and inactive users")

Creating sample user data...
Using timestamp: 2025-07-01 14:13:24.144741+00:00
Prepared 5 user records
Data includes mix of active and inactive users


## DataFrame Creation and Enhancement

Create a Polars DataFrame and add the required partition and audit fields.

In [5]:
# Create Polars DataFrame
print("Creating Polars DataFrame...")

# Base DataFrame from sample data
df = pl.DataFrame(sample_users)

# Add partition fields (date components)
df = df.with_columns([
    pl.lit(current_time.year).alias("created_year").cast(pl.Int32),
    pl.lit(current_time.month).alias("created_month").cast(pl.Int32),
    pl.lit(current_time.day).alias("created_day").cast(pl.Int32),
    pl.lit(current_time).alias("updated_at")
])

# Ensure proper data types
df = df.with_columns([
    pl.col("user_id").cast(pl.Int64),
    pl.col("username").cast(pl.Utf8),
    pl.col("email").cast(pl.Utf8),
    pl.col("is_active").cast(pl.Boolean)
])

print("\nDataFrame created with enhanced fields:")
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns}")

# Display the DataFrame
print("\nSample Data Preview:")
print(df)

Creating Polars DataFrame...

DataFrame created with enhanced fields:
Shape: (5, 8)
Columns: ['user_id', 'username', 'email', 'is_active', 'created_year', 'created_month', 'created_day', 'updated_at']

Sample Data Preview:
shape: (5, 8)
┌─────────┬────────────┬────────────┬───────────┬────────────┬────────────┬────────────┬───────────┐
│ user_id ┆ username   ┆ email      ┆ is_active ┆ created_ye ┆ created_mo ┆ created_da ┆ updated_a │
│ ---     ┆ ---        ┆ ---        ┆ ---       ┆ ar         ┆ nth        ┆ y          ┆ t         │
│ i64     ┆ str        ┆ str        ┆ bool      ┆ ---        ┆ ---        ┆ ---        ┆ ---       │
│         ┆            ┆            ┆           ┆ i32        ┆ i32        ┆ i32        ┆ datetime[ │
│         ┆            ┆            ┆           ┆            ┆            ┆            ┆ μs, UTC]  │
╞═════════╪════════════╪════════════╪═══════════╪════════════╪════════════╪════════════╪═══════════╡
│ 1       ┆ john_doe   ┆ john.doe@e ┆ true      ┆ 2025  

## Schema Alignment and Type Conversion

Convert the Polars DataFrame to Arrow format and align with the Iceberg table schema.

In [6]:
# Convert to Arrow and align with Iceberg schema
print("Converting DataFrame to Arrow format...")

# Convert Polars DataFrame to Arrow Table
arrow_table = df.to_arrow()

print("\nOriginal Arrow schema:")
print(arrow_table.schema)

# Cast to match Iceberg table schema
print("\nTarget Iceberg Arrow schema:")
target_schema = users_table.schema().as_arrow()
print(target_schema)

# Perform schema alignment
try:
    aligned_table = arrow_table.cast(target_schema)
    print("\nSchema alignment successful!")
    print(f"Final Arrow table shape: {aligned_table.shape}")
    
    # Verify schema match
    print("\nSchema verification:")
    for i, field in enumerate(target_schema):
        original_type = arrow_table.schema.field(i).type
        aligned_type = aligned_table.schema.field(i).type
        match_status = "✓" if original_type == aligned_type else "→"
        print(f"  {field.name}: {original_type} {match_status} {aligned_type}")
        
except Exception as e:
    print(f"Schema alignment failed: {e}")
    print("Check data types and field alignment")
    raise

Converting DataFrame to Arrow format...

Original Arrow schema:
user_id: int64
username: large_string
email: large_string
is_active: bool
created_year: int32
created_month: int32
created_day: int32
updated_at: timestamp[us, tz=UTC]

Target Iceberg Arrow schema:
user_id: int64 not null
  -- field metadata --
  PARQUET:field_id: '1'
username: large_string not null
  -- field metadata --
  PARQUET:field_id: '2'
email: large_string not null
  -- field metadata --
  PARQUET:field_id: '3'
is_active: bool not null
  -- field metadata --
  PARQUET:field_id: '4'
created_year: int32 not null
  -- field metadata --
  PARQUET:field_id: '5'
created_month: int32 not null
  -- field metadata --
  PARQUET:field_id: '6'
created_day: int32 not null
  -- field metadata --
  PARQUET:field_id: '7'
updated_at: timestamp[us] not null
  -- field metadata --
  PARQUET:field_id: '8'

Schema alignment successful!
Final Arrow table shape: (5, 8)

Schema verification:
  user_id: int64 ✓ int64
  username: large_str

## Data Insertion

Insert the prepared data into the Iceberg table. This operation is atomic and will create a new snapshot.

In [7]:
# Insert data into Iceberg table
print("Inserting data into Iceberg table...")
print("\nInsertion process:")
print("1. Validate data against table schema")
print("2. Write data files to object storage")
print("3. Update manifest files")
print("4. Create new table snapshot")
print("5. Update catalog metadata")

try:
    # Check table state before insertion
    print("\nTable state before insertion:")
    current_snapshot = users_table.current_snapshot()
    if current_snapshot:
        print(f"  Current snapshot: {current_snapshot.snapshot_id}")
    else:
        print("  No current snapshot (empty table)")
    
    # Perform insertion
    users_table.append(aligned_table)
    print("\nData insertion completed successfully!")
    
    # Check table state after insertion
    print("\nTable state after insertion:")
    new_snapshot = users_table.current_snapshot()
    if new_snapshot:
        print(f"  New snapshot: {new_snapshot.snapshot_id}")
        print(f"  Snapshot timestamp: {new_snapshot.timestamp_ms}")
    
    print("\nInsertion summary:")
    print(f"  Records inserted: {len(aligned_table)}")
    print("  Table now contains data and is queryable")
    
except Exception as e:
    print(f"Data insertion failed: {e}")
    print("Check data format and table accessibility")
    raise

Inserting data into Iceberg table...

Insertion process:
1. Validate data against table schema
2. Write data files to object storage
3. Update manifest files
4. Create new table snapshot
5. Update catalog metadata

Table state before insertion:
  No current snapshot (empty table)

Data insertion completed successfully!

Table state after insertion:
  New snapshot: 7988583305662433429
  Snapshot timestamp: 1751379211622

Insertion summary:
  Records inserted: 5
  Table now contains data and is queryable


## Data Verification and Querying

Verify the data was inserted correctly by querying the table and examining the results.

In [8]:
# Verify data insertion by querying the table
print("Verifying data insertion...")

try:
    # Scan all data from the table
    scan_result = users_table.scan()
    
    # Convert to Pandas for easy viewing
    result_df = scan_result.to_pandas()
    
    print("\nQuery Results:")
    print(f"Total records: {len(result_df)}")
    print(f"Columns: {list(result_df.columns)}")
    
    print("\nData Summary:")
    print(result_df.info())
    
    print("\nSample Records:")
    print(result_df.head())
    
    # Verify data quality
    print("\nData Quality Checks:")
    print(f"  Unique user IDs: {result_df['user_id'].nunique()}")
    print(f"  Active users: {result_df['is_active'].sum()}")
    print(f"  Inactive users: {(~result_df['is_active']).sum()}")
    print(f"  Null values: {result_df.isnull().sum().sum()}")
    
    # Verify partitioning
    partition_info = result_df[['created_year', 'created_month', 'created_day']].drop_duplicates()
    print("\nPartitioning Verification:")
    print(f"  Unique partitions: {len(partition_info)}")
    print(f"  Partition values: {partition_info.values.tolist()}")
    
except Exception as e:
    print(f"Data verification failed: {e}")
    raise

Verifying data insertion...

Query Results:
Total records: 5
Columns: ['user_id', 'username', 'email', 'is_active', 'created_year', 'created_month', 'created_day', 'updated_at']

Data Summary:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   user_id        5 non-null      int64         
 1   username       5 non-null      object        
 2   email          5 non-null      object        
 3   is_active      5 non-null      bool          
 4   created_year   5 non-null      int32         
 5   created_month  5 non-null      int32         
 6   created_day    5 non-null      int32         
 7   updated_at     5 non-null      datetime64[us]
dtypes: bool(1), datetime64[us](1), int32(3), int64(1), object(2)
memory usage: 357.0+ bytes
None

Sample Records:
   user_id       username                      email  is_active  created_year  \
0

## Performance Analysis

Analyze the insertion performance and examine the created data files.

In [9]:
# Analyze insertion performance and file structure
print("Performance and Storage Analysis:")
print("=" * 40)

# Get table metadata
current_snapshot = users_table.current_snapshot()
if current_snapshot:
    print("\nSnapshot Information:")
    print(f"  Snapshot ID: {current_snapshot.snapshot_id}")
    print(f"  Timestamp: {current_snapshot.timestamp_ms}")
    print("  Operation: append (data insertion)")
    
    # File information
    if hasattr(current_snapshot, 'manifest_list'):
        print(f"  Manifest list: {current_snapshot.manifest_list}")

# Performance considerations
print("\nPerformance Considerations:")
record_count = len(result_df)
if record_count < 1000:
    print(f"  Small dataset ({record_count} records) - suitable for testing")
    print("  For production: batch larger datasets for efficiency")
elif record_count < 100000:
    print(f"  Medium dataset ({record_count} records) - good batch size")
else:
    print(f"  Large dataset ({record_count} records) - consider partitioning")

print("\nBest Practices Applied:")
print("  ✓ Schema alignment before insertion")
print("  ✓ Proper data type conversion")
print("  ✓ Partition key population")
print("  ✓ Atomic insertion operation")
print("  ✓ Data verification after insertion")

Performance and Storage Analysis:

Snapshot Information:
  Snapshot ID: 7988583305662433429
  Timestamp: 1751379211622
  Operation: append (data insertion)
  Manifest list: s3://warehouse/play_iceberg/users/metadata/snap-7988583305662433429-0-a0730b11-4f1d-4ae2-bb4e-c77bd509dc10.avro

Performance Considerations:
  Small dataset (5 records) - suitable for testing
  For production: batch larger datasets for efficiency

Best Practices Applied:
  ✓ Schema alignment before insertion
  ✓ Proper data type conversion
  ✓ Partition key population
  ✓ Atomic insertion operation
  ✓ Data verification after insertion


## Advanced Querying Examples

Demonstrate various querying capabilities with the inserted data.

In [10]:
# Advanced querying examples
print("Advanced Querying Examples:")
print("=" * 35)

# Example 1: Filter by active status
print("\n1. Active Users Only:")
active_users = users_table.scan(
    row_filter="is_active == true"
).to_pandas()
print(f"   Found {len(active_users)} active users")
print(f"   Usernames: {active_users['username'].tolist()}")

# Example 2: Column projection
print("\n2. User Contact Information:")
contact_info = users_table.scan(
    selected_fields=["username", "email", "is_active"]
).to_pandas()
print(f"   Retrieved {len(contact_info.columns)} columns for {len(contact_info)} users")
print(contact_info.head(3))

# Example 3: Combined filter and projection
print("\n3. Active User Emails:")
active_emails = users_table.scan(
    selected_fields=["username", "email"],
    row_filter="is_active == true"
).to_pandas()
print(f"   Retrieved emails for {len(active_emails)} active users")
for _, user in active_emails.iterrows():
    print(f"   - {user['username']}: {user['email']}")

print("\nQuerying Benefits:")
print("  - Row filtering reduces data transfer")
print("  - Column projection minimizes memory usage")
print("  - Partition pruning improves performance")
print("  - Predicate pushdown optimizes storage scans")

Advanced Querying Examples:

1. Active Users Only:
   Found 4 active users
   Usernames: ['john_doe', 'jane_smith', 'bob_builder', 'charlie_brown']

2. User Contact Information:
   Retrieved 3 columns for 5 users
       username                     email  is_active
0      john_doe      john.doe@example.com       True
1    jane_smith    jane.smith@example.com       True
2  alice_wonder  alice.wonder@example.com      False

3. Active User Emails:
   Retrieved emails for 4 active users
   - john_doe: john.doe@example.com
   - jane_smith: jane.smith@example.com
   - bob_builder: bob.builder@example.com
   - charlie_brown: charlie.brown@example.com

Querying Benefits:
  - Row filtering reduces data transfer
  - Column projection minimizes memory usage
  - Partition pruning improves performance
  - Predicate pushdown optimizes storage scans


## Next Steps

The data has been successfully inserted and verified. Here are recommended next steps:

### Immediate Actions:
1. **Explore Querying**: Try different filter and projection combinations
2. **Data Updates**: Learn how to modify existing records
3. **Batch Operations**: Insert larger datasets efficiently
4. **Schema Evolution**: Add new columns to the table

### Production Considerations:
1. **Error Handling**: Implement robust error handling for data operations
2. **Data Validation**: Add validation rules before insertion
3. **Monitoring**: Track insertion performance and data quality
4. **Optimization**: Tune batch sizes and partition strategies

### Advanced Features:
- **Upsert Operations**: Merge new data with existing records
- **Time Travel**: Query historical versions of data
- **Concurrent Writes**: Handle multiple writers safely
- **Data Compaction**: Optimize storage layout over time

## Summary

This notebook demonstrated the complete process of inserting data into Apache Iceberg using Polars:

### What We Accomplished:
1. **Data Preparation**: Created structured user data with Polars
2. **Schema Alignment**: Matched DataFrame types with Iceberg schema
3. **Type Conversion**: Properly converted Polars to Arrow format
4. **Data Insertion**: Successfully loaded data into Iceberg table
5. **Verification**: Confirmed data integrity and queryability
6. **Performance Analysis**: Evaluated insertion efficiency

### Key Concepts Learned:
- **Schema Compatibility**: Importance of type alignment
- **Arrow Integration**: How Polars works with Iceberg via Arrow
- **Partition Handling**: Automatic partition creation during insertion
- **Atomic Operations**: Insertion creates consistent snapshots
- **Query Optimization**: How to efficiently read inserted data

### Best Practices Applied:
- **Type Safety**: Explicit type casting for data integrity
- **Schema Validation**: Verified compatibility before insertion
- **Data Quality**: Included realistic test data with edge cases
- **Error Handling**: Proper exception handling for robust operations
- **Verification**: Confirmed successful operations through querying

The table now contains sample data and is ready for further operations including updates, deletes, and advanced querying scenarios.