# Creating an Apache Iceberg Table

## Objectives

- Connect to an Iceberg REST catalog and manage namespaces
- Define table schemas with proper data types and field IDs
- Implement effective partitioning strategies for query performance
- Create production-ready tables with best practices
- Understand Iceberg metadata management and table structure

In [1]:
from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import (
    NestedField,
    StringType,
    LongType,
    BooleanType,
    TimestampType,
    IntegerType
)
from pyiceberg.partitioning import PartitionSpec, PartitionField
from pyiceberg.transforms import IdentityTransform
from pyiceberg.exceptions import TableAlreadyExistsError, NamespaceAlreadyExistsError

print("PyIceberg libraries imported successfully")
print("Ready to create Iceberg table")

PyIceberg libraries imported successfully
Ready to create Iceberg table


In [2]:
# Configure the REST catalog with MinIO backend
catalog_config = {
    "uri": "http://localhost:8181",
    "s3.endpoint": "http://localhost:9000",
    "s3.access-key-id": "admin",
    "s3.secret-access-key": "password",
    "s3.path-style-access": "true",
}

# Load the catalog
try:
    catalog = load_catalog("rest", **catalog_config)
    print("Catalog connected successfully!")
    
    # List existing namespaces
    namespaces = list(catalog.list_namespaces())
    print(f"Existing namespaces: {namespaces}")
    
except Exception as e:
    print(f"Failed to connect to catalog: {e}")
    print("Please ensure Docker services are running")
    raise

Catalog connected successfully!
Existing namespaces: [('play_iceberg',)]


In [3]:
# Define namespace for our user management system
namespace = "play_iceberg"

print(f"Creating namespace: {namespace}")
print("Namespaces provide logical organization for tables")
print("They support hierarchical organization and access control")

try:
    catalog.create_namespace(namespace)
    print(f"Namespace '{namespace}' created successfully!")
except NamespaceAlreadyExistsError:
    print(f"Namespace '{namespace}' already exists - continuing")
except Exception as e:
    print(f"Error creating namespace: {e}")
    
# Verify namespace creation
updated_namespaces = list(catalog.list_namespaces())
print(f"Current namespaces: {updated_namespaces}")

Creating namespace: play_iceberg
Namespaces provide logical organization for tables
They support hierarchical organization and access control
Namespace 'play_iceberg' already exists - continuing
Current namespaces: [('play_iceberg',)]


In [4]:
# Define the schema for the User table
print("Defining user table schema...")
print("\nSchema Design Considerations:")
print("- user_id: Primary key, long type for large scale")
print("- username/email: String types for text data")
print("- is_active: Boolean for efficient filtering")
print("- created_*: Integer fields for partition keys")
print("- updated_at: Timestamp for audit trails")

user_schema = Schema(
    # Primary identifier
    NestedField(
        field_id=1, 
        name="user_id", 
        field_type=LongType(), 
        required=True
    ),
    
    # User profile information
    NestedField(
        field_id=2, 
        name="username", 
        field_type=StringType(), 
        required=True
    ),
    NestedField(
        field_id=3, 
        name="email", 
        field_type=StringType(), 
        required=True
    ),
    NestedField(
        field_id=4, 
        name="is_active", 
        field_type=BooleanType(), 
        required=True
    ),
    
    # Partition key fields (created date components)
    NestedField(
        field_id=5, 
        name="created_year", 
        field_type=IntegerType(), 
        required=True
    ),
    NestedField(
        field_id=6, 
        name="created_month", 
        field_type=IntegerType(), 
        required=True
    ),
    NestedField(
        field_id=7, 
        name="created_day", 
        field_type=IntegerType(), 
        required=True
    ),
    
    # Audit timestamp
    NestedField(
        field_id=8, 
        name="updated_at", 
        field_type=TimestampType(), 
        required=True
    ),
)

print(f"\nSchema defined with {len(user_schema.fields)} fields")
print("Field IDs assigned: 1-8 (leaving gaps for future evolution)")

Defining user table schema...

Schema Design Considerations:
- user_id: Primary key, long type for large scale
- username/email: String types for text data
- is_active: Boolean for efficient filtering
- created_*: Integer fields for partition keys
- updated_at: Timestamp for audit trails

Schema defined with 8 fields
Field IDs assigned: 1-8 (leaving gaps for future evolution)


In [5]:
# Define partitioning strategy
print("Defining partitioning strategy...")
print("\nPartitioning by date (year/month/day):")
print("- Enables efficient time-range queries")
print("- Supports data retention policies")
print("- Improves query performance through partition pruning")
print("- Facilitates data lifecycle management")

# Create partition specification
# Partition by year, month, and day for fine-grained control
partition_spec = PartitionSpec(
    PartitionField(
        source_id=5,  # created_year field
        field_id=1000, 
        transform=IdentityTransform(), 
        name="created_year"
    ),
    PartitionField(
        source_id=6,  # created_month field
        field_id=1001, 
        transform=IdentityTransform(), 
        name="created_month"
    ),
    PartitionField(
        source_id=7,  # created_day field
        field_id=1002, 
        transform=IdentityTransform(), 
        name="created_day"
    ),
)

print("\nPartition specification created:")
print("- 3 partition fields (year, month, day)")
print("- Identity transform (no data transformation)")
print("- Field IDs: 1000-1002 (reserved for partition fields)")

print("\nExpected partition structure:")
print("created_year=2025/created_month=6/created_day=27/")
print("This creates a hierarchical directory structure")

Defining partitioning strategy...

Partitioning by date (year/month/day):
- Enables efficient time-range queries
- Supports data retention policies
- Improves query performance through partition pruning
- Facilitates data lifecycle management

Partition specification created:
- 3 partition fields (year, month, day)
- Identity transform (no data transformation)
- Field IDs: 1000-1002 (reserved for partition fields)

Expected partition structure:
created_year=2025/created_month=6/created_day=27/
This creates a hierarchical directory structure


In [6]:
# Create the User table
table_name = f"{namespace}.users"

print(f"Creating table: {table_name}")
print("\nTable creation process:")
print("1. Validate schema and partition specification")
print("2. Create metadata files in object storage")
print("3. Register table in catalog")
print("4. Return table reference for operations")

try:
    user_table = catalog.create_table(
        table_name, 
        schema=user_schema, 
        partition_spec=partition_spec
    )
    print(f"\nTable '{table_name}' created successfully!")
    print("Table is ready for data operations")
    
except TableAlreadyExistsError:
    print(f"\nTable '{table_name}' already exists")
    print("Loading existing table reference...")
    user_table = catalog.load_table(table_name)
    print("Existing table loaded successfully")
    
except Exception as e:
    print(f"\nError creating table: {e}")
    raise

print(f"\nTable reference obtained: {type(user_table).__name__}")

Creating table: play_iceberg.users

Table creation process:
1. Validate schema and partition specification
2. Create metadata files in object storage
3. Register table in catalog
4. Return table reference for operations

Table 'play_iceberg.users' already exists
Loading existing table reference...
Existing table loaded successfully

Table reference obtained: Table


In [7]:
# Inspect the created table
print("Table Inspection:")
print("=" * 40)

# Display table schema
print("\nTable Schema:")
print(user_table.schema())

# Display partition specification
print("\nPartition Specification:")
print(user_table.spec())

# Table properties
print("\nTable Properties:")
properties = user_table.properties
if properties:
    for key, value in properties.items():
        print(f"  {key}: {value}")
else:
    print("  No custom properties set")

# Current snapshot information
print("\nCurrent Snapshot:")
current_snapshot = user_table.current_snapshot()
if current_snapshot:
    print(f"  Snapshot ID: {current_snapshot.snapshot_id}")
    print(f"  Timestamp: {current_snapshot.timestamp_ms}")
else:
    print("  No data snapshots (empty table)")

print("\nTable Status: Ready for data operations")

Table Inspection:

Table Schema:
table {
  1: user_id: required long
  2: username: required string
  3: email: required string
  4: is_active: required boolean
  5: created_year: required int
  6: created_month: required int
  7: created_day: required int
  8: updated_at: required timestamp
}

Partition Specification:
[
  1000: created_year: identity(5)
  1001: created_month: identity(6)
  1002: created_day: identity(7)
]

Table Properties:
  write.parquet.compression-codec: zstd

Current Snapshot:
  No data snapshots (empty table)

Table Status: Ready for data operations


In [8]:
# Verify table creation
print("Table Verification:")
print("=" * 30)

# List tables in namespace
try:
    tables_in_namespace = list(catalog.list_tables(namespace))
    print(f"\nTables in '{namespace}' namespace:")
    for table_id in tables_in_namespace:
        print(f"  - {table_id}")
    
    if len(tables_in_namespace) == 0:
        print("  No tables found")
    
except Exception as e:
    print(f"Error listing tables: {e}")

# Test table accessibility
try:
    test_table = catalog.load_table(table_name)
    print("\nTable accessibility test: SUCCESS")
    print("Table can be loaded and accessed")
    
except Exception as e:
    print("\nTable accessibility test: FAILED")
    print(f"Error: {e}")

# Summary
print("\nTable Creation Summary:")
print(f"- Namespace: {namespace}")
print("- Table: users")
print(f"- Full name: {table_name}")
print(f"- Schema fields: {len(user_schema.fields)}")
print(f"- Partition fields: {len(partition_spec.fields)}")
print("- Status: Ready for data operations")

Table Verification:

Tables in 'play_iceberg' namespace:
  - ('play_iceberg', 'users')

Table accessibility test: SUCCESS
Table can be loaded and accessed

Table Creation Summary:
- Namespace: play_iceberg
- Table: users
- Full name: play_iceberg.users
- Schema fields: 8
- Partition fields: 3
- Status: Ready for data operations


## Future Steps

### Immediate Next Actions:
1. **Data Ingestion**: Insert sample user data (→ Notebook 2)
2. **Query Operations**: Read and filter data with Spark (→ Notebook 3)
3. **Data Modifications**: Update and upsert operations (→ Notebook 4)
4. **Schema Evolution**: Add columns without breaking changes (→ Notebook 5)

### Production Considerations:
- **Access Control**: Implement namespace-level permissions
- **Monitoring**: Set up query performance tracking
- **Maintenance**: Plan for compaction and snapshot cleanup
- **Backup Strategy**: Design disaster recovery procedures
- **Data Governance**: Establish data quality and lineage tracking

### Advanced Features to Explore:
- **Time Travel**: Query historical table versions
- **Partition Evolution**: Change partitioning strategy over time
- **Column Statistics**: Optimize query planning with metadata
- **Snapshot Management**: Control table history and retention