# Apache Iceberg Architecture: Understanding the Three Layers

This notebook provides a comprehensive exploration of Apache Iceberg's three-layer architecture. Understanding these layers is crucial for working effectively with Iceberg and optimizing its performance.

## Architecture Overview

Apache Iceberg is built on a three-layer architecture that separates concerns and provides flexibility:

### 1. Catalog Layer
- **Purpose**: Table discovery and metadata location
- **Function**: Maps table names to metadata file locations
- **Examples**: Hive Metastore, AWS Glue, REST Catalog, JDBC Catalog

### 2. Metadata Layer
- **Purpose**: Table schema, partitioning, and file tracking
- **Components**: Metadata files, manifest lists, manifest files
- **Function**: Enables ACID transactions and time travel

### 3. Data Layer
- **Purpose**: Actual data storage
- **Format**: Parquet, ORC, or Avro files
- **Organization**: Partitioned and optimized for query performance

## What We'll Explore

1. **Catalog Layer Deep Dive**: How table discovery works
2. **Metadata Layer Analysis**: Examine metadata files and manifests
3. **Data Layer Inspection**: Understand data file organization
4. **Layer Interactions**: How the layers work together
5. **Performance Implications**: Impact on query planning and execution
6. **Best Practices**: Optimizing each layer

## Prerequisites

- Basic understanding of data lakes and table formats
- Completed previous notebooks to have sample data
- Access to Iceberg table with REST catalog

## Environment Setup

Initialize our environment and prepare for architecture exploration.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import count, size

# Create Spark session
spark = SparkSession.builder \
    .appName("Iceberg Architecture Layers") \
    .getOrCreate()

print(f"Spark version: {spark.version}")
print("Architecture exploration session initialized")

# Set default database
spark.sql("USE rest.`play_iceberg`")
print("Using namespace: rest.play_iceberg")

# Verify table exists
try:
    table_count = spark.sql("SELECT COUNT(*) as count FROM users").collect()[0]['count']
    print(f"Target table 'users' found with {table_count} records")
except Exception as e:
    print(f"Error accessing table: {e}")
    print("Please run previous notebooks to create sample data")

25/07/01 13:54:03 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


Spark version: 3.5.5
Architecture exploration session initialized
Using namespace: rest.play_iceberg


[Stage 0:>                                                          (0 + 1) / 1]

Target table 'users' found with 1012 records


                                                                                

## 1. Catalog Layer Deep Dive

The catalog layer is the entry point to Iceberg tables. It provides:
- **Table Discovery**: Maps table names to metadata locations
- **Namespace Management**: Organizes tables into logical groups
- **Authentication**: Controls access to tables
- **Metadata Location**: Points to the current metadata file

### 1.1 Catalog Configuration
Let's examine our catalog configuration.

In [2]:
# Examine Spark catalog configuration
print("Catalog Configuration:")
print("=" * 40)

catalog_configs = {
    'Default Catalog': spark.conf.get('spark.sql.defaultCatalog', 'Not set'),
    'Catalog Type': spark.conf.get('spark.sql.catalog.rest.type', 'Not set'),
    'Catalog URI': spark.conf.get('spark.sql.catalog.rest.uri', 'Not set'),
    'Warehouse Location': spark.conf.get('spark.sql.catalog.rest.warehouse', 'Not set'),
    'S3 Endpoint': spark.conf.get('spark.sql.catalog.rest.s3.endpoint', 'Not set')
}

for key, value in catalog_configs.items():
    print(f"{key}: {value}")

print("\nCatalog Type: REST Catalog")
print("- Provides HTTP API for table operations")
print("- Centralized metadata management")
print("- Supports multiple storage backends")

Catalog Configuration:
Default Catalog: rest
Catalog Type: rest
Catalog URI: http://iceberg-rest:8181/
Warehouse Location: s3://warehouse/
S3 Endpoint: http://minio:9000

Catalog Type: REST Catalog
- Provides HTTP API for table operations
- Centralized metadata management
- Supports multiple storage backends


### 1.2 Catalog Operations
Explore what the catalog layer provides.

In [3]:
# List all catalogs
print("Available Catalogs:")
print("=" * 30)
spark.sql("SHOW CATALOGS").show()

# List namespaces in our catalog
print("Namespaces in 'rest' catalog:")
print("=" * 35)
try:
    spark.sql("SHOW NAMESPACES IN rest").show()
except Exception as e:
    print(f"Note: {e}")
    print("Using direct namespace access")

# List tables in our namespace
print("Tables in 'rest.play_iceberg' namespace:")
print("=" * 45)
tables_df = spark.sql("SHOW TABLES IN rest.`play_iceberg`")
tables_df.show()

# Get table count
table_count = tables_df.count()
print(f"Total tables found: {table_count}")

Available Catalogs:
+-------------+
|      catalog|
+-------------+
|         rest|
|spark_catalog|
+-------------+

Namespaces in 'rest' catalog:
+------------+
|   namespace|
+------------+
|play_iceberg|
+------------+

Tables in 'rest.play_iceberg' namespace:
+------------+---------+-----------+
|   namespace|tableName|isTemporary|
+------------+---------+-----------+
|play_iceberg|    users|      false|
+------------+---------+-----------+

Total tables found: 1


### 1.3 Table Metadata Location
The catalog stores the location of the current metadata file for each table.

In [4]:
# Get detailed table information
print("Table Metadata Information:")
print("=" * 40)

# Show table details
table_details = spark.sql("DESCRIBE EXTENDED users")
table_details.show(50, truncate=False)

print("\nCatalog Layer Functions:")
print("1. Maps 'users' table name to metadata file location")
print("2. Provides table schema and properties")
print("3. Manages table lifecycle (create, drop, rename)")
print("4. Handles concurrent access and locking")

Table Metadata Information:
+----------------------------+----------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
|col_name                    |data_type                                                                                                             |comment                                       |
+----------------------------+----------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
|user_id                     |bigint                                                                                                                |NULL                                          |
|username                    |string                                                                                                                |NULL                               

## 2. Metadata Layer Deep Dive

The metadata layer is the heart of Iceberg's functionality. It consists of:

### Metadata Hierarchy:
1. **Metadata File** (JSON): Root of metadata tree, contains schema and current snapshot
2. **Manifest List** (Avro): Lists all manifest files for a snapshot
3. **Manifest Files** (Avro): Lists data files with statistics
4. **Data Files** (Parquet/ORC/Avro): Actual data storage

### 2.1 Table Metadata Files
Let's explore the metadata files structure.

In [5]:
# Get current table metadata
print("Current Table Schema and Metadata:")
print("=" * 45)

# Show table schema
users_df = spark.table("users")
print("Table Schema:")
users_df.printSchema()

# Get table properties
print("\nTable Properties from Metadata Layer:")
properties = spark.sql("""
SELECT 
    current_timestamp() as query_time,
    'users' as table_name,
    COUNT(*) as current_record_count
FROM users
""")
properties.show(truncate=False)

# Show partitioning information
print("Partitioning Strategy (from metadata):")
partition_info = spark.sql("""
SELECT 
    created_year,
    created_month, 
    created_day,
    COUNT(*) as records_in_partition
FROM users 
GROUP BY created_year, created_month, created_day
ORDER BY created_year, created_month, created_day
""")
partition_info.show()

Current Table Schema and Metadata:
Table Schema:
root
 |-- user_id: long (nullable = false)
 |-- username: string (nullable = false)
 |-- email: string (nullable = false)
 |-- is_active: boolean (nullable = false)
 |-- created_year: integer (nullable = false)
 |-- created_month: integer (nullable = false)
 |-- created_day: integer (nullable = false)
 |-- updated_at: timestamp_ntz (nullable = false)
 |-- country: string (nullable = true)
 |-- registration_source: string (nullable = true)
 |-- engagement_score: double (nullable = true)
 |-- last_login_at: timestamp (nullable = true)


Table Properties from Metadata Layer:
+--------------------------+----------+--------------------+
|query_time                |table_name|current_record_count|
+--------------------------+----------+--------------------+
|2025-07-01 13:54:14.663151|users     |1012                |
+--------------------------+----------+--------------------+

Partitioning Strategy (from metadata):




+------------+-------------+-----------+--------------------+
|created_year|created_month|created_day|records_in_partition|
+------------+-------------+-----------+--------------------+
|        2025|            6|         27|                1002|
|        2025|            6|         28|                   3|
|        2025|            7|          1|                   7|
+------------+-------------+-----------+--------------------+



                                                                                

### 2.2 Snapshots and History
Each change to the table creates a new snapshot in the metadata layer.

In [6]:
# Examine table snapshots
print("Table Snapshots (Metadata Layer History):")
print("=" * 50)

snapshots_df = spark.sql("""
SELECT 
    made_current_at,
    snapshot_id,
    parent_id,
    is_current_ancestor
FROM rest.`play_iceberg`.users.history
ORDER BY made_current_at
""")

snapshots_df.show(truncate=False)

# Count snapshots
snapshot_count = snapshots_df.count()
current_snapshots = snapshots_df.filter("is_current_ancestor = true").count()
orphaned_snapshots = snapshot_count - current_snapshots

print("\nSnapshot Analysis:")
print(f"Total Snapshots: {snapshot_count}")
print(f"Current Lineage: {current_snapshots}")
print(f"Orphaned Snapshots: {orphaned_snapshots}")

print("\nMetadata Layer Benefits:")
print("- ACID transactions through atomic metadata updates")
print("- Time travel via snapshot history")
print("- Schema evolution without data rewrite")
print("- Concurrent readers during writes")

Table Snapshots (Metadata Layer History):
+-----------------------+-------------------+-------------------+-------------------+
|made_current_at        |snapshot_id        |parent_id          |is_current_ancestor|
+-----------------------+-------------------+-------------------+-------------------+
|2025-07-01 13:41:23.087|3824926247776142618|NULL               |true               |
|2025-07-01 13:46:22.566|3682366184055111348|3824926247776142618|true               |
|2025-07-01 13:46:29.92 |5474445044599496672|3682366184055111348|true               |
|2025-07-01 13:46:43.276|703749108260326071 |5474445044599496672|true               |
|2025-07-01 13:46:51.537|1680807781020980082|703749108260326071 |true               |
|2025-07-01 13:47:01.334|8866811446601788626|1680807781020980082|true               |
|2025-07-01 13:48:29.944|5929459840228022626|8866811446601788626|true               |
|2025-07-01 13:48:37.868|1926503896228031236|5929459840228022626|true               |
|2025-07-01 

### 2.3 Manifest Files
Manifest files are the index that maps partitions to data files.

In [7]:
# Examine manifest files
print("Manifest Files (Metadata Layer Index):")
print("=" * 50)

manifests_df = spark.sql("""
SELECT 
    content,
    CASE content
        WHEN 0 THEN 'DATA_FILES'
        WHEN 1 THEN 'DELETE_FILES'
        ELSE 'UNKNOWN'
    END as content_type,
    partition_spec_id,
    added_snapshot_id,
    added_data_files_count,
    existing_data_files_count,
    deleted_data_files_count,
    ROUND(length / 1024.0, 2) as manifest_size_kb
FROM rest.`play_iceberg`.users.manifests
ORDER BY added_snapshot_id, partition_spec_id
""")

manifests_df.show(truncate=False)

# Manifest statistics
manifest_stats = spark.sql("""
SELECT 
    COUNT(*) as total_manifests,
    SUM(added_data_files_count) as total_data_files_tracked,
    SUM(deleted_data_files_count) as total_deleted_files,
    ROUND(AVG(length / 1024.0), 2) as avg_manifest_size_kb
FROM rest.`play_iceberg`.users.manifests
""")

print("\nManifest Statistics:")
manifest_stats.show()

print("Manifest File Functions:")
print("- Index data files by partition")
print("- Store file-level statistics (row counts, column bounds)")
print("- Enable partition pruning and file skipping")
print("- Support efficient query planning")

Manifest Files (Metadata Layer Index):
+-------+------------+-----------------+-------------------+----------------------+-------------------------+------------------------+----------------+
|content|content_type|partition_spec_id|added_snapshot_id  |added_data_files_count|existing_data_files_count|deleted_data_files_count|manifest_size_kb|
+-------+------------+-----------------+-------------------+----------------------+-------------------------+------------------------+----------------+
|0      |DATA_FILES  |0                |1926503896228031236|0                     |4                        |0                       |8.86            |
|0      |DATA_FILES  |1                |1926503896228031236|2                     |0                        |0                       |8.76            |
+-------+------------+-----------------+-------------------+----------------------+-------------------------+------------------------+----------------+


Manifest Statistics:
+---------------+---------

### 2.4 Partition Summaries
Manifests contain partition summaries that enable efficient query pruning.

In [8]:
# Examine partition summaries
print("Partition Summaries (Query Optimization Metadata):")
print("=" * 60)

# Get manifest with partition summaries
manifest_with_summaries = spark.sql("""
SELECT 
    partition_spec_id,
    added_snapshot_id,
    partition_summaries
FROM rest.`play_iceberg`.users.manifests
WHERE partition_summaries IS NOT NULL
LIMIT 1
""")

# Show partition summary structure
summaries = manifest_with_summaries.collect()
if summaries:
    summary = summaries[0]
    print(f"Partition Spec ID: {summary['partition_spec_id']}")
    print(f"Snapshot ID: {summary['added_snapshot_id']}")
    print("\nPartition Summary Details:")
    
    if summary['partition_summaries']:
        for i, part_summary in enumerate(summary['partition_summaries']):
            print(f"  Partition {i+1}:")
            print(f"    Lower Bound: {part_summary['lower_bound']}")
            print(f"    Upper Bound: {part_summary['upper_bound']}")
            print(f"    Contains Null: {part_summary['contains_null']}")
            print(f"    Contains NaN: {part_summary['contains_nan']}")
            print()
else:
    print("No partition summaries found")

print("Partition Summary Benefits:")
print("- Enable partition pruning during query planning")
print("- Reduce data scan requirements")
print("- Support predicate pushdown optimizations")
print("- Improve query performance significantly")

Partition Summaries (Query Optimization Metadata):
Partition Spec ID: 1
Snapshot ID: 1926503896228031236

Partition Summary Details:
  Partition 1:
    Lower Bound: 2025
    Upper Bound: 2025
    Contains Null: False
    Contains NaN: False

  Partition 2:
    Lower Bound: 6
    Upper Bound: 6
    Contains Null: False
    Contains NaN: False

  Partition 3:
    Lower Bound: 28
    Upper Bound: 28
    Contains Null: False
    Contains NaN: False

  Partition 4:
    Lower Bound: false
    Upper Bound: true
    Contains Null: False
    Contains NaN: False

Partition Summary Benefits:
- Enable partition pruning during query planning
- Reduce data scan requirements
- Support predicate pushdown optimizations
- Improve query performance significantly


## 3. Data Layer Deep Dive

The data layer contains the actual table data stored in optimized file formats.

### Data Layer Characteristics:
- **File Formats**: Parquet (default), ORC, Avro
- **Partitioning**: Physical data organization
- **Compression**: Storage optimization
- **Statistics**: Column-level min/max values

### 3.1 Data File Organization

In [9]:
# Examine data files
print("Data Layer File Organization:")
print("=" * 40)

data_files_df = spark.sql("""
SELECT 
    file_path,
    file_format,
    record_count,
    file_size_in_bytes,
    ROUND(file_size_in_bytes / 1024.0 / 1024.0, 3) as file_size_mb,
    ROUND(file_size_in_bytes * 1.0 / record_count, 2) as bytes_per_record
FROM rest.`play_iceberg`.users.files
ORDER BY file_size_in_bytes DESC
""")

# Show file overview
data_files_df.select(
    "file_format", "record_count", "file_size_mb", "bytes_per_record"
).show(truncate=False)

# Data layer statistics
data_stats = spark.sql("""
SELECT 
    COUNT(*) as total_files,
    SUM(record_count) as total_records,
    SUM(file_size_in_bytes) as total_size_bytes,
    ROUND(SUM(file_size_in_bytes) / 1024.0 / 1024.0, 3) as total_size_mb,
    ROUND(AVG(file_size_in_bytes), 0) as avg_file_size_bytes,
    ROUND(AVG(record_count), 1) as avg_records_per_file
FROM rest.`play_iceberg`.users.files
""")

print("\nData Layer Statistics:")
data_stats.show(truncate=False)

Data Layer File Organization:
+-----------+------------+------------+----------------+
|file_format|record_count|file_size_mb|bytes_per_record|
+-----------+------------+------------+----------------+
|PARQUET    |998         |0.012       |13.07           |
|PARQUET    |2           |0.004       |1944.50         |
|PARQUET    |3           |0.004       |1282.33         |
|PARQUET    |1           |0.004       |3681.00         |
|PARQUET    |7           |0.003       |375.29          |
|PARQUET    |1           |0.002       |2442.00         |
+-----------+------------+------------+----------------+


Data Layer Statistics:
+-----------+-------------+----------------+-------------+-------------------+--------------------+
|total_files|total_records|total_size_bytes|total_size_mb|avg_file_size_bytes|avg_records_per_file|
+-----------+-------------+----------------+-------------+-------------------+--------------------+
|6          |1012         |29528           |0.028        |4921.0           

### 3.2 Partition Layout Analysis
Examine how data is physically organized by partitions.

In [10]:
# Analyze partition layout from file paths
print("Physical Partition Layout:")
print("=" * 35)

# Extract partition information from file paths
partition_analysis = spark.sql("""
SELECT 
    file_path,
    record_count,
    file_size_in_bytes,
    CASE 
        WHEN file_path LIKE '%created_year=2025/created_month=6/created_day=27%' THEN '2025/06/27'
        ELSE 'OTHER'
    END as partition_path,
    CASE
        WHEN file_path LIKE '%is_active=true%' THEN 'ACTIVE'
        WHEN file_path LIKE '%is_active=false%' THEN 'INACTIVE'
        ELSE 'MIXED'
    END as activity_partition
FROM rest.`play_iceberg`.users.files
""")

# Show partition distribution
partition_summary = spark.sql("""
SELECT 
    CASE 
        WHEN file_path LIKE '%created_year=2025/created_month=6/created_day=27%' THEN '2025/06/27'
        ELSE 'OTHER'
    END as date_partition,
    CASE
        WHEN file_path LIKE '%is_active=true%' THEN 'ACTIVE'
        WHEN file_path LIKE '%is_active=false%' THEN 'INACTIVE'
        ELSE 'MIXED'
    END as activity_partition,
    COUNT(*) as file_count,
    SUM(record_count) as total_records,
    ROUND(SUM(file_size_in_bytes) / 1024.0, 2) as total_size_kb
FROM rest.`play_iceberg`.users.files
GROUP BY 
    CASE 
        WHEN file_path LIKE '%created_year=2025/created_month=6/created_day=27%' THEN '2025/06/27'
        ELSE 'OTHER'
    END,
    CASE
        WHEN file_path LIKE '%is_active=true%' THEN 'ACTIVE'
        WHEN file_path LIKE '%is_active=false%' THEN 'INACTIVE'
        ELSE 'MIXED'
    END
ORDER BY date_partition, activity_partition
""")

partition_summary.show(truncate=False)

print("\nPartition Benefits:")
print("- Enables partition pruning during queries")
print("- Improves data locality and scan performance")
print("- Supports efficient data lifecycle management")
print("- Reduces I/O for time-based queries")

Physical Partition Layout:
+--------------+------------------+----------+-------------+-------------+
|date_partition|activity_partition|file_count|total_records|total_size_kb|
+--------------+------------------+----------+-------------+-------------+
|2025/06/27    |MIXED             |3         |1002         |18.88        |
|OTHER         |ACTIVE            |1         |2            |3.80         |
|OTHER         |INACTIVE          |1         |1            |3.59         |
|OTHER         |MIXED             |1         |7            |2.57         |
+--------------+------------------+----------+-------------+-------------+


Partition Benefits:
- Enables partition pruning during queries
- Improves data locality and scan performance
- Supports efficient data lifecycle management
- Reduces I/O for time-based queries


### 3.3 File-Level Statistics
Each data file includes detailed statistics stored in the metadata layer.

In [11]:
# Examine file-level statistics
print("File-Level Statistics (for Query Optimization):")
print("=" * 55)

# Get detailed file statistics
file_stats_detailed = spark.sql("""
SELECT 
    SPLIT(file_path, '/')[SIZE(SPLIT(file_path, '/')) - 1] as file_name,
    record_count,
    file_size_in_bytes,
    column_sizes,
    value_counts,
    null_value_counts,
    lower_bounds,
    upper_bounds
FROM rest.`play_iceberg`.users.files
ORDER BY file_size_in_bytes DESC
LIMIT 1
""")

# Show basic file info
basic_info = file_stats_detailed.select("file_name", "record_count", "file_size_in_bytes")
basic_info.show(truncate=False)

# Analyze statistics from the largest file
largest_file_stats = file_stats_detailed.collect()[0]

print(f"\nDetailed Statistics for: {largest_file_stats['file_name']}")
print("=" * 60)

# Column sizes (storage footprint per column)
if largest_file_stats['column_sizes']:
    print("Column Storage Sizes (bytes):")
    column_sizes = largest_file_stats['column_sizes']
    # Map column IDs to names
    column_names = spark.table("users").columns
    
    for col_id, size in sorted(column_sizes.items(), key=lambda x: x[1], reverse=True):
        col_name = column_names[col_id - 1] if col_id <= len(column_names) else f"column_{col_id}"
        print(f"  {col_name}: {size} bytes")

# Value counts
if largest_file_stats['value_counts']:
    print("\nValue Counts per Column:")
    value_counts = largest_file_stats['value_counts']
    null_counts = largest_file_stats['null_value_counts'] or {}
    
    for col_id, count in value_counts.items():
        col_name = column_names[col_id - 1] if col_id <= len(column_names) else f"column_{col_id}"
        nulls = null_counts.get(col_id, 0)
        print(f"  {col_name}: {count} values, {nulls} nulls")

print("\nStatistics Usage:")
print("- Enable predicate pushdown and file skipping")
print("- Support bloom filter creation")
print("- Optimize join ordering and execution")
print("- Provide cost estimates for query planning")

File-Level Statistics (for Query Optimization):
+--------------------------------------------------------------+------------+------------------+
|file_name                                                     |record_count|file_size_in_bytes|
+--------------------------------------------------------------+------------+------------------+
|00000-839-759a0fd4-25db-41bd-8e21-a70f9cb4dc84-0-00001.parquet|998         |13042             |
+--------------------------------------------------------------+------------+------------------+


Detailed Statistics for: 00000-839-759a0fd4-25db-41bd-8e21-a70f9cb4dc84-0-00001.parquet
Column Storage Sizes (bytes):
  updated_at: 3773 bytes
  email: 2698 bytes
  username: 2348 bytes
  user_id: 1691 bytes
  is_active: 163 bytes
  created_month: 74 bytes
  created_day: 74 bytes
  created_year: 73 bytes

Value Counts per Column:
  user_id: 998 values, 0 nulls
  username: 998 values, 0 nulls
  email: 998 values, 0 nulls
  is_active: 998 values, 0 nulls
  create

## 4. Layer Interactions and Query Flow

Let's trace how a query flows through all three layers.

### 4.1 Query Planning Process

In [12]:
# Demonstrate query flow through layers
print("Query Flow Through Iceberg Layers:")
print("=" * 45)

# Example query: Find active users
query = "SELECT COUNT(*) as active_users FROM users WHERE is_active = true"
print(f"Example Query: {query}")
print()

print("Step 1: CATALOG LAYER")
print("- Resolve 'users' table name to metadata location")
print("- Authenticate and authorize access")
print("- Return current metadata file pointer")
print()

print("Step 2: METADATA LAYER")
print("- Read current snapshot metadata")
print("- Get manifest list for current snapshot")
print("- Read relevant manifest files")
print("- Apply partition pruning using partition summaries")
print("- Filter data files based on predicates (is_active = true)")
print()

print("Step 3: DATA LAYER")
print("- Read identified Parquet files")
print("- Apply column pruning and predicate pushdown")
print("- Use file statistics for further optimization")
print("- Execute query on actual data")
print()

# Execute the example query
result = spark.sql(query)
print("Query Result:")
result.show()

# Show query execution plan
print("\nQuery Execution Plan (simplified):")
print("=" * 50)
result.explain(True)

Query Flow Through Iceberg Layers:
Example Query: SELECT COUNT(*) as active_users FROM users WHERE is_active = true

Step 1: CATALOG LAYER
- Resolve 'users' table name to metadata location
- Authenticate and authorize access
- Return current metadata file pointer

Step 2: METADATA LAYER
- Read current snapshot metadata
- Get manifest list for current snapshot
- Read relevant manifest files
- Apply partition pruning using partition summaries
- Filter data files based on predicates (is_active = true)

Step 3: DATA LAYER
- Read identified Parquet files
- Apply column pruning and predicate pushdown
- Use file statistics for further optimization
- Execute query on actual data

Query Result:


[Stage 33:>                                                         (0 + 2) / 2]

+------------+
|active_users|
+------------+
|         510|
+------------+


Query Execution Plan (simplified):
== Parsed Logical Plan ==
'Project ['COUNT(1) AS active_users#790]
+- 'Filter ('is_active = true)
   +- 'UnresolvedRelation [users], [], false

== Analyzed Logical Plan ==
active_users: bigint
Aggregate [count(1) AS active_users#790L]
+- Filter (is_active#794 = true)
   +- SubqueryAlias rest.play_iceberg.users
      +- RelationV2[user_id#791L, username#792, email#793, is_active#794, created_year#795, created_month#796, created_day#797, updated_at#798, country#799, registration_source#800, engagement_score#801, last_login_at#802] rest.play_iceberg.users rest.play_iceberg.users

== Optimized Logical Plan ==
Aggregate [count(1) AS active_users#790L]
+- Project
   +- Filter is_active#794: boolean
      +- RelationV2[is_active#794] rest.play_iceberg.users

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(1)], output=[active_users#7

                                                                                

### 4.2 Performance Impact Analysis
Understand how each layer affects query performance.

In [13]:
# Analyze performance characteristics of each layer
print("Performance Impact by Layer:")
print("=" * 35)

# Catalog layer performance
catalog_perf = spark.sql("""
SELECT 
    'CATALOG_LAYER' as layer,
    'Table Resolution' as operation,
    'O(1)' as complexity,
    'Network latency to catalog service' as bottleneck
""")

# Metadata layer performance
metadata_perf = spark.sql("""
SELECT 
    'METADATA_LAYER' as layer,
    'File Planning' as operation,
    'O(manifests * partitions)' as complexity,
    'Number of manifest files' as bottleneck
""")

# Data layer performance
data_perf = spark.sql("""
SELECT 
    'DATA_LAYER' as layer,
    'Data Scanning' as operation,
    'O(data_files * file_size)' as complexity,
    'File count and size' as bottleneck
""")

# Combine performance analysis
performance_df = catalog_perf.union(metadata_perf).union(data_perf)
performance_df.show(truncate=False)

# Get actual metrics for our table
print("\nActual Metrics for 'users' Table:")
print("=" * 40)

# Metadata layer metrics
metadata_metrics = spark.sql("""
SELECT 
    COUNT(*) as manifest_files,
    SUM(added_data_files_count) as total_data_files,
    ROUND(SUM(length) / 1024.0, 2) as total_manifest_size_kb
FROM rest.`play_iceberg`.users.manifests
""")

print("Metadata Layer Metrics:")
metadata_metrics.show()

# Data layer metrics
data_metrics = spark.sql("""
SELECT 
    COUNT(*) as data_files,
    SUM(record_count) as total_records,
    ROUND(SUM(file_size_in_bytes) / 1024.0 / 1024.0, 2) as total_data_size_mb,
    ROUND(AVG(file_size_in_bytes) / 1024.0, 2) as avg_file_size_kb
FROM rest.`play_iceberg`.users.files
""")

print("Data Layer Metrics:")
data_metrics.show()

Performance Impact by Layer:
+--------------+----------------+-------------------------+----------------------------------+
|layer         |operation       |complexity               |bottleneck                        |
+--------------+----------------+-------------------------+----------------------------------+
|CATALOG_LAYER |Table Resolution|O(1)                     |Network latency to catalog service|
|METADATA_LAYER|File Planning   |O(manifests * partitions)|Number of manifest files          |
|DATA_LAYER    |Data Scanning   |O(data_files * file_size)|File count and size               |
+--------------+----------------+-------------------------+----------------------------------+


Actual Metrics for 'users' Table:
Metadata Layer Metrics:
+--------------+----------------+----------------------+
|manifest_files|total_data_files|total_manifest_size_kb|
+--------------+----------------+----------------------+
|             2|               2|                 17.62|
+--------------+--

## 5. Layer-Specific Optimizations

Each layer has specific optimization strategies.

### 5.1 Catalog Layer Optimizations

In [14]:
# Catalog layer optimization strategies
print("Catalog Layer Optimization Strategies:")
print("=" * 50)

catalog_optimizations = [
    "1. Connection Pooling: Reuse catalog connections",
    "2. Caching: Cache table metadata locations",
    "3. Batch Operations: Group multiple table operations",
    "4. Geographic Distribution: Use regional catalog endpoints",
    "5. Authentication Optimization: Use service accounts with long-lived tokens"
]

for optimization in catalog_optimizations:
    print(optimization)

# Check current catalog settings
print("\nCurrent Catalog Configuration Review:")
print("=" * 45)

# Simulate catalog health check
try:
    start_time = spark.sql("SELECT current_timestamp() as start_time").collect()[0]['start_time']
    table_check = spark.sql("SELECT COUNT(*) as count FROM users LIMIT 1").collect()[0]['count']
    end_time = spark.sql("SELECT current_timestamp() as end_time").collect()[0]['end_time']
    
    print("Catalog Response: SUCCESS")
    print("Table Resolution: FAST")
    print("Recommendation: Current catalog configuration is performing well")
except Exception as e:
    print(f"Catalog Issue: {e}")
    print("Recommendation: Check catalog connectivity and authentication")

Catalog Layer Optimization Strategies:
1. Connection Pooling: Reuse catalog connections
2. Caching: Cache table metadata locations
3. Batch Operations: Group multiple table operations
4. Geographic Distribution: Use regional catalog endpoints
5. Authentication Optimization: Use service accounts with long-lived tokens

Current Catalog Configuration Review:
Catalog Response: SUCCESS
Table Resolution: FAST
Recommendation: Current catalog configuration is performing well


### 5.2 Metadata Layer Optimizations

In [15]:
# Metadata layer optimization analysis
print("Metadata Layer Optimization Analysis:")
print("=" * 50)

# Analyze manifest file health
manifest_health = spark.sql("""
SELECT 
    COUNT(*) as manifest_count,
    AVG(added_data_files_count) as avg_files_per_manifest,
    MAX(added_data_files_count) as max_files_per_manifest,
    MIN(added_data_files_count) as min_files_per_manifest,
    ROUND(AVG(length / 1024.0), 2) as avg_manifest_size_kb
FROM rest.`play_iceberg`.users.manifests
WHERE added_data_files_count > 0
""")

print("Manifest Health Check:")
manifest_health.show()

# Get manifest statistics for recommendations
manifest_stats = manifest_health.collect()[0]
manifest_count = manifest_stats['manifest_count']
avg_files_per_manifest = manifest_stats['avg_files_per_manifest']

print("\nMetadata Layer Recommendations:")
print("=" * 40)

if manifest_count > 20:
    print("WARNING: High manifest count detected")
    print("Recommendation: Consider manifest compaction")
elif manifest_count > 10:
    print("CAUTION: Moderate manifest count")
    print("Recommendation: Monitor manifest growth")
else:
    print("GOOD: Manifest count is healthy")

if avg_files_per_manifest and avg_files_per_manifest < 10:
    print("WARNING: Low files per manifest ratio")
    print("Recommendation: Increase target file size or batch writes")
else:
    print("GOOD: Files per manifest ratio is healthy")

print("\nMetadata Optimization Strategies:")
print("1. Manifest Compaction: Merge small manifests")
print("2. Partition Strategy: Align with query patterns")
print("3. File Sizing: Target 128MB-1GB per file")
print("4. Snapshot Cleanup: Remove old snapshots regularly")
print("5. Statistics Refresh: Update column statistics")

Metadata Layer Optimization Analysis:
Manifest Health Check:
+--------------+----------------------+----------------------+----------------------+--------------------+
|manifest_count|avg_files_per_manifest|max_files_per_manifest|min_files_per_manifest|avg_manifest_size_kb|
+--------------+----------------------+----------------------+----------------------+--------------------+
|             1|                   2.0|                     2|                     2|                8.76|
+--------------+----------------------+----------------------+----------------------+--------------------+


Metadata Layer Recommendations:
GOOD: Manifest count is healthy
Recommendation: Increase target file size or batch writes

Metadata Optimization Strategies:
1. Manifest Compaction: Merge small manifests
2. Partition Strategy: Align with query patterns
3. File Sizing: Target 128MB-1GB per file
4. Snapshot Cleanup: Remove old snapshots regularly
5. Statistics Refresh: Update column statistics


### 5.3 Data Layer Optimizations

In [16]:
# Data layer optimization analysis
print("Data Layer Optimization Analysis:")
print("=" * 45)

# Analyze file size distribution
file_size_analysis = spark.sql("""
SELECT 
    COUNT(*) as total_files,
    MIN(file_size_in_bytes) as min_file_size,
    MAX(file_size_in_bytes) as max_file_size,
    ROUND(AVG(file_size_in_bytes), 0) as avg_file_size,
    ROUND(STDDEV(file_size_in_bytes), 0) as stddev_file_size,
    SUM(CASE WHEN file_size_in_bytes < 1024*1024 THEN 1 ELSE 0 END) as small_files_1mb,
    SUM(CASE WHEN file_size_in_bytes > 128*1024*1024 THEN 1 ELSE 0 END) as large_files_128mb
FROM rest.`play_iceberg`.users.files
""")

print("File Size Distribution:")
file_size_analysis.show(truncate=False)

# Get file size statistics for recommendations
file_stats = file_size_analysis.collect()[0]
total_files = file_stats['total_files']
small_files = file_stats['small_files_1mb']
large_files = file_stats['large_files_128mb']
avg_file_size = file_stats['avg_file_size']

print("\nData Layer Health Assessment:")
print("=" * 40)

# Small files check
if small_files > total_files * 0.3:
    print(f"WARNING: {small_files} small files (<1MB) detected")
    print("Recommendation: Run compaction to merge small files")
elif small_files > 0:
    print(f"CAUTION: {small_files} small files detected")
    print("Recommendation: Monitor file growth patterns")
else:
    print("GOOD: No small files detected")

# Large files check
if large_files > 0:
    print(f"WARNING: {large_files} large files (>128MB) detected")
    print("Recommendation: Consider better partitioning or file splitting")
else:
    print("GOOD: No excessively large files")

# Average file size assessment
optimal_size = 64 * 1024 * 1024  # 64MB target
if avg_file_size < optimal_size * 0.1:
    print(f"WARNING: Average file size ({avg_file_size/1024/1024:.1f}MB) is very small")
    print("Recommendation: Increase write batch size or run compaction")
elif avg_file_size > optimal_size * 4:
    print(f"WARNING: Average file size ({avg_file_size/1024/1024:.1f}MB) is very large")
    print("Recommendation: Improve partitioning strategy")
else:
    print(f"GOOD: Average file size ({avg_file_size/1024/1024:.1f}MB) is reasonable")

print("\nData Layer Optimization Strategies:")
print("1. File Compaction: Merge small files regularly")
print("2. Partition Tuning: Align partitions with query patterns")
print("3. Compression: Use appropriate compression codecs")
print("4. Column Layout: Optimize for read patterns")
print("5. File Format: Consider Parquet optimizations (bloom filters, etc.)")

Data Layer Optimization Analysis:
File Size Distribution:
+-----------+-------------+-------------+-------------+----------------+---------------+-----------------+
|total_files|min_file_size|max_file_size|avg_file_size|stddev_file_size|small_files_1mb|large_files_128mb|
+-----------+-------------+-------------+-------------+----------------+---------------+-----------------+
|6          |2442         |13042        |4921.0       |4028.0          |6              |0                |
+-----------+-------------+-------------+-------------+----------------+---------------+-----------------+


Data Layer Health Assessment:
Recommendation: Run compaction to merge small files
GOOD: No excessively large files
Recommendation: Increase write batch size or run compaction

Data Layer Optimization Strategies:
1. File Compaction: Merge small files regularly
2. Partition Tuning: Align partitions with query patterns
3. Compression: Use appropriate compression codecs
4. Column Layout: Optimize for read 

## 6. Advanced Architecture Concepts

### 6.1 Concurrent Operations and ACID Properties

In [17]:
# Demonstrate ACID properties through the three-layer architecture
print("ACID Properties in Iceberg's Three-Layer Architecture:")
print("=" * 60)

print("\nATOMICITY:")
print("- Catalog Layer: Atomic pointer updates to metadata files")
print("- Metadata Layer: All-or-nothing snapshot creation")
print("- Data Layer: Immutable files ensure consistency")

print("\nCONSISTENCY:")
print("- Catalog Layer: Validates table schema constraints")
print("- Metadata Layer: Maintains referential integrity")
print("- Data Layer: Schema evolution without data rewrite")

print("\nISOLATION:")
print("- Catalog Layer: Concurrent readers see consistent snapshots")
print("- Metadata Layer: MVCC through snapshot isolation")
print("- Data Layer: Immutable files prevent read/write conflicts")

print("\nDURABILITY:")
print("- Catalog Layer: Persisted metadata locations")
print("- Metadata Layer: Immutable manifest and metadata files")
print("- Data Layer: Durable storage with replication")

# Show current transaction isolation
print("\n\nCurrent Table State (Transaction View):")
print("=" * 50)

# Multiple concurrent queries see the same snapshot
snapshot_id = spark.sql("""
SELECT snapshot_id 
FROM rest.`play_iceberg`.users.history 
WHERE is_current_ancestor = true 
ORDER BY made_current_at DESC 
LIMIT 1
""").collect()[0]['snapshot_id']

print(f"Current Snapshot ID: {snapshot_id}")
print("All concurrent readers will see this exact snapshot until a new transaction commits.")

ACID Properties in Iceberg's Three-Layer Architecture:

ATOMICITY:
- Catalog Layer: Atomic pointer updates to metadata files
- Metadata Layer: All-or-nothing snapshot creation
- Data Layer: Immutable files ensure consistency

CONSISTENCY:
- Catalog Layer: Validates table schema constraints
- Metadata Layer: Maintains referential integrity
- Data Layer: Schema evolution without data rewrite

ISOLATION:
- Catalog Layer: Concurrent readers see consistent snapshots
- Metadata Layer: MVCC through snapshot isolation
- Data Layer: Immutable files prevent read/write conflicts

DURABILITY:
- Catalog Layer: Persisted metadata locations
- Metadata Layer: Immutable manifest and metadata files
- Data Layer: Durable storage with replication


Current Table State (Transaction View):
Current Snapshot ID: 1926503896228031236
All concurrent readers will see this exact snapshot until a new transaction commits.


### 6.2 Scaling Characteristics

In [18]:
# Analyze scaling characteristics of each layer
print("Scaling Characteristics by Layer:")
print("=" * 40)

# Create scaling analysis
scaling_analysis = [
    {
        'layer': 'CATALOG',
        'bottleneck': 'Network latency and catalog service capacity',
        'scaling_strategy': 'Horizontal catalog service scaling, caching, regional distribution',
        'complexity': 'O(1) per table operation'
    },
    {
        'layer': 'METADATA', 
        'bottleneck': 'Manifest file count and size',
        'scaling_strategy': 'Manifest compaction, partition pruning, manifest caching',
        'complexity': 'O(manifest_files × partitions_scanned)'
    },
    {
        'layer': 'DATA',
        'bottleneck': 'Data volume and file count',
        'scaling_strategy': 'Horizontal processing, file compaction, columnar storage',
        'complexity': 'O(data_files_scanned × file_size)'
    }
]

for analysis in scaling_analysis:
    print(f"\n{analysis['layer']} LAYER:")
    print(f"  Bottleneck: {analysis['bottleneck']}")
    print(f"  Scaling Strategy: {analysis['scaling_strategy']}")
    print(f"  Complexity: {analysis['complexity']}")

# Project scaling needs based on current metrics
print("\n\nScaling Projection for Current Table:")
print("=" * 45)

current_files = spark.sql("SELECT COUNT(*) as count FROM rest.`play_iceberg`.users.files").collect()[0]['count']
current_manifests = spark.sql("SELECT COUNT(*) as count FROM rest.`play_iceberg`.users.manifests").collect()[0]['count']
current_records = spark.sql("SELECT COUNT(*) as count FROM users").collect()[0]['count']

print("Current Scale:")
print(f"  Records: {current_records:,}")
print(f"  Data Files: {current_files}")
print(f"  Manifest Files: {current_manifests}")

# Project 100x growth
projected_records = current_records * 100
projected_files = current_files * 100
projected_manifests = current_manifests * 10  # Manifests grow slower due to compaction

print("\nProjected Scale (100x growth):")
print(f"  Records: {projected_records:,}")
print(f"  Data Files: {projected_files}")
print(f"  Manifest Files: {projected_manifests}")

print("\nScaling Recommendations:")
if projected_files > 10000:
    print("- Implement automated compaction")
if projected_manifests > 100:
    print("- Plan manifest compaction strategy")
if projected_records > 1000000:
    print("- Consider partition strategy optimization")
    print("- Implement tiered storage for old data")

Scaling Characteristics by Layer:

CATALOG LAYER:
  Bottleneck: Network latency and catalog service capacity
  Scaling Strategy: Horizontal catalog service scaling, caching, regional distribution
  Complexity: O(1) per table operation

METADATA LAYER:
  Bottleneck: Manifest file count and size
  Scaling Strategy: Manifest compaction, partition pruning, manifest caching
  Complexity: O(manifest_files × partitions_scanned)

DATA LAYER:
  Bottleneck: Data volume and file count
  Scaling Strategy: Horizontal processing, file compaction, columnar storage
  Complexity: O(data_files_scanned × file_size)


Scaling Projection for Current Table:
Current Scale:
  Records: 1,012
  Data Files: 6
  Manifest Files: 2

Projected Scale (100x growth):
  Records: 101,200
  Data Files: 600
  Manifest Files: 20

Scaling Recommendations:


## 7. Troubleshooting by Layer

Understanding which layer is causing issues helps with faster problem resolution.

### 7.1 Layer-Specific Diagnostics

In [19]:
# Layer-specific diagnostic checks
print("Layer-Specific Diagnostic Checks:")
print("=" * 40)

# Catalog layer diagnostics
print("\n1. CATALOG LAYER DIAGNOSTICS:")
print("=" * 35)

try:
    # Test catalog connectivity
    catalog_test = spark.sql("SHOW TABLES IN rest.`play_iceberg`")
    table_count = catalog_test.count()
    print(f"✓ Catalog connectivity: OK ({table_count} tables found)")
    
    # Test table resolution
    table_desc = spark.sql("DESCRIBE TABLE users")
    column_count = table_desc.count()
    print(f"✓ Table resolution: OK ({column_count} columns)")
    
except Exception as e:
    print(f"✗ Catalog issue: {e}")
    print("  Check: Network connectivity, authentication, catalog service status")

# Metadata layer diagnostics
print("\n2. METADATA LAYER DIAGNOSTICS:")
print("=" * 35)

try:
    # Test metadata access
    history_check = spark.sql("SELECT COUNT(*) as count FROM rest.`play_iceberg`.users.history")
    snapshot_count = history_check.collect()[0]['count']
    print(f"✓ Metadata access: OK ({snapshot_count} snapshots)")
    
    # Test manifest access
    manifest_check = spark.sql("SELECT COUNT(*) as count FROM rest.`play_iceberg`.users.manifests")
    manifest_count = manifest_check.collect()[0]['count']
    print(f"✓ Manifest access: OK ({manifest_count} manifests)")
    
    # Check for metadata consistency
    files_check = spark.sql("SELECT COUNT(*) as count FROM rest.`play_iceberg`.users.files")
    file_count = files_check.collect()[0]['count']
    print(f"✓ File tracking: OK ({file_count} files tracked)")
    
except Exception as e:
    print(f"✗ Metadata issue: {e}")
    print("  Check: Storage accessibility, metadata file corruption, permissions")

# Data layer diagnostics
print("\n3. DATA LAYER DIAGNOSTICS:")
print("=" * 30)

try:
    # Test data access
    data_check = spark.sql("SELECT COUNT(*) as count FROM users")
    record_count = data_check.collect()[0]['count']
    print(f"✓ Data access: OK ({record_count} records)")
    
    # Test data quality
    sample_check = spark.sql("SELECT * FROM users LIMIT 1")
    sample_data = sample_check.collect()
    if sample_data:
        print("✓ Data integrity: OK (sample record retrieved)")
    else:
        print("⚠ Data integrity: No records found")
    
except Exception as e:
    print(f"✗ Data layer issue: {e}")
    print("  Check: Storage connectivity, file permissions, data file corruption")

print("\nCommon Issue Patterns:")
print("- Catalog errors: Authentication, network, service availability")
print("- Metadata errors: File corruption, permission issues, storage problems")
print("- Data errors: File not found, storage issues, format problems")

Layer-Specific Diagnostic Checks:

1. CATALOG LAYER DIAGNOSTICS:
✓ Catalog connectivity: OK (1 tables found)
✓ Table resolution: OK (19 columns)

2. METADATA LAYER DIAGNOSTICS:
✓ Metadata access: OK (11 snapshots)
✓ Manifest access: OK (2 manifests)
✓ File tracking: OK (6 files tracked)

3. DATA LAYER DIAGNOSTICS:
✓ Data access: OK (1012 records)
✓ Data integrity: OK (sample record retrieved)

Common Issue Patterns:
- Catalog errors: Authentication, network, service availability
- Metadata errors: File corruption, permission issues, storage problems
- Data errors: File not found, storage issues, format problems


## Summary

This comprehensive exploration of Apache Iceberg's three-layer architecture revealed:

### Key Architectural Insights:

#### 1. **Catalog Layer**
- **Role**: Table discovery and metadata location management
- **Benefits**: Centralized table registry, authentication, namespace organization
- **Optimization**: Connection pooling, caching, geographic distribution
- **Scaling**: Horizontal service scaling, regional replication

#### 2. **Metadata Layer**
- **Role**: Schema management, transaction coordination, file tracking
- **Benefits**: ACID transactions, time travel, schema evolution, query optimization
- **Optimization**: Manifest compaction, partition pruning, statistics maintenance
- **Scaling**: Manifest management, partition strategy, snapshot cleanup

#### 3. **Data Layer**
- **Role**: Actual data storage in optimized file formats
- **Benefits**: Columnar storage, compression, partition pruning, predicate pushdown
- **Optimization**: File sizing, compaction, partition alignment, compression tuning
- **Scaling**: Horizontal processing, file organization, storage tiering

### Architecture Benefits:

1. **Separation of Concerns**: Each layer has distinct responsibilities
2. **Independent Scaling**: Layers can be optimized independently
3. **Flexibility**: Multiple implementations for each layer
4. **Performance**: Optimized for different access patterns
5. **Reliability**: Fault isolation and recovery at each layer

### Best Practices:

1. **Monitor All Layers**: Each layer has different performance characteristics
2. **Optimize by Usage**: Align optimizations with query patterns
3. **Plan for Scale**: Different layers scale differently
4. **Troubleshoot Systematically**: Isolate issues by layer
5. **Maintain Regularly**: Each layer needs different maintenance strategies

Understanding these three layers is essential for:
- Effective performance tuning
- Proper troubleshooting
- Successful production deployments
- Optimal cost management

This layered architecture is what makes Apache Iceberg a powerful, scalable, and reliable table format for modern data lakes.

In [20]:
# Clean up
print("Architecture exploration completed!")
print("\nKey takeaway: Understanding the three layers is crucial for:")
print("- Performance optimization")
print("- Effective troubleshooting")
print("- Successful production operations")
print("\nSpark session cleanup...")
spark.stop()
print("Session closed.")

Architecture exploration completed!

Key takeaway: Understanding the three layers is crucial for:
- Performance optimization
- Effective troubleshooting
- Successful production operations

Spark session cleanup...
Session closed.
