# Apache Iceberg Metadata Inspection

This comprehensive notebook demonstrates how to inspect and analyze Apache Iceberg table metadata. Understanding Iceberg's metadata structure is crucial for:

- **Performance Optimization**: Query planning and partition pruning
- **Storage Management**: Understanding file organization and sizes
- **Data Evolution**: Tracking schema changes and table history
- **Debugging**: Investigating data issues and performance bottlenecks

## What We'll Explore

1. **Table History**: Complete timeline of all table snapshots and operations
2. **Manifest Files**: Index files that track data file locations and statistics
3. **Data Files**: Physical Parquet files containing the actual data
4. **File-Level Analysis**: Deep dive into storage patterns and efficiency
5. **Column Statistics**: Data distribution and quality insights

## Prerequisites

- Apache Iceberg table with data (run notebooks 1-7 first)
- Spark session configured with Iceberg extensions
- REST catalog connection to MinIO storage

## Environment Setup

Initialize our Spark session with Iceberg configuration and verify the connection.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, count, min as spark_min, max as spark_max, sum as spark_sum
)

# Create Spark session with Iceberg extensions
spark = SparkSession.builder \
    .appName("Iceberg Metadata Deep Dive") \
    .getOrCreate()

print("Spark Session Initialized")
print(f"   Spark Version: {spark.version}")
print(f"   Default Catalog: {spark.conf.get('spark.sql.defaultCatalog')}")
print(f"   Iceberg Extensions: {spark.conf.get('spark.sql.extensions')}")

# Verify table exists
try:
    table_exists = (
        spark.sql("SHOW TABLES IN rest.`play_iceberg`")
        .filter("tableName = 'users'")
        .count() > 0
    )
    if table_exists:
        print("Users table found and accessible")
    else:
        print("Users table not found - run notebooks 1-7 first")
except Exception as e:
    print(f"Error accessing catalog: {e}")

25/07/01 13:51:45 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


Spark Session Initialized
   Spark Version: 3.5.5
   Default Catalog: rest
   Iceberg Extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions


[Stage 0:>                                                          (0 + 1) / 1]

Users table found and accessible


                                                                                

## 1. Table History Analysis

The table history is Iceberg's version control system. Each snapshot represents a consistent state of the table after an operation (INSERT, UPDATE, DELETE, etc.).

### Key Concepts:
- **Snapshot ID**: Unique identifier for each table version
- **Parent ID**: Points to the previous snapshot, forming a lineage chain
- **made_current_at**: Timestamp when this snapshot became the current version
- **is_current_ancestor**: Whether this snapshot is in the current table's lineage

This information is crucial for:
- Time travel queries
- Understanding table evolution
- Debugging data changes
- Planning retention policies

In [2]:
# Analyze table history with detailed metrics
history_df = spark.sql("""
SELECT 
    made_current_at,
    snapshot_id,
    parent_id,
    is_current_ancestor,
    CASE 
        WHEN parent_id IS NULL THEN 'INITIAL'
        ELSE 'UPDATE'
    END as operation_type
FROM rest.`play_iceberg`.users.history
ORDER BY made_current_at
""")

print("Table History Timeline:")
print("=" * 50)
history_df.show(truncate=False)

# Calculate history statistics
total_snapshots = history_df.count()
current_ancestors = history_df.filter("is_current_ancestor = true").count()
orphaned_snapshots = total_snapshots - current_ancestors

print("\nHistory Statistics:")
print(f"   Total Snapshots: {total_snapshots}")
print(f"   Current Lineage: {current_ancestors}")
print(f"   Orphaned Snapshots: {orphaned_snapshots}")

if total_snapshots > 1:
    first_snapshot = history_df.orderBy("made_current_at").first()
    latest_snapshot = history_df.orderBy(col("made_current_at").desc()).first()
    
    print("\nTimeline:")
    print(f"   First Snapshot: {first_snapshot['made_current_at']}")
    print(f"   Latest Snapshot: {latest_snapshot['made_current_at']}")
    print(f"   Table Age: {total_snapshots - 1} operations")

Table History Timeline:
+-----------------------+-------------------+-------------------+-------------------+--------------+
|made_current_at        |snapshot_id        |parent_id          |is_current_ancestor|operation_type|
+-----------------------+-------------------+-------------------+-------------------+--------------+
|2025-07-01 13:41:23.087|3824926247776142618|NULL               |true               |INITIAL       |
|2025-07-01 13:46:22.566|3682366184055111348|3824926247776142618|true               |UPDATE        |
|2025-07-01 13:46:29.92 |5474445044599496672|3682366184055111348|true               |UPDATE        |
|2025-07-01 13:46:43.276|703749108260326071 |5474445044599496672|true               |UPDATE        |
|2025-07-01 13:46:51.537|1680807781020980082|703749108260326071 |true               |UPDATE        |
|2025-07-01 13:47:01.334|8866811446601788626|1680807781020980082|true               |UPDATE        |
|2025-07-01 13:48:29.944|5929459840228022626|8866811446601788626|tr

## 2. Manifest File Analysis

Manifest files are Iceberg's indexing system. They contain metadata about data files and enable efficient query planning.

### Understanding Manifests:
- **Content Type**: 0 = data files, 1 = delete files
- **Partition Spec**: Defines how data is partitioned
- **File Counts**: Track additions, deletions, and existing files
- **Partition Summaries**: Min/max values for partition pruning

### Why This Matters:
- Query engines use manifests to skip irrelevant files
- Partition summaries enable efficient filtering
- File counts help understand table operations

In [3]:
# Detailed manifest analysis
manifests_df = spark.sql("""
SELECT 
    content,
    CASE content 
        WHEN 0 THEN 'DATA_FILES'
        WHEN 1 THEN 'DELETE_FILES'
        ELSE 'UNKNOWN'
    END as content_type,
    partition_spec_id,
    added_snapshot_id,
    added_data_files_count,
    existing_data_files_count,
    deleted_data_files_count,
    ROUND(length / 1024.0, 2) as manifest_size_kb,
    partition_summaries
FROM rest.`play_iceberg`.users.manifests
ORDER BY added_snapshot_id, partition_spec_id
""")

print("Manifest Files Overview:")
print("=" * 60)
manifests_df.select(
    "content_type", "partition_spec_id", "added_snapshot_id", 
    "added_data_files_count", "existing_data_files_count", "manifest_size_kb"
).show(truncate=False)

# Analyze manifest statistics
manifest_stats = manifests_df.agg(
    count("*").alias("total_manifests"),
    spark_min("added_data_files_count").alias("min_files_per_manifest"),
    spark_max("added_data_files_count").alias("max_files_per_manifest"),
    spark_min("manifest_size_kb").alias("min_manifest_size_kb"),
    spark_max("manifest_size_kb").alias("max_manifest_size_kb")
).collect()[0]

print("\nManifest Statistics:")
print(f"   Total Manifests: {manifest_stats['total_manifests']}")
print(
    f"   Files per Manifest: {manifest_stats['min_files_per_manifest']} - "
    f"{manifest_stats['max_files_per_manifest']}"
)
print(
    f"   Manifest Sizes: {manifest_stats['min_manifest_size_kb']:.2f} - "
    f"{manifest_stats['max_manifest_size_kb']:.2f} KB"
)

# Show partition summaries for the first manifest
first_manifest = manifests_df.filter("partition_summaries IS NOT NULL").first()
if first_manifest and first_manifest['partition_summaries']:
    print("\nPartition Summary Example:")
    print(f"   Partition Spec ID: {first_manifest['partition_spec_id']}")
    for i, summary in enumerate(first_manifest['partition_summaries']):
        print(
            f"   Partition {i+1}: {summary['lower_bound']} - "
            f"{summary['upper_bound']} (nulls: {summary['contains_null']})"
        )

Manifest Files Overview:
+------------+-----------------+-------------------+----------------------+-------------------------+----------------+
|content_type|partition_spec_id|added_snapshot_id  |added_data_files_count|existing_data_files_count|manifest_size_kb|
+------------+-----------------+-------------------+----------------------+-------------------------+----------------+
|DATA_FILES  |0                |1926503896228031236|0                     |4                        |8.86            |
|DATA_FILES  |1                |1926503896228031236|2                     |0                        |8.76            |
+------------+-----------------+-------------------+----------------------+-------------------------+----------------+


Manifest Statistics:
   Total Manifests: 2
   Files per Manifest: 0 - 2
   Manifest Sizes: 8.76 - 8.86 KB

Partition Summary Example:
   Partition Spec ID: 0
   Partition 1: 2025 - 2025 (nulls: False)
   Partition 2: 6 - 7 (nulls: False)
   Partition 3: 1 - 2

## 3. Data Files Deep Dive

Data files are the actual Parquet files containing your table data. Understanding their characteristics is essential for performance optimization.

### Key Metrics:
- **File Size**: Impacts query performance (too small = overhead, too large = slow)
- **Record Count**: Number of rows per file
- **Compression Ratio**: Storage efficiency
- **Partition Layout**: How data is organized on disk

### Optimization Insights:
- Optimal file sizes: 128MB - 1GB for most workloads
- Consistent record counts indicate good partitioning
- File paths reveal partition strategy effectiveness

In [4]:
# Comprehensive data files analysis
files_df = spark.sql("""
SELECT 
    file_path,
    file_format,
    record_count,
    file_size_in_bytes,
    ROUND(file_size_in_bytes / 1024.0 / 1024.0, 3) as file_size_mb,
    ROUND(file_size_in_bytes * 1.0 / record_count, 2) as bytes_per_record,
    -- Extract partition information from file path
    CASE 
        WHEN file_path LIKE '%is_active=true%' THEN 'active_users'
        WHEN file_path LIKE '%is_active=false%' THEN 'inactive_users'
        ELSE 'mixed_partition'
    END as partition_type
FROM rest.`play_iceberg`.users.files
ORDER BY file_size_in_bytes DESC
""")

print("Data Files Analysis:")
print("=" * 80)
files_df.select(
    "file_format", "record_count", "file_size_mb", 
    "bytes_per_record", "partition_type"
).show(truncate=False)

# Calculate storage statistics using Spark SQL functions
file_stats = files_df.agg(
    count("*").alias("total_files"),
    spark_sum("record_count").alias("total_records"),
    spark_sum("file_size_in_bytes").alias("total_size_bytes"),
    spark_min("file_size_in_bytes").alias("min_file_size"),
    spark_max("file_size_in_bytes").alias("max_file_size"),
    spark_min("record_count").alias("min_records"),
    spark_max("record_count").alias("max_records")
).collect()[0]

print("\nStorage Statistics:")
print(f"   Total Files: {file_stats['total_files']}")
print(f"   Total Records: {file_stats['total_records']:,}")
total_size_mb = file_stats['total_size_bytes']/1024/1024
print(
    f"   Total Storage: {file_stats['total_size_bytes']:,} bytes "
    f"({total_size_mb:.3f} MB)"
)
print(
    f"   File Size Range: {file_stats['min_file_size']} - "
    f"{file_stats['max_file_size']} bytes"
)
print(
    f"   Records per File: {file_stats['min_records']} - "
    f"{file_stats['max_records']}"
)
avg_file_size = file_stats['total_size_bytes']/file_stats['total_files']
print(f"   Average File Size: {avg_file_size:.0f} bytes")
avg_records = file_stats['total_records']/file_stats['total_files']
print(f"   Average Records per File: {avg_records:.1f}")

# Analyze partition distribution
partition_stats = files_df.groupBy("partition_type").agg(
    count("*").alias("file_count"),
    spark_sum("record_count").alias("total_records"),
    spark_sum("file_size_in_bytes").alias("total_size")
).orderBy("file_count")

print("\nPartition Distribution:")
partition_stats.show(truncate=False)

Data Files Analysis:
+-----------+------------+------------+----------------+---------------+
|file_format|record_count|file_size_mb|bytes_per_record|partition_type |
+-----------+------------+------------+----------------+---------------+
|PARQUET    |998         |0.012       |13.07           |mixed_partition|
|PARQUET    |2           |0.004       |1944.50         |active_users   |
|PARQUET    |3           |0.004       |1282.33         |mixed_partition|
|PARQUET    |1           |0.004       |3681.00         |inactive_users |
|PARQUET    |7           |0.003       |375.29          |mixed_partition|
|PARQUET    |1           |0.002       |2442.00         |mixed_partition|
+-----------+------------+------------+----------------+---------------+


Storage Statistics:
   Total Files: 6
   Total Records: 1,012
   Total Storage: 29,528 bytes (0.028 MB)
   File Size Range: 2442 - 13042 bytes
   Records per File: 1 - 998
   Average File Size: 4921 bytes
   Average Records per File: 168.7

Partit

## 4. File-Level Metadata Deep Dive

Iceberg tracks detailed statistics for each data file, enabling efficient query planning and optimization.

### File Statistics Include:
- **Column Sizes**: Bytes used by each column (compression effectiveness)
- **Value Counts**: Number of values per column
- **Null Counts**: Null values per column per file
- **Bounds**: Min/max values for predicate pushdown

### Performance Impact:
- Query engines use bounds for file pruning
- Column sizes help estimate scan costs
- Statistics enable better join ordering

In [6]:
# Detailed file-level metadata analysis
file_metadata = spark.sql("""
SELECT 
    SPLIT(file_path, '/')[SIZE(SPLIT(file_path, '/')) - 1] as file_name,
    file_size_in_bytes,
    record_count,
    column_sizes,
    value_counts,
    null_value_counts,
    lower_bounds,
    upper_bounds,
    ROUND(file_size_in_bytes * 1.0 / record_count, 2) as bytes_per_record
FROM rest.`play_iceberg`.users.files
ORDER BY file_size_in_bytes DESC
""")

print("File-Level Metadata Analysis:")
print("=" * 50)

# Show basic file info
file_metadata.select(
    "file_name", "file_size_in_bytes", "record_count", "bytes_per_record"
).show(truncate=False)

# Analyze the largest file in detail
largest_file = file_metadata.collect()[0]
print("\nDetailed Analysis - Largest File:")
print(f"   File: {largest_file['file_name']}")
print(f"   Size: {largest_file['file_size_in_bytes']:,} bytes")
print(f"   Records: {largest_file['record_count']:,}")
print(f"   Efficiency: {largest_file['bytes_per_record']} bytes/record")

users_df = spark.table("play_iceberg.users")

# Column size analysis
if largest_file['column_sizes']:
    print("\nColumn Storage Breakdown:")
    column_sizes = largest_file['column_sizes']
    total_column_bytes = sum(column_sizes.values())
    
    # Get column names for mapping
    column_names = users_df.columns
    
    for col_id, size in sorted(column_sizes.items(), key=lambda x: x[1], reverse=True):
        col_name = (
            column_names[col_id - 1] 
            if col_id <= len(column_names) 
            else f"column_{col_id}"
        )
        percentage = (size / total_column_bytes) * 100
        print(f"   {col_name}: {size} bytes ({percentage:.1f}%)")

# Value and null count analysis
if largest_file['value_counts'] and largest_file['null_value_counts']:
    print("\nData Quality per Column:")
    value_counts = largest_file['value_counts']
    null_counts = largest_file['null_value_counts']
    
    for col_id in sorted(value_counts.keys()):
        col_name = (
            column_names[col_id - 1] 
            if col_id <= len(column_names) 
            else f"column_{col_id}"
        )
        values = value_counts.get(col_id, 0)
        nulls = null_counts.get(col_id, 0)
        null_rate = (nulls / values) * 100 if values > 0 else 0
        print(f"   {col_name}: {values} values, {nulls} nulls ({null_rate:.1f}% null rate)")

# Storage efficiency analysis
all_files = file_metadata.collect()
total_size = sum(f['file_size_in_bytes'] for f in all_files)
total_records = sum(f['record_count'] for f in all_files)

print("\nStorage Efficiency Summary:")
print(f"   Total Files: {len(all_files)}")
print(f"   Total Storage: {total_size:,} bytes ({total_size/1024/1024:.3f} MB)")
print(f"   Total Records: {total_records:,}")
print(f"   Overall Efficiency: {total_size/total_records:.2f} bytes/record")
print(f"   Average File Size: {total_size/len(all_files):,.0f} bytes")
min_file_size = min(f['file_size_in_bytes'] for f in all_files)
max_file_size = max(f['file_size_in_bytes'] for f in all_files)
print(f"   File Size Range: {min_file_size} - {max_file_size} bytes")

File-Level Metadata Analysis:
+--------------------------------------------------------------+------------------+------------+----------------+
|file_name                                                     |file_size_in_bytes|record_count|bytes_per_record|
+--------------------------------------------------------------+------------------+------------+----------------+
|00000-839-759a0fd4-25db-41bd-8e21-a70f9cb4dc84-0-00001.parquet|13042             |998         |13.07           |
|00000-31-5faa8614-eeb6-445d-b377-644a41656b38-0-00001.parquet |3889              |2           |1944.50         |
|00000-10-106f15bf-6293-4ee2-a926-7f4807b22ec3-0-00001.parquet |3847              |3           |1282.33         |
|00000-31-5faa8614-eeb6-445d-b377-644a41656b38-0-00002.parquet |3681              |1           |3681.00         |
|00000-804-a1e7a35f-d545-4f9d-9c96-2e90163f309c-0-00002.parquet|2627              |7           |375.29          |
|00000-65-cc0575ab-588e-4474-ae8e-9ba2720524f5-0-00001.par

25/07/01 13:53:33 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
25/07/01 13:53:33 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
25/07/01 13:53:35 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
25/07/01 13:53:35 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore UNKNOWN@172.23.0.6
25/07/01 13:53:36 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException



Column Storage Breakdown:
   updated_at: 3773 bytes (34.6%)
   email: 2698 bytes (24.8%)
   username: 2348 bytes (21.6%)
   user_id: 1691 bytes (15.5%)
   is_active: 163 bytes (1.5%)
   created_month: 74 bytes (0.7%)
   created_day: 74 bytes (0.7%)
   created_year: 73 bytes (0.7%)

Data Quality per Column:
   user_id: 998 values, 0 nulls (0.0% null rate)
   username: 998 values, 0 nulls (0.0% null rate)
   email: 998 values, 0 nulls (0.0% null rate)
   is_active: 998 values, 0 nulls (0.0% null rate)
   created_year: 998 values, 0 nulls (0.0% null rate)
   created_month: 998 values, 0 nulls (0.0% null rate)
   created_day: 998 values, 0 nulls (0.0% null rate)
   updated_at: 998 values, 0 nulls (0.0% null rate)

Storage Efficiency Summary:
   Total Files: 6
   Total Storage: 29,528 bytes (0.028 MB)
   Total Records: 1,012
   Overall Efficiency: 29.18 bytes/record
   Average File Size: 4,921 bytes
   File Size Range: 2442 - 13042 bytes


In [None]:
# Clean up resources
print("Cleaning up Spark session...")
spark.stop()
print("Cleanup complete!")