# Complete Banking Reconciliation Setup - Phase 1
## Local Hadoop Hive Style File-Based Catalog

## 🎯 Learning Objectives

This notebook demonstrates how to set up Apache Iceberg with a **local hadoop catalog** for a banking reconciliation system. You will learn:

### **Apache Iceberg Fundamentals**
- **Catalog System**: Understanding how Iceberg catalogs work with Hive metastore
- **Local Storage**: How Iceberg stores metadata and data locally
- **Table Structure**: The file organization created by Iceberg tables
- **Partitioning**: How Iceberg handles partitioned tables

### **What You'll Build**
- A complete banking reconciliation system with three core tables
- Local file-based storage with Hive metastore integration
- Partitioned tables for efficient querying
- Comprehensive audit trail and validation

### **Key Concepts Explained**

#### **1. Iceberg Catalog with Hive Metastore**
```
spark.sql.catalog.local.type = "hadoop"
spark.sql.catalog.local.warehouse = "file:///opt/bitnami/spark/warehouse"
```
- **Catalog**: A namespace for tables, schemas, and functions
- **Hive Metastore**: Stores table metadata (schema, location, properties)
- **Local Storage**: Files stored on local filesystem, not distributed

#### **2. File Structure Created by Iceberg**
```
/warehouse/banking/table_name/
├── metadata/          # Table metadata and version history
│   ├── v1.metadata.json    # Current table schema and properties
│   ├── version-hint.text   # Points to latest metadata version
│   └── .*.crc             # Checksums for data integrity
├── data/              # Actual data files (Parquet format)
│   └── partition_paths/   # Partitioned data organized by partition keys
└── snapshots/         # Table snapshots for time travel
```

#### **3. Partitioning Strategy**
- **source_transactions**: Partitioned by `days(transaction_date)` and `source_system`
- **reconciliation_results**: Partitioned by `days(reconciliation_timestamp)` and `match_status`
- **reconciliation_batches**: No partitioning (small lookup table)

## Phase 1: Foundation Setup

## Step 1: Import Required Libraries

**Purpose**: Import all necessary libraries for Spark, MinIO, and file operations.

**Key Libraries**:
- `pyspark.sql.SparkSession`: Core Spark functionality
- `boto3`: AWS S3/MinIO client for object storage
- `os`, `json`, `datetime`: File and data manipulation utilities

In [2]:
# Import required libraries
!pip install --root-user-action=ignore rich --quiet
from rich import print
from pyspark.sql import SparkSession
import boto3
from botocore.client import Config
import time
import os
import json
from datetime import datetime

print("✓ All required libraries imported successfully")

[33mDEPRECATION: Loading egg at /opt/bitnami/python/lib/python3.12/site-packages/pip-23.3.2-py3.12.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m

## Step 2: Stop any existing Spark session

**Purpose**: Ensure a clean Spark environment by stopping any existing sessions.

**Why This Matters**:
- Prevents configuration conflicts
- Ensures fresh Iceberg catalog initialization
- Clears any cached metadata or connections

In [3]:
# Stop any existing Spark session
try:
    SparkSession.builder.getOrCreate().stop()
    print("✓ Stopped existing Spark session")
except:
    print("ℹ No existing Spark session to stop")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/07/13 08:59:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/07/13 08:59:09 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


## Step 3: Create Spark Session with Iceberg Configuration

**Purpose**: Initialize Spark with Apache Iceberg extensions and local catalog configuration.

### **Configuration Breakdown**:

#### **Iceberg Extensions**
```python
spark.sql.extensions = "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
```
- Enables Iceberg-specific SQL commands (CREATE TABLE, MERGE, etc.)
- Provides time travel, schema evolution, and partition evolution capabilities

#### **Catalog Configuration**
```python
spark.sql.catalog.local = "org.apache.iceberg.spark.SparkCatalog"
spark.sql.catalog.local.type = "hadoop"
spark.sql.catalog.local.warehouse = "file:///opt/bitnami/spark/warehouse"
spark.sql.catalog.spark_catalog.type = "hive
```
- **local**: Our custom catalog name
- **hadoop**: Uses Hive metastore for metadata storage
- **warehouse**: Local filesystem path for table data

#### **Default Catalog**
```python
spark.sql.defaultCatalog = "local"
```
- Sets 'local' as the default catalog for table operations
- Allows using `local.table_name` syntax

In [4]:
# Create warehouse directory if it doesn't exist
warehouse_dir = "/opt/bitnami/spark/warehouse"
os.makedirs(warehouse_dir, exist_ok=True)
print(f"✓ Warehouse directory: {warehouse_dir}")

# Create a new Spark session with the correct configuration
spark = SparkSession.builder \
    .appName("Banking Reconciliation Demo") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog") \
    .config("spark.sql.catalog.spark_catalog.type", "hive") \
    .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.local.type", "hadoop") \
    .config("spark.sql.catalog.local.warehouse", f"file://{warehouse_dir}") \
    .config("spark.sql.defaultCatalog", "local") \
    .getOrCreate()

print("✓ Spark session created successfully")
print(f"Spark version: {spark.version}")
print(f"Default catalog: {spark.conf.get('spark.sql.defaultCatalog')}")
print(f"Warehouse location: {spark.conf.get('spark.sql.catalog.local.warehouse')}")

25/07/13 09:00:44 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


## HMS - Hive Metastore

### 1. **Current Setting (Local Hadoop Catalog)**
```python
spark = SparkSession.builder \
    .appName("Banking Reconciliation Demo") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog") \
    .config("spark.sql.catalog.spark_catalog.type", "hive") \
    .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.local.type", "hadoop") \
    .config("spark.sql.catalog.local.warehouse", f"file://{warehouse_dir}") \
    .config("spark.sql.defaultCatalog", "local") \
    .getOrCreate()
```

- **Catalog Name:** `local`
- **Catalog Type:** `hadoop`
- **Warehouse Location:** Local file system (e.g., `/opt/bitnami/spark/warehouse`)
- **Metastore:** No external Hive Metastore; metadata is stored in the file system (usually as JSON files in the warehouse directory).
- **Use Case:** Good for local development, testing, or single-node setups. No need for a running Hive Metastore service.

---

### 2. **Hive Metastore (HMS) Catalog Setting**
```ini
# Register one catalog called “hms” that points at Hive Metastore
spark.sql.catalog.hms               org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.hms.type          hive
spark.sql.catalog.hms.uri           thrift://metastore:9083 
spark.sql.catalog.hms.warehouse     /opt/warehouse 
spark.sql.defaultCatalog            hms
```

- **Catalog Name:** `hms`
- **Catalog Type:** `hive`
- **Warehouse Location:** Could be HDFS, S3, or local, but managed by Hive Metastore (e.g., `/opt/warehouse`)
- **Metastore:** Uses an external Hive Metastore service (running at `thrift://metastore:9083`). All table metadata is stored in the Hive Metastore database.
- **Use Case:** Production, multi-user, or distributed environments. Enables sharing tables and metadata between Spark, Hive, Trino, Presto, Flink, etc.

---

## **Key Differences**

| Feature                | Your Setting (`local`/`hadoop`)         | HMS Setting (`hms`/`hive`)                |
|------------------------|-----------------------------------------|-------------------------------------------|
| **Catalog Name**       | `local`                                 | `hms`                                     |
| **Catalog Type**       | `hadoop`                                | `hive`                                    |
| **Metastore**          | No external metastore (file-based)      | External Hive Metastore (service needed)  |
| **Metadata Location**  | Files in warehouse directory            | Hive Metastore DB (plus warehouse files)  |
| **Sharing**            | Not shared across engines               | Shared across Spark, Hive, Trino, etc.    |
| **Best For**           | Local, dev, single-user                 | Production, multi-user, distributed       |
| **Setup Complexity**   | Simple, no extra services               | Needs Hive Metastore running              |

---

## **Summary**

- **The current config** is simple, local, and file-based—great for development and testing.
- **The HMS config** is for production, multi-user, and multi-engine environments, requiring a running Hive Metastore service and enabling true metadata sharing.

**If you want to use Iceberg tables across multiple engines or clusters, or need central metadata management, use the HMS config. For local, quick, or isolated work, your current config is fine.**

Ref : [Hive Metastore as an Apache Iceberg Catalog](https://www.linkedin.com/pulse/hive-metastore-apache-iceberg-catalog-ankur-ranjan-qmhpc/)

## Step 4: Initialize MinIO Client and Check Health

**Purpose**: Set up MinIO (S3-compatible object storage) for future data operations.

### **MinIO Configuration**:
- **Endpoint**: `http://minio:9000` (Docker service name)
- **Credentials**: `minio`/`minio123` (default development credentials)
- **Region**: `us-east-1` (standard region)
- **Signature Version**: `s3v4` (AWS Signature Version 4)

### **Health Check Strategy**:
- Retry logic with exponential backoff
- Graceful handling of connection failures
- Detailed logging for troubleshooting

In [5]:
# Initialize MinIO client
s3_client = boto3.client(
    's3',
    endpoint_url='http://minio:9000',
    aws_access_key_id='minio',
    aws_secret_access_key='minio123',
    config=Config(signature_version='s3v4'),
    region_name='us-east-1'
)

print("✓ MinIO client initialized")
print(f"Endpoint: http://minio:9000")
print(f"Access Key: minio")
print(f"Signature Version: s3v4")

In [6]:
# Check MinIO health with retries
print("Checking MinIO health...")
max_attempts = 30
attempt = 1

while attempt <= max_attempts:
    try:
        # Try to list buckets to check connectivity
        buckets = s3_client.list_buckets()
        print(f"✓ MinIO is ready! Found {len(buckets['Buckets'])} existing buckets")
        break
    except Exception as e:
        print(f"Waiting for MinIO... (attempt {attempt}/{max_attempts}): {str(e)}")
        if attempt >= max_attempts:
            print("⚠ MinIO failed to start within the expected time. Continuing anyway...")
            break
        time.sleep(2)
        attempt += 1

## Step 5: Initialize MinIO Buckets

**Purpose**: Create the required S3 buckets for different stages of data processing.

### **Bucket Strategy**:

| Bucket | Purpose | Data Type |
|--------|---------|-----------|
| `warehouse` | Iceberg table data and metadata | Structured data |
| `raw-data` | Original source files | CSV, JSON, etc. |
| `stage-data` | Intermediate processing data | Transformed data |
| `reconciled-data` | Final reconciliation results | Processed results |

### **Benefits of This Structure**:
- **Data Lineage**: Clear separation of data stages
- **Cost Optimization**: Different retention policies per bucket
- **Security**: Granular access controls per bucket
- **Performance**: Optimized storage for each data type

In [7]:
# List existing buckets
try:
    existing_buckets = [bucket['Name'] for bucket in s3_client.list_buckets()['Buckets']]
    print(f"Existing buckets: {existing_buckets}")
except Exception as e:
    print(f"Error listing buckets: {str(e)}")
    existing_buckets = []

# Define required buckets with their purposes
bucket_purposes = {
    'warehouse': 'Iceberg table data and metadata',
    'raw-data': 'Original source files (CSV, JSON)',
    'stage-data': 'Intermediate processing data',
    'reconciled-data': 'Final reconciliation results'
}

# Create buckets if they don't exist
created_buckets = []
for bucket, purpose in bucket_purposes.items():
    if bucket not in existing_buckets:
        try:
            s3_client.create_bucket(Bucket=bucket)
            created_buckets.append(bucket)
            print(f"✓ Created bucket: {bucket} ({purpose})")
        except Exception as e:
            print(f"✗ Error creating bucket {bucket}: {str(e)}")
    else:
        print(f"ℹ Bucket already exists: {bucket} ({purpose})")

print(f"\n📊 Bucket Summary:")
print(f"- Total buckets: {len(existing_buckets) + len(created_buckets)}")
print(f"- Newly created: {len(created_buckets)}")
print(f"- Already existed: {len(existing_buckets)}")

## Step 6: Create Banking Namespace

**Purpose**: Create a logical namespace to organize related tables.

### **Namespace Benefits**:
- **Organization**: Groups related tables together
- **Access Control**: Apply permissions at namespace level
- **Naming**: Avoid table name conflicts
- **Discovery**: Easier to find related tables

### **File System Impact**:
The namespace creates a directory structure:
```
/opt/bitnami/spark/warehouse/banking/
├── source_transactions/
├── reconciliation_results/
└── reconciliation_batches/
```

In [8]:
# Create the banking namespace
try:
    spark.sql("CREATE NAMESPACE IF NOT EXISTS local.banking")
    print("✓ Created namespace: local.banking")
    print(f"Namespace location: {warehouse_dir}/banking")
except Exception as e:
    print(f"✗ Error creating namespace: {str(e)}")

In [9]:
# Verify namespace creation by listing the directory structure
print("📁 Verifying namespace creation...")
!ls -la {warehouse_dir}

print(f"\n📁 Banking namespace contents:")
!ls -la {warehouse_dir}/banking

total 16
drwxr-xr-x 3 root root 4096 Jul 13 07:44 .
drwxr-xr-x 1 1001 root 4096 Jul 13 07:41 ..
drwxr-xr-x 5 root root 4096 Jul 13 08:19 banking


total 20
drwxr-xr-x 5 root root 4096 Jul 13 08:19 .
drwxr-xr-x 3 root root 4096 Jul 13 07:44 ..
drwxr-xr-x 3 root root 4096 Jul 13 08:19 reconciliation_batches
drwxr-xr-x 3 root root 4096 Jul 13 08:19 reconciliation_results
drwxr-xr-x 3 root root 4096 Jul 13 07:48 source_transactions


## Step 7: Create Iceberg Tables

**Purpose**: Create the core tables for the banking reconciliation system.

### **Table Design Strategy**:

#### **1. source_transactions**
- **Purpose**: Store all transaction data from different source systems
- **Partitioning**: By `days(transaction_date)` and `source_system`
- **Benefits**: Efficient querying by date range and source
- **Schema**: Comprehensive transaction metadata

#### **2. reconciliation_results**
- **Purpose**: Store reconciliation outcomes and discrepancies
- **Partitioning**: By `days(reconciliation_timestamp)` and `match_status`
- **Benefits**: Easy analysis of reconciliation performance
- **Schema**: Detailed discrepancy tracking

#### **3. reconciliation_batches**
- **Purpose**: Track reconciliation batch metadata
- **Partitioning**: None (small lookup table)
- **Benefits**: Batch-level monitoring and reporting
- **Schema**: Batch execution metadata

In [10]:
# Create source_transactions table
print("Creating source_transactions table...")
try:
    spark.sql("""
    CREATE TABLE IF NOT EXISTS local.banking.source_transactions (
      transaction_id STRING,
      source_system STRING,
      transaction_date TIMESTAMP,
      amount DECIMAL(18,2),
      account_id STRING,
      transaction_type STRING,
      reference_id STRING,
      status STRING,
      payload STRING,
      created_at TIMESTAMP,
      processing_timestamp TIMESTAMP
    )
    USING iceberg
    PARTITIONED BY (days(transaction_date), source_system)
    """)
    print("✓ Created table: local.banking.source_transactions")
    print("  - Partitioned by: days(transaction_date), source_system")
    print("  - Purpose: Store all transaction data from different sources")
except Exception as e:
    print(f"✗ Error creating source_transactions table: {str(e)}")

In [11]:
# Create reconciliation_results table
print("Creating reconciliation_results table...")
try:
    spark.sql("""
    CREATE TABLE IF NOT EXISTS local.banking.reconciliation_results (
      reconciliation_id STRING,
      batch_id STRING,
      primary_transaction_id STRING,
      secondary_transaction_id STRING,
      match_status STRING,
      discrepancy_type STRING,
      discrepancy_amount DECIMAL(18,2),
      reconciliation_timestamp TIMESTAMP,
      notes STRING
    )
    USING iceberg
    PARTITIONED BY (days(reconciliation_timestamp), match_status)
    """)
    print("✓ Created table: local.banking.reconciliation_results")
    print("  - Partitioned by: days(reconciliation_timestamp), match_status")
    print("  - Purpose: Store reconciliation outcomes and discrepancies")
except Exception as e:
    print(f"✗ Error creating reconciliation_results table: {str(e)}")

In [None]:
# Create reconciliation_batches table
print("Creating reconciliation_batches table...")
try:
    spark.sql("""
    CREATE TABLE IF NOT EXISTS local.banking.reconciliation_batches (
      batch_id STRING,
      reconciliation_date TIMESTAMP,
      source_systems ARRAY<STRING>,
      start_date TIMESTAMP,
      end_date TIMESTAMP,
      status STRING,
      total_transactions BIGINT,
      matched_count BIGINT,
      unmatched_count BIGINT,
      created_at TIMESTAMP,
      completed_at TIMESTAMP
    )
    USING iceberg
    """)
    print("✓ Created table: local.banking.reconciliation_batches")
    print("  - Partitioned by: None (small lookup table)")
    print("  - Purpose: Track reconciliation batch metadata")
except Exception as e:
    print(f"✗ Error creating reconciliation_batches table: {str(e)}")

## Step 8: Verify Tables and Audit Setup

**Purpose**: Validate that all tables were created correctly and examine the Iceberg file structure.

### **What We'll Verify**:
1. **Table Existence**: Confirm all tables are created
2. **File Structure**: Examine Iceberg metadata and data directories
3. **Schema Validation**: Check table schemas are correct
4. **Accessibility**: Test basic queries on each table

In [12]:
# List tables to verify
print("📋 Verifying created tables...")
tables_df = spark.sql("SHOW TABLES IN local.banking")
tables_df.show()

# Count tables
table_count = tables_df.count()
print(f"\n📊 Table Summary:")
print(f"- Total tables in banking namespace: {table_count}")
print(f"- Expected tables: 3")
print(f"- Status: {'✓ PASS' if table_count >= 3 else '✗ FAIL'}")

+---------+--------------------+-----------+
|namespace|           tableName|isTemporary|
+---------+--------------------+-----------+
|  banking|reconciliation_ba...|      false|
|  banking|reconciliation_re...|      false|
|  banking| source_transactions|      false|
+---------+--------------------+-----------+



                                                                                

In [13]:
# Test a simple query on each table
print("🔍 Testing table accessibility...")

tables_to_test = [
    'local.banking.source_transactions',
    'local.banking.reconciliation_results',
    'local.banking.reconciliation_batches'
]

for table in tables_to_test:
    try:
        result = spark.sql(f"SELECT COUNT(*) as count FROM {table}")
        count = result.collect()[0]['count']
        print(f"✓ {table}: {count} rows")
    except Exception as e:
        print(f"✗ {table}: Error - {str(e)}")

## Step 9: Examine Iceberg File Structure

**Purpose**: Understand how Iceberg organizes files and metadata on the local filesystem.

### **Iceberg File Organization**:

Each Iceberg table creates this structure:
```
table_name/
├── metadata/              # Table metadata and version history
│   ├── v1.metadata.json  # Current table schema and properties
│   ├── version-hint.text # Points to latest metadata version
│   └── .*.crc           # Checksums for data integrity
├── data/                # Actual data files (Parquet format)
│   └── partition_paths/ # Partitioned data organized by partition keys
└── snapshots/           # Table snapshots for time travel
```

### **Key Files Explained**:
- **v1.metadata.json**: Contains table schema, partitioning, and properties
- **version-hint.text**: Points to the current metadata version
- **.crc files**: Checksums to ensure data integrity
- **data/**: Contains actual Parquet files with the data
- **snapshots/**: Enables time travel and rollback capabilities

In [14]:
# Examine the file structure created by Iceberg
print("📁 Examining Iceberg file structure...")
print("\n🔍 Banking namespace structure:")
!ls -la {warehouse_dir}/banking/

print("\n🔍 source_transactions table structure:")
!ls -la {warehouse_dir}/banking/source_transactions/

print("\n🔍 source_transactions metadata:")
!ls -la {warehouse_dir}/banking/source_transactions/metadata/

print("\n🔍 reconciliation_results table structure:")
!ls -la {warehouse_dir}/banking/reconciliation_results/

print("\n🔍 reconciliation_batches table structure:")
!ls -la {warehouse_dir}/banking/reconciliation_batches/

total 20
drwxr-xr-x 5 root root 4096 Jul 13 08:19 .
drwxr-xr-x 3 root root 4096 Jul 13 07:44 ..
drwxr-xr-x 3 root root 4096 Jul 13 08:19 reconciliation_batches
drwxr-xr-x 3 root root 4096 Jul 13 08:19 reconciliation_results
drwxr-xr-x 3 root root 4096 Jul 13 07:48 source_transactions


total 12
drwxr-xr-x 3 root root 4096 Jul 13 07:48 .
drwxr-xr-x 5 root root 4096 Jul 13 08:19 ..
drwxr-xr-x 2 root root 4096 Jul 13 07:48 metadata


total 24
drwxr-xr-x 2 root root 4096 Jul 13 07:48 .
drwxr-xr-x 3 root root 4096 Jul 13 07:48 ..
-rw-r--r-- 1 root root   24 Jul 13 07:48 .v1.metadata.json.crc
-rw-r--r-- 1 root root   12 Jul 13 07:48 .version-hint.text.crc
-rw-r--r-- 1 root root 1547 Jul 13 07:48 v1.metadata.json
-rw-r--r-- 1 root root    1 Jul 13 07:48 version-hint.text


total 12
drwxr-xr-x 3 root root 4096 Jul 13 08:19 .
drwxr-xr-x 5 root root 4096 Jul 13 08:19 ..
drwxr-xr-x 2 root root 4096 Jul 13 08:19 metadata


total 12
drwxr-xr-x 3 root root 4096 Jul 13 08:19 .
drwxr-xr-x 5 root root 4096 Jul 13 08:19 ..
drwxr-xr-x 2 root root 4096 Jul 13 08:19 metadata


In [15]:
# Examine the metadata file to understand table schema and properties
print("📄 Examining table metadata...")
print("\n🔍 source_transactions metadata content:")
!cat {warehouse_dir}/banking/source_transactions/metadata/v1.metadata.json

print("\n🔍 version hint file:")
!cat {warehouse_dir}/banking/source_transactions/metadata/version-hint.text

{"format-version":2,"table-uuid":"878b837f-5cf0-4eeb-909a-1bbfedb65d92","location":"file:///opt/bitnami/spark/warehouse/banking/source_transactions","last-sequence-number":0,"last-updated-ms":1752392915539,"last-column-id":11,"current-schema-id":0,"schemas":[{"type":"struct","schema-id":0,"fields":[{"id":1,"name":"transaction_id","required":false,"type":"string"},{"id":2,"name":"source_system","required":false,"type":"string"},{"id":3,"name":"transaction_date","required":false,"type":"timestamptz"},{"id":4,"name":"amount","required":false,"type":"decimal(18, 2)"},{"id":5,"name":"account_id","required":false,"type":"string"},{"id":6,"name":"transaction_type","required":false,"type":"string"},{"id":7,"name":"reference_id","required":false,"type":"string"},{"id":8,"name":"status","required":false,"type":"string"},{"id":9,"name":"payload","required":false,"type":"string"},{"id":10,"name":"created_at","required":false,"type":"timestamptz"},{"id":11,"name":"processing_timestamp","required":f

1

## Step 10: Phase 1 Summary and Next Steps

**Purpose**: Provide a comprehensive summary of what was accomplished and prepare for the next phase.

### **What We've Accomplished**:

#### **✅ Infrastructure Setup**
- Spark session with Iceberg extensions
- Local catalog with Hive metastore
- MinIO object storage buckets

#### **✅ Data Architecture**
- Banking namespace for organization
- Three core tables with proper schemas
- Partitioning strategy for performance

#### **✅ File Structure**
- Iceberg metadata and data directories
- Proper table organization
- Version control and checksums

### **Next Phase Preview**:
- **Data Generation**: Create sample transaction data
- **Data Ingestion**: Load data into Iceberg tables
- **Reconciliation Logic**: Implement matching algorithms
- **Testing**: Validate the complete system

In [16]:
# Generate setup summary
setup_summary = {
    "timestamp": datetime.now().isoformat(),
    "phase": "Phase 1 - Foundation Setup",
    "status": "completed",
    "components": {
        "spark_session": {
            "status": "✓ Created",
            "version": spark.version,
            "catalog": spark.conf.get('spark.sql.defaultCatalog'),
            "warehouse": spark.conf.get('spark.sql.catalog.local.warehouse')
        },
        "minio_connection": {
            "status": "✓ Established",
            "endpoint": "http://minio:9000",
            "buckets": len(bucket_purposes)
        },
        "namespace": {
            "status": "✓ Created",
            "name": "local.banking",
            "location": f"{warehouse_dir}/banking"
        },
        "tables": {
            "status": "✓ Created",
            "count": table_count,
            "names": ["source_transactions", "reconciliation_results", "reconciliation_batches"]
        }
    },
    "file_structure": {
        "warehouse_root": warehouse_dir,
        "banking_namespace": f"{warehouse_dir}/banking",
        "metadata_files": "v1.metadata.json, version-hint.text, .crc files",
        "data_directories": "data/, snapshots/"
    },
    "next_steps": [
        "Phase 2: Generate sample transaction data",
        "Phase 3: Ingest data into Iceberg tables",
        "Phase 4: Implement reconciliation logic",
        "Phase 5: Run comprehensive tests"
    ]
}

print("🎉 Phase 1 Setup Complete!")
print("=" * 60)
print(json.dumps(setup_summary, indent=2))
print("=" * 60)
print("\n📚 Learning Summary:")
print("✅ Understanding of Iceberg local catalog with Hive metastore")
print("✅ Knowledge of Iceberg file structure and metadata organization")
print("✅ Experience with table partitioning and schema design")
print("✅ Familiarity with MinIO object storage integration")
print("\n🚀 Next: Run Phase 2 notebook for data generation and population.")