Of course! Here is the rewritten version of the **first markdown cell** in English, tailored to your actual setup (Lakekeeper as the catalog, MinIO for storage, Postgres for metadata):

---

# Complete Banking Reconciliation Setup – Phase 1  
## Lakekeeper Catalog with MinIO S3 Storage and Postgres Metadata Management

## 🎯 Learning Objectives

This notebook demonstrates how to set up Apache Iceberg using **Lakekeeper** as the catalog (with Postgres as the metadata backend) and **MinIO** as S3-compatible storage for a banking reconciliation system.

### **Apache Iceberg Fundamentals**
- **Catalog**: Learn how Iceberg works with Lakekeeper (Postgres) and MinIO
- **MinIO Storage**: Understand how Iceberg stores data and metadata in MinIO buckets
- **Table Structure**: Explore the file and folder organization created by Iceberg tables in MinIO
- **Partitioning**: See how Iceberg manages partitioned tables

### **What You Will Build**
- A complete banking reconciliation system with three core tables
- S3-based storage using MinIO
- Partitioned tables for efficient querying
- Full audit trail and validation

### **Key Concepts**

#### **1. Iceberg Catalog with Lakekeeper and MinIO**
In this setup, the Iceberg catalog is of type REST, managed by Lakekeeper, and data is stored in MinIO:
```python
spark.sql.catalog.lakekeeper.type = "rest"
spark.sql.catalog.lakekeeper.uri = "http://lakekeeper:8181/catalog"
spark.sql.catalog.lakekeeper.warehouse = "irisa-ot"
```
- **Catalog**: A namespace for tables and schemas
- **MinIO**: S3-compatible object storage for all table data and metadata
- **Warehouse**: The root location in MinIO where data and metadata are stored


#### **2. Partitioning Strategy**
- **source_transactions**: Partitioned by `days(transaction_date)` and `source_system`
- **reconciliation_results**: Partitioned by `days(reconciliation_timestamp)` and `match_status`
- **reconciliation_batches**: No partitioning (small lookup table)

## Phase 1: Foundation Setup

---

Here is the **second markdown cell** rewritten in English and tailored to your setup:

---

## Step 1: Import Required Libraries

**Purpose:**  
Import all necessary libraries for working with Spark, MinIO, and file operations.

**Key Libraries:**
- `pyspark.sql.SparkSession`: Core Spark functionality
- `boto3`: AWS S3/MinIO client for object storage
- `os`, `json`, `datetime`: Utilities for file and data manipulation


In [1]:
# Import required libraries
!pip install --root-user-action=ignore rich --quiet
from rich import print
import pyspark
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
import pandas as pd

# This CATALOG_URL works for the "docker compose" testing and development environment
# Change 'lakekeeper' if you are not running on "docker compose" (f. ex. 'localhost' if Lakekeeper is running locally).
CATALOG_URL = "http://lakekeeper:8181/catalog"
WAREHOUSE = "irisa-ot" # as is in lakekeeper : http://localhost:8181/ui/warehouse

SPARK_VERSION = pyspark.__version__
SPARK_MINOR_VERSION = '.'.join(SPARK_VERSION.split('.')[:2])
ICEBERG_VERSION = "1.9.2"

print(f"Spark Version: {SPARK_VERSION} - Spark Minor Version: {SPARK_MINOR_VERSION} - Iceberg Version: {ICEBERG_VERSION}")

## Step 2: Stop any existing Spark session

**Purpose**: Ensure a clean Spark environment by stopping any existing sessions.

**Why This Matters**:
- Prevents configuration conflicts
- Ensures fresh Iceberg catalog initialization
- Clears any cached metadata or connections

In [2]:
# Stop any existing Spark session
try:
    SparkSession.builder.getOrCreate().stop()
    print("✓ Stopped existing Spark session")
except:
    print("ℹ No existing Spark session to stop")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/07/22 09:44:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/07/22 09:44:03 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/07/22 09:44:03 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


## Step 3: Create Spark Session with Iceberg + MinIO (S3) Configuration

**Purpose:**  
Initialize Spark with Apache Iceberg extensions and MinIO S3-compatible catalog configuration.

### **Configuration Breakdown:**

#### **Iceberg Extensions**
```python
spark.sql.extensions = "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
```
- Enables Iceberg-specific SQL commands (CREATE TABLE, MERGE, etc.)
- Provides time travel, schema evolution, and partition evolution capabilities

#### **Catalog Configuration**
```python
spark.sql.catalog.lakekeeper = "org.apache.iceberg.spark.SparkCatalog"
spark.sql.catalog.lakekeeper.type = "rest"
spark.sql.catalog.lakekeeper.uri = "http://lakekeeper:8181/catalog"
spark.sql.catalog.lakekeeper.warehouse = "irisa-ot"
```
- **lakekeeper**: The custom catalog name for Lakekeeper REST API
- **warehouse**: The MinIO bucket path for table data and metadata

#### **Default Catalog**
- You will use `lakekeeper` as the catalog name in all SQL operations.



In [3]:
# Create warehouse directory if it doesn't exist
config = {
    f"spark.sql.catalog.lakekeeper": "org.apache.iceberg.spark.SparkCatalog",
    f"spark.sql.catalog.lakekeeper.type": "rest",
    f"spark.sql.catalog.lakekeeper.uri": CATALOG_URL,
    f"spark.sql.catalog.lakekeeper.warehouse": WAREHOUSE,
    f"spark.sql.catalog.lakekeeper.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
    "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
    "spark.sql.defaultCatalog": "lakekeeper",
    "spark.executor.memory": "1024m",
    "spark.executor.cores": "1",
    "spark.jars.packages": f"org.apache.iceberg:iceberg-spark-runtime-{SPARK_MINOR_VERSION}_2.12:{ICEBERG_VERSION},org.apache.iceberg:iceberg-aws-bundle:{ICEBERG_VERSION}",
}

spark_config = SparkConf().setMaster('spark://spark-master:7077').setAppName("Iceberg-REST-Cluster-Banking-Sample-Phase1")
for k, v in config.items():
    spark_config = spark_config.set(k, v)

spark = SparkSession.builder.config(conf=spark_config).getOrCreate()

spark.sql("USE lakekeeper")
print("✓ Spark session created successfully")
print(f"Spark version: {spark.version}")
print(f"Default catalog: {spark.conf.get('spark.sql.defaultCatalog')}")


25/07/22 09:44:13 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/07/22 09:44:13 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


## Step 4: Initialize MinIO Client and Check Health

**Purpose:**  
Set up MinIO (S3-compatible object storage) for future data operations.

### **MinIO Configuration:**
- **Endpoint:** `http://minio:9000` (Docker service name)
- **Credentials:** `minio`/`minio123` (default development credentials)
- **Region:** `us-east-1` (standard region)
- **Signature Version:** `s3v4` (AWS Signature Version 4)

### **Health Check Strategy:**
- Retry logic with exponential backoff
- Graceful handling of connection failures
- Detailed logging for troubleshooting



In [4]:
!pip3 install boto3

[0m[31mERROR: Could not find a version that satisfies the requirement boto3 (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for boto3[0m[31m
[0m

In [5]:

import boto3
from botocore.client import Config

# Initialize MinIO client
s3_client = boto3.client(
    's3',
    endpoint_url='http://minio:9000',
    aws_access_key_id='minio-root-user',
    aws_secret_access_key='minio-root-password',
    config=Config(signature_version='s3v4'),
    region_name='us-east-1'
)

print("✓ MinIO client initialized")
print(f"Endpoint: http://minio:9000")
print(f"Signature Version: s3v4")

ModuleNotFoundError: No module named 'boto3'

In [7]:
# Check MinIO health with retries
print("Checking MinIO health...")
max_attempts = 30
attempt = 1

while attempt <= max_attempts:
    try:
        # Try to list buckets to check connectivity
        buckets = s3_client.list_buckets()
        print(f"✓ MinIO is ready! Found {len(buckets['Buckets'])} existing buckets")
        break
    except Exception as e:
        print(f"Waiting for MinIO... (attempt {attempt}/{max_attempts}): {str(e)}")
        if attempt >= max_attempts:
            print("⚠ MinIO failed to start within the expected time. Continuing anyway...")
            break
        time.sleep(2)
        attempt += 1

## Step 5: Create Banking Namespace

**Purpose**: Create a logical namespace to organize related tables in MinIO.

### **Namespace Benefits**:
- **Organization**: Groups related tables together
- **Access Control**: Apply permissions at namespace level
- **Naming**: Avoid table name conflicts
- **Discovery**: Easier to find related tables

### **S3 Impact**:
The namespace creates a directory structure in your MinIO bucket:



In [7]:
# Create the banking namespace
try:
    spark.sql("CREATE NAMESPACE IF NOT EXISTS lakekeeper.banking")
    print("✓ Created namespace: lakekeeper.banking")
    print(f"Namespace location: lakekeeper/banking")
except Exception as e:
    print(f"✗ Error creating namespace: {str(e)}")

## Step 5b: Verify Namespace Creation 

**Purpose:**  
Confirm that the namespace has been created in the Lakekeeper catalog.

> Note: Click on `irisa-ot` at `http://localhost:8181/ui/warehouse` to view the namespace.

**Also, check the Lakekeeper Postgres database:** Look at the `namespace` table.



Here is the **eighth markdown cell** rewritten in English and tailored to your setup:

---

## Step 6: Create Iceberg Tables 

**Purpose:**  
Create the core tables for the banking reconciliation system in Lakekeeper and MinIO.

### **Table Design Strategy:**

#### **1. source_transactions**
- **Purpose:** Store all transaction data from different source systems
- **Partitioning:** By `days(transaction_date)` and `source_system`
- **Benefits:** Efficient querying by date range and source
- **Schema:** Comprehensive transaction metadata

#### **2. reconciliation_results**
- **Purpose:** Store reconciliation outcomes and discrepancies
- **Partitioning:** By `days(reconciliation_timestamp)` and `match_status`
- **Benefits:** Easy analysis of reconciliation performance
- **Schema:** Detailed discrepancy tracking

#### **3. reconciliation_batches**
- **Purpose:** Track reconciliation batch metadata
- **Partitioning:** None (small lookup table)
- **Benefits:** Batch-level monitoring and reporting
- **Schema:** Batch execution metadata

> All tables will be created under the `lakekeeper.banking` namespace and stored in your MinIO S3 bucket.



In [12]:
# Create source_transactions table
print("Creating source_transactions table...")
try:
    spark.sql("""
    CREATE TABLE IF NOT EXISTS lakekeeper.banking.source_transactions (
      transaction_id STRING,
      source_system STRING,
      transaction_date TIMESTAMP,
      amount DECIMAL(18,2),
      account_id STRING,
      transaction_type STRING,
      reference_id STRING,
      status STRING,
      payload STRING,
      created_at TIMESTAMP,
      processing_timestamp TIMESTAMP
    )
    USING iceberg
    PARTITIONED BY (days(transaction_date), source_system)
    """)
    print("✓ Created table: lakekeeper.banking.source_transactions")
    print("  - Partitioned by: days(transaction_date), source_system")
    print("  - Purpose: Store all transaction data from different sources")
except Exception as e:
    print(f"✗ Error creating source_transactions table: {str(e)}")

In [11]:
# Create reconciliation_results table
print("Creating reconciliation_results table...")
try:
    spark.sql("""
    CREATE TABLE IF NOT EXISTS lakekeeper.banking.reconciliation_results (
      reconciliation_id STRING,
      batch_id STRING,
      primary_transaction_id STRING,
      secondary_transaction_id STRING,
      match_status STRING,
      discrepancy_type STRING,
      discrepancy_amount DECIMAL(18,2),
      reconciliation_timestamp TIMESTAMP,
      notes STRING
    )
    USING iceberg
    PARTITIONED BY (days(reconciliation_timestamp), match_status)
    """)
    print("✓ Created table: lakekeeper.banking.reconciliation_results")
    print("  - Partitioned by: days(reconciliation_timestamp), match_status")
    print("  - Purpose: Store reconciliation outcomes and discrepancies")
except Exception as e:
    print(f"✗ Error creating reconciliation_results table: {str(e)}")

In [10]:
# Create reconciliation_batches table
print("Creating reconciliation_batches table...")
try:
    spark.sql("""
    CREATE TABLE IF NOT EXISTS lakekeeper.banking.reconciliation_batches (
      batch_id STRING,
      reconciliation_date TIMESTAMP,
      source_systems ARRAY<STRING>,
      start_date TIMESTAMP,
      end_date TIMESTAMP,
      status STRING,
      total_transactions BIGINT,
      matched_count BIGINT,
      unmatched_count BIGINT,
      created_at TIMESTAMP,
      completed_at TIMESTAMP
    )
    USING iceberg
    """)
    print("✓ Created table: lakekeeper.banking.reconciliation_batches")
    print("  - Partitioned by: None (small lookup table)")
    print("  - Purpose: Track reconciliation batch metadata")
except Exception as e:
    print(f"✗ Error creating reconciliation_batches table: {str(e)}")

## Step 7: Verify Tables and Audit Setup

**Purpose**: Validate that all tables were created correctly in MinIO and examine the Iceberg file structure.

### **What We'll Verify**:
1. **Table Existence**: Confirm all tables are created in the `minio.banking` namespace
2. **File Structure**: Examine Iceberg metadata and data directories in S3
3. **Schema Validation**: Check table schemas are correct
4. **Accessibility**: Test basic queries on each table

  

### Check out the Lakekeeper Carefully
 ![Tables in Lakekeeper](images/tables-in-lakekeeper.png)
 
 ![Table Properties in Lakekeeper](images/tables-in-lakekeeper-properties.png)

In [10]:
# List tables to verify
print("📋 Verifying created tables...")
tables_df = spark.sql("SHOW TABLES IN lakekeeper.banking")
tables_df.show()

# Count tables
table_count = tables_df.count()
print(f"\n📊 Table Summary:")
print(f"- Total tables in banking namespace: {table_count}")
print(f"- Expected tables: 3")
print(f"- Status: {'✓ PASS' if table_count >= 3 else '✗ FAIL'}")

+---------+--------------------+-----------+
|namespace|           tableName|isTemporary|
+---------+--------------------+-----------+
|  banking| source_transactions|      false|
|  banking|reconciliation_re...|      false|
|  banking|reconciliation_ba...|      false|
+---------+--------------------+-----------+



                                                                                

Certainly! Here is a **conclusion markdown cell** tailored to your setup:

---

## Conclusion

In this notebook, you have successfully set up a foundational lakehouse architecture for banking reconciliation using Apache Iceberg, Lakekeeper as the catalog (with Postgres for metadata management), and MinIO as S3-compatible storage. You learned how to:

- Configure Spark to work with Iceberg, Lakekeeper, and MinIO
- Initialize and verify the health of your MinIO storage
- Create a logical namespace for organizing your banking tables
- Design and create partitioned Iceberg tables for efficient data management and querying
- Validate the setup by checking table existence and file structures in both Lakekeeper and MinIO

This setup provides a robust, scalable, and auditable foundation for advanced analytics and reconciliation workflows in a modern data lakehouse environment.

You can now proceed to ingest data, perform reconciliation logic, and explore advanced Iceberg features such as time travel, schema evolution, and more.

---

In [13]:
spark.stop()