# Complete Banking Reconciliation Setup – Phase 2

## 🎯 Learning Objectives

This notebook demonstrates how to generate and populate data into Apache Iceberg tables using a **Lakekeeper catalog**. You will learn:

### **Data Generation Fundamentals**
- **Realistic Data Creation:** Generate banking transactions that mimic real-world scenarios
- **Multi-System Simulation:** Create data from different source systems (core banking, card processor, payment gateway)
- **Discrepancy Introduction:** Intentionally create mismatches to test reconciliation logic
- **Data Quality:** Ensure generated data meets business requirements

### **Data Ingestion with Iceberg**
- **CSV to Iceberg:** Convert CSV files to Iceberg table format
- **Partitioning Strategy:** Understand how data is organized by date and source system
- **Metadata Management:** Learn how Iceberg tracks data lineage
- **Incremental Loading:** Add new data without affecting existing records

### **Audit and Validation**
- **Data Quality Checks:** Verify data integrity and completeness
- **Performance Analysis:** Monitor query performance with real data
- **File Structure Analysis:** Examine how Iceberg organizes data files
- **Reconciliation Readiness:** Ensure data is ready for reconciliation processes

## Phase 2: Data Generation and Population



## Step 1: Import Required Libraries and Setup

**Purpose**: Import all necessary libraries for data generation, manipulation, and Iceberg operations.

### **Key Libraries**:
- `pyspark.sql`: Spark DataFrame operations and Iceberg integration
- `pandas`: Data manipulation and CSV handling
- `faker`: Generate realistic fake data
- `uuid`, `random`, `datetime`: Data generation utilities
- `json`: Handle complex payload structures

In [5]:
!pip install --root-user-action=ignore rich faker --quiet
from rich import print
import pyspark
# Import required libraries for data generation and manipulation
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark.sql.functions import col, expr, when, lit, current_timestamp
import pandas as pd
import json
import uuid
import random
from datetime import datetime, timedelta
import os
from faker import Faker

# Initialize Faker for realistic data generation
fake = Faker()

print("✓ All required libraries imported successfully")
print(f"Faker locale: {fake.locale}")
print(f"Current timestamp: {datetime.now()}")


Here is the **rewritten markdown for Cell 3**:

---

## Step 2: Create Spark Session with Iceberg Configuration

**Purpose:**  
Initialize Spark with the same Iceberg configuration from Phase 1.

### **Configuration Consistency:**
- Uses the same Lakekeeper catalog configuration
- Maintains warehouse directory structure
- Ensures compatibility with existing tables

### **Why This Matters:**
- **Session Continuity:** Same configuration as Phase 1
- **Table Access:** Can access previously created tables
- **Data Consistency:** Ensures data is written to the correct location



In [8]:
# This CATALOG_URL works for the "docker compose" testing and development environment
# Change 'lakekeeper' if you are not running on "docker compose" (f. ex. 'localhost' if Lakekeeper is running locally).
CATALOG_URL = "http://lakekeeper:8181/catalog"
WAREHOUSE = "irisa-ot" # as is in lakekeeper : http://localhost:8181/ui/warehouse

SPARK_VERSION = pyspark.__version__
SPARK_MINOR_VERSION = '.'.join(SPARK_VERSION.split('.')[:2])
ICEBERG_VERSION = "1.9.2"

print(f"Spark Version: {SPARK_VERSION} - Spark Minor Version: {SPARK_MINOR_VERSION} - Iceberg Version: {ICEBERG_VERSION}")
# Stop any existing Spark session
try:
    SparkSession.builder.getOrCreate().stop()
    print("✓ Stopped existing Spark session")
except:
    print("ℹ No existing Spark session to stop")

# Create warehouse directory if it doesn't exist
config = {
    f"spark.sql.catalog.lakekeeper": "org.apache.iceberg.spark.SparkCatalog",
    f"spark.sql.catalog.lakekeeper.type": "rest",
    f"spark.sql.catalog.lakekeeper.uri": CATALOG_URL,
    f"spark.sql.catalog.lakekeeper.warehouse": WAREHOUSE,
    f"spark.sql.catalog.lakekeeper.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
    "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
    "spark.sql.defaultCatalog": "lakekeeper",
    "spark.executor.memory": "1024m",
    "spark.executor.cores": "1",
    "spark.jars.packages": f"org.apache.iceberg:iceberg-spark-runtime-{SPARK_MINOR_VERSION}_2.12:{ICEBERG_VERSION},org.apache.iceberg:iceberg-aws-bundle:{ICEBERG_VERSION}",
}

spark_config = SparkConf().setMaster('spark://spark-master:7077').setAppName("Iceberg-REST-Cluster-Banking-Sample-Phase2")
for k, v in config.items():
    spark_config = spark_config.set(k, v)

spark = SparkSession.builder.config(conf=spark_config).getOrCreate()

spark.sql("USE lakekeeper")
print("✓ Spark session created successfully")
print(f"Spark version: {spark.version}")
print(f"Default catalog: {spark.conf.get('spark.sql.defaultCatalog')}")    

## Step 3: Verify Existing Tables and Data

**Purpose**: Confirm that tables from Phase 1 exist and are accessible.

### **What We'll Check**:
1. **Table Existence**: Verify all three tables are present
2. **Schema Validation**: Confirm table schemas are correct
3. **Current Data State**: Check if any data already exists
4. **Partitioning**: Verify partitioning strategy is in place

In [9]:
# Verify existing tables from Phase 1
print(" Verifying existing tables from Phase 1...")

# List tables in banking namespace
tables_df = spark.sql("SHOW TABLES IN lakekeeper.banking")
print("\n Available tables:")
tables_df.show()

# Check current data counts
print("\n📊 Current data counts:")
tables_to_check = [
    'lakekeeper.banking.source_transactions',
    'lakekeeper.banking.reconciliation_results',
    'lakekeeper.banking.reconciliation_batches'
]

for table in tables_to_check:
    try:
        count = spark.sql(f"SELECT COUNT(*) as count FROM {table}").collect()[0]['count']
        print(f"✓ {table}: {count} rows")
    except Exception as e:
        print(f"✗ {table}: Error - {str(e)}")

# Verify table schemas
print("\n Table schemas:")
for table in tables_to_check:
    try:
        schema = spark.sql(f"DESCRIBE {table}")
        print(f"\n📋 {table} schema:")
        schema.show()
    except Exception as e:
        print(f"✗ Error getting schema for {table}: {str(e)}")

+---------+--------------------+-----------+
|namespace|           tableName|isTemporary|
+---------+--------------------+-----------+
|  banking| source_transactions|      false|
|  banking|reconciliation_re...|      false|
|  banking|reconciliation_ba...|      false|
+---------+--------------------+-----------+



                                                                                

+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|      transaction_id|              string|   NULL|
|       source_system|              string|   NULL|
|    transaction_date|           timestamp|   NULL|
|              amount|       decimal(18,2)|   NULL|
|          account_id|              string|   NULL|
|    transaction_type|              string|   NULL|
|        reference_id|              string|   NULL|
|              status|              string|   NULL|
|             payload|              string|   NULL|
|          created_at|           timestamp|   NULL|
|processing_timestamp|           timestamp|   NULL|
|                    |                    |       |
|      # Partitioning|                    |       |
|              Part 0|days(transaction_...|       |
|              Part 1|       source_system|       |
+--------------------+--------------------+-------+



+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|   reconciliation_id|              string|   NULL|
|            batch_id|              string|   NULL|
|primary_transacti...|              string|   NULL|
|secondary_transac...|              string|   NULL|
|        match_status|              string|   NULL|
|    discrepancy_type|              string|   NULL|
|  discrepancy_amount|       decimal(18,2)|   NULL|
|reconciliation_ti...|           timestamp|   NULL|
|               notes|              string|   NULL|
|                    |                    |       |
|      # Partitioning|                    |       |
|              Part 0|days(reconciliati...|       |
|              Part 1|        match_status|       |
+--------------------+--------------------+-------+



+-------------------+-------------+-------+
|           col_name|    data_type|comment|
+-------------------+-------------+-------+
|           batch_id|       string|   NULL|
|reconciliation_date|    timestamp|   NULL|
|     source_systems|array<string>|   NULL|
|         start_date|    timestamp|   NULL|
|           end_date|    timestamp|   NULL|
|             status|       string|   NULL|
| total_transactions|       bigint|   NULL|
|      matched_count|       bigint|   NULL|
|    unmatched_count|       bigint|   NULL|
|         created_at|    timestamp|   NULL|
|       completed_at|    timestamp|   NULL|
+-------------------+-------------+-------+



## Step 4: Define Data Generation Configuration

**Purpose**: Set up the configuration for generating realistic banking transaction data.

### **Data Generation Strategy**:

#### **Source Systems**
- **core_banking**: Primary system with complete transaction records
- **card_processor**: Secondary system with some discrepancies
- **payment_gateway**: Tertiary system with additional discrepancies

#### **Transaction Types**
- **deposit**: Money added to account
- **withdrawal**: Money removed from account
- **transfer**: Money moved between accounts
- **payment**: Payment to merchant/service
- **refund**: Money returned to account
- **fee**: Service charges and fees

#### **Discrepancy Types**
- **amount**: Slight differences in transaction amounts
- **date**: Timing differences in transaction dates
- **status**: Different transaction statuses
- **missing**: Transactions that exist in one system but not another

In [10]:
# Define data generation configuration
print("⚙️ Setting up data generation configuration...")

# Source systems for multi-system reconciliation
SOURCE_SYSTEMS = ['core_banking', 'card_processor', 'payment_gateway']
print(f"Source systems: {SOURCE_SYSTEMS}")

# Transaction types for realistic banking scenarios
TRANSACTION_TYPES = ['deposit', 'withdrawal', 'transfer', 'payment', 'refund', 'fee']
print(f"Transaction types: {TRANSACTION_TYPES}")

# Transaction statuses
STATUS_VALUES = ['completed', 'pending', 'failed', 'reversed']
print(f"Status values: {STATUS_VALUES}")

# Data generation parameters
DATA_CONFIG = {
    'num_accounts': 100,
    'primary_transactions': 5000,
    'date_range_days': 30,
    'error_rate': 0.05,  # 5% discrepancy rate
    'extra_transactions_rate': 0.02  # 2% extra transactions in secondary systems
}

print(f"\n📊 Data generation parameters:")
for key, value in DATA_CONFIG.items():
    print(f"  - {key}: {value}")

# Define date range for transactions
end_date = datetime.now()
start_date = end_date - timedelta(days=DATA_CONFIG['date_range_days'])
print(f"\n Date range: {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}")

## Step 5: Create Data Generation Functions

**Purpose**: Define functions to generate realistic banking transaction data.

### **Data Generation Functions**:

#### **1. Account ID Generation**
- Creates unique account identifiers
- Uses consistent format for all systems
- Ensures referential integrity

#### **2. Transaction ID Generation**
- System-specific prefixes (CB, CP, PG)
- Unique identifiers for each transaction
- Enables cross-system matching

#### **3. Transaction Data Generation**
- Realistic amounts and dates
- Proper status distribution
- Rich payload metadata

#### **4. Discrepancy Introduction**
- Intentional mismatches for testing
- Various discrepancy types
- Controlled error rates

In [11]:
# Data generation functions
print("🔧 Creating data generation functions...")

def generate_account_ids(num_accounts=100):
    """Generate a list of random account IDs."""
    return [f"ACC{fake.unique.random_number(digits=8)}" for _ in range(num_accounts)]

def generate_transaction_id(source_system):
    """Generate a transaction ID with a prefix based on the source system."""
    prefixes = {
        'core_banking': 'CB',
        'card_processor': 'CP',
        'payment_gateway': 'PG'
    }
    prefix = prefixes.get(source_system, 'TX')
    return f"{prefix}-{uuid.uuid4().hex[:12].upper()}"

def generate_reference_id():
    """Generate a reference ID for transactions."""
    return f"REF-{fake.unique.random_number(digits=10)}"

def generate_transaction_data(source_system, account_ids, start_date, end_date, count=1000):
    """Generate transaction data for a specific source system."""
    transactions = []
    
    for _ in range(count):
        # Generate random transaction date within the date range
        transaction_date = fake.date_time_between_dates(
            datetime_start=start_date,
            datetime_end=end_date
        )
        
        # Generate random amount (between $1 and $10,000)
        amount = round(random.uniform(1, 10000), 2)
        
        # Select random account ID
        account_id = random.choice(account_ids)
        
        # Generate transaction ID
        transaction_id = generate_transaction_id(source_system)
        
        # Select random transaction type
        transaction_type = random.choice(TRANSACTION_TYPES)
        
        # Generate reference ID
        reference_id = generate_reference_id()
        
        # Select random status
        status = random.choice(STATUS_VALUES)
        
        # Generate additional payload data
        payload = {
            'description': fake.sentence(),
            'location': fake.city(),
            'merchant': fake.company() if transaction_type in ['payment', 'refund'] else None,
            'category': fake.word(),
            'metadata': {
                'device': fake.user_agent(),
                'ip_address': fake.ipv4(),
                'channel': random.choice(['web', 'mobile', 'atm', 'branch', 'phone'])
            }
        }
        
        # Create transaction record
        transaction = {
            'transaction_id': transaction_id,
            'source_system': source_system,
            'transaction_date': transaction_date,
            'amount': amount,
            'account_id': account_id,
            'transaction_type': transaction_type,
            'reference_id': reference_id,
            'status': status,
            'payload': json.dumps(payload),
            'created_at': transaction_date - timedelta(minutes=random.randint(1, 60)),
            'processing_timestamp': transaction_date + timedelta(seconds=random.randint(1, 30))
        }
        
        transactions.append(transaction)
    
    return transactions

def create_matching_transactions(primary_transactions, secondary_system, error_rate=0.05):
    """Create matching transactions for a secondary system with intentional discrepancies."""
    secondary_transactions = []
    
    for primary_tx in primary_transactions:
        # Create a copy of the primary transaction
        secondary_tx = primary_tx.copy()
        
        # Change the transaction ID and source system
        secondary_tx['transaction_id'] = generate_transaction_id(secondary_system)
        secondary_tx['source_system'] = secondary_system
        
        # Introduce discrepancies based on error rate
        if random.random() < error_rate:
            # Choose a type of discrepancy
            discrepancy_type = random.choice(['amount', 'date', 'status', 'missing'])
            
            if discrepancy_type == 'amount':
                # Change the amount slightly
                original_amount = secondary_tx['amount']
                secondary_tx['amount'] = round(original_amount * random.uniform(0.95, 1.05), 2)
                
            elif discrepancy_type == 'date':
                # Shift the date slightly
                original_date = secondary_tx['transaction_date']
                secondary_tx['transaction_date'] = original_date + timedelta(
                    minutes=random.randint(-120, 120)
                )
                
            elif discrepancy_type == 'status':
                # Change the status
                original_status = secondary_tx['status']
                new_status = random.choice([s for s in STATUS_VALUES if s != original_status])
                secondary_tx['status'] = new_status
                
            elif discrepancy_type == 'missing':
                # Skip this transaction (don't add to secondary)
                continue
        
        secondary_transactions.append(secondary_tx)
    
    # Add some transactions that only exist in the secondary system
    extra_count = int(len(primary_transactions) * DATA_CONFIG['extra_transactions_rate'])
    account_ids = [tx['account_id'] for tx in primary_transactions]
    start_date = min(tx['transaction_date'] for tx in primary_transactions)
    end_date = max(tx['transaction_date'] for tx in primary_transactions)
    
    extra_transactions = generate_transaction_data(
        secondary_system,
        account_ids,
        start_date,
        end_date,
        count=extra_count
    )
    
    secondary_transactions.extend(extra_transactions)
    
    return secondary_transactions

print("✓ Data generation functions created successfully")

## Step 6: Generate Sample Transaction Data

**Purpose**: Create realistic banking transaction data for all source systems.

### **Data Generation Process**:

#### **1. Primary System (Core Banking)**
- Generate base transactions with complete information
- Use as reference for other systems
- Highest data quality and completeness

#### **2. Secondary Systems (Card Processor, Payment Gateway)**
- Create matching transactions with intentional discrepancies
- Introduce various types of errors
- Add extra transactions that don't exist in primary system

#### **3. Data Quality Features**
- Realistic amounts and dates
- Proper status distribution
- Rich metadata payloads
- Consistent account references

In [12]:
# Generate sample transaction data
print("🎲 Generating sample transaction data...")

# Create data directory if it doesn't exist
data_dir = "/opt/bitnami/spark/data/raw"
os.makedirs(data_dir, exist_ok=True)
print(f"✓ Data directory: {data_dir}")

# Generate account IDs
print("\n Generating account IDs...")
account_ids = generate_account_ids(num_accounts=DATA_CONFIG['num_accounts'])
print(f"✓ Generated {len(account_ids)} unique account IDs")
print(f"Sample account IDs: {account_ids[:5]}")

# Generate primary transactions (core banking)
print("\n Generating primary transactions (core banking)...")
primary_system = 'core_banking'
primary_transactions = generate_transaction_data(
    primary_system,
    account_ids,
    start_date,
    end_date,
    count=DATA_CONFIG['primary_transactions']
)

print(f"✓ Generated {len(primary_transactions)} primary transactions")
print(f"Date range: {min(tx['transaction_date'] for tx in primary_transactions)} to {max(tx['transaction_date'] for tx in primary_transactions)}")
print(f"Amount range: ${min(tx['amount'] for tx in primary_transactions):.2f} to ${max(tx['amount'] for tx in primary_transactions):.2f}")

# Generate matching transactions for secondary systems
print("\n🔄 Generating matching transactions for secondary systems...")
secondary_systems = [s for s in SOURCE_SYSTEMS if s != primary_system]
all_transactions = {primary_system: primary_transactions}

for secondary_system in secondary_systems:
    print(f"\n📊 Processing {secondary_system}...")
    secondary_transactions = create_matching_transactions(
        primary_transactions,
        secondary_system,
        error_rate=DATA_CONFIG['error_rate']
    )
    all_transactions[secondary_system] = secondary_transactions
    print(f"✓ Generated {len(secondary_transactions)} {secondary_system} transactions")

# Summary of generated data
print("\n📊 Data Generation Summary:")
total_transactions = sum(len(txs) for txs in all_transactions.values())
print(f"- Total transactions: {total_transactions:,}")
for system, transactions in all_transactions.items():
    print(f"- {system}: {len(transactions):,} transactions")
print(f"- Date range: {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}")
print(f"- Error rate: {DATA_CONFIG['error_rate']*100:.1f}%")
print(f"- Extra transactions rate: {DATA_CONFIG['extra_transactions_rate']*100:.1f}%")

## Step 7: Save Data to CSV Files

**Purpose**: Save generated data to CSV files for ingestion into Iceberg tables.

### **CSV File Strategy**:

#### **File Organization**
- One CSV file per source system
- Consistent naming convention
- Proper data type handling

#### **Data Format**
- DateTime objects converted to strings
- JSON payloads properly serialized
- Consistent column ordering

#### **File Locations**
- `/opt/bitnami/spark/data/raw/`
- Accessible to Spark for ingestion
- Organized by source system

In [13]:
# Save data to CSV files
print(" Saving data to CSV files...")

def save_to_csv(transactions, filename):
    """Save transactions to a CSV file."""
    df = pd.DataFrame(transactions)
    
    # Convert datetime objects to strings
    for col in ['transaction_date', 'created_at', 'processing_timestamp']:
        df[col] = df[col].apply(lambda x: x.strftime('%Y-%m-%d %H:%M:%S'))
    
    # Save to CSV
    df.to_csv(filename, index=False)
    print(f"✓ Saved {len(transactions)} transactions to {filename}")
    
    # Show sample data
    print(f"  Sample data from {filename}:")
    print(f"  - Columns: {list(df.columns)}")
    print(f"  - Shape: {df.shape}")
    print(f"  - Sample transaction_id: {df['transaction_id'].iloc[0]}")
    print(f"  - Sample amount: ${df['amount'].iloc[0]:.2f}")
    print(f"  - Sample status: {df['status'].iloc[0]}")

# Save each source system's data
csv_files = {}
for system, transactions in all_transactions.items():
    filename = f"{data_dir}/{system}_transactions.csv"
    save_to_csv(transactions, filename)
    csv_files[system] = filename

print(f"\n📁 CSV files created:")
for system, filename in csv_files.items():
    file_size = os.path.getsize(filename)
    print(f"- {system}: {filename} ({file_size:,} bytes)")

# Verify files exist and are readable
print("\n🔍 Verifying CSV files...")
for filename in csv_files.values():
    if os.path.exists(filename):
        print(f"✓ {filename} exists and is readable")
    else:
        print(f"✗ {filename} not found")

# Show directory contents
print(f"\n Raw data directory contents:")
!ls -la {data_dir}/

total 6968
drwxrwxrwx 1 root root    4096 Jul 16 07:18 .
drwxrwxrwx 1 root root    4096 Jul 16 07:18 ..
-rwxrwxrwx 1 root root       0 Jul 12 19:25 .gitkeep
-rwxrwxrwx 1 root root 2387816 Jul 21 13:18 card_processor_transactions.csv
-rwxrwxrwx 1 root root 2354922 Jul 21 13:18 core_banking_transactions.csv
-rwxrwxrwx 1 root root 2388256 Jul 21 13:18 payment_gateway_transactions.csv


## Step 8: Ingest Data into Iceberg Tables

**Purpose**: Load CSV data into Iceberg tables using Spark DataFrame operations.

### **Ingestion Strategy**:

#### **1. CSV Reading**
- Use Spark's CSV reader with proper schema inference
- Handle datetime parsing correctly
- Ensure data type consistency

#### **2. Data Transformation**
- Convert string dates back to timestamps
- Ensure proper decimal precision for amounts
- Validate data quality

#### **3. Iceberg Writing**
- Use Iceberg's write capabilities
- Leverage partitioning for performance
- Maintain ACID properties

#### **4. Error Handling**
- Validate data before writing
- Handle missing or malformed data
- Provide detailed error reporting

In [14]:
# Ingest data into Iceberg tables
print("📥 Ingesting data into Iceberg tables...")

def ingest_csv_to_iceberg(csv_file, table_name):
    """Ingest CSV file into Iceberg table."""
    try:
        print(f"\n🔄 Processing {csv_file}...")
        
        # Read CSV file
        df = spark.read.option("header", "true").option("inferSchema", "true").csv(csv_file)
        
        print(f"✓ Read {df.count()} rows from {csv_file}")
        print(f"Schema: {df.schema}")
        
        # Convert string dates back to timestamps
        from pyspark.sql.functions import to_timestamp
        
        df = df.withColumn("transaction_date", to_timestamp("transaction_date")) \
               .withColumn("created_at", to_timestamp("created_at")) \
               .withColumn("processing_timestamp", to_timestamp("processing_timestamp"))
        
        # Show sample data
        print(f"\n Sample data from {csv_file}:")
        df.show(5, truncate=False)
        
        # Write to the Iceberg table
        print(f"\n💾 Writing to {table_name}...")
        df.writeTo(table_name).append()
        
        print(f"✓ Successfully ingested {df.count()} rows into {table_name}")
    except Exception as e:
        print(f"✗ Error ingesting {csv_file} into {table_name}: {str(e)}")

# Ingest each system's data
table_map = {
    "core_banking": "lakekeeper.banking.source_transactions",
    "card_processor": "lakekeeper.banking.source_transactions",
    "payment_gateway": "lakekeeper.banking.source_transactions"
}

for system, csv_file in csv_files.items():
    ingest_csv_to_iceberg(csv_file, table_map[system])

                                                                                

[Stage 14:>                                                         (0 + 1) / 1]

+---------------+-------------+-------------------+-------+-----------+----------------+--------------+---------+------------------------------------------------------------------------------+----------+--------------------+
|transaction_id |source_system|transaction_date   |amount |account_id |transaction_type|reference_id  |status   |payload                                                                       |created_at|processing_timestamp|
+---------------+-------------+-------------------+-------+-----------+----------------+--------------+---------+------------------------------------------------------------------------------+----------+--------------------+
|CB-0904A26AD08F|core_banking |2025-06-25 14:57:51|4965.84|ACC20300719|payment         |REF-6634064655|pending  |"{""description"": ""Reach method name source which.""                        |NULL      |NULL                |
|CB-8816AF66FBD4|core_banking |2025-06-25 18:56:16|9672.07|ACC79990125|payment         |REF-49562968

                                                                                

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".                
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.


                                                                                

+---------------+--------------+-------------------+-------+-----------+----------------+--------------+---------+------------------------------------------------------------------------------+----------+--------------------+
|transaction_id |source_system |transaction_date   |amount |account_id |transaction_type|reference_id  |status   |payload                                                                       |created_at|processing_timestamp|
+---------------+--------------+-------------------+-------+-----------+----------------+--------------+---------+------------------------------------------------------------------------------+----------+--------------------+
|CP-EDD1B3B25116|card_processor|2025-06-25 14:57:51|4965.84|ACC20300719|payment         |REF-6634064655|pending  |"{""description"": ""Reach method name source which.""                        |NULL      |NULL                |
|CP-8DCBAB544B03|card_processor|2025-06-25 18:56:16|9672.07|ACC79990125|payment         |REF-495

                                                                                

+---------------+---------------+-------------------+-------+-----------+----------------+--------------+---------+------------------------------------------------------------------------------+----------+--------------------+
|transaction_id |source_system  |transaction_date   |amount |account_id |transaction_type|reference_id  |status   |payload                                                                       |created_at|processing_timestamp|
+---------------+---------------+-------------------+-------+-----------+----------------+--------------+---------+------------------------------------------------------------------------------+----------+--------------------+
|PG-8D430B581B31|payment_gateway|2025-06-25 13:27:51|4965.84|ACC20300719|payment         |REF-6634064655|pending  |"{""description"": ""Reach method name source which.""                        |NULL      |NULL                |
|PG-CA22793CC8CE|payment_gateway|2025-06-25 18:56:16|9672.07|ACC79990125|payment         |RE

                                                                                

## Step 9: Audit and Validate Data Ingestion

**Purpose**: Ensure that data was ingested correctly and is ready for reconciliation.

### **Audit Steps**:
- Count rows in each table
- Check partition distribution
- Validate schema and sample data
- Confirm data ranges and integrity

In [10]:
# Audit and validate data ingestion
print("🔍 Auditing data ingestion...")

for table in tables_to_check:
    try:
        count = spark.sql(f"SELECT COUNT(*) as count FROM {table}").collect()[0]['count']
        print(f"✓ {table}: {count} rows")
    except Exception as e:
        print(f"✗ {table}: Error - {str(e)}")

# Check partition distribution for source_transactions
print("\n📊 Partition distribution for source_transactions:")
try:
    partition_df = spark.sql("""
        SELECT
            date_trunc('day', transaction_date) as day,
            source_system,
            COUNT(*) as count
        FROM lakekeeper.banking.source_transactions
        GROUP BY day, source_system
        ORDER BY day, source_system
    """)
    partition_df.show(10)
except Exception as e:
    print(f"✗ Error checking partition distribution: {str(e)}")

# Show sample data
print("\n📋 Sample data from source_transactions:")
try:
    spark.sql("SELECT * FROM lakekeeper.banking.source_transactions LIMIT 5").show(truncate=False)
except Exception as e:
    print(f"✗ Error showing sample data: {str(e)}")



+-------------------+---------------+-----+
|                day|  source_system|count|
+-------------------+---------------+-----+
|2025-06-13 00:00:00| card_processor|   92|
|2025-06-13 00:00:00|   core_banking|   92|
|2025-06-13 00:00:00|payment_gateway|   96|
|2025-06-14 00:00:00| card_processor|  157|
|2025-06-14 00:00:00|   core_banking|  153|
|2025-06-14 00:00:00|payment_gateway|  153|
|2025-06-15 00:00:00| card_processor|  158|
|2025-06-15 00:00:00|   core_banking|  155|
|2025-06-15 00:00:00|payment_gateway|  156|
|2025-06-16 00:00:00| card_processor|  159|
+-------------------+---------------+-----+
only showing top 10 rows



                                                                                

+---------------+---------------+-------------------+-------+-----------+----------------+--------------+---------+----------------------------------------------------------------------+----------+--------------------+
|transaction_id |source_system  |transaction_date   |amount |account_id |transaction_type|reference_id  |status   |payload                                                               |created_at|processing_timestamp|
+---------------+---------------+-------------------+-------+-----------+----------------+--------------+---------+----------------------------------------------------------------------+----------+--------------------+
|PG-FCEAA40AD81D|payment_gateway|2025-06-21 10:31:46|3129.39|ACC87735766|refund          |REF-1434221495|completed|"{""description"": ""Live court report model college defense police.""|NULL      |NULL                |
|PG-698A7AFE593F|payment_gateway|2025-06-21 10:49:59|9315.79|ACC67788147|withdrawal      |REF-2158590067|reversed |"{""descr

## Check Partitioning - Hive Style

![source_transactions_table_in_minio](images/source_transactions_table_in_minio.png)


## Step 11: Phase 2 Summary and Next Steps

**Purpose**: Summarize what was accomplished in Phase 2 and prepare for the next phase.

### **What We've Accomplished**:
- Generated realistic, multi-system banking transaction data
- Introduced controlled discrepancies for reconciliation testing
- Saved data to CSV files and ingested into Iceberg tables
- Validated data quality, partitioning, and schema
- Examined Iceberg file structure with real data

### **Next Phase Preview**:
- Implement reconciliation logic
- Analyze and resolve discrepancies
- Generate reconciliation reports
- Perform advanced Iceberg operations (time travel, schema evolution)

In [15]:
spark.stop()