# Apache Iceberg Banking Reconciliation Demo

## 🎯 Learning Objectives

This comprehensive notebook demonstrates advanced Apache Iceberg features for banking reconciliation systems. You will learn:

### **Advanced Iceberg Capabilities**
- **Schema Evolution**: Adding, removing, and modifying columns without downtime
- **Time Travel**: Querying historical data states and audit trails
- **Partition Evolution**: Optimizing performance through partition strategy changes
- **ACID Transactions**: Ensuring data consistency with atomic operations
- **Incremental Processing**: Efficient handling of new data streams

This notebook demonstrates the key features of the Apache Iceberg Banking Reconciliation System.

## 1. Setup

First, let's initialize our Spark session with Iceberg configuration.

In [1]:
!pip install --root-user-action=ignore rich boto3 --quiet
from rich import print

[33mDEPRECATION: Loading egg at /opt/bitnami/python/lib/python3.12/site-packages/pip-23.3.2-py3.12.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m

In [1]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, abs, lit, expr, when
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
from pyspark.sql.functions import current_timestamp
from pyspark.sql.functions import concat
import uuid
from datetime import datetime, timedelta
import os
from pyspark import SparkConf

# نسخه‌های مورد نیاز
SPARK_VERSION = pyspark.__version__
SPARK_MINOR_VERSION = '.'.join(SPARK_VERSION.split('.')[:2])
ICEBERG_VERSION = "1.9.2"

# مسیر warehouse
warehouse_dir = "/opt/bitnami/spark/warehouse"
os.makedirs(warehouse_dir, exist_ok=True)
print(f"✓ Warehouse directory: {warehouse_dir}")

# توقف سشن قبلی در صورت وجود
try:
    SparkSession.builder.getOrCreate().stop()
    print("✓ Stopped existing Spark session")
except:
    print("ℹ No existing Spark session to stop")


# تنظیمات اسپارک و آیسبرگ
config = {
    "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
    "spark.sql.catalog.local": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.local.type": "hadoop",
    "spark.sql.catalog.local.warehouse": f"file://{warehouse_dir}",
    "spark.sql.defaultCatalog": "local",
    "spark.executor.memory": "1024m",
    "spark.executor.cores": "1",
    "spark.jars.packages": f"org.apache.iceberg:iceberg-spark-runtime-{SPARK_MINOR_VERSION}_2.12:{ICEBERG_VERSION}",
}

spark_config = SparkConf().setAppName("Banking Reconciliation Demo")
for k, v in config.items():
    spark_config = spark_config.set(k, v)

spark = SparkSession.builder.config(conf=spark_config).getOrCreate()

print("✓ Spark session created successfully")
print(f"Spark version: {spark.version}")
print(f"Default catalog: {spark.conf.get('spark.sql.defaultCatalog')}")
print(f"Warehouse location: {spark.conf.get('spark.sql.catalog.local.warehouse')}")


✓ Warehouse directory: /opt/bitnami/spark/warehouse


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/07/22 07:54:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/07/22 07:54:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/07/22 07:54:26 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


✓ Stopped existing Spark session


25/07/22 07:54:28 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/07/22 07:54:28 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


✓ Spark session created successfully
Spark version: 3.5.6
Default catalog: local
Warehouse location: file:///opt/bitnami/spark/warehouse


## 2. Explore Iceberg Tables

Let's explore the Iceberg tables we've created.

In [2]:
# List all tables in the banking namespace
spark.sql("SHOW TABLES IN local.banking").show()

+---------+--------------------+-----------+
|namespace|           tableName|isTemporary|
+---------+--------------------+-----------+
|  banking|reconciliation_ba...|      false|
|  banking|reconciliation_re...|      false|
|  banking| source_transactions|      false|
+---------+--------------------+-----------+



In [3]:
# Describe the source_transactions table
spark.sql("DESCRIBE TABLE local.banking.source_transactions").show()

+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|      transaction_id|              string|   NULL|
|       source_system|              string|   NULL|
|    transaction_date|           timestamp|   NULL|
|              amount|       decimal(18,2)|   NULL|
|          account_id|              string|   NULL|
|    transaction_type|              string|   NULL|
|        reference_id|              string|   NULL|
|              status|              string|   NULL|
|             payload|              string|   NULL|
|          created_at|           timestamp|   NULL|
|processing_timestamp|           timestamp|   NULL|
|                    |                    |       |
|      # Partitioning|                    |       |
|              Part 0|days(transaction_...|       |
|              Part 1|       source_system|       |
+--------------------+--------------------+-------+



In [5]:
# Check the number of transactions by source system
spark.sql("""
    SELECT source_system, COUNT(*) as transaction_count
    FROM local.banking.source_transactions
    GROUP BY source_system
    ORDER BY source_system
""").show()



+---------------+-----------------+
|  source_system|transaction_count|
+---------------+-----------------+
| card_processor|             5050|
|   core_banking|             5000|
|payment_gateway|             5026|
+---------------+-----------------+



                                                                                

## 3. Demonstrate Iceberg Features

### 3.1 Schema Evolution

Let's demonstrate schema evolution by adding a new column to the source_transactions table.

In [4]:
# Add a new column to the source_transactions table
spark.sql("""
    ALTER TABLE local.banking.source_transactions
    ADD COLUMN transaction_category STRING
""").show()

++
||
++
++



In [5]:
# Verify the new column was added
spark.sql("DESCRIBE TABLE local.banking.source_transactions").show()

+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|      transaction_id|              string|   NULL|
|       source_system|              string|   NULL|
|    transaction_date|           timestamp|   NULL|
|              amount|       decimal(18,2)|   NULL|
|          account_id|              string|   NULL|
|    transaction_type|              string|   NULL|
|        reference_id|              string|   NULL|
|              status|              string|   NULL|
|             payload|              string|   NULL|
|          created_at|           timestamp|   NULL|
|processing_timestamp|           timestamp|   NULL|
|transaction_category|              string|   NULL|
|                    |                    |       |
|      # Partitioning|                    |       |
|              Part 0|days(transaction_...|       |
|              Part 1|       source_system|       |
+-----------

In [6]:
# Update some records with the new column
spark.sql("""
    UPDATE local.banking.source_transactions
    SET transaction_category = 
        CASE 
            WHEN transaction_type = 'deposit' THEN 'INCOME'
            WHEN transaction_type = 'withdrawal' THEN 'EXPENSE'
            WHEN transaction_type = 'transfer' THEN 'TRANSFER'
            WHEN transaction_type = 'payment' THEN 'EXPENSE'
            WHEN transaction_type = 'refund' THEN 'INCOME'
            WHEN transaction_type = 'fee' THEN 'FEE'
            ELSE 'OTHER'
        END
    WHERE source_system = 'core_banking'
""").show()

                                                                                

++
||
++
++



In [7]:
# Query the data with the new column
spark.sql("""
    SELECT transaction_type, transaction_category, COUNT(*) as count
    FROM local.banking.source_transactions
    WHERE source_system = 'core_banking'
    GROUP BY transaction_type, transaction_category
    ORDER BY transaction_type
""").show()

+----------------+--------------------+-----+
|transaction_type|transaction_category|count|
+----------------+--------------------+-----+
|         deposit|              INCOME|  826|
|             fee|                 FEE|  841|
|         payment|             EXPENSE|  825|
|          refund|              INCOME|  866|
|        transfer|            TRANSFER|  812|
|      withdrawal|             EXPENSE|  830|
+----------------+--------------------+-----+



### 3.2 Time Travel

Let's demonstrate time travel by querying the table at different points in time.

In [8]:
# Get the current snapshot information
spark.sql("""
    SELECT * FROM local.banking.source_transactions.snapshots
    ORDER BY committed_at DESC
    LIMIT 5
""").show()

+--------------------+-------------------+-------------------+---------+--------------------+--------------------+
|        committed_at|        snapshot_id|          parent_id|operation|       manifest_list|             summary|
+--------------------+-------------------+-------------------+---------+--------------------+--------------------+
|2025-07-22 07:55:...|7845908200839870587|1423606334598951226|overwrite|file:/opt/bitnami...|{spark.app.id -> ...|
|2025-07-22 07:52:...|1423606334598951226|9061708065984232342|   append|file:/opt/bitnami...|{spark.app.id -> ...|
|2025-07-22 07:52:...|9061708065984232342| 966229618760623534|   append|file:/opt/bitnami...|{spark.app.id -> ...|
|2025-07-22 07:52:...| 966229618760623534|               NULL|   append|file:/opt/bitnami...|{spark.app.id -> ...|
+--------------------+-------------------+-------------------+---------+--------------------+--------------------+



In [9]:
# Store the timestamp of the snapshot before our update
snapshots = spark.sql("""
    SELECT * FROM local.banking.source_transactions.snapshots
    ORDER BY committed_at DESC
    LIMIT 2
""").collect()

# Get the timestamp of the previous snapshot
if len(snapshots) >= 2:
    previous_snapshot_timestamp = snapshots[1]["committed_at"]
    print(f"Previous snapshot timestamp: {previous_snapshot_timestamp}")

Previous snapshot timestamp: 2025-07-22 07:52:56.135000


In [10]:
# Query the table as of the previous snapshot (before adding the new column)
if 'previous_snapshot_timestamp' in locals():
    spark.sql(f"""
        SELECT transaction_type, COUNT(*) as count
        FROM local.banking.source_transactions
        FOR TIMESTAMP AS OF '{previous_snapshot_timestamp}'
        WHERE source_system = 'core_banking'
        GROUP BY transaction_type
        ORDER BY transaction_type
    """).show()
    
    # This would fail because the column didn't exist in the previous snapshot
    try:
        spark.sql(f"""
            SELECT transaction_category, COUNT(*) as count
            FROM local.banking.source_transactions
            FOR TIMESTAMP AS OF '{previous_snapshot_timestamp}'
            WHERE source_system = 'core_banking'
            GROUP BY transaction_category
            ORDER BY transaction_category
        """).show()
    except Exception as e:
        print(f"Error (expected): {str(e)}")

+----------------+-----+
|transaction_type|count|
+----------------+-----+
|         deposit|  826|
|             fee|  841|
|         payment|  825|
|          refund|  866|
|        transfer|  812|
|      withdrawal|  830|
+----------------+-----+

Error (expected): [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `transaction_category` cannot be resolved. Did you mean one of the following? [`transaction_date`, `transaction_type`, `transaction_id`, `created_at`, `account_id`].; line 2 pos 19;
'Sort ['transaction_category ASC NULLS FIRST], true
+- 'Aggregate ['transaction_category], ['transaction_category, count(1) AS count#341L]
   +- Filter (source_system#343 = core_banking)
      +- SubqueryAlias local.banking.source_transactions
         +- RelationV2[transaction_id#342, source_system#343, transaction_date#344, amount#345, account_id#346, transaction_type#347, reference_id#348, status#349, payload#350, created_at#351, processing_timestamp#352] local.ban

### 3.3 Partition Evolution

Let's demonstrate partition evolution by changing the partition spec.

In [8]:
# Check the current partition spec
spark.sql("""
    SELECT * FROM local.banking.source_transactions.partitions
""").show()

+--------------------+-------+------------+----------+-----------------------------+----------------------------+--------------------------+----------------------------+--------------------------+--------------------+------------------------+
|           partition|spec_id|record_count|file_count|total_data_file_size_in_bytes|position_delete_record_count|position_delete_file_count|equality_delete_record_count|equality_delete_file_count|     last_updated_at|last_updated_snapshot_id|
+--------------------+-------+------------+----------+-----------------------------+----------------------------+--------------------------+----------------------------+--------------------------+--------------------+------------------------+
|{2025-07-06, core...|      1|          29|         1|                         5972|                           0|                         0|                           0|                         0|2025-07-13 12:25:...|     6569966068115906138|
|{2025-06-13, paym...|      

In [None]:
# Add a new partition field
spark.sql("""
    ALTER TABLE local.banking.source_transactions
    ADD PARTITION FIELD transaction_category
""").show()

In [None]:
# Check the updated partition spec
spark.sql("""
    SELECT * FROM local.banking.source_transactions.partitions
""").show()

## 4. Run a Reconciliation Process

Let's run a reconciliation process to match transactions across systems.

In [None]:
!ls

In [9]:
# Import necessary modules
import sys
# Import necessary modules
import os

# Add the correct path to sys.path
sys.path.append('/opt/bitnami/spark/src/main/python')
# sys.path.append('/opt/spark')

from etl.extractors import TransactionExtractor
from etl.transformers import TransactionTransformer
from reconciliation.matcher import TransactionMatcher
from reconciliation.reporter import ReconciliationReporter
from etl.loaders import IcebergLoader

In [10]:
# Define reconciliation parameters
batch_id = f"DEMO-{uuid.uuid4().hex[:8]}"
source_systems = ['core_banking', 'card_processor']
end_date = datetime.now()
start_date = end_date - timedelta(days=30)

print(f"Running reconciliation for batch {batch_id}")
print(f"Source systems: {source_systems}")
print(f"Date range: {start_date} to {end_date}")

In [11]:
# Define the schema explicitly
# Register reconciliation batch
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, ArrayType, IntegerType


batch_schema = StructType([
    StructField("batch_id", StringType(), False),
    StructField("reconciliation_date", TimestampType(), True),
    StructField("source_systems", ArrayType(StringType()), True),
    StructField("start_date", TimestampType(), True),
    StructField("end_date", TimestampType(), True),
    StructField("status", StringType(), True),
    StructField("total_transactions", IntegerType(), True),
    StructField("matched_count", IntegerType(), True),
    StructField("unmatched_count", IntegerType(), True),
    StructField("created_at", TimestampType(), True),
    StructField("completed_at", TimestampType(), True)
])

batch_df = spark.createDataFrame([{
    "batch_id": batch_id,
    "reconciliation_date": datetime.now(),
    "source_systems": source_systems,
    "start_date": start_date,
    "end_date": end_date,
    "status": "IN_PROGRESS",
    "total_transactions": 0,
    "matched_count": 0,
    "unmatched_count": 0,
    "created_at": datetime.now(),
    "completed_at": None
}], schema=batch_schema)

loader = IcebergLoader(spark)
loader.load_reconciliation_batch(batch_df)

                                                                                

Loaded 1 batches into reconciliation_batches table


In [12]:
# Extract transactions
extractor = TransactionExtractor(spark)
transactions_by_source = extractor.extract_transactions_for_reconciliation(
    source_systems, start_date, end_date
)

# Print transaction counts
for source, df in transactions_by_source.items():
    print(f"{source}: {df.count()} transactions")

In [13]:
# Transform transactions
transformer = TransactionTransformer(spark)
prepared_transactions = transformer.prepare_for_reconciliation(transactions_by_source)

# Get primary and secondary DataFrames
primary_source = source_systems[0]
secondary_source = source_systems[1]
primary_df = prepared_transactions[primary_source]
secondary_df = prepared_transactions[secondary_source]

In [14]:
# Match transactions
matcher = TransactionMatcher(spark)
matched_df, unmatched_primary_df, unmatched_secondary_df = matcher.match_transactions(
    primary_df, secondary_df, match_strategy="hybrid"
)

# Print matching results
print(f"Matched: {matched_df.count()} transactions")
print(f"Unmatched in {primary_source}: {unmatched_primary_df.count()} transactions")
print(f"Unmatched in {secondary_source}: {unmatched_secondary_df.count()} transactions")

                                                                                

                                                                                

                                                                                

In [15]:
# Create reconciliation results
results_df = matcher.create_reconciliation_results(
    batch_id,
    matched_df,
    unmatched_primary_df,
    unmatched_secondary_df,
    primary_source,
    secondary_source
)

# Save reconciliation results
loader.load_reconciliation_results(results_df)

                                                                                

Loaded 5138 results into reconciliation_results table


In [16]:
# Generate reports
reporter = ReconciliationReporter(spark)
summary_report = reporter.generate_summary_report(results_df)
discrepancy_report = reporter.generate_discrepancy_report(results_df)

# Display summary report
summary_report.show()

                                                                                

+-------------+------------+-----+------------------------+
|     batch_id|match_status|count|total_discrepancy_amount|
+-------------+------------+-----+------------------------+
|DEMO-5a45e038|     MATCHED| 4810|                    0.00|
|DEMO-5a45e038|     PARTIAL|   63|                  337.34|
|DEMO-5a45e038|   UNMATCHED|  265|              1267091.26|
+-------------+------------+-----+------------------------+



In [17]:
# Display discrepancy report (first 10 rows)
discrepancy_report.show(10)

                                                                                

+--------------------+-------------+----------------------+------------------------+------------+----------------+------------------+------------------------+--------------------+
|   reconciliation_id|     batch_id|primary_transaction_id|secondary_transaction_id|match_status|discrepancy_type|discrepancy_amount|reconciliation_timestamp|               notes|
+--------------------+-------------+----------------------+------------------------+------------+----------------+------------------+------------------------+--------------------+
|c241147f-128f-416...|DEMO-5a45e038|       CB-0531DDDA18F2|         CP-CBD78B2DB14D|     PARTIAL|          STATUS|              NULL|    2025-07-13 13:18:...|Partial match bet...|
|826d8f32-807d-474...|DEMO-5a45e038|       CB-05B6C2A6B2B1|         CP-A48E95951AF9|     PARTIAL|          STATUS|              NULL|    2025-07-13 13:18:...|Partial match bet...|
|5ef66ee8-f21d-4c6...|DEMO-5a45e038|       CB-067A134D2E29|         CP-E3DE2F3FFC4F|     PARTIAL|   

In [19]:
# Update batch status
spark.sql(f"""
    UPDATE local.banking.reconciliation_batches
    SET 
        status = 'COMPLETED',
        matched_count = {matched_df.count()},
        unmatched_count = {unmatched_primary_df.count() + unmatched_secondary_df.count()},
        total_transactions = {matched_df.count() + unmatched_primary_df.count() + unmatched_secondary_df.count()},
        completed_at = CURRENT_TIMESTAMP()
    WHERE batch_id = '{batch_id}'
""")

                                                                                

DataFrame[]

## 5. Demonstrate ACID Transactions

Let's demonstrate ACID transactions by performing a multi-statement transaction.

In [21]:
from pyspark.sql.functions import lit
import uuid

try:
    # 1. Build a DataFrame for updates
    updates = spark.sql("""
        SELECT transaction_id AS id
        FROM local.banking.source_transactions
        WHERE status = 'pending' AND source_system = 'core_banking'
    """).withColumn("status", lit("completed"))

    updates.createOrReplaceTempView("updates")

    # 2. Perform MERGE INTO for atomic update
    spark.sql("""
        MERGE INTO local.banking.source_transactions AS t
        USING updates AS u
          ON t.transaction_id = u.id
        WHEN MATCHED THEN
          UPDATE SET t.status = u.status
    """)

    # 3. Insert a new batch row
    new_batch_id = f"ACID-{uuid.uuid4().hex[:8]}"
    spark.sql(f"""
        INSERT INTO local.banking.reconciliation_batches VALUES (
            '{new_batch_id}',
            CURRENT_TIMESTAMP(),
            ARRAY('core_banking', 'payment_gateway'),
            TIMESTAMP('{start_date}'),
            TIMESTAMP('{end_date}'),
            'PENDING',
            0,
            0,
            0,
            CURRENT_TIMESTAMP(),
            NULL
        )
    """)
    print("Operations completed successfully.")

except Exception as e:
    print(f"Error during operations: {str(e)}")

                                                                                

In [22]:
# Verify the changes
spark.sql(f"""
    SELECT * FROM local.banking.reconciliation_batches
    WHERE batch_id = '{new_batch_id}'
""").show()

+-------------+--------------------+--------------------+--------------------+--------------------+-------+------------------+-------------+---------------+--------------------+------------+
|     batch_id| reconciliation_date|      source_systems|          start_date|            end_date| status|total_transactions|matched_count|unmatched_count|          created_at|completed_at|
+-------------+--------------------+--------------------+--------------------+--------------------+-------+------------------+-------------+---------------+--------------------+------------+
|ACID-b3048f2e|2025-07-13 13:30:...|[core_banking, pa...|2025-06-13 13:17:...|2025-07-13 13:17:...|PENDING|                 0|            0|              0|2025-07-13 13:30:...|        NULL|
+-------------+--------------------+--------------------+--------------------+--------------------+-------+------------------+-------------+---------------+--------------------+------------+



## 6. Demonstrate Incremental Processing

Let's demonstrate incremental processing by processing only new transactions.

In [23]:
# Get the latest snapshot timestamp
latest_snapshot = spark.sql("""
    SELECT * FROM local.banking.source_transactions.snapshots
    ORDER BY committed_at DESC
    LIMIT 1
""").collect()[0]

latest_timestamp = latest_snapshot["committed_at"]
print(f"Latest snapshot timestamp: {latest_timestamp}")

In [24]:
# Create some new transactions
import pandas as pd
import random
from decimal import Decimal

# Generate 10 new transactions
new_transactions = []
for i in range(10):
    tx_id = f"NEW-{uuid.uuid4().hex[:8]}"
    source_system = "core_banking"
    tx_date = datetime.now()
    amount = Decimal(random.uniform(100, 1000)).quantize(Decimal('0.01'))
    account_id = f"ACC{random.randint(10000000, 99999999)}"
    tx_type = random.choice(["deposit", "withdrawal", "transfer", "payment"])
    ref_id = f"REF-{random.randint(1000000000, 9999999999)}"
    status = random.choice(["completed", "pending"])
    
    new_transactions.append({
        "transaction_id": tx_id,
        "source_system": source_system,
        "transaction_date": tx_date,
        "amount": amount,
        "account_id": account_id,
        "transaction_type": tx_type,
        "reference_id": ref_id,
        "status": status,
        "payload": "{}",
        "created_at": tx_date,
        "processing_timestamp": tx_date,
        "transaction_category": "NEW"
    })

# Create a DataFrame from the new transactions
new_tx_df = spark.createDataFrame(new_transactions)
new_tx_df.show(5)

+-----------+--------------------+--------------------+-------+--------------------+--------------+-------------+---------+--------------------+--------------------+--------------+----------------+
| account_id|              amount|          created_at|payload|processing_timestamp|  reference_id|source_system|   status|transaction_category|    transaction_date|transaction_id|transaction_type|
+-----------+--------------------+--------------------+-------+--------------------+--------------+-------------+---------+--------------------+--------------------+--------------+----------------+
|ACC32063425|703.3700000000000...|2025-07-13 13:33:...|     {}|2025-07-13 13:33:...|REF-4246999770| core_banking|completed|                 NEW|2025-07-13 13:33:...|  NEW-b678ff98|         deposit|
|ACC57490967|826.9200000000000...|2025-07-13 13:33:...|     {}|2025-07-13 13:33:...|REF-8898056994| core_banking|completed|                 NEW|2025-07-13 13:33:...|  NEW-d367651a|        transfer|
|ACC615056

In [25]:
# Load the new transactions incrementally
loader = IcebergLoader(spark)
loader.load_transactions_incrementally(
    new_tx_df, 
    "core_banking",
    snapshot_time=datetime.now()
)

                                                                                

Loaded 10 transactions into source_transactions table
Loaded 10 new transactions incrementally


In [27]:
# Query only the new transactions added since the last snapshot
spark.sql(f"""
    SELECT * 
    FROM local.banking.source_transactions
    WHERE created_at > TIMESTAMP('{latest_timestamp}')
    ORDER BY transaction_date DESC
""").show()

+--------------+-------------+--------------------+------+-----------+----------------+--------------+---------+-------+--------------------+--------------------+--------------------+
|transaction_id|source_system|    transaction_date|amount| account_id|transaction_type|  reference_id|   status|payload|          created_at|processing_timestamp|transaction_category|
+--------------+-------------+--------------------+------+-----------+----------------+--------------+---------+-------+--------------------+--------------------+--------------------+
|  NEW-c35b672a| core_banking|2025-07-13 13:33:...|785.20|ACC92554416|         deposit|REF-6811757290|  pending|     {}|2025-07-13 13:33:...|2025-07-13 13:33:...|                 NEW|
|  NEW-c684bc0f| core_banking|2025-07-13 13:33:...|831.84|ACC68619522|         payment|REF-8337618624|  pending|     {}|2025-07-13 13:33:...|2025-07-13 13:33:...|                 NEW|
|  NEW-ebcdce89| core_banking|2025-07-13 13:33:...|100.51|ACC86556910|        tr

## 7. Conclusion

In this notebook, we've demonstrated the key features of the Apache Iceberg Banking Reconciliation System:

1. **Schema Evolution**: Adding new columns without rebuilding tables
2. **Time Travel**: Querying historical states of the data
3. **Partition Evolution**: Changing partition specifications
4. **ACID Transactions**: Ensuring consistency during updates
5. **Incremental Processing**: Processing only new data
6. **Reconciliation Process**: Matching transactions across systems

These features make Apache Iceberg an excellent choice for banking reconciliation systems, providing the reliability, flexibility, and performance needed for financial data processing.

In [11]:
spark.stop()