# Testing the Deterministic Bank Statement Parser

This notebook demonstrates how to use the deterministic parser to extract transaction data from DBS bank statement PDFs.

## Features Demonstrated
- Basic parsing and extraction
- Confidence scoring and warnings
- Transaction data structure
- Validation against known values
- Visual inspection of extracted data

## Setup and Imports

In [None]:
import sys
import json
from pathlib import Path
import pandas as pd

# Add src to path so we can import our parser
sys.path.insert(0, str(Path('..').resolve()))

from src.parsers.deterministic_parser import DeterministicBankStatementParser

## 1. Basic Parsing Example

Let's parse the DBS Singapore bank statement and examine the results.

In [None]:
# Initialize the parser
parser = DeterministicBankStatementParser()

# Path to the PDF
pdf_path = "../resources/statements/DBS_POSB Consolidated Statement_Oct2025.pdf"

# Parse the document
result = parser.parse(pdf_path)

# Display basic results
print(f"Success: {result.success}")
print(f"Confidence: {result.confidence:.2%}")
print(f"Transactions Extracted: {len(result.data)}")
print(f"Warnings: {len(result.warnings)}")

if result.abort_reason:
    print(f"Abort Reason: {result.abort_reason}")

## 2. Examine Warnings

The parser generates warnings for anomalies or issues encountered during parsing.

In [None]:
print("Warnings:")
print("=" * 60)
for i, warning in enumerate(result.warnings, 1):
    print(f"{i}. {warning}")

## 3. View Sample Transactions

Let's examine the first few transactions to understand the data structure.

In [None]:
# Display first 5 transactions
print("First 5 Transactions:")
print("=" * 100)

for i, txn in enumerate(result.data[:5], 1):
    print(f"\n[Transaction {i}]")
    print(f"  Date:        {txn['date']}")
    print(f"  Description: {txn['description'][:60]}..." if len(txn['description']) > 60 else f"  Description: {txn['description']}")
    print(f"  Withdrawal:  {txn['withdrawal'] if txn['withdrawal'] else '-'}")
    print(f"  Deposit:     {txn['deposit'] if txn['deposit'] else '-'}")
    print(f"  Balance:     {txn['balance']}")
    print(f"  Page:        {txn['page']}")

## 4. Convert to Pandas DataFrame

For easier analysis and manipulation, convert the transaction data to a pandas DataFrame.

In [None]:
# Create DataFrame
df = pd.DataFrame(result.data)

# Display basic info
print(f"Total Transactions: {len(df)}")
print(f"\nDataFrame Info:")
print(df.info())

# Display first few rows
print("\nFirst 10 Transactions:")
df.head(10)

## 5. Data Analysis

Perform basic analysis on the extracted transactions.

In [None]:
# Summary statistics
print("Transaction Statistics:")
print("=" * 60)
print(f"Total Transactions:     {len(df)}")
print(f"Transactions with Withdrawals: {df['withdrawal'].notna().sum()}")
print(f"Transactions with Deposits:    {df['deposit'].notna().sum()}")
print(f"\nTotal Withdrawals:  SGD {df['withdrawal'].sum():,.2f}")
print(f"Total Deposits:     SGD {df['deposit'].sum():,.2f}")
print(f"Net Change:         SGD {df['deposit'].sum() - df['withdrawal'].sum():,.2f}")

# Opening and closing balances
opening_balance = df.iloc[0]['balance'] + (df.iloc[0]['withdrawal'] or 0) - (df.iloc[0]['deposit'] or 0)
closing_balance = df.iloc[-1]['balance']

print(f"\nOpening Balance:    SGD {opening_balance:,.2f}")
print(f"Closing Balance:    SGD {closing_balance:,.2f}")
print(f"Actual Change:      SGD {closing_balance - opening_balance:,.2f}")

## 6. Filter and Search Transactions

Examples of how to filter and search through transactions.

In [None]:
# Find all deposits
deposits = df[df['deposit'].notna()]
print(f"Deposit Transactions ({len(deposits)}):")
print("=" * 100)
deposits[['date', 'description', 'deposit', 'balance']]

In [None]:
# Find large withdrawals (> 100 SGD)
large_withdrawals = df[df['withdrawal'] > 100]
print(f"Large Withdrawals > SGD 100 ({len(large_withdrawals)}):")
print("=" * 100)
large_withdrawals[['date', 'description', 'withdrawal', 'balance']]

In [None]:
# Search for specific keywords in description
keyword = "PAYNOW"  # Change this to search for different keywords
matching_txns = df[df['description'].str.contains(keyword, case=False, na=False)]

print(f"Transactions containing '{keyword}' ({len(matching_txns)}):")
print("=" * 100)
matching_txns[['date', 'description', 'withdrawal', 'deposit', 'balance']]

## 7. Validate Balance Continuity

Check that each transaction's balance follows correctly from the previous balance.

In [None]:
# Validate balance arithmetic
discrepancies = []

for i in range(1, len(df)):
    prev_balance = df.iloc[i-1]['balance']
    curr_balance = df.iloc[i]['balance']
    withdrawal = df.iloc[i]['withdrawal'] or 0
    deposit = df.iloc[i]['deposit'] or 0
    
    expected_balance = prev_balance - withdrawal + deposit
    
    # Allow for small floating point differences
    if abs(expected_balance - curr_balance) > 0.01:
        discrepancies.append({
            'index': i,
            'date': df.iloc[i]['date'],
            'expected': expected_balance,
            'actual': curr_balance,
            'difference': curr_balance - expected_balance
        })

if discrepancies:
    print(f"Found {len(discrepancies)} balance discrepancies:")
    for disc in discrepancies:
        print(f"  Transaction {disc['index']} ({disc['date']}): Expected {disc['expected']:.2f}, Got {disc['actual']:.2f}")
else:
    print("âœ“ All balance calculations are consistent!")

## 8. Export to JSON

Save the extracted data to a JSON file for further processing.

In [None]:
# Create output dictionary
output_data = {
    "success": result.success,
    "confidence": result.confidence,
    "warnings": result.warnings,
    "transaction_count": len(result.data),
    "data": result.data
}

# Save to file
output_path = "../extracted_data_notebook.json"
with open(output_path, 'w', encoding='utf-8') as f:
    json.dump(output_data, f, indent=2, ensure_ascii=False)

print(f"Data exported to: {output_path}")

## 9. Experiment: Parse Specific Pages

You can modify the parser to test specific scenarios or pages.

In [None]:
# Example: Get transactions from page 2 only
page_2_txns = df[df['page'] == 2]

print(f"Transactions on Page 2: {len(page_2_txns)}")
print("=" * 100)
page_2_txns

## 10. Custom Analysis Playground

Use this cell to experiment with your own queries and analysis.

In [None]:
# Your custom analysis here

# Example: Group by date and sum amounts
# df['date_only'] = pd.to_datetime(df['date'], format='%d/%m/%Y')
# daily_summary = df.groupby('date_only').agg({
#     'withdrawal': 'sum',
#     'deposit': 'sum'
# })
# print(daily_summary)