Bankable

End-to-end CLI tool for extracting structured transaction data from bank and credit card statement PDFs into clean CSV files.

Pipeline: PDF → Marker → .md → Substitutions → LLM → .csv → Pandas cleanup

Features

PDF to CSV in one command — single pdf_to_csv.py handles the entire pipeline
Plug-and-play LLM — switch Ollama models via --model flag without code changes
Batch processing — processes entire directories of PDFs, merges into one master CSV
Smart table extraction — custom patches fix Marker's handling of collapsed financial tables
Password support — handles encrypted PDFs via --password flag
Text substitutions — user-configurable find/replace rules (SUBSTITUTIONS.md) clean up OCR artifacts before LLM processing
Deterministic post-processing — Credit/Debit classification uses original Markdown markers, not LLM guesses
Apple Silicon optimized — MPS GPU acceleration where supported

Installation

uv sync

Ollama Setup

brew install ollama
ollama serve
ollama pull llama3.1:8b

Quick Start

# Single file → produces statement.md + statement.csv
uv run pdf_to_csv.py statement.pdf

# Batch directory → produces per-file .md files + all_transactions.csv
uv run pdf_to_csv.py /path/to/statements/

# Encrypted PDFs
uv run pdf_to_csv.py /path/to/statements/ --password "your_password"

# Use a different model
uv run pdf_to_csv.py statement.pdf --model llama3.2:3b

# Overwrite existing outputs
uv run pdf_to_csv.py /path/to/statements/ --overwrite --password "your_password"

Output

Single File

statement.md — Markdown with extracted tables
statement.csv — cleaned transaction data

CSV Format

Date,Time,Description,Amount,Type
13/03/2025,21:13:01,SWIGGY,937.00,D
02/04/2025,12:29:45,NETBANKING TRANSFER,42545.00,C

Type: C (Credit) or D (Debit)
Sorted by date and time
Deduplicated across statements

Command Line Options

Option	Description
`input`	PDF file or directory of PDFs (required)
`--model`, `-m`	Ollama model for extraction (default: `llama3.1:8b`)
`--password`, `-p`	Password for encrypted PDFs
`--overwrite`	Overwrite existing output files
`--verbose`, `-v`	Verbose output
`--ollama-url`	Ollama base URL (default: `http://localhost:11434`)
`--config`	Marker config file (default: `MARKER.md`)
`--substitutions`	Substitutions config (default: `SUBSTITUTIONS.md`)
`--use-llm`	Enable Marker's built-in LLM mode
`--no-llm`	Disable Marker's built-in LLM mode
`--html-tables`	HTML tables in Markdown output
`--workers`, `-w`	Batch size for parallel processing

Configuration

`MARKER.md`

Marker conversion settings (output format, table processing, DPI, etc.). CLI flags override these.

`SUBSTITUTIONS.md`

User-maintained text replacements applied before LLM processing:

"<br>" → " "
"SMARTBUYBANGALORE" → "SMARTBUY BANGALORE"
"PayU*Swiggy Limited" → "SWIGGY"

Project Structure

bankable/
├── pdf_to_csv.py          # Main program — end-to-end PDF → CSV
├── marker_patched.py      # Marker subprocess wrapper (applies patches)
├── table_split_patch.py   # Fixes collapsed transaction rows in tables
├── pdfprovider_patch.py   # Adds password support to Marker's PDF provider
├── MARKER.md              # Marker configuration
├── SUBSTITUTIONS.md       # Text substitution rules
└── pyproject.toml         # Dependencies

Requirements

Python 3.13+
macOS with Apple Silicon recommended
Ollama with a model pulled (e.g. llama3.1:8b)

Notes

Existing files are skipped by default. Use --overwrite to reprocess.
GPU: Apple Silicon MPS where supported; some models fall back to CPU.
LLM accuracy: llama3.1:8b is recommended over llama3.2:3b for fewer digit transposition errors.
Credit detection uses Cr markers from the original Markdown tables, not LLM output.

License

MIT License

Acknowledgments

Marker — PDF to Markdown conversion
Ollama — local LLM runtime

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.vscode		.vscode
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
MARKER.md		MARKER.md
README.md		README.md
SUBSTITUTIONS.md		SUBSTITUTIONS.md
marker_patched.py		marker_patched.py
pdf_to_csv.py		pdf_to_csv.py
pdfprovider_patch.py		pdfprovider_patch.py
pyproject.toml		pyproject.toml
table_split_patch.py		table_split_patch.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bankable

Features

Installation

Ollama Setup

Quick Start

Output

Single File

Directory

CSV Format

Command Line Options

Configuration

`MARKER.md`

`SUBSTITUTIONS.md`

Project Structure

Requirements

Notes

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

thecont1/bankable

Folders and files

Latest commit

History

Repository files navigation

Bankable

Features

Installation

Ollama Setup

Quick Start

Output

Single File

Directory

CSV Format

Command Line Options

Configuration

MARKER.md

SUBSTITUTIONS.md

Project Structure

Requirements

Notes

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`MARKER.md`

`SUBSTITUTIONS.md`

Packages