End-to-end CLI tool for extracting structured transaction data from bank and credit card statement PDFs into clean CSV files.
Pipeline: PDF → Marker → .md → Substitutions → LLM → .csv → Pandas cleanup
- PDF to CSV in one command — single
pdf_to_csv.pyhandles the entire pipeline - Plug-and-play LLM — switch Ollama models via
--modelflag without code changes - Batch processing — processes entire directories of PDFs, merges into one master CSV
- Smart table extraction — custom patches fix Marker's handling of collapsed financial tables
- Password support — handles encrypted PDFs via
--passwordflag - Text substitutions — user-configurable find/replace rules (
SUBSTITUTIONS.md) clean up OCR artifacts before LLM processing - Deterministic post-processing — Credit/Debit classification uses original Markdown markers, not LLM guesses
- Apple Silicon optimized — MPS GPU acceleration where supported
uv syncbrew install ollama
ollama serve
ollama pull llama3.1:8b# Single file → produces statement.md + statement.csv
uv run pdf_to_csv.py statement.pdf
# Batch directory → produces per-file .md files + all_transactions.csv
uv run pdf_to_csv.py /path/to/statements/
# Encrypted PDFs
uv run pdf_to_csv.py /path/to/statements/ --password "your_password"
# Use a different model
uv run pdf_to_csv.py statement.pdf --model llama3.2:3b
# Overwrite existing outputs
uv run pdf_to_csv.py /path/to/statements/ --overwrite --password "your_password"statement.md— Markdown with extracted tablesstatement.csv— cleaned transaction data
<name>.mdfor each PDF — individual Markdown filesall_transactions.csv— merged, deduplicated, sorted master CSV
Date,Time,Description,Amount,Type
13/03/2025,21:13:01,SWIGGY,937.00,D
02/04/2025,12:29:45,NETBANKING TRANSFER,42545.00,C
- Type:
C(Credit) orD(Debit) - Sorted by date and time
- Deduplicated across statements
| Option | Description |
|---|---|
input |
PDF file or directory of PDFs (required) |
--model, -m |
Ollama model for extraction (default: llama3.1:8b) |
--password, -p |
Password for encrypted PDFs |
--overwrite |
Overwrite existing output files |
--verbose, -v |
Verbose output |
--ollama-url |
Ollama base URL (default: http://localhost:11434) |
--config |
Marker config file (default: MARKER.md) |
--substitutions |
Substitutions config (default: SUBSTITUTIONS.md) |
--use-llm |
Enable Marker's built-in LLM mode |
--no-llm |
Disable Marker's built-in LLM mode |
--html-tables |
HTML tables in Markdown output |
--workers, -w |
Batch size for parallel processing |
Marker conversion settings (output format, table processing, DPI, etc.). CLI flags override these.
User-maintained text replacements applied before LLM processing:
"<br>" → " "
"SMARTBUYBANGALORE" → "SMARTBUY BANGALORE"
"PayU*Swiggy Limited" → "SWIGGY"
bankable/
├── pdf_to_csv.py # Main program — end-to-end PDF → CSV
├── marker_patched.py # Marker subprocess wrapper (applies patches)
├── table_split_patch.py # Fixes collapsed transaction rows in tables
├── pdfprovider_patch.py # Adds password support to Marker's PDF provider
├── MARKER.md # Marker configuration
├── SUBSTITUTIONS.md # Text substitution rules
└── pyproject.toml # Dependencies
- Python 3.13+
- macOS with Apple Silicon recommended
- Ollama with a model pulled (e.g.
llama3.1:8b)
- Existing files are skipped by default. Use
--overwriteto reprocess. - GPU: Apple Silicon MPS where supported; some models fall back to CPU.
- LLM accuracy:
llama3.1:8bis recommended overllama3.2:3bfor fewer digit transposition errors. - Credit detection uses
Crmarkers from the original Markdown tables, not LLM output.
MIT License