PDFcsv

Universal PDF to CSV Extractor

Extract tabular data from any PDF with intelligent column detection and bank statement support

pdf-to-csv · pdf-to-excel · pdf-to-json · bank-statement-parser

Quick Start

pip3 install git+https://github.com/stexz01/pdfcsv.git
pdfcsv your_file.pdf

Done! All dependencies included. Works on Linux, macOS, and Windows.

Demo

Interactive column selection with live preview

Installation

Option 1: pip install (Recommended)

pip3 install git+https://github.com/stexz01/pdfcsv.git

Option 2: Clone & Install

git clone https://github.com/stexz01/pdfcsv.git
cd pdfcsv
pip3 install -e .

Option 3: Direct Script

curl -O https://raw.githubusercontent.com/stexz01/pdfcsv/main/pdfcsv.py
pip3 install pdfplumber openpyxl
python3 pdfcsv.py input.pdf

Requirements: Python 3.8+

Usage

Basic Commands

Command	Description
`pdfcsv file.pdf`	Interactive mode - select columns with arrow keys
`pdfcsv file.pdf --columns 6`	Extract rows with 6 columns
`pdfcsv file.pdf --analyze`	Show column structure analysis
`pdfcsv file.pdf -o out.csv`	Custom output filename
`pdfcsv file.pdf -f excel`	Export as Excel (.xlsx)
`pdfcsv file.pdf --silent`	Minimal output (scripting)

Output Formats

Format	Flag	Extension
CSV	`-f csv`	.csv
Excel	`-f excel`	.xlsx
JSON	`-f json`	.json
JSON Lines	`-f jsonl`	.jsonl
Markdown	`-f markdown`	.md
TSV	`-f tsv`	.tsv

Examples

# Bank statement to CSV
pdfcsv statement.pdf

# Invoice to Excel with specific columns
pdfcsv invoice.pdf --columns 5 -f excel -o invoice.xlsx

# Analyze structure first
pdfcsv report.pdf --analyze
pdfcsv report.pdf --columns 7

# Silent mode for scripts
pdfcsv data.pdf --columns 4 --silent

All Options

pdfcsv <file.pdf> [options]

Options:
  --columns N, --column N    Extract rows with N columns
  --analyze                  Show column distribution
  --gap N                    Gap threshold in pixels (default: 5)
  -o, --output FILE          Output filename
  -f, --format FORMAT        Output format (csv/excel/json/jsonl/md/tsv)
  --silent                   Minimal output (only result or errors)
  -h, --help                 Show help
  -v, --version              Show version

Features

Feature	Description
Smart Column Detection	Gap-based analysis using character X-positions
Bank Statement Support	Auto-fills empty debit/credit columns with `-`
Interactive CLI	Arrow-key navigation with live preview
Auto-Select	Single map option? Automatically selected
Multi-Format Export	CSV, Excel, JSON, Markdown, TSV
15+ Languages	Bank keyword detection in multiple languages
Deduplication	Removes duplicate rows and columns
Encrypted PDF Handling	Helpful error messages for protected files

Bank Statement Support

PDFcsv automatically detects bank statements and handles missing columns:

Raw PDF:                          Processed Output:
Date | Desc | 500.00 | 1000      Date | Desc | Debit | Credit | Balance
                                 01/01 | ATM  | 500   | -      | 1000
                                 01/02 | Dep  | -     | 200    | 1200

Supported Languages: English, Spanish, French, German, Portuguese, Italian, Dutch, Hindi, Arabic, Chinese, Japanese, Korean, Russian, Turkish

Interactive Mode

Run without --columns to enter interactive selection:

  (3) Map found - Select the perfect one

  ──────────────────────────────────────────────────────────
  > (46 rows) 7 columns
    (12 rows) 4 columns
    (5 rows) 2 columns
  ──────────────────────────────────────────────────────────
  columns: Date | Chq No | Particulars | Debit | Credit | Balance | Init.
        1. 02-04-2023 | - | Pay/ONSPG202 | - | 9170.00 | 1266873.70 | 115
        2. 03-04-2023 | - | PAYMENTSSERV | - | 21000.00 | 1287873.70 | 248
  ──────────────────────────────────────────────────────────
  [up/down] move  [Enter] select  [q] quit

Single option? Auto-selects without prompting
Use arrow keys to navigate
Press Enter to confirm
Press q to cancel

Troubleshooting

Password-Protected PDF

# Remove protection first
qpdf --decrypt input.pdf unlocked.pdf
pdfcsv unlocked.pdf

Wrong Column Count

# Analyze structure first
pdfcsv file.pdf --analyze

# Adjust gap threshold
pdfcsv file.pdf --gap 10  # Wider gaps
pdfcsv file.pdf --gap 3   # Tighter text

No Matching Rows

# Check available column counts
pdfcsv file.pdf --analyze

# Extract specific count
pdfcsv file.pdf --columns 5

How It Works

┌─────────────────────────────────────────────────────────────────────┐
│                         PDF DOCUMENT                                │
└─────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│  1. CHARACTER EXTRACTION                                            │
│     └─ Extract all characters with X,Y coordinates (pdfplumber)     │
└─────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│  2. LINE GROUPING                                                   │
│     └─ Group characters by Y position (same row = same Y)           │
└─────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│  3. GAP-BASED COLUMN DETECTION                                      │
│     └─ Detect gaps between characters (gap > threshold = new col)   │
└─────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│  4. STRUCTURE ANALYSIS                                              │
│     └─ Group rows by column count, find most common structure       │
└─────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│  5. BANK STATEMENT PROCESSING (if detected)                         │
│     └─ Identify debit/credit columns, fill missing with '-'         │
└─────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│  6. OUTPUT                                                          │
│     └─ Export to CSV / Excel / JSON / Markdown / TSV                │
└─────────────────────────────────────────────────────────────────────┘

Project Structure

pdfcsv/
├── pdfcsv.py             # Main CLI tool
├── banking_keywords.py   # Multi-language bank keywords
├── pyproject.toml        # Package configuration
├── requirements.txt      # Dependencies
├── LICENSE               # MIT License
├── README.md             # Documentation
└── assets/
    └── pdfcsv_demo.gif          # CLI demo

Contributing

Contributions are welcome! Here's how to get started:

Development Setup

git clone https://github.com/stexz01/pdfcsv.git
cd pdfcsv
pip3 install -e ".[dev]"

Areas for Contribution

Additional bank statement formats
More language keywords in banking_keywords.py
Performance optimizations for large PDFs
GUI interface (web or desktop)
Test cases and CI/CD pipeline
Documentation improvements

Pull Request Process

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Test with various PDF types
Commit (git commit -m 'Add amazing feature')
Push (git push origin feature/amazing-feature)
Open a Pull Request

Changelog

v1.4.0 (Current)

Multiple output formats: CSV, TSV, JSON, JSONL, Markdown, Excel
--format / -f flag for output type selection
--silent flag for minimal output (scripting)
--column alias for --columns
Auto-select when only one map option exists
All dependencies bundled (no extras needed)

v1.3.x

Interactive column selector with arrow keys
Bank statement gap-filling with position analysis
Multi-language keyword support
Automatic deduplication
Encrypted PDF handling

License

MIT License - see LICENSE file.

Made with ❤️ by @stexz01

_{Keywords: pdf to csv, pdf to excel, pdf to json, extract table from pdf,
bank statement parser, pdf table extractor, python pdf parser, convert pdf to csv,
pdf data extraction, tabular data extractor, financial pdf parser}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDFcsv

Universal PDF to CSV Extractor

Quick Start

Demo

Installation

Option 1: pip install (Recommended)

Option 2: Clone & Install

Option 3: Direct Script

Usage

Basic Commands

Output Formats

Examples

All Options

Features

Bank Statement Support

Interactive Mode

Troubleshooting

Password-Protected PDF

Wrong Column Count

No Matching Rows

How It Works

Project Structure

Contributing

Development Setup

Areas for Contribution

Pull Request Process

Changelog

v1.4.0 (Current)

v1.3.x

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
banking_keywords.py		banking_keywords.py
pdfcsv.py		pdfcsv.py
pdfcsv_demo.gif		pdfcsv_demo.gif
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

stexz01/pdfcsv

Folders and files

Latest commit

History

Repository files navigation

PDFcsv

Universal PDF to CSV Extractor

Quick Start

Demo

Installation

Option 1: pip install (Recommended)

Option 2: Clone & Install

Option 3: Direct Script

Usage

Basic Commands

Output Formats

Examples

All Options

Features

Bank Statement Support

Interactive Mode

Troubleshooting

Password-Protected PDF

Wrong Column Count

No Matching Rows

How It Works

Project Structure

Contributing

Development Setup

Areas for Contribution

Pull Request Process

Changelog

v1.4.0 (Current)

v1.3.x

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages