PDF Splitter

Split-Process-Merge pipeline for converting large PDFs to Docling documents. This pipeline is useful for projects like soliplex that might need to break large PDF files apart before processing.

Installation

With pip:

pip install pdf-splitter            # core only
pip install pdf-splitter[dev]       # with test/lint tools

With uv:

uv add pdf-splitter                 # core only
uv add pdf-splitter --group dev     # with test/lint tools

From source:

git clone https://github.com/soliplex/pdf-splitter.git
cd pdf-splitter

# pip
pip install -e .                    # core only
pip install -e ".[dev]"             # with dev tools

# uv
uv sync                             # core only
uv sync --group dev                 # with dev tools

Requires Python 3.12+.

Usage

pdf-splitter analyze doc.pdf              # analyze structure
pdf-splitter chunk doc.pdf -o ./chunks    # split into chunks
pdf-splitter convert ./chunks -o out.json # process & merge
pdf-splitter validate out.json ./chunks   # validate output

Options

Option	Description
`-v`	Verbose logging
`-s <strategy>`	Force: `fixed`, `hybrid`, `enhanced`
`--max-pages N`	Max pages per chunk (default: 100)
`-w N`	Worker processes
`--keep-parts`	Output individual chunks

Python API

from pdf_splitter.segmentation_enhanced import smart_split_to_files
from pdf_splitter.processor import BatchProcessor
from pdf_splitter.reassembly import merge_from_results

chunks, _ = smart_split_to_files("doc.pdf", output_dir="./chunks")
results = BatchProcessor(max_workers=4).execute_parallel(chunks)
merged = merge_from_results(results)
merged.export_to_json("output.json")

Development

uv run pytest       # run tests
uv run ruff check   # run ruff
uv run mypy src/    # run mypy

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github		.github
.vscode		.vscode
assets		assets
examples		examples
src/pdf_splitter		src/pdf_splitter
tests		tests
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Splitter

Installation

Usage

Options

Python API

Development

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Splitter

Installation

Usage

Options

Python API

Development

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages