Split-Process-Merge pipeline for converting large PDFs to Docling documents. This pipeline is useful for projects like soliplex that might need to break large PDF files apart before processing.
With pip:
pip install pdf-splitter # core only
pip install pdf-splitter[dev] # with test/lint toolsWith uv:
uv add pdf-splitter # core only
uv add pdf-splitter --group dev # with test/lint toolsFrom source:
git clone https://github.com/soliplex/pdf-splitter.git
cd pdf-splitter
# pip
pip install -e . # core only
pip install -e ".[dev]" # with dev tools
# uv
uv sync # core only
uv sync --group dev # with dev toolsRequires Python 3.12+.
pdf-splitter analyze doc.pdf # analyze structure
pdf-splitter chunk doc.pdf -o ./chunks # split into chunks
pdf-splitter convert ./chunks -o out.json # process & merge
pdf-splitter validate out.json ./chunks # validate output| Option | Description |
|---|---|
-v |
Verbose logging |
-s <strategy> |
Force: fixed, hybrid, enhanced |
--max-pages N |
Max pages per chunk (default: 100) |
-w N |
Worker processes |
--keep-parts |
Output individual chunks |
from pdf_splitter.segmentation_enhanced import smart_split_to_files
from pdf_splitter.processor import BatchProcessor
from pdf_splitter.reassembly import merge_from_results
chunks, _ = smart_split_to_files("doc.pdf", output_dir="./chunks")
results = BatchProcessor(max_workers=4).execute_parallel(chunks)
merged = merge_from_results(results)
merged.export_to_json("output.json")uv run pytest # run tests
uv run ruff check # run ruff
uv run mypy src/ # run mypy