A fast, pure-Rust document parser built for agents & RAG.
Extract positioned, structured content from 12+ formats — every chunk citable with page + bbox.
English | 中文
docparse-rs turns PDF · DOCX · HTML · XLSX · PPTX · Markdown · CSV · SRT/VTT · LaTeX · EML · PNG/JPEG · AsciiDoc into a unified intermediate representation, then emits JSON / Markdown / Text / RAG chunks. It takes the structure-extraction fast path — interpreting PDF content streams for coordinates instead of rendering pages to pixels — so a warm parse is <10 ms (~700 pages/s) and output is byte-identical for identical input. One ~29 MB binary, no JVM / C++ / Python, zero runtime dependencies.
▶️ If the inline player doesn't load, watch / downloadfastdemo.mp4.
- 🦀 One pure-Rust binary — ~29 MB, zero runtime deps, <10 ms warm parse (~700 pages/s)
- 🔌 Four faces, one output — CLI / library / MCP (stdio) / REST, byte-identical across all
- 📍 RAG-native citations — every chunk carries page + bbox + heading breadcrumbs +
section_id;locate(x, y)reverse lookup, 100% coverage - 🌲 Document structure tree — nested sections (title/level/page/bbox) for agentic navigation (
-f outline, MCPoutline); chunks carrysection_idback into the tree - 🖼️ Figures as retrievable chunks — embedded images (PDF + DOCX + PPTX + HTML) become
imagechunks: caption + surrounding context fill the searchabletext, whilefile/base64 + page + bbox let RAG render & cite them. Caption binds the adjacent in-document "Figure N" line (or HTMLalt) for free;--vlm-describeupgrades it to a neural description - 📦 OKF knowledge bundles — first-class Open Knowledge Format producer (
-f okf): structure tree → vendor-neutral, git-native Markdown bundle; citable (page+bbox), deterministic (byte-identical, no wall clock), one binary — no Python/JVM/service. Also MCPexport_okf/ REST?format=okf - 🔍 In-process OCR —
--ocrruns ONNX ontract(PP-OCRv6 tiny by default; offers to fetch ~7 MB on first use); digital pages never touch a model; CCITT G3/G4 fax + JBIG2 scans covered - 🧠 Embedded models, opt-in — merged-cell table structure, formula→LaTeX, full-page transcription (UniRec-0.1B), plus PP-DocLayoutV2 / DocLayout-YOLO layout
- 🛡️ Security pre-checks — hidden-text filtering (flagged & auditable, never silently dropped), zip-bomb & page-count guards, per-page complexity profiling
- 🧩 Pluggable AI boundary — the deterministic core stands alone; models trigger only on hard pages and carry a
sourcetag + capped confidence
Prebuilt binary — no toolchain, macOS · Linux · Windows (from Releases):
# macOS / Linux
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/yzlabai/docparse-rs/releases/latest/download/docparse-cli-installer.sh | sh# Windows (PowerShell)
powershell -c "irm https://github.com/yzlabai/docparse-rs/releases/latest/download/docparse-cli-installer.ps1 | iex"With Cargo (builds from source; honors the vendored tract patch):
cargo install --git https://github.com/yzlabai/docparse-rs docparse-cliBuild locally (development):
git clone https://github.com/yzlabai/docparse-rs && cd docparse-rs
cargo build --release # → ./target/release/docparseAll three give you the docparse binary. The core needs no models — optional tiers are fetched on demand (Optional models).
docparse input.pdf -f json # full IR: provenance + coordinates
docparse input.pdf -f markdown # Markdown
docparse input.pdf -f chunks # RAG chunks (page + bbox + breadcrumbs + section_id)
docparse input.pdf -f outline # document structure tree (nested sections, citable)
docparse input.pdf -f okf # OKF knowledge bundle → report-okf/ (git-native, --okf-tar for stdout)
docparse ./papers --out-dir out/ --jobs 8 # batch a folder (file-level parallelism for digital docs)
docparse scan.pdf --ocr # OCR scans (free for digital pages; offers to fetch models/ppocr-v6 on first use)More commands — layout · tables · formulas · VLM
docparse hard.pdf --layout # layout-model reading order (DocLayout-YOLO; needs models/layout)
docparse hard.pdf --layout --layout-model models/layout-ppv2/PP-DoclayoutV2_simp.onnx # PP-DocLayoutV2 backend (~3x YOLO on messy tables)
docparse doc.pdf --table-model models/unirec # merged-cell table structure (in-process, no service)
docparse doc.pdf --formula-model models/unirec # formula → LaTeX
docparse doc.pdf --transcribe-model models/unirec # full-page transcription (zh/en hard layouts & scans)
docparse doc.pdf --vlm-describe --vlm-url URL --vlm-model M # figure captions via an OpenAI-compatible VLM (written onto each image chunk)
docparse doc.pdf --vlm-tables --vlm-url URL --vlm-model M # VLM table re-extraction (failures keep the deterministic grid)
docparse doc.pdf --image-dir imgs/ -f chunks # export images + emit them as RAG image chunks (caption + context, file/bbox)
docparse doc.pdf --image-dir imgs/ # export embedded images (JSON "file" / Markdown ![alt])
docparse input.pdf --quality --profile --route-plan # quality / per-page profile / routing (JSON on stderr)claude mcp add docparse -- docparse mcp # MCP tools: parse_document / get_chunks / outline / export_okf / locate
# + outputSchema/structuredContent, resources (schemas + decision guide), prompts
docparse serve --port 8642 # REST: POST /parse + GET /healthz + GET /openapi.json + /schema/{name}
curl -F "file=@doc.pdf" "http://127.0.0.1:8642/parse?format=chunks&ocr=true"Machine-readable contract — every output format has a JSON Schema (draft 2020-12) generated from the code (one source of truth, golden-tested against drift), so external projects can codegen typed clients instead of reading prose. docparse schema [--write] emits them (schemas/); REST serves GET /openapi.json + GET /schema/{name}; MCP advertises them per-tool and as resources. See agent integration → §6b.
# Python / LangChain (clients/python — zero-dependency thin client)
from docparse_client.langchain import DocparseLoader
docs = DocparseLoader("paper.pdf").load # one Document per chunk, page + bbox metadata// TypeScript / Node (clients/typescript — zero-dependency, LangChain.js + Vercel AI SDK adapters)
import { DocparseClient } from 'docparse-client';
const chunks = await new DocparseClient.chunks('paper.pdf'); // text + page + bbox + breadcrumbAgent Skill — a SKILL.md bundle that teaches a coding agent (Claude Code / Cursor) to drive the docparse CLI: format selection, OCR/table/formula decision matrix, and a parse → self-check (--quality/--profile) → refine loop. Symlink it where the agent looks for skills:
mkdir -p .claude/skills
ln -s "$(pwd)/skills/docparse-document-intelligence" .claude/skills/ # or ~/.claude/skills (global), ~/.cursor/skills (Cursor)docparse-rs is a first-class producer for Open Knowledge Format (OKF v0.1, Google Cloud) — the vendor-neutral, git-native way to ship parsed documents to any knowledge base or agent. -f okf turns a document's structure tree into a directory of Markdown + YAML-frontmatter concept files (one per section, nested to mirror the outline) that any OKF-aware tool mounts with zero adaptation — no bespoke chunk JSON, no vector-store glue.
docparse report.pdf -f okf # → report-okf/ (one concept .md per section)
docparse report.pdf -f okf --okf-tar | tar x -C kb/ # stream a tar to a knowledge base
git -C report-okf add . && git commit -m "knowledge: report" # versionable, diff-ableWhat makes docparse-rs's OKF distinctive — and fills the gap left by heavier producers (llmsherpa, Docling) that need Python/JVM + models + a running server:
- Citable — every concept's
resourcecarriespage+bbox, so answers trace back to the exact spot in the source. - Deterministic — same source ⇒ byte-identical bundle (timestamps come from file mtime, never the wall clock), so
git diffis meaningful and re-exports are reproducible. - Zero-dependency — one Rust binary, no model download or service to deliver a bundle.
Also available as the MCP export_okf tool and POST /parse?format=okf (deterministic application/x-tar). See agent integration →.
Scored on OmniDocBench (CVPR 2025) against human ground truth, using the embedded UniRec models:
| Dimension | Path | Score |
|---|---|---|
| Text recognition | --transcribe-model, papers |
0.872 |
| Formula → LaTeX | --formula-model, papers |
0.874 |
| Table structure | --table-model, clean tables |
0.810 (median 0.895) |
Text and formula are near paper-level (~0.87). The remaining gap is hard academic tables (multi-row headers + dense numbers + embedded LaTeX). A proxy "Overall" ≈ 75 puts us in the pipeline-tool tier (Marker 78, Docling ~80–85; dedicated VLMs 90+) — see the scoreboard & method →.
| docparse-rs | liteparse | Docling | OpenDataLoader | MarkItDown | |
|---|---|---|---|---|---|
| Deploy | pure-Rust ~29 MB binary, zero runtime deps | Rust + PDFium/Tesseract (C++); LibreOffice/ImageMagick for non-PDF | Python + models (GB env) | Java / JVM | Python |
| PDF engine | own content-stream interpreter | wraps PDFium | own | veraPDF | (delegates) |
| Determinism | byte-identical default path | deterministic | not strictly | deterministic | deterministic |
| Citations | page+bbox both ways, 100% | bbox per text element | element-level | coordinates | none |
| Output | JSON / Markdown / text / RAG chunks / structure tree / OKF bundle | JSON / text / PNG | Markdown / JSON | JSON / Markdown | Markdown |
| Formats | 12, all in-process | PDF native; others via external convert | 15+ | PDF-focused | 20+ |
| Hard pages | opt-in embedded models (table/formula/CJK) | none (by design) | neural layout | rule-based | none |
| Speed (born-digital) | <10 ms / ~700 pg/s | fast | seconds/page | fast | fast |
Closest peer — liteparse (run-llama, also Rust + deterministic + bbox-first): the design philosophies overlap, the tradeoffs differ. liteparse wraps PDFium for PDF text, bundles Tesseract, and converts DOCX/XLSX/PPTX/images through LibreOffice + ImageMagick — so it carries native C++ deps and external tools, where docparse-rs is a single zero-dependency binary with its own from-scratch PDF interpreter and in-process parsers for all 12 formats. liteparse wins on reach: WASM/browser builds, first-class Node/Python bindings, npm/pip/cargo install, and Tesseract's broad multi-language OCR out of the box (docparse-rs OCR is zh/en-focused). docparse-rs adds Markdown + RAG-chunk output with heading breadcrumbs and bidirectional locate, plus opt-in embedded models (merged-cell tables, formula→LaTeX, CJK/full-page transcription) that liteparse deliberately omits to stay light.
Where the others win: Docling's neural layout has a higher ceiling on the hardest layouts and a more mature ecosystem; MarkItDown covers more long-tail formats; we ship no GPU pipeline, and non-zh/en OCR (RTL / Korean …) isn't covered yet. Detailed comparison →
A Cargo workspace of 17 crates. The key invariant: core depends on no PDF library — reading order and output are format-agnostic, so adding a format means implementing the DocumentParser trait plus one registry line.
The heart of the project is a self-built PDF content-stream interpreter (graphics/text matrix state machine emitting positioned chunks — the layer ODL delegates to veraPDF) and a font layer (ToUnicode CMap / AFM / Encoding, independently implemented with veraPDF as the algorithmic reference). Neural models never enter the core — they attach per page through an Enhancer boundary, and only a hard page routed to a model is ever rendered (on demand, pure-Rust).
Full architecture (crate map · data flow · IR · model subsystem · four interfaces): docs/architecture.md →. Complete capability/feature reference: docs/capabilities.md →. Status & scoreboard: docs/status.md →.
All Apache-2.0, fetched from their original repos as external files — never baked into the binary. The core needs none of them: born-digital PDFs and every other format parse with zero downloads. Pull a tier only when you want the feature:
# --ocr's default models are also auto-offered on first use — this is just the explicit path:
./scripts/fetch-models.sh ppocr-v6 # --ocr (default) (~7 MB)
./scripts/fetch-models.sh ocr # --ocr v4 fallback (~16 MB)
./scripts/fetch-models.sh layout # --layout (default) (~75 MB)
./scripts/fetch-models.sh unirec # --table/formula/transcribe-model (~700 MB)
./scripts/fetch-models.sh ppv2 # --layout-model ppv2 (~210 MB + a local prep step)
./scripts/fetch-models.sh allNeeds the HuggingFace CLI (pip install -U huggingface_hub); ppv2 additionally needs onnx+onnxsim to static-ize its graph for tract (the script prints the one-liner). The ppocr-v6 default needs no prep — the loader reads PaddleOCR's raw ONNX directly (tract's ignore_value_info handles its dynamic graph) and parses the char dict out of the rec yml.
| Tier | Model (source) | Powers |
|---|---|---|
ppocr-v6 → models/ppocr-v6/ (~7 MB) |
PP-OCRv6 tiny det+rec (PaddlePaddle/PP-OCRv6_tiny_*_onnx) |
--ocr scanned text (default), auto-deskew |
ocr → models/ppocr/ (~16 MB) |
PP-OCRv4 det+rec+cls (SWHL/RapidOCR) |
--ocr v4 fallback |
layout → models/layout/ (~75 MB) |
DocLayout-YOLO (wybxc/DocLayout-YOLO-DocStructBench-onnx) |
--layout regions (default), formula detection |
ppv2 → models/layout-ppv2/ (~210 MB) |
PP-DocLayoutV2 (topdu/PP_DoclayoutV2_onnx) |
richer layout + native reading order (~3× YOLO on messy tables) |
unirec → models/unirec/ (~700 MB) |
UniRec-0.1B (topdu/unirec_0_1b_onnx) |
--table-model / --formula-model / --transcribe-model |
PP-OCRv6 (PaddleOCR, 2026-06) is the default OCR tier: on a real Chinese scan it's more accurate than the previous PP-OCRv4 mobile (e.g. fixes a 顿号
、misread), ~2× faster, and ~half the size — at 1.5 M params. Same DB-detection + CTC-recognition interface as v4/v5, so it drops into the existing pipeline; tract reads the raw export directly. Evaluation →UniRec and PP-DocLayoutV2 are the two halves of OpenOCR's OpenDoc-0.1B; we run their official ONNX on pure-Rust
tractand stitch them with our own deterministic core. Selection rationale →
Apache-2.0 — an independent implementation containing no veraPDF code (veraPDF is GPLv3+/MPLv2; its algorithms are referenced with attribution in the sources). All external model files are Apache-2.0. The build carries two minimal, attributed tract patches (vendored on main by design) needed to run PP-DocLayoutV2 on tract.