How far is document parsing from solved?
A source-traceable benchmark for OCR and document parsing across clean, digitally degraded, and real-degraded document settings.
Chinese README | Dataset | Paper
PureDocBench uses HTML/CSS document sources as hidden anchors: each page is rendered into images and annotated from the same structured source. This gives a benchmark where text, tables, formulas, captions, and reading order can be scored with less post-hoc annotation noise.
| Item | Count |
|---|---|
| Official pages | 1,475 |
| Official images | 4,425 |
| Top-level domains | 10 |
| Fine-grained subcategories | 66 |
| Image tracks | clean, digital-degraded, real-degraded |
| Scored structures | text, formulas, tables, reading order |
The paper evaluates 40 systems across pipeline specialists, end-to-end document parsers, and general-purpose VLMs. Table 2 is the main leaderboard: each track reports Overall, TextEdit, FormulaCDM, TableTEDS, and ROEdit; Avg3 averages the three track Overall scores.
The diagnostic panel shows where current systems still have headroom. Formula recognition is the largest single bottleneck, and real degradation changes rankings more sharply than digital degradation.
The four case studies below are all taken from the paper. They show failures that aggregate scores can hide: notation loss, reading-order mistakes, annotation contamination, table-structure errors, character-level corruption, and missing visual authentication cues.
The appendix documents the degradation design and per-category behavior used to make the benchmark reproducible.
The full image/GT/HTML release is hosted on Hugging Face:
# After downloading all files from Hugging Face:
shasum -a 256 -c SHA256SUMS.txt
cat pdb_full.tar.part-* | tar -xf -Verify the split archive and reconstructed release:
python scripts/verify_split_archive.py /path/to/downloaded/files
python scripts/validate_release_manifest.py \
--release-root /path/to/puredocbench-v1.0 \
--manifest manifests/release_manifest_candidate_1475.csvPureDocBench includes a public CLI for model-agnostic inference, fast lightweight scoring, and OmniDocBench-aligned evaluation. Use puredocbench score for quick sanity checks; use puredocbench score-omnidocbench with an OmniDocBench checkout for platform-aligned CDM/TEDS numbers.
pip install -e .
puredocbench infer \
--images /path/to/puredocbench-v1.0/images/clean \
--output-dir predictions/my_model_clean \
--command-template 'python my_model_infer.py --image {image} --out {output}'
puredocbench score \
--release-root /path/to/puredocbench-v1.0 \
--manifest manifests/release_manifest_candidate_1475.csv \
--pred-dir predictions/my_model_clean \
--track clean \
--limit 20 \
--out-dir scores/my_model_clean
puredocbench score-omnidocbench \
--release-root /path/to/puredocbench-v1.0 \
--manifest manifests/release_manifest_candidate_1475.csv \
--pred-dir predictions/my_model_clean \
--track clean \
--omnidocbench-root /path/to/OmniDocBench \
--out-dir omnidocbench_scores/my_model_cleanSee docs/INFERENCE_SCORING.md for the full interface and evaluator-version notes.
manifests/ Release and sample manifests
metadata/ Dataset card and Croissant metadata
scripts/ Rendering, degradation, validation, leaderboard tools
puredocbench/ Public inference, scoring, and OmniDocBench export CLI
model_inference/ Sanitized model inference configs and runners
supplemental_inference_scoring/ API/local inference and scoring utilities
assets/figures/ Figures from the paper
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
playwright install chromiumRender one HTML page:
python scripts/render_single_image.py \
--html /path/to/page.html \
--out /path/to/page.png \
--dpi 300Apply a deterministic degradation profile:
python scripts/apply_degradation_ablation.py \
--input /path/to/clean_images \
--output /path/to/degraded_images \
--profile full_medium- Dataset assets are released under CC BY 4.0; see LICENSE_DATA.
- Code in this repository is released under the license in LICENSE.
- Model weights are not redistributed.
@article{li2026puredocbench,
title = {How Far Is Document Parsing from Solved? PureDocBench: A Source-Traceable Benchmark across Clean, Degraded, and Real-World Settings},
author = {Li, Zhiheng and Ma, Zongyang and Chen, Jiaxian and Zhang, Jianing and Su, Zhaolong and Zhang, Yutong and Yu, Zhiyin and Liu, Ruiqi and Lv, Xiaolei and Li, Bo and Gao, Jun and Zhang, Ziqi and Yuan, Chunfeng and Li, Bing and Hu, Weiming},
journal = {arXiv preprint arXiv:2605.07492},
year = {2026},
doi = {10.48550/arXiv.2605.07492},
url = {https://arxiv.org/abs/2605.07492}
}








