PureDocBench

How far is document parsing from solved?
A source-traceable benchmark for OCR and document parsing across clean, digitally degraded, and real-degraded document settings.

Chinese README | Dataset | Paper

PureDocBench uses HTML/CSS document sources as hidden anchors: each page is rendered into images and annotated from the same structured source. This gives a benchmark where text, tables, formulas, captions, and reading order can be scored with less post-hoc annotation noise.

At A Glance

Item	Count
Official pages	1,475
Official images	4,425
Top-level domains	10
Fine-grained subcategories	66
Image tracks	clean, digital-degraded, real-degraded
Scored structures	text, formulas, tables, reading order

Main Leaderboard

The paper evaluates 40 systems across pipeline specialists, end-to-end document parsers, and general-purpose VLMs. Table 2 is the main leaderboard: each track reports Overall, TextEdit, FormulaCDM, TableTEDS, and ROEdit; Avg3 averages the three track Overall scores.

Diagnostics

The diagnostic panel shows where current systems still have headroom. Formula recognition is the largest single bottleneck, and real degradation changes rankings more sharply than digital degradation.

Case Studies

The four case studies below are all taken from the paper. They show failures that aggregate scores can hide: notation loss, reading-order mistakes, annotation contamination, table-structure errors, character-level corruption, and missing visual authentication cues.

Case 1: Academic

Case 2: Business

Case 3: Finance

Case 4: Certificate

Appendix Highlights

The appendix documents the degradation design and per-category behavior used to make the benchmark reproducible.

Download

The full image/GT/HTML release is hosted on Hugging Face:

# After downloading all files from Hugging Face:
shasum -a 256 -c SHA256SUMS.txt
cat pdb_full.tar.part-* | tar -xf -

Verify the split archive and reconstructed release:

python scripts/verify_split_archive.py /path/to/downloaded/files

python scripts/validate_release_manifest.py \
  --release-root /path/to/puredocbench-v1.0 \
  --manifest manifests/release_manifest_candidate_1475.csv

Inference And Scoring

PureDocBench includes a public CLI for model-agnostic inference, fast lightweight scoring, and OmniDocBench-aligned evaluation. Use puredocbench score for quick sanity checks; use puredocbench score-omnidocbench with an OmniDocBench checkout for platform-aligned CDM/TEDS numbers.

pip install -e .

puredocbench infer \
  --images /path/to/puredocbench-v1.0/images/clean \
  --output-dir predictions/my_model_clean \
  --command-template 'python my_model_infer.py --image {image} --out {output}'

puredocbench score \
  --release-root /path/to/puredocbench-v1.0 \
  --manifest manifests/release_manifest_candidate_1475.csv \
  --pred-dir predictions/my_model_clean \
  --track clean \
  --limit 20 \
  --out-dir scores/my_model_clean

puredocbench score-omnidocbench \
  --release-root /path/to/puredocbench-v1.0 \
  --manifest manifests/release_manifest_candidate_1475.csv \
  --pred-dir predictions/my_model_clean \
  --track clean \
  --omnidocbench-root /path/to/OmniDocBench \
  --out-dir omnidocbench_scores/my_model_clean

See docs/INFERENCE_SCORING.md for the full interface and evaluator-version notes.

Repository Contents

manifests/                         Release and sample manifests
metadata/                          Dataset card and Croissant metadata
scripts/                           Rendering, degradation, validation, leaderboard tools
puredocbench/                      Public inference, scoring, and OmniDocBench export CLI
model_inference/                   Sanitized model inference configs and runners
supplemental_inference_scoring/    API/local inference and scoring utilities
assets/figures/                    Figures from the paper

Quick Start

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
playwright install chromium

Render one HTML page:

python scripts/render_single_image.py \
  --html /path/to/page.html \
  --out /path/to/page.png \
  --dpi 300

Apply a deterministic degradation profile:

python scripts/apply_degradation_ablation.py \
  --input /path/to/clean_images \
  --output /path/to/degraded_images \
  --profile full_medium

License

Dataset assets are released under CC BY 4.0; see LICENSE_DATA.
Code in this repository is released under the license in LICENSE.
Model weights are not redistributed.

Citation

@article{li2026puredocbench,
  title   = {How Far Is Document Parsing from Solved? PureDocBench: A Source-Traceable Benchmark across Clean, Degraded, and Real-World Settings},
  author  = {Li, Zhiheng and Ma, Zongyang and Chen, Jiaxian and Zhang, Jianing and Su, Zhaolong and Zhang, Yutong and Yu, Zhiyin and Liu, Ruiqi and Lv, Xiaolei and Li, Bo and Gao, Jun and Zhang, Ziqi and Yuan, Chunfeng and Li, Bing and Hu, Weiming},
  journal = {arXiv preprint arXiv:2605.07492},
  year    = {2026},
  doi     = {10.48550/arXiv.2605.07492},
  url     = {https://arxiv.org/abs/2605.07492}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PureDocBench

At A Glance

Main Leaderboard

Diagnostics

Case Studies

Case 1: Academic

Case 2: Business

Case 3: Finance

Case 4: Certificate

Appendix Highlights

Download

Inference And Scoring

Repository Contents

Quick Start

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
docs		docs
manifests		manifests
metadata		metadata
model_inference		model_inference
puredocbench		puredocbench
scripts		scripts
supplemental_inference_scoring		supplemental_inference_scoring
.gitignore		.gitignore
.mailmap		.mailmap
CITATION.cff		CITATION.cff
LICENSE		LICENSE
LICENSE_DATA		LICENSE_DATA
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PureDocBench

At A Glance

Main Leaderboard

Diagnostics

Case Studies

Case 1: Academic

Case 2: Business

Case 3: Finance

Case 4: Certificate

Appendix Highlights

Download

Inference And Scoring

Repository Contents

Quick Start

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages