A small, focused benchmark for probing vision-language models on an out-of-distribution document grounding task: given a 200 DPI scan of a 1970s Chinese botanical book page, output pixel-coordinate bounding boxes for every figure on the page (photos, line drawings, illustrations).
No OCR, no captions — just "where are the figures." Modern VLM grounding stacks are trained on clean web documents; this set targets ink bleed, handwritten margin notes, mixed figure styles, and irregular layout.
- 61 pages, 64 figures, all bboxes hand-verified
- Metric: greedy max-IoU match → micro precision / recall / F1 at IoU 0.3 / 0.5 / 0.7 + mean matched IoU
- Runs on Qwen2-VL, Qwen2.5-VL, InternVL2, LLaVA out of the box; add a family in one file
uv sync
uv run eval.py --model-id Qwen/Qwen2-VL-2B-Instruct \
--device mps --dtype float16 \
--output runs/qwen2vl_2b.jsonlReplace --device with cuda / cpu as appropriate. Smoke-test with
--limit 3 first if the model is unfamiliar.
The run prints a table like:
samples : 61
GT bboxes : 64
pred bboxes : 70
mean matched IoU: 71.2%
IoU≥ P R F1 tp/fp/fn
0.30 84.3% 92.2% 88.0% 59/11/5
0.50 71.4% 78.1% 74.6% 50/20/14
0.70 42.9% 46.9% 44.8% 30/40/34
and writes per-sample JSONL + a summary JSON next to it.
agent.md one-page brief for AI agents (read this first if you're an agent)
eval.py main CLI
src/ dataset loader, prompt, bbox parser, IoU metrics, model runners
scripts/ benchmark builders (not needed to run eval)
benchmarks/gansulishuzhi/ the eval set: pages/ + overlays/ + manifest.jsonl
Auto-routed by --model-id prefix:
| Family | Example |
|---|---|
| Qwen2-VL | Qwen/Qwen2-VL-2B-Instruct |
| Qwen2.5-VL | Qwen/Qwen2.5-VL-7B-Instruct |
| InternVL2 | OpenGVLab/InternVL2-8B |
| LLaVA | llava-hf/llava-1.5-7b-hf |
To add a family: drop a new runner in src/models/ exposing
generate(image, prompt) -> str, then register its id prefix in
src/models/__init__.py.
If you are an AI agent (Claude Code, Cursor, etc.), read
agent.md instead of this file. It has the same content
restructured for one-shot programmatic use: manifest schema, output schema,
metric definition, and exact commands.
A Claude Code skill is provided under .claude/skills/gansulishuzhi-eval/.
The repo ships the prepared dataset; you only need this if you want to regenerate from the source PDF.
uv sync --extra build
uv run scripts/build_gansulishuzhi.py
# then hand-verify by reviewing benchmarks/gansulishuzhi/overlays/*.pngThe detector (scripts/_detect_figures.py) runs two heuristics in parallel —
midtone-smooth blob for photos, large connected ink components for line
drawings — then unions and dedupes.
Code: MIT. The source page images are scans of a 1970s reference work included for non-commercial research use only.