vlm-ood-eval

A small, focused benchmark for probing vision-language models on an out-of-distribution document grounding task: given a 200 DPI scan of a 1970s Chinese botanical book page, output pixel-coordinate bounding boxes for every figure on the page (photos, line drawings, illustrations).

No OCR, no captions — just "where are the figures." Modern VLM grounding stacks are trained on clean web documents; this set targets ink bleed, handwritten margin notes, mixed figure styles, and irregular layout.

61 pages, 64 figures, all bboxes hand-verified
Metric: greedy max-IoU match → micro precision / recall / F1 at IoU 0.3 / 0.5 / 0.7 + mean matched IoU
Runs on Qwen2-VL, Qwen2.5-VL, InternVL2, LLaVA out of the box; add a family in one file

Quickstart

uv sync
uv run eval.py --model-id Qwen/Qwen2-VL-2B-Instruct \
               --device mps --dtype float16 \
               --output runs/qwen2vl_2b.jsonl

Replace --device with cuda / cpu as appropriate. Smoke-test with --limit 3 first if the model is unfamiliar.

The run prints a table like:

samples       : 61
GT  bboxes    : 64
pred bboxes   : 70
mean matched IoU: 71.2%

 IoU≥        P        R       F1   tp/fp/fn
 0.30     84.3%    92.2%    88.0%   59/11/5
 0.50     71.4%    78.1%    74.6%   50/20/14
 0.70     42.9%    46.9%    44.8%   30/40/34

and writes per-sample JSONL + a summary JSON next to it.

Repo layout

agent.md                          one-page brief for AI agents (read this first if you're an agent)
eval.py                           main CLI
src/                              dataset loader, prompt, bbox parser, IoU metrics, model runners
scripts/                          benchmark builders (not needed to run eval)
benchmarks/gansulishuzhi/         the eval set: pages/ + overlays/ + manifest.jsonl

Supported models

Auto-routed by --model-id prefix:

Family	Example
Qwen2-VL	`Qwen/Qwen2-VL-2B-Instruct`
Qwen2.5-VL	`Qwen/Qwen2.5-VL-7B-Instruct`
InternVL2	`OpenGVLab/InternVL2-8B`
LLaVA	`llava-hf/llava-1.5-7b-hf`

To add a family: drop a new runner in src/models/ exposing generate(image, prompt) -> str, then register its id prefix in src/models/__init__.py.

For AI agents

If you are an AI agent (Claude Code, Cursor, etc.), read agent.md instead of this file. It has the same content restructured for one-shot programmatic use: manifest schema, output schema, metric definition, and exact commands.

A Claude Code skill is provided under .claude/skills/gansulishuzhi-eval/.

Rebuilding the benchmark (optional)

The repo ships the prepared dataset; you only need this if you want to regenerate from the source PDF.

uv sync --extra build
uv run scripts/build_gansulishuzhi.py
# then hand-verify by reviewing benchmarks/gansulishuzhi/overlays/*.png

The detector (scripts/_detect_figures.py) runs two heuristics in parallel — midtone-smooth blob for photos, large connected ink components for line drawings — then unions and dedupes.

License

Code: MIT. The source page images are scans of a 1970s reference work included for non-commercial research use only.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.claude/skills/gansulishuzhi-eval		.claude/skills/gansulishuzhi-eval
benchmarks/gansulishuzhi		benchmarks/gansulishuzhi
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
agent.md		agent.md
eval.py		eval.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vlm-ood-eval

Quickstart

Repo layout

Supported models

For AI agents

Rebuilding the benchmark (optional)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vlm-ood-eval

Quickstart

Repo layout

Supported models

For AI agents

Rebuilding the benchmark (optional)

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages