Skip to content

vra/gansulishuzhi

Repository files navigation

vlm-ood-eval

A small, focused benchmark for probing vision-language models on an out-of-distribution document grounding task: given a 200 DPI scan of a 1970s Chinese botanical book page, output pixel-coordinate bounding boxes for every figure on the page (photos, line drawings, illustrations).

No OCR, no captions — just "where are the figures." Modern VLM grounding stacks are trained on clean web documents; this set targets ink bleed, handwritten margin notes, mixed figure styles, and irregular layout.

  • 61 pages, 64 figures, all bboxes hand-verified
  • Metric: greedy max-IoU match → micro precision / recall / F1 at IoU 0.3 / 0.5 / 0.7 + mean matched IoU
  • Runs on Qwen2-VL, Qwen2.5-VL, InternVL2, LLaVA out of the box; add a family in one file

Quickstart

uv sync
uv run eval.py --model-id Qwen/Qwen2-VL-2B-Instruct \
               --device mps --dtype float16 \
               --output runs/qwen2vl_2b.jsonl

Replace --device with cuda / cpu as appropriate. Smoke-test with --limit 3 first if the model is unfamiliar.

The run prints a table like:

samples       : 61
GT  bboxes    : 64
pred bboxes   : 70
mean matched IoU: 71.2%

 IoU≥        P        R       F1   tp/fp/fn
 0.30     84.3%    92.2%    88.0%   59/11/5
 0.50     71.4%    78.1%    74.6%   50/20/14
 0.70     42.9%    46.9%    44.8%   30/40/34

and writes per-sample JSONL + a summary JSON next to it.

Repo layout

agent.md                          one-page brief for AI agents (read this first if you're an agent)
eval.py                           main CLI
src/                              dataset loader, prompt, bbox parser, IoU metrics, model runners
scripts/                          benchmark builders (not needed to run eval)
benchmarks/gansulishuzhi/         the eval set: pages/ + overlays/ + manifest.jsonl

Supported models

Auto-routed by --model-id prefix:

Family Example
Qwen2-VL Qwen/Qwen2-VL-2B-Instruct
Qwen2.5-VL Qwen/Qwen2.5-VL-7B-Instruct
InternVL2 OpenGVLab/InternVL2-8B
LLaVA llava-hf/llava-1.5-7b-hf

To add a family: drop a new runner in src/models/ exposing generate(image, prompt) -> str, then register its id prefix in src/models/__init__.py.

For AI agents

If you are an AI agent (Claude Code, Cursor, etc.), read agent.md instead of this file. It has the same content restructured for one-shot programmatic use: manifest schema, output schema, metric definition, and exact commands.

A Claude Code skill is provided under .claude/skills/gansulishuzhi-eval/.

Rebuilding the benchmark (optional)

The repo ships the prepared dataset; you only need this if you want to regenerate from the source PDF.

uv sync --extra build
uv run scripts/build_gansulishuzhi.py
# then hand-verify by reviewing benchmarks/gansulishuzhi/overlays/*.png

The detector (scripts/_detect_figures.py) runs two heuristics in parallel — midtone-smooth blob for photos, large connected ink components for line drawings — then unions and dedupes.

License

Code: MIT. The source page images are scans of a 1970s reference work included for non-commercial research use only.

About

small, focused benchmark for probing vision-language models on an out-of-distribution document grounding task

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages