PlanBench is a benchmark for evaluating LLMs on urban planning tasks, including both text-based QA and vision-based QA.
| Subset | Items | Type | Description |
|---|---|---|---|
| PlanBench | 405 | Text | Urban planning exam questions (memory, understanding, analysis, application, evaluation) |
| PlanBench-V (subset) | 300 | Vision | Planning map understanding with critical-point scoring |
| PlanBench-V (full) | 1,567 | Vision | Full vision benchmark |
- GitHub (this repo):
planbench/data/andplanbench-v/data/contain the question JSONs - HuggingFace (includes images): chichi56/PlanBench
# Clone with data
git clone https://github.com/zhuchichi56/PlanBench.git
cd PlanBench
# Download images from HuggingFace (required for vision tasks)
pip install huggingface_hub
huggingface-cli download chichi56/PlanBench --repo-type dataset --local-dir .PlanBench (text) — planbench/data/planbench.json:
{
"instruction": "问题文本...",
"response": "参考答案...",
"type": "记忆能力",
"answer": "简答",
"explanation": "解析..."
}PlanBench-V (vision) — planbench-v/data/planbench-v-subset.json:
{
"type": "要素",
"image_id": "22-1",
"image_url": "images/22-1.png",
"question": "请描述国家重点湿地。",
"answer": "...",
"critical_points": ["[1] ...", "[2] ...", "[3] ..."]
}pip install openai httpx tqdminference.py uses an OpenAI-compatible backend (OpenRouter by default). Any
OpenAI-compatible endpoint works by overriding OPENROUTER_BASE_URL.
export OPENROUTER_API_KEY="sk-or-..."
# PlanBench-V (vision)
python inference.py \
--model google/gemini-2.5-pro \
--data planbench-v/data/planbench-v-subset.json \
--image-dir planbench-v/images \
--output planbench-v/results/gemini-2.5-pro.json
# PlanBench (text)
python inference.py \
--model openai/gpt-4o-mini \
--data planbench/data/planbench.json \
--output planbench/results/gpt-4o-mini.json
# Limit to first 5 items for testing
python inference.py --model openai/gpt-4o-mini \
--data planbench-v/data/planbench-v-subset.json \
--image-dir planbench-v/images \
--output planbench-v/results/test.json --limit 5eval.py scores inference outputs using a judge model.
# Score vision answers (judge needs image access)
python eval.py \
--judge openai/gpt-4o-mini \
--input planbench-v/results/gemini-2.5-pro.json \
--image-dir planbench-v/images \
--output planbench-v/results/gemini-2.5-pro-scored.json
# Score text answers
python eval.py \
--judge openai/gpt-4o-mini \
--input planbench/results/gpt-4o-mini.json \
--output planbench/results/gpt-4o-mini-scored.jsonScoring:
- PlanBench-V: Each answer is checked against
critical_points. Score = (matched / total) × 2. Range: 0–2. - PlanBench: Judge evaluates analysis logic (0–1) + answer correctness (0–1). Range: 0–2.
Both scripts support --max-workers N for parallel processing:
python inference.py --model openai/gpt-4o-mini \
--data planbench-v/data/planbench-v-subset.json \
--image-dir planbench-v/images \
--output planbench-v/results/test.json --max-workers 8PlanBench/
├── README.md
├── inference.py # Unified inference (OpenAI-compatible backend)
├── eval.py # Unified evaluation / judge
├── planbench/
│ ├── data/
│ │ └── planbench.json # 405 text items
│ └── results/ # Text eval results
└── planbench-v/
├── data/
│ ├── planbench-v-subset.json # 300 vision items
│ └── planbench-v-full.json # 1,567 vision items
├── images/ # Planning map images
└── results/ # Vision eval results
Numbers match Table 2 in the IJGIS paper. Overall is a per-item (micro) average over the 300-item subset, so it is not the unweighted mean of the eight sub-task columns.
| Rank | Model | Elem. Recog. | Caption | Classification | Spatial Reasoning | Domain Reasoning | Association | Scheme Eval. | Decision Making | Overall |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GPT-4o | 1.051 | 1.878 | 1.406 | 1.305 | 1.564 | 1.527 | 1.223 | 1.429 | 1.342 |
| 2 | Qwen2.5-VL-72B-AWQ | 1.299 | 1.825 | 1.406 | 1.248 | 1.263 | 1.253 | 1.153 | 1.090 | 1.288 |
| 3 | InternVL3-9B | 1.173 | 1.878 | 1.297 | 1.260 | 1.435 | 1.297 | 0.921 | 0.903 | 1.271 |
| 4 | Qwen2.5-VL-7B | 1.101 | 1.628 | 1.089 | 0.865 | 1.110 | 1.069 | 0.802 | 1.054 | 1.050 |
| 5 | InternVL3-14B | 0.931 | 1.580 | 1.177 | 0.793 | 0.998 | 1.098 | 0.709 | 0.917 | 0.980 |
| 6 | Qwen2-VL-72B-AWQ | 1.010 | 1.367 | 1.125 | 0.746 | 0.967 | 1.114 | 0.670 | 0.632 | 0.963 |
| 7 | Qwen2-VL-7B | 0.902 | 1.386 | 1.031 | 0.716 | 0.943 | 0.979 | 0.857 | 0.657 | 0.910 |
| 8 | InternVL3-8B | 0.992 | 1.783 | 1.026 | 0.798 | 0.751 | 0.926 | 0.631 | 1.073 | 0.909 |
| 9 | Qwen2.5-VL-3B | 0.862 | 1.554 | 0.953 | 0.691 | 0.870 | 0.970 | 0.697 | 0.936 | 0.876 |
| 10 | GPT-4o-mini | 0.664 | 0.890 | 1.021 | 0.789 | 1.030 | 0.963 | 0.636 | 1.175 | 0.866 |
| 11 | Qwen2-VL-2B | 0.744 | 0.925 | 0.948 | 0.500 | 0.656 | 0.926 | 0.537 | 0.792 | 0.731 |
| Rank | Model | Elem. Recog. | Caption | Classification | Spatial Reasoning | Domain Reasoning | Association | Scheme Eval. | Decision Making | Overall |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Qwen3.6-Plus | 1.751 | 1.950 | 1.672 | 1.670 | 1.744 | 1.619 | 1.619 | 1.710 | 1.701 |
| 2 | Gemini-2.5-Pro | 1.408 | 1.775 | 1.656 | 1.444 | 1.425 | 1.468 | 1.439 | 1.525 | 1.472 |
| 3 | GPT-5.4 | 1.233 | 1.900 | 1.562 | 1.383 | 1.486 | 1.438 | 1.586 | 1.508 | 1.431 |
| 4 | Kimi-K2.6 | 1.518 | 1.516 | 1.344 | 1.414 | 1.376 | 1.365 | 1.440 | 1.317 | 1.417 |
| 5 | Claude-Opus-4.7 | 1.186 | 1.825 | 1.320 | 1.295 | 1.558 | 1.493 | 1.434 | 1.321 | 1.384 |
| 6 | Qwen3.6-Flash | 1.318 | 1.636 | 1.297 | 1.353 | 1.476 | 1.351 | 1.046 | 1.441 | 1.353 |
Per-model scored JSON files for Gemini-2.5-Pro, GPT-5.4, and Claude-Opus-4.7 are under planbench-v/results/.
Score = answer accuracy (%). Cognitive levels: Remember, Understand, Apply, Analyze, Evaluate.
| Rank | Model | Score | Remember | Understand | Apply | Analyze | Evaluate |
|---|---|---|---|---|---|---|---|
| 1 | Qwen3-32B | 80.9% | 97.5 | 86.4 | 95.1 | 86.1 | 39.5 |
| 2 | Qwen3-14B | 80.6% | 97.5 | 77.8 | 92.6 | 86.8 | 48.1 |
| 3 | QwQ-32B | 80.4% | 95.1 | 85.2 | 91.4 | 91.9 | 38.3 |
| 4 | Qwen3-8B | 80.0% | 93.8 | 80.2 | 90.1 | 90.4 | 45.7 |
| 5 | Qwen3-4B | 78.8% | 95.1 | 72.8 | 90.1 | 89.3 | 46.9 |
| 6 | Qwen3-30B-A3B | 78.4% | 97.5 | 79.0 | 88.9 | 89.5 | 37.0 |
| 7 | Qwen3-1.7B | 74.1% | 95.1 | 79.0 | 76.5 | 85.1 | 34.6 |
| 8 | glm-4-9b-chat | 73.3% | 91.4 | 72.8 | 84.0 | 79.9 | 38.3 |
| 9 | Meta-Llama-3-8B-Instruct | 70.6% | 95.1 | 58.0 | 72.8 | 78.8 | 48.1 |
| 10 | Qwen2.5-3B-Instruct | 70.3% | 98.8 | 66.7 | 92.6 | 64.0 | 29.6 |
| 11 | Qwen2.5-7B-Instruct | 69.5% | 98.8 | 70.4 | 81.5 | 65.9 | 30.9 |
| 12 | Qwen2-VL-7B-Instruct | 68.2% | 93.8 | 65.4 | 76.5 | 65.7 | 39.5 |
| 13 | DeepSeek-R1-Distill-Llama-8B | 68.1% | 93.8 | 64.2 | 75.3 | 78.8 | 28.4 |
| 14 | DeepSeek-R1-Distill-Qwen-7B | 68.0% | 96.3 | 69.1 | 77.8 | 73.4 | 23.5 |
| 15 | Qwen3-0.6B | 55.9% | 90.1 | 55.6 | 46.9 | 74.8 | 12.3 |
| 16 | Llama-3.1-Tulu-3-8B | 49.0% | 60.5 | 56.8 | 30.9 | 80.8 | 16.0 |
| 17 | chatglm3-6b | 48.3% | 80.2 | 37.5 | 44.4 | 58.3 | 21.0 |
| 18 | Qwen2.5-0.5B-Instruct | 39.3% | 65.4 | 21.0 | 25.9 | 69.4 | 14.8 |
Per-model raw eval results are under planbench/results/.
- Get an API key from openrouter.ai
- Set the environment variable:
export OPENROUTER_API_KEY="sk-or-v1-..."
- Use any model available on OpenRouter:
python inference.py --model google/gemini-2.5-pro ... python inference.py --model anthropic/claude-sonnet-4 ... python inference.py --model openai/gpt-4o ...
If you use PlanBench in your research, please cite:
@misc{zhu2024plangptenhancingurbanplanning,
title={PlanGPT: Enhancing Urban Planning with Tailored Language Model and Efficient Retrieval},
author={He Zhu and Wenjia Zhang and Nuoxian Huang and Boyang Li and Luyao Niu and Zipei Fan and Tianle Lun and Yicheng Tao and Junyou Su and Zhaoya Gong and Chenyu Fang and Xing Liu},
year={2024},
eprint={2402.19273},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2402.19273},
}
@misc{deng2025urban,
title = {Urban Planning Bench: A Comprehensive Benchmark for Evaluating Urban Planning Capabilities in Large Language Models},
author = {Yijie Deng and He Zhu and Wen Wang and Minxin Chen and Junyou Su and Wenjia Zhang},
year = {2025},
institution = {Behavioral and Spatial AI Lab, Tongji University and Peking University; College of Architecture and Urban Planning, Tongji University},
note = {†Equal contribution. *Corresponding author: wenjiazhang@tongji.edu.cn},
}Released for academic research use. See dataset card on HuggingFace for terms.