GitHub - tongxuluo/gamecraft-bench: Code and Data for paper "GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?"

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

GameCraft-Bench evaluates whether coding agents can transform natural-language game specifications into complete, playable Godot projects. Unlike traditional coding tasks, game generation depends on scripts, scenes, assets, rendering, runtime configuration, and player-game interaction working together as one executable system.

The benchmark contains 140 tasks across 15 game families. Each agent must submit a complete Godot project together with replayable demonstration traces. The verifier launches the project, replays the traces, records gameplay evidence, and scores observed play with a hidden rubric and multimodal judge.

The benchmark runs on top of Harbor and ships with a custom local-subprocess environment for Docker-less hosts.

Benchmark overview

GameCraft-Bench is organized around three desiderata for end-to-end game generation:

Engine Grounding: games are generated and evaluated inside a concrete game engine and runtime environment.
Artifact Completeness: agents must deliver complete launchable game projects rather than isolated scripts, scenes, or assets.
Interactive Verification: games are judged by observed behavior under player input, using replayed demonstrations as standardized gameplay evidence.

The task suite covers diverse 2D game-generation demands, including continuous control and collision, rule and state management, progression and economy, exploration, narrative interaction, and presentation-heavy gameplay.

Family	Tasks	Family	Tasks	Family	Tasks
Platformer	19	Strategy	17	Tycoon	16
Open-world	15	Roguelike	14	Visual novel	11
Puzzle	8	Shooter	7	Simulation	6
Card game	5	Horror	5	Rhythm	5
Idle	4	Racing	4	Sports	4

Main results

Frontier coding agents remain far from reliable end-to-end game generation. The strongest evaluated configuration reaches only 41.46% overall, and most agents score below 40%.

Harness	Model	Overall	Mechanics	Depth	Visuals	Art
Claude Code	Opus-4.7 high	41.46	55.34	39.48	42.78	36.86
Codex	GPT-5.5 high	39.49	54.36	38.61	41.84	32.94
Kimi Code	Kimi-K2.6	30.65	39.76	28.07	33.66	27.99
Claude Code	MiMo-V2.5-Pro	24.10	32.33	22.59	27.45	20.65
Code Buddy	GLM-5.1	18.29	25.23	17.80	21.14	14.59
Code Buddy	MiniMax-M2.7	10.95	14.27	9.92	14.92	8.85
Codex	DeepSeek-V4-Pro	2.15	2.25	1.69	1.97	2.63

Scores are percentages. Mechanics, Depth, Visuals, and Art correspond to the four rubric categories: Core Mechanics, Content Depth, Functional Visuals, and Art and Presentation.

Evaluation protocol

Each task provides a natural-language game specification, a Godot-based workspace, shared resources, and a hidden rubric. A valid submission includes a Godot project under /workspace/game and replayable input traces under /workspace/game/demo_outputs/. The verifier checks launchability, replays the submitted traces, records gameplay evidence, and applies the hidden rubric to the observed behavior.

Install

Tested on Ubuntu 22.04. Run as root (or with sudo).

1. System dependencies

apt update
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
    xvfb xdotool ffmpeg x11-utils x11-xserver-utils \
    libxcursor1 libxinerama1 libxrandr2 libxi6 libgl1 libegl1 \
    unzip ca-certificates curl \
    x11vnc novnc

These cover everything the verifier and dashboard need:

xvfb — virtual X display for headless replay
xdotool — synthetic mouse / keyboard injection
ffmpeg — screen recording (x11grab)
x11-utils / x11-xserver-utils — display probes
libxcursor1 … libegl1 — runtime libraries Godot dlopens
x11vnc / novnc — browser-based dashboard play sessions

2. Godot 4.6.2

Pinned for reproducibility. Direct GitHub download often fails behind the GFW; the gh-proxy.com mirror is the easy fallback:

mkdir -p /opt/godot && cd /opt/godot
curl -sSL -o godot.zip \
    "https://gh-proxy.com/https://github.com/godotengine/godot/releases/download/4.6.2-stable/Godot_v4.6.2-stable_linux.x86_64.zip"
unzip -o godot.zip && rm godot.zip
mv Godot_v4.6.2-stable_linux.x86_64 godot
chmod +x godot
ln -sf /opt/godot/godot /usr/local/bin/godot

godot --version       # → 4.6.2.stable.official.71f334935

If the direct link works for you, drop the gh-proxy.com/ prefix.

3. Python

git clone <this-repo> game-bench && cd game-bench
uv venv --python 3.12 .venv
source .venv/bin/activate
uv pip install -e .          # add --index-url https://pypi.tuna.tsinghua.edu.cn/simple if needed

4. Local config

cp .env.example .env         # fill in API keys / paths if defaults aren't right

.env holds judge API keys (OPENAI_API_KEY, ANTHROPIC_AUTH_TOKEN, …), the path to the Godot binary, and any path overrides. All scripts under scripts/ source it automatically.

5. Asset libraries (optional but expected)

Tasks share two CC0 / permissive 2D asset pools, mounted read-only into each trial:

Kenney at /workspace/assets/library/ — themed packs (sprites, tilesets, UI, audio). ~480 MB across ~157 packs.
OpenGameArt at /workspace/assets/library-oga/ — narrative / RPG / pixel-art entries scraped from opengameart.org. Per-entry LICENSE.txt records the source URL and license string.

python scripts/fetch_assets.py                              # Kenney
python scripts/fetch_oga_assets.py --license CC0            # OGA, CC0 only

Both fetchers are idempotent; safe to re-run.

Run a task

./scripts/run.sh -p tasks/<task> --agent <agent>

run.sh is a thin wrapper around harbor run that sources .env, activates the venv, sets PYTHONPATH, and points harbor at our custom LocalSubprocessEnvironment. Use --agent oracle to run the bundled reference solution; --agent nop to confirm a do-nothing agent fails (0.0). All extra arguments are forwarded to Harbor, so -n, -x, --job-name, --delete, and --ak ... work the same way as in harbor run.

For local coding-agent runs, use the dedicated wrappers. They source .env, pin the local agent implementation, forward the required API credentials into the agent subprocess, and default to --no-delete so generated projects remain under the job directory for inspection.

# Claude Code + Opus-4.7
./scripts/run_claude_code_opus_4_7.sh --ak reasoning_effort=high -p tasks/<task>

# Codex CLI + GPT-5.5
./scripts/run_codex_gpt_5_5.sh --ak reasoning_effort=high -p tasks/<task>

# Kimi Code + Kimi coding model
./scripts/run_kimi_code.sh -p tasks/<task>

# Claude Code + MiMo-V2.5-Pro through an Anthropic-compatible endpoint
./scripts/run_claude_code_mimo_2_5.sh --ak reasoning_effort=high -p tasks/<task>

Claude Code / Codex-based wrappers require --ak reasoning_effort=<low|medium|high>. The Kimi wrapper passes --ak thinking=true by default.

Job artifacts land under $GAMECRAFT_BENCH_JOBS_ROOT (default ../gamecraft-bench-jobs/<timestamp>/<task>__<id>/).

Dashboard

gamecraft_bench/dashboard/ is a browser-based dashboard for inspecting benchmark jobs, scores, artifacts, and the playable games agents produce. When launching a game for interactive inspection, it snapshots the project to a temporary play directory so the agent's retained sandbox is not disturbed.

./scripts/dashboard_service.sh             # default port 6090, default jobs root
./scripts/dashboard_service.sh --port 7000 --jobs-root /custom/path

Forward the port in VS Code (Ports panel), open http://localhost:6090/, then either:

Click Play on a single trial, or
Tick the checkboxes on multiple trials and click Compare to open a grid view with one live noVNC iframe per game (auto-laid-out 2/3/4 columns). Each cell has its own Refresh / Stop button and is fully interactive.

Architecture: a session pool of up to 8 X displays (:300–:307, disjoint from the verifier's :99–:199 range), each backing a dedicated Xvfb + Godot + x11vnc trio. The FastAPI app serves noVNC's static files at /novnc/ and bridges browser WebSocket frames to x11vnc TCP at /ws/{sid}. Closing the browser tab fires navigator.sendBeacon to free the slot.

Adding a task

harbor task init <org>/<task-name> from the repo root.
Move the generated dir under tasks/.
Fill in instruction.md and tests/rubric.json (build_check, requirements, score_formula).
Optional: write solution/solve.sh for the oracle agent.

End-to-end check:

./scripts/run.sh -p tasks/<task> --agent oracle

Task assets / starter scaffold go under tasks/<task>/workspace/; contents are copied into the agent's /workspace/ at env start.

Scope

GameCraft-Bench focuses on 2D game generation in Godot. This makes the benchmark lightweight and reproducible for headless evaluation, but it does not cover Unity, Unreal, 3D games, multiplayer systems, large-scale physics, or long-form production workflows. The verifier scores visual gameplay evidence; audio-dependent aspects are represented through visible game behavior rather than direct audio evaluation.

The benchmark measures whether an agent follows the game specification and realizes the requested mechanics, content, visual state, and presentation in an executable artifact. It does not attempt to measure whether a generated game is subjectively fun.

Acknowledgment

GameCraft-Bench builds on Godot as the game engine runtime and Harbor as the benchmark and agent-execution harness. We thank the open-source communities behind these projects for making reproducible, end-to-end game-generation evaluation possible.

Citation

If you find GameCraft-Bench useful, please cite us:

@article{luo2026gamecraft,
  title={GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?},
  author={Luo, Tongxu and Wang, Rongsheng and Bi, Jiaxi and Xu, Chenming and Tang, Zhengyang and Chen, Jianlong and Liang, Juhao and Ji, Ke and Guo, Shuqi and Du, Yuhao and Bu, Fan and Du, Wenyu and Zhang, Xiaotong and Li, Kyle and Wang, Shaobo and Zhang, Linfeng and Liu, Yuxuan and Lai, Xin and Li, Chenxin and Guo, Yiduo and Zhang, Zhexin and Wang, Xinyuan and Bai, Tianyi and Li, Ziniu and Wang, Benyou},
  journal={arXiv preprint arXiv:2606.17861},
  year={2026}
}

License

Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
gamecraft_bench		gamecraft_bench
media		media
scripts		scripts
tasks		tasks
tools		tools
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmark overview

Main results

Evaluation protocol

Install

1. System dependencies

2. Godot 4.6.2

3. Python

4. Local config

5. Asset libraries (optional but expected)

Run a task

Dashboard

Adding a task

Scope

Acknowledgment

Citation

License

Star History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Benchmark overview

Main results

Evaluation protocol

Install

1. System dependencies

2. Godot 4.6.2

3. Python

4. Local config

5. Asset libraries (optional but expected)

Run a task

Dashboard

Adding a task

Scope

Acknowledgment

Citation

License

Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages