WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces β paper code release.
114 long-horizon, real-world tasks across 8 work domains, where every task requires the agent to interleave GUI clicks with shell/code in one trajectory. Each task is scored by a trajectory-aware Agent-as-Judge that reads the chat trace + deliverables and emits per-clause evidence β much harder to spoof than file-existence checks.
βΆ Watch the demo (RabbitMQ DLQ topology, OPS domain Β· 110 s) Β· or browse all task demos on the website.
| Backbone Γ harness | PassRate β |
|---|---|
| Claude Opus 4.7 + Claude Code (best frontier pairing) | 41.2 % |
| Claude Opus 4.7 + OpenClaw | 35.1 % |
| GPT-5.5 + Codex CLI | 35.1 % |
| GPT-5.5 + OpenClaw | 33.3 % |
| GPT-5.4 + OpenClaw | 22.8 % |
| Gemini 3.1 Pro + OpenClaw | 1.8 % |
Frontier backbones report >78 % on OSWorld-Verified for comparable models β WeaveBench is far from saturation.
Full per-domain breakdowns + cross-harness sweep in docs/REPRODUCE.md.
Curious before you commit to the full setup? Install + run the offline demo:
git clone https://github.com/weavebench/WeaveBench.git && cd WeaveBench
pip install -e .
weavebench-demoThis prints a real score.json from a paper-canonical Claude Opus 4.7 attempt at WEB_task_1_mockup_pixel_diff β judge model, per-artifact evidence quotes, the works. ~1 second, no network calls. Use this to see what scoring looks like before deciding to spin up a 29.5 GB qcow2.
| What | Why |
|---|---|
| Linux host with KVM + Docker (β₯ 8 CPU, β₯ 32 GB RAM, β₯ 150 GB free disk) | Every task spins up an Ubuntu VM (OSWorld-derived) for the agent to act in |
| Python β₯ 3.10.12, Node β₯ 22 | Node powers the host-side OpenClaw judge (npm install -g openclaw) |
| OpenRouter API key | Used by both the agent and the trajectory-aware judge β both default to openai/gpt-5.5. Pay-as-you-go. export OPENROUTER_API_KEY=... |
π¨π³ Users in China:
export HF_ENDPOINT=https://hf-mirror.comβ allweavebench-download-*commands honor it automatically. Optionallyexport HF_TOKEN=hf_...if you have an account and want higher HF rate limits (not required β dataset is public).
| Resource | Estimate |
|---|---|
| Disk for one-time setup | ~29.5 GB (207 MB tasks + 852 MB runtime tarballs + 28.46 GB qcow2 + 7 KB judge template) |
Wall time and OpenRouter cost per task depend heavily on your VM host, network, chosen model, and how many parallel envs you run β --num_envs 8 on a beefy host is roughly an order of magnitude faster than --num_envs 1 on a laptop-class server. Run one smoke task first to calibrate before kicking off the full sweep.
If you already have an OSWorld-compatible qcow2 locally, use bash scripts/setup.sh --skip-vm and save the 28 GB.
git clone https://github.com/weavebench/WeaveBench.git && cd WeaveBench
export OPENROUTER_API_KEY=sk-or-v1-...
# Optional: export HF_ENDPOINT=https://hf-mirror.com # if in China
# Optional: export HF_TOKEN=hf_... # higher HF rate limits
bash scripts/setup.sh # installs + downloads dataset + runtime + 28 GB qcow2scripts/setup.sh runs prereq checks β pip install -e . β npm install -g openclaw β weavebench-download-{dataset,assets,judge,vm} and prints the next command. Pass --skip-vm if you already have a qcow2 (you'll need to export OSWORLD_LOCAL_QCOW2_PATH=...).
export OSWORLD_LOCAL_QCOW2_PATH=./cache/vm/Ubuntu.qcow2
# Smoke test: one WEB task (note the trailing _ in --task_filter β a bare
# 'task_1' is a substring match and would also include task_10..task_19)
weavebench-run \
--harness openclaw \
--model openai/gpt-5.5 \
--tasks_root ./cache/tasks \
--domains WEB \
--task_filter task_1_ \
--result_dir ./results/smoke
# Full 114-task sweep (multi-hour; tune --num_envs to your host)
weavebench-run \
--harness openclaw \
--model openai/gpt-5.5 \
--tasks_root ./cache/tasks \
--num_envs 8 \
--result_dir ./results/runSwap --harness to codex, claudecode, or hermes. Swap --model to anything OpenRouter exposes (anthropic/claude-opus-4.7, google/gemini-2.5-pro, β¦).
# OpenRouter overrides
export OPENROUTER_BASE_URL=... # default https://openrouter.ai/api/v1
export JUDGE_MODEL=anthropic/claude-opus-4.7 # judge defaults to openai/gpt-5.5;
# set this if you want a different judge
# Reproducibility β pin to the exact dataset commit used for the paper.
# See docs/REPRODUCE.md for the canonical SHA + per-table model snapshot ids.
export WEAVEBENCH_DATASET_REVISION=<full sha from docs/REPRODUCE.md>βΉοΈ Default model. Both the agent and the judge default to
openai/gpt-5.5β one model, one billing line, easy to reason about. Pass--model <other>to change the agent and/orJUDGE_MODEL=<other>to change just the judge. Seedocs/REPRODUCE.mdfor the exact backbones used in the paper.
results/run/<mode>/<model>/<DOMAIN>/<task>/
βββ chat.jsonl # LLM trace
βββ agent.log
βββ results.tar.gz # files the agent produced
βββ score.json # trajectory-aware judge: per-clause evidence + overall_score
Scoring runs automatically β every task is judged by a separate OpenClaw on the host that reads the deliverables + chat trace. See docs/AGENT_JUDGE.md for the one-time host-side setup.
See docs/REPRODUCE.md for the pinned dataset revision, model snapshot ids, and per-table commands used in the paper.
- OSWorld hybrid-scoring experiment (CLI agent vs vision, native + agent-as-judge):
experiments/osworld_hybrid/ - Architecture:
docs/ARCHITECTURE.md - Agent-as-Judge setup:
docs/AGENT_JUDGE.md - Reproducing paper numbers:
docs/REPRODUCE.md - Contributing:
CONTRIBUTING.md - Changelog:
CHANGELOG.md - Dataset + runtime + qcow2: π€ wanlilll/WeaveBench
@article{li2026weavebench,
title = {WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces},
author = {Li, Wanli and Zhou, Bowen and Yang, Yifan and Yu, Yunyao and Li, Dongsheng and Xu, Zhou and Shan, Caihua},
year = {2026},
}Built on OpenClaw, Codex CLI, Claude Code, Hermes-Agent, OSWorld, and WildClawBench.