WeaveBench

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces — paper code release.

114 long-horizon, real-world tasks across 8 work domains, where every task requires the agent to interleave GUI clicks with shell/code in one trajectory. Each task is scored by a trajectory-aware Agent-as-Judge that reads the chat trace + deliverables and emits per-clause evidence — much harder to spoof than file-existence checks.

🎬 Demo

▶ Watch the demo (RabbitMQ DLQ topology, OPS domain · 110 s) · or browse all task demos on the website.

Headline results (from the paper)

Backbone × harness	PassRate ↑
Claude Opus 4.7 + Claude Code (best frontier pairing)	41.2 %
Claude Opus 4.7 + OpenClaw	35.1 %
GPT-5.5 + Codex CLI	35.1 %
GPT-5.5 + OpenClaw	33.3 %
GPT-5.4 + OpenClaw	22.8 %
Gemini 3.1 Pro + OpenClaw	1.8 %

Frontier backbones report >78 % on OSWorld-Verified for comparable models — WeaveBench is far from saturation.

Full per-domain breakdowns + cross-harness sweep in docs/REPRODUCE.md.

Try it in 30 seconds (no VM, no API key)

Curious before you commit to the full setup? Install + run the offline demo:

git clone https://github.com/weavebench/WeaveBench.git && cd WeaveBench
pip install -e .
weavebench-demo

This prints a real score.json from a paper-canonical Claude Opus 4.7 attempt at WEB_task_1_mockup_pixel_diff — judge model, per-artifact evidence quotes, the works. ~1 second, no network calls. Use this to see what scoring looks like before deciding to spin up a 29.5 GB qcow2.

Requirements (for the full pipeline)

What	Why
Linux host with KVM + Docker (≥ 8 CPU, ≥ 32 GB RAM, ≥ 150 GB free disk)	Every task spins up an Ubuntu VM (OSWorld-derived) for the agent to act in
Python ≥ 3.10.12, Node ≥ 22	Node powers the host-side OpenClaw judge (`npm install -g openclaw`)
OpenRouter API key	Used by both the agent and the trajectory-aware judge — both default to `openai/gpt-5.5`. Pay-as-you-go. `export OPENROUTER_API_KEY=...`

🇨🇳 Users in China: export HF_ENDPOINT=https://hf-mirror.com — all weavebench-download-* commands honor it automatically. Optionally export HF_TOKEN=hf_... if you have an account and want higher HF rate limits (not required — dataset is public).

Resource budget (read before you commit)

Resource	Estimate
Disk for one-time setup	~29.5 GB (207 MB tasks + 852 MB runtime tarballs + 28.46 GB qcow2 + 7 KB judge template)

Wall time and OpenRouter cost per task depend heavily on your VM host, network, chosen model, and how many parallel envs you run — --num_envs 8 on a beefy host is roughly an order of magnitude faster than --num_envs 1 on a laptop-class server. Run one smoke task first to calibrate before kicking off the full sweep.

If you already have an OSWorld-compatible qcow2 locally, use bash scripts/setup.sh --skip-vm and save the 28 GB.

One-command setup

git clone https://github.com/weavebench/WeaveBench.git && cd WeaveBench

export OPENROUTER_API_KEY=sk-or-v1-...
# Optional: export HF_ENDPOINT=https://hf-mirror.com   # if in China
# Optional: export HF_TOKEN=hf_...                     # higher HF rate limits

bash scripts/setup.sh                          # installs + downloads dataset + runtime + 28 GB qcow2

scripts/setup.sh runs prereq checks → pip install -e . → npm install -g openclaw → weavebench-download-{dataset,assets,judge,vm} and prints the next command. Pass --skip-vm if you already have a qcow2 (you'll need to export OSWORLD_LOCAL_QCOW2_PATH=...).

Run

export OSWORLD_LOCAL_QCOW2_PATH=./cache/vm/Ubuntu.qcow2

# Smoke test: one WEB task (note the trailing _ in --task_filter — a bare
# 'task_1' is a substring match and would also include task_10..task_19)
weavebench-run \
    --harness openclaw \
    --model openai/gpt-5.5 \
    --tasks_root ./cache/tasks \
    --domains WEB \
    --task_filter task_1_ \
    --result_dir ./results/smoke

# Full 114-task sweep (multi-hour; tune --num_envs to your host)
weavebench-run \
    --harness openclaw \
    --model openai/gpt-5.5 \
    --tasks_root ./cache/tasks \
    --num_envs 8 \
    --result_dir ./results/run

Swap --harness to codex, claudecode, or hermes. Swap --model to anything OpenRouter exposes (anthropic/claude-opus-4.7, google/gemini-2.5-pro, …).

Optional env vars

# OpenRouter overrides
export OPENROUTER_BASE_URL=...                       # default https://openrouter.ai/api/v1
export JUDGE_MODEL=anthropic/claude-opus-4.7         # judge defaults to openai/gpt-5.5;
                                                     # set this if you want a different judge

# Reproducibility — pin to the exact dataset commit used for the paper.
# See docs/REPRODUCE.md for the canonical SHA + per-table model snapshot ids.
export WEAVEBENCH_DATASET_REVISION=<full sha from docs/REPRODUCE.md>

ℹ️ Default model. Both the agent and the judge default to openai/gpt-5.5 — one model, one billing line, easy to reason about. Pass --model <other> to change the agent and/or JUDGE_MODEL=<other> to change just the judge. See docs/REPRODUCE.md for the exact backbones used in the paper.

Output

results/run/<mode>/<model>/<DOMAIN>/<task>/
  ├── chat.jsonl            # LLM trace
  ├── agent.log
  ├── results.tar.gz        # files the agent produced
  └── score.json            # trajectory-aware judge: per-clause evidence + overall_score

Scoring runs automatically — every task is judged by a separate OpenClaw on the host that reads the deliverables + chat trace. See docs/AGENT_JUDGE.md for the one-time host-side setup.

Reproducing paper numbers

See docs/REPRODUCE.md for the pinned dataset revision, model snapshot ids, and per-table commands used in the paper.

More

OSWorld hybrid-scoring experiment (CLI agent vs vision, native + agent-as-judge): experiments/osworld_hybrid/
Architecture: docs/ARCHITECTURE.md
Agent-as-Judge setup: docs/AGENT_JUDGE.md
Reproducing paper numbers: docs/REPRODUCE.md
Contributing: CONTRIBUTING.md
Changelog: CHANGELOG.md
Dataset + runtime + qcow2: 🤗 wanlilll/WeaveBench

Citation

@article{li2026weavebench,
  title  = {WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces},
  author = {Li, Wanli and Zhou, Bowen and Yang, Yifan and Yu, Yunyao and Li, Dongsheng and Xu, Zhou and Shan, Caihua},
  year   = {2026},
}

Acknowledgements

Built on OpenClaw, Codex CLI, Claude Code, Hermes-Agent, OSWorld, and WildClawBench.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
experiments/osworld_hybrid		experiments/osworld_hybrid
scripts		scripts
tests		tests
weavebench		weavebench
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WeaveBench

🎬 Demo

Headline results (from the paper)

Try it in 30 seconds (no VM, no API key)

Requirements (for the full pipeline)

Resource budget (read before you commit)

One-command setup

Run

Optional env vars

Output

Reproducing paper numbers

More

Citation

Acknowledgements

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WeaveBench

🎬 Demo

Headline results (from the paper)

Try it in 30 seconds (no VM, no API key)

Requirements (for the full pipeline)

Resource budget (read before you commit)

One-command setup

Run

Optional env vars

Output

Reproducing paper numbers

More

Citation

Acknowledgements

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages