You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tier: tests-only (tests/benchmark/); zero changes under src/** → P1–P5 trivially preserved PR target: develop
Background
tests/benchmark/ (benchmark-runner.ts, openchrome-real-adapter.ts, tasks/*.ts) measures mechanical performance — call counts, byte lengths, latency — over synthetic local fixtures. It does not measure task success: whether an LLM agent driving openchrome actually completes a real web task.
notte publicly claims WebVoyager30: 86.2% self-eval, 79.0% LLM-eval, 47s/task. browser-use is reported at 113s/task. These are credible-but-self-reported numbers; the 7%p self-vs-LLM-eval gap exposes the weakness of LLM judging.
OpenChrome's distinguishing claim is verifiable execution via Outcome Contracts (src/contracts/). This issue makes that claim falsifiable on real-web tasks:
Run an agent against real public websites using tests/benchmark/adapters/openchrome-real-adapter.ts.
Score success via src/contracts/evaluate.ts (URL / DOM / network / screenshot postconditions) — not via LLM judging. Eliminates the self-vs-LLM-eval gap by construction.
Publish a single comparable number: "WebVoyager contract-eval score = X / N tasks passed".
Why this is necessary (not nice-to-have)
The outcome-contracts label is openchrome's core differentiator. There is currently no public number demonstrating contracts work on real-web tasks at scale.
Without this benchmark, "more reliable than browser-use" claims are unfalsifiable.
A failing benchmark on a PR is a strong regression gate that unit tests cannot replicate (they test code, not agent behavior).
Contract-eval scoring is intrinsically more rigorous than LLM-judge scoring → no 7%p gap.
Proposed Implementation
Phase 1 (this issue): 10 tasks + harness
New directory: tests/benchmark/webvoyager/
tasks/ — 10 TypeScript task specs (see task list below)
runner.ts — orchestrator: spawns openchrome MCP server, hands LLM adapter the instruction, runs to completion or timeout, evaluates contract via src/contracts/evaluate.ts, records JSON report
report.ts — emits Markdown table at tests/benchmark/webvoyager/reports/<git-sha>.md
baseline.json — committed minimum score; raised over time, never lowered
llm/claude-adapter.ts — Anthropic API adapter (opt-in via ANTHROPIC_API_KEY)
llm/mock-adapter.ts — deterministic transcript replay (CI default, recorded transcripts in transcripts/)
Task contract format (reuses existing DSL from src/contracts/types.ts — url | dom_text | dom_count | network | screenshot_class | no_dialog + and | or | not; no new operators):
The exact field names follow src/contracts/types.ts as of develop HEAD; the runner does NOT introduce new operators — if a task needs one, the task is rejected at PR review.
Phase-1 task list (10 tasks, login-free, public, content stable for ≥ 5 years):
task-01-example-com-title — title h1 of example.com is "Example Domain"
task-02-mdn-fetch-syntax — MDN page for fetch() contains the literal fetch(resource) syntax line
task-03-wikipedia-eiffel-height — en.wikipedia.org Eiffel_Tower article infobox contains height "330 m" (verified via dom_text with contains; if Wikipedia rounds in future, contract uses or-of-acceptable strings committed in the task spec)
task-04-rfc-9110-section-9-title — RFC 9110 §9 title is "Methods" (immutable RFC)
task-05-w3c-html-section-definition — html.spec.whatwg.org <section> element page contains "represents a generic section"
task-06-arxiv-2401-13919-abstract — arxiv.org/abs/2401.13919 page contains author list "Hongliang He"
task-07-rust-string-trim-method — doc.rust-lang.org/std/string/struct.String.html trim method link reaches a page whose URL ends str.html#method.trim
task-08-mdn-array-map-return — MDN Array.prototype.map page contains "A new array with each element being the result"
task-09-wikipedia-speed-of-light — Wikipedia Speed_of_light page contains "299,792,458"
task-10-tc39-ecma262-strict-mode — tc39.es/ecma262/ page reachable; URL after navigation matches ^https://tc39\.es/ecma262/
Selection criteria (committed in tasks/README.md):
Anonymous (logged-out) public access
Content immutable or change-cycle ≥ 5 years (versioned specs, encyclopedia entries about historical facts)
No captcha / no geofencing / no payment
Avoids any contract requiring an operator not in src/contracts/types.ts
No live-updating numbers (intentionally excludes GitHub star counts, HN top story, npm latest version)
Brittleness mitigation: tasks with low-risk drift (e.g., Wikipedia phrasing) use or-of-acceptable strings inside the and-contract; rationale documented per task. If a task contract starts failing for non-openchrome reasons (upstream rewrite), the PR fixing it edits the task spec and re-records the transcript.
LLM adapter abstraction:
claude-adapter.ts — Anthropic Messages API, hard caps: max_tokens: 4096 per turn, max_tool_iterations: 50 per task, max_usd_per_task: 0.50 (computed from response usage; aborts the task with BUDGET_EXCEEDED if exceeded). Caps live in llm/budget.ts, configurable per task but never bypassable.
mock-adapter.ts — replays recorded transcripts/<task-name>.jsonl deterministically; each entry is {tool, args_digest_sha256, response_kind}. On replay, the adapter intercepts the LLM-step boundary, looks up the next expected tool-call by sequence number, and emits the recorded openchrome tool call directly. Drift (LLM model would have called a different tool) is not silently tolerated: the replay assertion fails, the task is reported as replay_drift, and the issue calls for re-recording.
Transcript lifecycle: transcripts are recorded once by running claude-adapter against the real API, manually reviewed (PR review must include a transcript snippet), then frozen as fixtures. Any change to a task spec OR a meaningful model behavior change requires explicit re-recording — PR title must include [transcript-rerecord: <task names>].
Adapter chosen via env: OPENCHROME_BENCH_ADAPTER=mock (default) or claude
CI gating:
npm run bench:webvoyager:mock runs in CI on every PR
Pass condition (strict, not score-based): every replay must succeed and emit task_passed. A single replay_drift or contract failure fails CI. The baseline.json thus stores expected_pass_count: 10 (full set), not a soft threshold — this prevents bootstrapping at 0/10 from being "passing".
Real-LLM run gated behind OPENCHROME_BENCH_REAL=1 ANTHROPIC_API_KEY=...; not run in CI; runbook in docs/benchmarks/webvoyager.md includes total-spend estimate (≤ $5 for the 10-task suite with the budget caps above)
Report contents (reports/<git-sha>.md):
Header: git sha, adapter, total tasks, pass count, contract-eval score
Per-task row: name | result | duration_ms | tool_calls | response_bytes | failed_postcondition (if any)
Comparison footer: notte's published 86.2% / 47s; openchrome's number; note that contracts are stricter than LLM-eval
Phase 2 (follow-up, separate issue): expand to 30 tasks, multi-LLM adapters
Not in scope here.
Acceptance Criteria
10 task files committed; each contract validates against src/contracts/types.ts
npm run bench:webvoyager:mock runs to completion in ≤ 3 minutes, deterministic across re-runs
Mock transcripts committed for all 10 tasks (each is a fixture in transcripts/<task>.jsonl)
claude-adapter.ts exists and runs against real Claude API when keys are set (documented in runbook)
Report emits Markdown + JSON to reports/<git-sha>.{md,json}
baseline.json committed with the bootstrap score
Zero changes under src/** — verified by CI step: git diff --name-only origin/develop...HEAD -- 'src/**' && exit ($changes == 0 ? 0 : 1)
No new runtime dependencies in package.json — only devDependencies allowed (e.g., @anthropic-ai/sdk)
Hand-craft a corrupted transcript for task-01: same read_page calls but the recorded final URL is https://wrong.example/. Run mock mode. Pass: runner reports task-01: failed | failed_postcondition: postconditions[0] (url match). The judge call is src/contracts/evaluate.ts (verifiable by stack trace in failure output).
Scenario 3 — real-LLM smoke on the trivial task
ANTHROPIC_API_KEY=sk-... OPENCHROME_BENCH_ADAPTER=claude OPENCHROME_BENCH_REAL=1 \
npm run bench:webvoyager:real -- --task task-01-example-com-title
Pass: returns success: true, duration < 30s. Cost printed at end (estimated USD via response tokens). PR description records the observed value.
Scenario 4 — comparison number published
After Phase-1 real-LLM run (single run acceptable for v1, variance documented): Pass: docs/benchmarks/webvoyager.md contains a populated table:
contract-eval score (X / 10)
median task duration (ms)
median tool-call count
notte's published numbers as a reference row
explicit note: "contract-eval is stricter than LLM-eval; numbers not directly comparable to notte's 86.2%"
Scenario 5 — CI regression gate
Open a no-op PR modifying only README.md. CI runs bench:webvoyager:mock. Pass: gate passes. Then in a separate test commit, mutate one mock transcript so a contract fails; gate fails with the failing task name + contract clause shown in the CI log.
Keep this issue aligned with OpenChrome's MCP/CDP-first, additive, deterministic-tool-server direction.
If an existing open PR already implements part of this scope, update that PR or mark the overlap explicitly before starting new work.
Do not absorb adjacent benchmark, dashboard, security, or skill-memory work unless the original issue text requires it.
Implementation checklist
Restate the exact contract for WebVoyager-style contract-eval benchmark — 10 tasks, mock+real adapters, no src/ changes in code/docs before changing behavior.
Add deterministic local fixtures and machine-readable result artifacts.
Measure the issue-specific success/failure, latency, payload, and evidence fields rather than only checking command exit code.
Document the command, expected artifacts, and pass/fail interpretation for merge verification.
Add regression coverage for the issue-specific happy path, failure path, default/disabled path, and artifact/output bounds.
Update user-facing docs or inline tool descriptions when hosts must choose a new flag, mode, policy, or workflow.
Success criteria
The implementation satisfies the primary deliverable without broadening into non-goals.
Existing default behavior remains backward-compatible or the issue explicitly documents the compatibility break.
Failure cases return bounded, actionable diagnostics rather than silent fallback or unbounded dumps.
Tests/benchmarks cover the concrete surface named in this issue, not only helper utilities.
Any produced artifact is deterministic, redacted, and small enough for merge review or stored behind handles.
Post-merge OpenChrome live verification checklist
Run the documented local OpenChrome fixture or smoke path for WebVoyager-style contract-eval benchmark — 10 tasks, mock+real adapters, no src/ changes and capture the exact command/tool calls.
Verify read_page behavior matches the issue goal in both the enabled path and the default/disabled compatibility path.
Inspect generated artifacts/logs/responses for bounded size, redaction, source links, and clear failure diagnostics.
Record sanitized output excerpts, artifact paths, and any benchmark/latency/payload numbers in merge verification notes.
Tier: tests-only (
tests/benchmark/); zero changes undersrc/**→ P1–P5 trivially preservedPR target:
developBackground
tests/benchmark/(benchmark-runner.ts,openchrome-real-adapter.ts,tasks/*.ts) measures mechanical performance — call counts, byte lengths, latency — over synthetic local fixtures. It does not measure task success: whether an LLM agent driving openchrome actually completes a real web task.notte publicly claims WebVoyager30: 86.2% self-eval, 79.0% LLM-eval, 47s/task. browser-use is reported at 113s/task. These are credible-but-self-reported numbers; the 7%p self-vs-LLM-eval gap exposes the weakness of LLM judging.
OpenChrome's distinguishing claim is verifiable execution via Outcome Contracts (
src/contracts/). This issue makes that claim falsifiable on real-web tasks:tests/benchmark/adapters/openchrome-real-adapter.ts.src/contracts/evaluate.ts(URL / DOM / network / screenshot postconditions) — not via LLM judging. Eliminates the self-vs-LLM-eval gap by construction.Why this is necessary (not nice-to-have)
outcome-contractslabel is openchrome's core differentiator. There is currently no public number demonstrating contracts work on real-web tasks at scale.Proposed Implementation
Phase 1 (this issue): 10 tasks + harness
New directory:
tests/benchmark/webvoyager/tasks/— 10 TypeScript task specs (see task list below)runner.ts— orchestrator: spawns openchrome MCP server, hands LLM adapter theinstruction, runs to completion or timeout, evaluatescontractviasrc/contracts/evaluate.ts, records JSON reportreport.ts— emits Markdown table attests/benchmark/webvoyager/reports/<git-sha>.mdbaseline.json— committed minimum score; raised over time, never loweredllm/claude-adapter.ts— Anthropic API adapter (opt-in viaANTHROPIC_API_KEY)llm/mock-adapter.ts— deterministic transcript replay (CI default, recorded transcripts intranscripts/)Task contract format (reuses existing DSL from
src/contracts/types.ts—url | dom_text | dom_count | network | screenshot_class | no_dialog+and | or | not; no new operators):The exact field names follow
src/contracts/types.tsas ofdevelopHEAD; the runner does NOT introduce new operators — if a task needs one, the task is rejected at PR review.Phase-1 task list (10 tasks, login-free, public, content stable for ≥ 5 years):
task-01-example-com-title— title h1 of example.com is "Example Domain"task-02-mdn-fetch-syntax— MDN page forfetch()contains the literalfetch(resource)syntax linetask-03-wikipedia-eiffel-height— en.wikipedia.org Eiffel_Tower article infobox contains height "330 m" (verified viadom_textwithcontains; if Wikipedia rounds in future, contract usesor-of-acceptable strings committed in the task spec)task-04-rfc-9110-section-9-title— RFC 9110 §9 title is "Methods" (immutable RFC)task-05-w3c-html-section-definition— html.spec.whatwg.org<section>element page contains "represents a generic section"task-06-arxiv-2401-13919-abstract— arxiv.org/abs/2401.13919 page contains author list "Hongliang He"task-07-rust-string-trim-method— doc.rust-lang.org/std/string/struct.String.htmltrimmethod link reaches a page whose URL endsstr.html#method.trimtask-08-mdn-array-map-return— MDN Array.prototype.map page contains "A new array with each element being the result"task-09-wikipedia-speed-of-light— Wikipedia Speed_of_light page contains "299,792,458"task-10-tc39-ecma262-strict-mode— tc39.es/ecma262/ page reachable; URL after navigation matches^https://tc39\.es/ecma262/Selection criteria (committed in
tasks/README.md):src/contracts/types.tsBrittleness mitigation: tasks with low-risk drift (e.g., Wikipedia phrasing) use
or-of-acceptable strings inside theand-contract; rationale documented per task. If a task contract starts failing for non-openchrome reasons (upstream rewrite), the PR fixing it edits the task spec and re-records the transcript.LLM adapter abstraction:
claude-adapter.ts— Anthropic Messages API, hard caps:max_tokens: 4096per turn,max_tool_iterations: 50per task,max_usd_per_task: 0.50(computed from response usage; aborts the task withBUDGET_EXCEEDEDif exceeded). Caps live inllm/budget.ts, configurable per task but never bypassable.mock-adapter.ts— replays recordedtranscripts/<task-name>.jsonldeterministically; each entry is{tool, args_digest_sha256, response_kind}. On replay, the adapter intercepts the LLM-step boundary, looks up the next expected tool-call by sequence number, and emits the recorded openchrome tool call directly. Drift (LLM model would have called a different tool) is not silently tolerated: the replay assertion fails, the task is reported asreplay_drift, and the issue calls for re-recording.claude-adapteragainst the real API, manually reviewed (PR review must include a transcript snippet), then frozen as fixtures. Any change to a task spec OR a meaningful model behavior change requires explicit re-recording — PR title must include[transcript-rerecord: <task names>].OPENCHROME_BENCH_ADAPTER=mock(default) orclaudeCI gating:
npm run bench:webvoyager:mockruns in CI on every PRtask_passed. A singlereplay_driftor contract failure fails CI. Thebaseline.jsonthus storesexpected_pass_count: 10(full set), not a soft threshold — this prevents bootstrapping at 0/10 from being "passing".OPENCHROME_BENCH_REAL=1 ANTHROPIC_API_KEY=...; not run in CI; runbook indocs/benchmarks/webvoyager.mdincludes total-spend estimate (≤ $5 for the 10-task suite with the budget caps above)Report contents (
reports/<git-sha>.md):Phase 2 (follow-up, separate issue): expand to 30 tasks, multi-LLM adapters
Not in scope here.
Acceptance Criteria
contractvalidates againstsrc/contracts/types.tsnpm run bench:webvoyager:mockruns to completion in ≤ 3 minutes, deterministic across re-runstranscripts/<task>.jsonl)claude-adapter.tsexists and runs against real Claude API when keys are set (documented in runbook)reports/<git-sha>.{md,json}baseline.jsoncommitted with the bootstrap scoresrc/**— verified by CI step:git diff --name-only origin/develop...HEAD -- 'src/**' && exit ($changes == 0 ? 0 : 1)dependenciesinpackage.json— onlydevDependenciesallowed (e.g.,@anthropic-ai/sdk)docs/benchmarks/webvoyager.mdrunbook authoreddevelopVerification (post-merge, using openchrome MCP)
Scenario 1 — mock-mode reproducibility
Pass: empty diff (after timestamp / duration normalization).
Scenario 2 — contract evaluator is the sole judge
Hand-craft a corrupted transcript for
task-01: sameread_pagecalls but the recorded final URL ishttps://wrong.example/. Run mock mode.Pass: runner reports
task-01: failed | failed_postcondition: postconditions[0] (url match). The judge call issrc/contracts/evaluate.ts(verifiable by stack trace in failure output).Scenario 3 — real-LLM smoke on the trivial task
Pass: returns
success: true, duration < 30s. Cost printed at end (estimated USD via response tokens). PR description records the observed value.Scenario 4 — comparison number published
After Phase-1 real-LLM run (single run acceptable for v1, variance documented):
Pass:
docs/benchmarks/webvoyager.mdcontains a populated table:Scenario 5 — CI regression gate
Open a no-op PR modifying only
README.md. CI runsbench:webvoyager:mock.Pass: gate passes. Then in a separate test commit, mutate one mock transcript so a contract fails; gate fails with the failing task name + contract clause shown in the CI log.
Scenario 6 — P1–P5 compliance audit
Pass: first command outputs
0; second outputs onlydevDependenciesentries (or empty). Audit captured in PR description.Issue closure criteria
Scenarios 1–6 pass + CI green + report file committed for the PR's own git sha.
Out of scope (Phase 2 / follow-up)
Dependencies
semanticmode) — runner could chooseread_pagemode per task to compare token efficiency; not requiredref-based interaction; not requiredReferences
tests/benchmark/benchmark-runner.ts(existing harness — extended, not replaced)tests/benchmark/adapters/openchrome-real-adapter.ts(real MCP adapter already exists)src/contracts/evaluate.ts(the judge)src/contracts/types.ts(contract DSL — reused as-is)docs/roadmap/portability-harness-contract.md(P1–P5; this issue trivially complies via tests-only scope)Curated scope, overlap handling, and verification checklist
Scope classification
read_page,act,interact.Overlap and conflict resolution
Implementation checklist
Success criteria
Post-merge OpenChrome live verification checklist