Skip to content

test(bench): WebVoyager-style contract-eval benchmark — 10 tasks, mock+real adapters, no src/ changes #851

@shaun0927

Description

@shaun0927

Tier: tests-only (tests/benchmark/); zero changes under src/** → P1–P5 trivially preserved
PR target: develop

Background

tests/benchmark/ (benchmark-runner.ts, openchrome-real-adapter.ts, tasks/*.ts) measures mechanical performance — call counts, byte lengths, latency — over synthetic local fixtures. It does not measure task success: whether an LLM agent driving openchrome actually completes a real web task.

notte publicly claims WebVoyager30: 86.2% self-eval, 79.0% LLM-eval, 47s/task. browser-use is reported at 113s/task. These are credible-but-self-reported numbers; the 7%p self-vs-LLM-eval gap exposes the weakness of LLM judging.

OpenChrome's distinguishing claim is verifiable execution via Outcome Contracts (src/contracts/). This issue makes that claim falsifiable on real-web tasks:

  1. Run an agent against real public websites using tests/benchmark/adapters/openchrome-real-adapter.ts.
  2. Score success via src/contracts/evaluate.ts (URL / DOM / network / screenshot postconditions) — not via LLM judging. Eliminates the self-vs-LLM-eval gap by construction.
  3. Publish a single comparable number: "WebVoyager contract-eval score = X / N tasks passed".

Why this is necessary (not nice-to-have)

  • The outcome-contracts label is openchrome's core differentiator. There is currently no public number demonstrating contracts work on real-web tasks at scale.
  • Without this benchmark, "more reliable than browser-use" claims are unfalsifiable.
  • A failing benchmark on a PR is a strong regression gate that unit tests cannot replicate (they test code, not agent behavior).
  • Contract-eval scoring is intrinsically more rigorous than LLM-judge scoring → no 7%p gap.

Proposed Implementation

Phase 1 (this issue): 10 tasks + harness

  1. New directory: tests/benchmark/webvoyager/

    • tasks/ — 10 TypeScript task specs (see task list below)
    • runner.ts — orchestrator: spawns openchrome MCP server, hands LLM adapter the instruction, runs to completion or timeout, evaluates contract via src/contracts/evaluate.ts, records JSON report
    • report.ts — emits Markdown table at tests/benchmark/webvoyager/reports/<git-sha>.md
    • baseline.json — committed minimum score; raised over time, never lowered
    • llm/claude-adapter.ts — Anthropic API adapter (opt-in via ANTHROPIC_API_KEY)
    • llm/mock-adapter.ts — deterministic transcript replay (CI default, recorded transcripts in transcripts/)
  2. Task contract format (reuses existing DSL from src/contracts/types.tsurl | dom_text | dom_count | network | screenshot_class | no_dialog + and | or | not; no new operators):

    {
      name: 'task-01-example-com-title',
      instruction: 'Visit https://example.com and report the page title.',
      contract: {
        postconditions: {
          kind: 'and',
          operands: [
            { kind: 'url', equals: 'https://example.com/' },
            { kind: 'dom_text', selector: 'h1', contains: 'Example Domain' },
            { kind: 'dom_count', selector: 'h1', op: 'gte', value: 1 }
          ]
        }
      },
      timeout_ms: 60_000
    }

    The exact field names follow src/contracts/types.ts as of develop HEAD; the runner does NOT introduce new operators — if a task needs one, the task is rejected at PR review.

  3. Phase-1 task list (10 tasks, login-free, public, content stable for ≥ 5 years):

    • task-01-example-com-title — title h1 of example.com is "Example Domain"
    • task-02-mdn-fetch-syntax — MDN page for fetch() contains the literal fetch(resource) syntax line
    • task-03-wikipedia-eiffel-height — en.wikipedia.org Eiffel_Tower article infobox contains height "330 m" (verified via dom_text with contains; if Wikipedia rounds in future, contract uses or-of-acceptable strings committed in the task spec)
    • task-04-rfc-9110-section-9-title — RFC 9110 §9 title is "Methods" (immutable RFC)
    • task-05-w3c-html-section-definition — html.spec.whatwg.org <section> element page contains "represents a generic section"
    • task-06-arxiv-2401-13919-abstract — arxiv.org/abs/2401.13919 page contains author list "Hongliang He"
    • task-07-rust-string-trim-method — doc.rust-lang.org/std/string/struct.String.html trim method link reaches a page whose URL ends str.html#method.trim
    • task-08-mdn-array-map-return — MDN Array.prototype.map page contains "A new array with each element being the result"
    • task-09-wikipedia-speed-of-light — Wikipedia Speed_of_light page contains "299,792,458"
    • task-10-tc39-ecma262-strict-mode — tc39.es/ecma262/ page reachable; URL after navigation matches ^https://tc39\.es/ecma262/

    Selection criteria (committed in tasks/README.md):

    • Anonymous (logged-out) public access
    • Content immutable or change-cycle ≥ 5 years (versioned specs, encyclopedia entries about historical facts)
    • No captcha / no geofencing / no payment
    • Avoids any contract requiring an operator not in src/contracts/types.ts
    • No live-updating numbers (intentionally excludes GitHub star counts, HN top story, npm latest version)

    Brittleness mitigation: tasks with low-risk drift (e.g., Wikipedia phrasing) use or-of-acceptable strings inside the and-contract; rationale documented per task. If a task contract starts failing for non-openchrome reasons (upstream rewrite), the PR fixing it edits the task spec and re-records the transcript.

  4. LLM adapter abstraction:

    • claude-adapter.ts — Anthropic Messages API, hard caps: max_tokens: 4096 per turn, max_tool_iterations: 50 per task, max_usd_per_task: 0.50 (computed from response usage; aborts the task with BUDGET_EXCEEDED if exceeded). Caps live in llm/budget.ts, configurable per task but never bypassable.
    • mock-adapter.ts — replays recorded transcripts/<task-name>.jsonl deterministically; each entry is {tool, args_digest_sha256, response_kind}. On replay, the adapter intercepts the LLM-step boundary, looks up the next expected tool-call by sequence number, and emits the recorded openchrome tool call directly. Drift (LLM model would have called a different tool) is not silently tolerated: the replay assertion fails, the task is reported as replay_drift, and the issue calls for re-recording.
    • Transcript lifecycle: transcripts are recorded once by running claude-adapter against the real API, manually reviewed (PR review must include a transcript snippet), then frozen as fixtures. Any change to a task spec OR a meaningful model behavior change requires explicit re-recording — PR title must include [transcript-rerecord: <task names>].
    • Adapter chosen via env: OPENCHROME_BENCH_ADAPTER=mock (default) or claude
  5. CI gating:

    • npm run bench:webvoyager:mock runs in CI on every PR
    • Pass condition (strict, not score-based): every replay must succeed and emit task_passed. A single replay_drift or contract failure fails CI. The baseline.json thus stores expected_pass_count: 10 (full set), not a soft threshold — this prevents bootstrapping at 0/10 from being "passing".
    • Real-LLM run gated behind OPENCHROME_BENCH_REAL=1 ANTHROPIC_API_KEY=...; not run in CI; runbook in docs/benchmarks/webvoyager.md includes total-spend estimate (≤ $5 for the 10-task suite with the budget caps above)
  6. Report contents (reports/<git-sha>.md):

    • Header: git sha, adapter, total tasks, pass count, contract-eval score
    • Per-task row: name | result | duration_ms | tool_calls | response_bytes | failed_postcondition (if any)
    • Comparison footer: notte's published 86.2% / 47s; openchrome's number; note that contracts are stricter than LLM-eval

Phase 2 (follow-up, separate issue): expand to 30 tasks, multi-LLM adapters

Not in scope here.

Acceptance Criteria

  • 10 task files committed; each contract validates against src/contracts/types.ts
  • npm run bench:webvoyager:mock runs to completion in ≤ 3 minutes, deterministic across re-runs
  • Mock transcripts committed for all 10 tasks (each is a fixture in transcripts/<task>.jsonl)
  • claude-adapter.ts exists and runs against real Claude API when keys are set (documented in runbook)
  • Report emits Markdown + JSON to reports/<git-sha>.{md,json}
  • baseline.json committed with the bootstrap score
  • Zero changes under src/** — verified by CI step: git diff --name-only origin/develop...HEAD -- 'src/**' && exit ($changes == 0 ? 0 : 1)
  • No new runtime dependencies in package.json — only devDependencies allowed (e.g., @anthropic-ai/sdk)
  • docs/benchmarks/webvoyager.md runbook authored
  • CHANGELOG entry under "Tooling"
  • PR targets develop

Verification (post-merge, using openchrome MCP)

Scenario 1 — mock-mode reproducibility

npm run bench:webvoyager:mock > /tmp/run1.json
npm run bench:webvoyager:mock > /tmp/run2.json
diff <(jq -S 'del(.timestamp, .duration_ms)' /tmp/run1.json) \
     <(jq -S 'del(.timestamp, .duration_ms)' /tmp/run2.json)

Pass: empty diff (after timestamp / duration normalization).

Scenario 2 — contract evaluator is the sole judge

Hand-craft a corrupted transcript for task-01: same read_page calls but the recorded final URL is https://wrong.example/. Run mock mode.
Pass: runner reports task-01: failed | failed_postcondition: postconditions[0] (url match). The judge call is src/contracts/evaluate.ts (verifiable by stack trace in failure output).

Scenario 3 — real-LLM smoke on the trivial task

ANTHROPIC_API_KEY=sk-... OPENCHROME_BENCH_ADAPTER=claude OPENCHROME_BENCH_REAL=1 \
  npm run bench:webvoyager:real -- --task task-01-example-com-title

Pass: returns success: true, duration < 30s. Cost printed at end (estimated USD via response tokens). PR description records the observed value.

Scenario 4 — comparison number published

After Phase-1 real-LLM run (single run acceptable for v1, variance documented):
Pass: docs/benchmarks/webvoyager.md contains a populated table:

  • contract-eval score (X / 10)
  • median task duration (ms)
  • median tool-call count
  • notte's published numbers as a reference row
  • explicit note: "contract-eval is stricter than LLM-eval; numbers not directly comparable to notte's 86.2%"

Scenario 5 — CI regression gate

Open a no-op PR modifying only README.md. CI runs bench:webvoyager:mock.
Pass: gate passes. Then in a separate test commit, mutate one mock transcript so a contract fails; gate fails with the failing task name + contract clause shown in the CI log.

Scenario 6 — P1–P5 compliance audit

git diff --name-only origin/develop...HEAD | grep -E '^src/' | wc -l
git diff origin/develop...HEAD -- package.json | grep -E '^\+\s+"[^@]'  # raw added deps under dependencies

Pass: first command outputs 0; second outputs only devDependencies entries (or empty). Audit captured in PR description.

Issue closure criteria

Scenarios 1–6 pass + CI green + report file committed for the PR's own git sha.

Out of scope (Phase 2 / follow-up)

  • Expand to 30 tasks (separate issue)
  • Multi-LLM adapter study (GPT-4o, Llama via Ollama) — separate issue
  • Vision-based / multi-modal eval — text-tree first
  • Login-required, payment, captcha-gated tasks — privacy / cost / determinism concerns
  • Statistical-significance multi-run study — variance is acknowledged but not yet quantified

Dependencies

References

  • WebVoyager paper (task selection methodology): https://arxiv.org/abs/2401.13919
  • notte open-operator-evals: https://github.com/nottelabs/open-operator-evals
  • tests/benchmark/benchmark-runner.ts (existing harness — extended, not replaced)
  • tests/benchmark/adapters/openchrome-real-adapter.ts (real MCP adapter already exists)
  • src/contracts/evaluate.ts (the judge)
  • src/contracts/types.ts (contract DSL — reused as-is)
  • docs/roadmap/portability-harness-contract.md (P1–P5; this issue trivially complies via tests-only scope)

Curated scope, overlap handling, and verification checklist

Scope classification

  • Canonical lane: test/benchmark harness and merge verification.
  • Primary deliverable: WebVoyager-style contract-eval benchmark — 10 tasks, mock+real adapters, no src/ changes.
  • Open PR: none currently linked in the active priority map; verify GitHub again before implementation.
  • Detected labels: enhancement, P1, observability, outcome-contracts.
  • Affected OpenChrome surfaces from issue text: read_page, act, interact.
  • Non-goal: shipping production behavior changes, relying on flaky external websites, or treating proxy metrics as complete validation.

Overlap and conflict resolution

  • Related issues found in the body: feat(core): formalize read_page refs as canonical interaction entrypoint #831. Keep this issue narrow and cross-link instead of duplicating their implementation.
  • Keep this issue aligned with OpenChrome's MCP/CDP-first, additive, deterministic-tool-server direction.
  • If an existing open PR already implements part of this scope, update that PR or mark the overlap explicitly before starting new work.
  • Do not absorb adjacent benchmark, dashboard, security, or skill-memory work unless the original issue text requires it.

Implementation checklist

  • Restate the exact contract for WebVoyager-style contract-eval benchmark — 10 tasks, mock+real adapters, no src/ changes in code/docs before changing behavior.
  • Add deterministic local fixtures and machine-readable result artifacts.
  • Measure the issue-specific success/failure, latency, payload, and evidence fields rather than only checking command exit code.
  • Document the command, expected artifacts, and pass/fail interpretation for merge verification.
  • Add regression coverage for the issue-specific happy path, failure path, default/disabled path, and artifact/output bounds.
  • Update user-facing docs or inline tool descriptions when hosts must choose a new flag, mode, policy, or workflow.

Success criteria

  • The implementation satisfies the primary deliverable without broadening into non-goals.
  • Existing default behavior remains backward-compatible or the issue explicitly documents the compatibility break.
  • Failure cases return bounded, actionable diagnostics rather than silent fallback or unbounded dumps.
  • Tests/benchmarks cover the concrete surface named in this issue, not only helper utilities.
  • Any produced artifact is deterministic, redacted, and small enough for merge review or stored behind handles.

Post-merge OpenChrome live verification checklist

  • Run the documented local OpenChrome fixture or smoke path for WebVoyager-style contract-eval benchmark — 10 tasks, mock+real adapters, no src/ changes and capture the exact command/tool calls.
  • Verify read_page behavior matches the issue goal in both the enabled path and the default/disabled compatibility path.
  • Inspect generated artifacts/logs/responses for bounded size, redaction, source links, and clear failure diagnostics.
  • Record sanitized output excerpts, artifact paths, and any benchmark/latency/payload numbers in merge verification notes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1P1 highenhancementNew feature or requestobservabilityObservabilityoutcome-contractsVerifiable execution via pre/post-condition contracts (Q2)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions