test(bench): WebVoyager-style contract-eval benchmark — 10 tasks, mock+real adapters, no src/ changes

**Tier**: tests-only (`tests/benchmark/`); zero changes under `src/**` → P1–P5 trivially preserved
**PR target**: `develop`

## Background

`tests/benchmark/` (`benchmark-runner.ts`, `openchrome-real-adapter.ts`, `tasks/*.ts`) measures **mechanical performance** — call counts, byte lengths, latency — over synthetic local fixtures. It does **not** measure **task success**: whether an LLM agent driving openchrome actually completes a real web task.

notte publicly claims WebVoyager30: 86.2% self-eval, 79.0% LLM-eval, 47s/task. browser-use is reported at 113s/task. These are credible-but-self-reported numbers; the 7%p self-vs-LLM-eval gap exposes the weakness of LLM judging.

OpenChrome's distinguishing claim is **verifiable execution via Outcome Contracts** (`src/contracts/`). This issue makes that claim falsifiable on real-web tasks:

1. Run an agent against real public websites using `tests/benchmark/adapters/openchrome-real-adapter.ts`.
2. Score success via `src/contracts/evaluate.ts` (URL / DOM / network / screenshot postconditions) — **not** via LLM judging. Eliminates the self-vs-LLM-eval gap by construction.
3. Publish a single comparable number: "WebVoyager contract-eval score = X / N tasks passed".

## Why this is necessary (not nice-to-have)

- The `outcome-contracts` label is openchrome's core differentiator. There is currently **no public number** demonstrating contracts work on real-web tasks at scale.
- Without this benchmark, "more reliable than browser-use" claims are unfalsifiable.
- A failing benchmark on a PR is a strong regression gate that unit tests cannot replicate (they test code, not agent behavior).
- Contract-eval scoring is intrinsically more rigorous than LLM-judge scoring → no 7%p gap.

## Proposed Implementation

### Phase 1 (this issue): 10 tasks + harness

1. **New directory**: `tests/benchmark/webvoyager/`
   - `tasks/` — 10 TypeScript task specs (see task list below)
   - `runner.ts` — orchestrator: spawns openchrome MCP server, hands LLM adapter the `instruction`, runs to completion or timeout, evaluates `contract` via `src/contracts/evaluate.ts`, records JSON report
   - `report.ts` — emits Markdown table at `tests/benchmark/webvoyager/reports/<git-sha>.md`
   - `baseline.json` — committed minimum score; raised over time, never lowered
   - `llm/claude-adapter.ts` — Anthropic API adapter (opt-in via `ANTHROPIC_API_KEY`)
   - `llm/mock-adapter.ts` — deterministic transcript replay (CI default, recorded transcripts in `transcripts/`)

2. **Task contract format** (reuses existing DSL from `src/contracts/types.ts` — `url | dom_text | dom_count | network | screenshot_class | no_dialog` + `and | or | not`; **no new operators**):
   ```ts
   {
     name: 'task-01-example-com-title',
     instruction: 'Visit https://example.com and report the page title.',
     contract: {
       postconditions: {
         kind: 'and',
         operands: [
           { kind: 'url', equals: 'https://example.com/' },
           { kind: 'dom_text', selector: 'h1', contains: 'Example Domain' },
           { kind: 'dom_count', selector: 'h1', op: 'gte', value: 1 }
         ]
       }
     },
     timeout_ms: 60_000
   }
   ```
   The exact field names follow `src/contracts/types.ts` as of `develop` HEAD; the runner does NOT introduce new operators — if a task needs one, the task is rejected at PR review.

3. **Phase-1 task list (10 tasks, login-free, public, content stable for ≥ 5 years)**:
   - `task-01-example-com-title` — title h1 of example.com is "Example Domain"
   - `task-02-mdn-fetch-syntax` — MDN page for `fetch()` contains the literal `fetch(resource)` syntax line
   - `task-03-wikipedia-eiffel-height` — en.wikipedia.org Eiffel_Tower article infobox contains height "330 m" (verified via `dom_text` with `contains`; if Wikipedia rounds in future, contract uses `or`-of-acceptable strings committed in the task spec)
   - `task-04-rfc-9110-section-9-title` — RFC 9110 §9 title is "Methods" (immutable RFC)
   - `task-05-w3c-html-section-definition` — html.spec.whatwg.org `<section>` element page contains "represents a generic section"
   - `task-06-arxiv-2401-13919-abstract` — arxiv.org/abs/2401.13919 page contains author list "Hongliang He"
   - `task-07-rust-string-trim-method` — doc.rust-lang.org/std/string/struct.String.html `trim` method link reaches a page whose URL ends `str.html#method.trim`
   - `task-08-mdn-array-map-return` — MDN Array.prototype.map page contains "A new array with each element being the result"
   - `task-09-wikipedia-speed-of-light` — Wikipedia Speed_of_light page contains "299,792,458"
   - `task-10-tc39-ecma262-strict-mode` — tc39.es/ecma262/ page reachable; URL after navigation matches `^https://tc39\.es/ecma262/`

   Selection criteria (committed in `tasks/README.md`):
   - Anonymous (logged-out) public access
   - Content immutable or change-cycle ≥ 5 years (versioned specs, encyclopedia entries about historical facts)
   - No captcha / no geofencing / no payment
   - Avoids any contract requiring an operator not in `src/contracts/types.ts`
   - No live-updating numbers (intentionally **excludes** GitHub star counts, HN top story, npm latest version)

   **Brittleness mitigation**: tasks with low-risk drift (e.g., Wikipedia phrasing) use `or`-of-acceptable strings inside the `and`-contract; rationale documented per task. If a task contract starts failing for non-openchrome reasons (upstream rewrite), the PR fixing it edits the task spec and re-records the transcript.

4. **LLM adapter abstraction**:
   - `claude-adapter.ts` — Anthropic Messages API, hard caps: `max_tokens: 4096` per turn, `max_tool_iterations: 50` per task, `max_usd_per_task: 0.50` (computed from response usage; aborts the task with `BUDGET_EXCEEDED` if exceeded). Caps live in `llm/budget.ts`, configurable per task but never bypassable.
   - `mock-adapter.ts` — replays recorded `transcripts/<task-name>.jsonl` deterministically; each entry is `{tool, args_digest_sha256, response_kind}`. On replay, the adapter intercepts the LLM-step boundary, looks up the next expected tool-call by sequence number, and emits the recorded openchrome tool call directly. Drift (LLM model would have called a different tool) is **not silently tolerated**: the replay assertion fails, the task is reported as `replay_drift`, and the issue calls for re-recording.
   - **Transcript lifecycle**: transcripts are recorded once by running `claude-adapter` against the real API, manually reviewed (PR review must include a transcript snippet), then frozen as fixtures. Any change to a task spec OR a meaningful model behavior change requires explicit re-recording — PR title must include `[transcript-rerecord: <task names>]`.
   - Adapter chosen via env: `OPENCHROME_BENCH_ADAPTER=mock` (default) or `claude`

5. **CI gating**:
   - `npm run bench:webvoyager:mock` runs in CI on every PR
   - Pass condition (strict, not score-based): **every replay must succeed and emit `task_passed`**. A single `replay_drift` or contract failure fails CI. The `baseline.json` thus stores `expected_pass_count: 10` (full set), not a soft threshold — this prevents bootstrapping at 0/10 from being "passing".
   - Real-LLM run gated behind `OPENCHROME_BENCH_REAL=1 ANTHROPIC_API_KEY=...`; not run in CI; runbook in `docs/benchmarks/webvoyager.md` includes total-spend estimate (≤ $5 for the 10-task suite with the budget caps above)

6. **Report contents** (`reports/<git-sha>.md`):
   - Header: git sha, adapter, total tasks, pass count, contract-eval score
   - Per-task row: name | result | duration_ms | tool_calls | response_bytes | failed_postcondition (if any)
   - Comparison footer: notte's published 86.2% / 47s; openchrome's number; note that contracts are stricter than LLM-eval

### Phase 2 (follow-up, separate issue): expand to 30 tasks, multi-LLM adapters

Not in scope here.

## Acceptance Criteria

- [ ] 10 task files committed; each `contract` validates against `src/contracts/types.ts`
- [ ] `npm run bench:webvoyager:mock` runs to completion in ≤ 3 minutes, deterministic across re-runs
- [ ] Mock transcripts committed for all 10 tasks (each is a fixture in `transcripts/<task>.jsonl`)
- [ ] `claude-adapter.ts` exists and runs against real Claude API when keys are set (documented in runbook)
- [ ] Report emits Markdown + JSON to `reports/<git-sha>.{md,json}`
- [ ] `baseline.json` committed with the bootstrap score
- [ ] **Zero changes under `src/**`** — verified by CI step: `git diff --name-only origin/develop...HEAD -- 'src/**' && exit ($changes == 0 ? 0 : 1)`
- [ ] **No new runtime `dependencies`** in `package.json` — only `devDependencies` allowed (e.g., `@anthropic-ai/sdk`)
- [ ] `docs/benchmarks/webvoyager.md` runbook authored
- [ ] CHANGELOG entry under "Tooling"
- [ ] PR targets `develop`

## Verification (post-merge, using openchrome MCP)

### Scenario 1 — mock-mode reproducibility
```bash
npm run bench:webvoyager:mock > /tmp/run1.json
npm run bench:webvoyager:mock > /tmp/run2.json
diff <(jq -S 'del(.timestamp, .duration_ms)' /tmp/run1.json) \
     <(jq -S 'del(.timestamp, .duration_ms)' /tmp/run2.json)
```
**Pass**: empty diff (after timestamp / duration normalization).

### Scenario 2 — contract evaluator is the sole judge
Hand-craft a corrupted transcript for `task-01`: same `read_page` calls but the recorded final URL is `https://wrong.example/`. Run mock mode.
**Pass**: runner reports `task-01: failed | failed_postcondition: postconditions[0] (url match)`. The judge call is `src/contracts/evaluate.ts` (verifiable by stack trace in failure output).

### Scenario 3 — real-LLM smoke on the trivial task
```bash
ANTHROPIC_API_KEY=sk-... OPENCHROME_BENCH_ADAPTER=claude OPENCHROME_BENCH_REAL=1 \
  npm run bench:webvoyager:real -- --task task-01-example-com-title
```
**Pass**: returns `success: true`, duration < 30s. Cost printed at end (estimated USD via response tokens). PR description records the observed value.

### Scenario 4 — comparison number published
After Phase-1 real-LLM run (single run acceptable for v1, variance documented):
**Pass**: `docs/benchmarks/webvoyager.md` contains a populated table:
- contract-eval score (X / 10)
- median task duration (ms)
- median tool-call count
- notte's published numbers as a reference row
- explicit note: "contract-eval is stricter than LLM-eval; numbers not directly comparable to notte's 86.2%"

### Scenario 5 — CI regression gate
Open a no-op PR modifying only `README.md`. CI runs `bench:webvoyager:mock`.
**Pass**: gate passes. Then in a separate test commit, mutate one mock transcript so a contract fails; gate fails with the failing task name + contract clause shown in the CI log.

### Scenario 6 — P1–P5 compliance audit
```bash
git diff --name-only origin/develop...HEAD | grep -E '^src/' | wc -l
git diff origin/develop...HEAD -- package.json | grep -E '^\+\s+"[^@]'  # raw added deps under dependencies
```
**Pass**: first command outputs `0`; second outputs only `devDependencies` entries (or empty). Audit captured in PR description.

### Issue closure criteria
Scenarios 1–6 pass + CI green + report file committed for the PR's own git sha.

## Out of scope (Phase 2 / follow-up)
- Expand to 30 tasks (separate issue)
- Multi-LLM adapter study (GPT-4o, Llama via Ollama) — separate issue
- Vision-based / multi-modal eval — text-tree first
- Login-required, payment, captcha-gated tasks — privacy / cost / determinism concerns
- Statistical-significance multi-run study — variance is acknowledged but not yet quantified

## Dependencies
- Optional: issue 1 (`semantic` mode) — runner could choose `read_page` mode per task to compare token efficiency; not required
- Optional: #831 (refs) — agent can use `ref`-based interaction; not required
- No blocker

## References
- WebVoyager paper (task selection methodology): https://arxiv.org/abs/2401.13919
- notte open-operator-evals: https://github.com/nottelabs/open-operator-evals
- `tests/benchmark/benchmark-runner.ts` (existing harness — extended, not replaced)
- `tests/benchmark/adapters/openchrome-real-adapter.ts` (real MCP adapter already exists)
- `src/contracts/evaluate.ts` (the judge)
- `src/contracts/types.ts` (contract DSL — reused as-is)
- `docs/roadmap/portability-harness-contract.md` (P1–P5; this issue trivially complies via tests-only scope)



## Curated scope, overlap handling, and verification checklist

### Scope classification
- **Canonical lane:** test/benchmark harness and merge verification.
- **Primary deliverable:** WebVoyager-style contract-eval benchmark — 10 tasks, mock+real adapters, no src/ changes.
- **Open PR:** none currently linked in the active priority map; verify GitHub again before implementation.
- **Detected labels:** enhancement, P1, observability, outcome-contracts.
- **Affected OpenChrome surfaces from issue text:** `read_page`, `act`, `interact`.
- **Non-goal:** shipping production behavior changes, relying on flaky external websites, or treating proxy metrics as complete validation.

### Overlap and conflict resolution
- [ ] Related issues found in the body: #831. Keep this issue narrow and cross-link instead of duplicating their implementation.
- [ ] Keep this issue aligned with OpenChrome's MCP/CDP-first, additive, deterministic-tool-server direction.
- [ ] If an existing open PR already implements part of this scope, update that PR or mark the overlap explicitly before starting new work.
- [ ] Do not absorb adjacent benchmark, dashboard, security, or skill-memory work unless the original issue text requires it.

### Implementation checklist
- [ ] Restate the exact contract for **WebVoyager-style contract-eval benchmark — 10 tasks, mock+real adapters, no src/ changes** in code/docs before changing behavior.
- [ ] Add deterministic local fixtures and machine-readable result artifacts.
- [ ] Measure the issue-specific success/failure, latency, payload, and evidence fields rather than only checking command exit code.
- [ ] Document the command, expected artifacts, and pass/fail interpretation for merge verification.
- [ ] Add regression coverage for the issue-specific happy path, failure path, default/disabled path, and artifact/output bounds.
- [ ] Update user-facing docs or inline tool descriptions when hosts must choose a new flag, mode, policy, or workflow.

### Success criteria
- [ ] The implementation satisfies the primary deliverable without broadening into non-goals.
- [ ] Existing default behavior remains backward-compatible or the issue explicitly documents the compatibility break.
- [ ] Failure cases return bounded, actionable diagnostics rather than silent fallback or unbounded dumps.
- [ ] Tests/benchmarks cover the concrete surface named in this issue, not only helper utilities.
- [ ] Any produced artifact is deterministic, redacted, and small enough for merge review or stored behind handles.

### Post-merge OpenChrome live verification checklist
- [ ] Run the documented local OpenChrome fixture or smoke path for **WebVoyager-style contract-eval benchmark — 10 tasks, mock+real adapters, no src/ changes** and capture the exact command/tool calls.
- [ ] Verify read_page behavior matches the issue goal in both the enabled path and the default/disabled compatibility path.
- [ ] Inspect generated artifacts/logs/responses for bounded size, redaction, source links, and clear failure diagnostics.
- [ ] Record sanitized output excerpts, artifact paths, and any benchmark/latency/payload numbers in merge verification notes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(bench): WebVoyager-style contract-eval benchmark — 10 tasks, mock+real adapters, no src/ changes #851

Background

Why this is necessary (not nice-to-have)

Proposed Implementation

Phase 1 (this issue): 10 tasks + harness

Phase 2 (follow-up, separate issue): expand to 30 tasks, multi-LLM adapters

Acceptance Criteria

Verification (post-merge, using openchrome MCP)

Scenario 1 — mock-mode reproducibility

Scenario 2 — contract evaluator is the sole judge

Scenario 3 — real-LLM smoke on the trivial task

Scenario 4 — comparison number published

Scenario 5 — CI regression gate

Scenario 6 — P1–P5 compliance audit

Issue closure criteria

Out of scope (Phase 2 / follow-up)

Dependencies

References

Curated scope, overlap handling, and verification checklist

Scope classification

Overlap and conflict resolution

Implementation checklist

Success criteria

Post-merge OpenChrome live verification checklist

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

test(bench): WebVoyager-style contract-eval benchmark — 10 tasks, mock+real adapters, no src/ changes #851

Description

Background

Why this is necessary (not nice-to-have)

Proposed Implementation

Phase 1 (this issue): 10 tasks + harness

Phase 2 (follow-up, separate issue): expand to 30 tasks, multi-LLM adapters

Acceptance Criteria

Verification (post-merge, using openchrome MCP)

Scenario 1 — mock-mode reproducibility

Scenario 2 — contract evaluator is the sole judge

Scenario 3 — real-LLM smoke on the trivial task

Scenario 4 — comparison number published

Scenario 5 — CI regression gate

Scenario 6 — P1–P5 compliance audit

Issue closure criteria

Out of scope (Phase 2 / follow-up)

Dependencies

References

Curated scope, overlap handling, and verification checklist

Scope classification

Overlap and conflict resolution

Implementation checklist

Success criteria

Post-merge OpenChrome live verification checklist

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions