Skip to content

ci: add release-driven npm publish with OIDC#6

Merged
drewstone merged 1 commit intomainfrom
chore/npm-oidc-publish
Mar 3, 2026
Merged

ci: add release-driven npm publish with OIDC#6
drewstone merged 1 commit intomainfrom
chore/npm-oidc-publish

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Summary\n- add .github/workflows/publish-npm.yml\n- publish on release published or manual dispatch\n- enforce tag version == package.json version\n- publish with npm provenance and public access\n- add publishConfig.access=public and docs for one-time trusted publisher setup\n\nValidation\n- pnpm -C /home/drew/code/agent-browser-driver build\n- pnpm -C /home/drew/code/agent-browser-driver test

@drewstone drewstone merged commit 0dd76d3 into main Mar 3, 2026
3 checks passed
@drewstone drewstone deleted the chore/npm-oidc-publish branch March 3, 2026 23:58
@github-actions github-actions bot mentioned this pull request Apr 8, 2026
drewstone added a commit that referenced this pull request Apr 9, 2026
…ured gain)

Mechanism in place. No measured pass-rate improvement at n=3 reps on the
real-web gauntlet. Honest non-result that points at the next architectural fix.

What this PR is:

A surgical change to executePlan: when the planner-emitted runScript step
returns null / empty / {x: null} / placeholder pattern, the runner now
declines to auto-complete with that garbage and falls through to the
per-action loop with a [REPLAN] context that names the failure. The
per-action loop's Brain.decide gets a fresh observation and a chance to
emit a smarter action.

Architectural mirror of how browser-use's per-action loop wins on
tasks like npm/mdn/w3c — it iterates after a failed extraction. Gen 9
gives bad's per-action loop the same recovery surface while keeping the
planner's first-try speed advantage on the cases that succeed cleanly.

Verified result: no measured improvement.

Gen 9 was validated against the same 10-task gauntlet as Gen 8, 3 reps
each, same conditions:

| metric         | Gen 8 (head-to-head bad) | Gen 9         | Δ           |
|----------------|-------------------------:|--------------:|-------------|
| pass rate      | 23/30 = 77%              | 21/30 = 70%   | -2 (variance) |
| mean wall-time | 9.2s                     | 13.5s         | +4.3s       |
| mean cost      | $0.0168                  | $0.0256       | +$0.009     |

The pass rate did NOT improve. The mechanism IS firing (visible in 5-7
turn runs where the per-action loop kicked in), but the recovery isn't
smart enough — when the per-action loop fires, it has the SAME LLM that
picked the wrong selector the first time. Iteration alone doesn't help
if the LLM keeps making the same wrong call.

Per-task delta vs Gen 8 head-to-head:

| task                          | Gen 8 | Gen 9 | what happened           |
|-------------------------------|------:|------:|-------------------------|
| npm-package-downloads         |   1/3 |   2/3 | recovery worked sometimes |
| github-pr-count               |   2/3 |   3/3 | recovery worked         |
| arxiv-paper-abstract          |   3/3 |   2/3 | variance                |
| python-docs-method-signature  |   2/3 |   1/3 | recovery couldn't fix wrong-selector |
| mdn-array-flatmap             |   2/3 |   0/3 | recovery REGRESSED      |
| 5 other tasks                 |     - |     - | unchanged               |
| TOTAL                         | 23/30 | 21/30 | within variance         |

Why this is NOT shipping as an improvement:

Per CLAUDE.md rule #6 (quality wins need ≥5 reps) and the honest-eval
rules: a non-improvement is not an improvement, even when the mechanism
is architecturally sound. Calling this a "Gen 9 win" would be reward-
hacking the headline.

Honest framing: Gen 9 ships the MECHANISM, not the IMPROVEMENT. The
substitution path, the isMeaningfulRunScriptOutput helper, and the
fall-through context are all in place for future generations to build on.

Why ship Gen 9 anyway:
1. The unit tests are valuable regardless (12 new tests)
2. The infrastructure is reusable for Gen 9.1: vision fallback, smarter
   recovery prompts, multiple parallel runScript candidates — all of
   these slot into the same fall-through point
3. isMeaningfulRunScriptOutput is a real primitive
4. Reverting feels like throwing away architectural correctness because
   the LLM isn't smart enough yet — the right path is to make recovery
   smarter, not remove the fall-through

What ships:

isMeaningfulRunScriptOutput(output) in src/runner/runner.ts — exported
helper that detects null/empty/placeholder runScript output:
- Rejects: null, undefined, empty/whitespace, "null", "undefined",
  '""', "''", '{}', '[]', {x: null}, hasPlaceholderPattern matches,
  JSON objects where every value is null/empty/zero
- Accepts: real JSON with values, non-empty strings, non-empty arrays
- Conservative: ANY top-level null in JSON object = "not meaningful"
  (the agent should retry to get all fields)

executePlan fall-through change — when last plan step is runScript AND
isMeaningfulRunScriptOutput is false, returns kind: deviated with reason
"runScript returned no meaningful output (got: ...)". The per-action loop's
[REPLAN] context names this failure to Brain.decide.

Tests: 951 → 963 passing (+12 net new):
- 11 isMeaningfulRunScriptOutput unit tests
- 4 executePlan fall-through integration tests
- existing Gen 7.2 placeholder substitution tests still pass
- Tier1 deterministic gate: PASSED (no regressions)

What Gen 9.1 should do (the actual fix):

1. Recovery-specific prompt: when fall-through fires, Brain.decide
   context should say "your previous runScript used selector X and
   returned null. Try a DIFFERENT approach." Today the context is
   generic.
2. Vision fallback: when runScript fails twice, take a screenshot + ask
   the LLM to find the element. Slower per call but only fires on
   failures. This is how Atlas/Cursor handle ambiguous DOM.
3. Multiple parallel runScript candidates: planner emits 2-3 alternative
   selectors in a single step; runner tries them in order. No extra LLM
   calls, just better recall.

Gen 9 is the SUBSTRATE for all three. Gen 9.1 picks one (or all).

This is what an honest non-result looks like under the rigor protocol.
drewstone added a commit that referenced this pull request Apr 9, 2026
5-rep matched same-day validation per CLAUDE.md rules #3 + #6:

  Gen 8 same-day 5-rep: 29/50 = 58%
  Gen 10 5-rep:         37/50 = 74%
  Delta:                +8 tasks (+16 percentage points)

Architectural wins (consistent across 3-rep AND 5-rep, same-day):
  - npm-package-downloads:    0/5 -> 5/5 (+5) extractWithIndex
  - w3c-html-spec-find-element: 2/5 -> 5/5 (+3) bigger snapshot
  - github-pr-count:          4/5 -> 5/5 (+1)
  - stackoverflow-answer-count: 2/5 -> 3/5 (+1)

Cost analysis (matched same-day):
  - Raw cost: +59% ($0.017 -> $0.027)
  - Cost per pass: +28% ($0.029 -> $0.037, more honest framing)
  - Death spirals: 0 (cost cap held; peak run $0.16)
  - Reddit Gen 9.1 regression FIXED: 5/5 at $0.015 mean (was 3/5 at $0.25-$0.32)

Failure modes that remain (Gen 10.1 candidates):
  - Wikipedia oracle compliance: agent emits raw '1815' not {year:1815}.
    Same in Gen 8, not a regression. Fixable via prompt, not architecture.
  - Supervisor extra-context bloat on stuck turns (1 wikipedia rep burned 75K tokens).
  - mdn/arxiv variance within Wilson 95% CI overlap.

Files:
  - .changeset/gen10-dom-index-extraction.md (honest writeup)
  - .evolve/progress.md (round 2 result + per-task table)
  - .evolve/current.json (status: round2_complete_promote)
  - .evolve/experiments.jsonl (gen10-002 with verdict KEEP)
drewstone added a commit that referenced this pull request Apr 9, 2026
* feat(runner): Gen 10 — DOM index extraction + bigger snapshot + cost cap

Three coordinated changes that ship together as Gen 10:

A) extractWithIndex action — pick-by-content over pick-by-selector

   New action {action:'extractWithIndex', query:'p, dd, code', contains:'downloads'}
   that returns a numbered, text-rich list of every visible element matching
   the query. The agent picks elements by index in the next turn.

   This is the architectural fix Gen 9 was missing: instead of asking the LLM
   to write a precise CSS selector for data it hasn't seen yet (the failure
   mode on npm/mdn/python-docs), the wide query finds candidates and the
   response shows actual textContent so the LLM can pick by content match.

   Wired into:
   - src/types.ts (ExtractWithIndexAction type, added to Action union)
   - src/brain/index.ts (validateAction parser, system prompt, planner prompt,
     data-extraction rule #25 explaining when to prefer extractWithIndex over
     runScript on extraction tasks)
   - src/drivers/extract-with-index.ts (browser-side query helper, returns
     {index, tag, text, attributes, selector} for each visible match, capped
     at 80 matches)
   - src/drivers/playwright.ts (driver.execute dispatch, returns formatted
     output as data so executePlan can capture it like runScript)
   - src/runner/runner.ts (per-action loop handler with feedback injection,
     executePlan capture into lastExtractOutput, plan-ends-with-extract
     fall-through to per-action loop with the match list as REPLAN context)
   - src/supervisor/policy.ts (action signature for stuck-detection)

C) Bigger snapshot + content-line preservation

   src/brain/index.ts:budgetSnapshot now preserves term/definition/code/pre/
   paragraph content lines (which previously got dropped as "decorative" by
   the interactive-only filter). These are exactly the lines that carry the
   data agents need on MDN/Python docs/W3C spec/arxiv pages.

   Budgets raised:
   - Default budgetSnapshot cap: 16k → 24k chars
   - Decide() new-page snapshot: 16k → 24k
   - Planner snapshot: 12k → 24k (planner is the most important caller for
     extraction tasks because it writes the runScript on the first observation)

   Same-page snapshot stays at 8k (after the LLM has already seen the page).

   Empirical verification: probed Playwright's locator.ariaSnapshot() output
   on a fixture with <dl><dt><code>flatMap(callbackFn)</code></dt><dd>...</dd></dl>
   — confirmed Playwright DOES emit `term`/`definition`/`code` lines with text
   content. The bug was in the budgetSnapshot filter dropping them, not in
   the snapshot pipeline missing them.

Cost cap (mandatory safety net for any iteration-based mechanism)

   src/run-state.ts adds totalTokensUsed accumulator, tokenBudget (default
   100k, override via Scenario.tokenBudget or BAD_TOKEN_BUDGET env), and
   isTokenBudgetExhausted gate. src/runner/runner.ts checks the gate at the
   top of every loop iteration (before the next LLM call) and returns
   `cost_cap_exceeded` if exceeded.

   Calibration:
   - Gen 8 real-web mean: ~6k tokens (well under 100k)
   - Tier 1 form-multistep full-evidence: ~60k tokens (within cap + 40k headroom)
   - Gen 9 death-spirals: 132k–173k (above cap → caught and aborted)

   100k = above any normal case I've measured, well below any death spiral.
   Catches the Gen 9 reddit failure mode (rep 3: $0.25/132k tokens, rep 4:
   $0.32/173k tokens) within 5–8 turns of futility instead of running for
   the full case timeout.

Tests: 18 new (981 total, +18 from baseline)
   - tests/budget-snapshot.test.ts: 6 (filter preservation, content lines,
     priority bucket, paragraph handling)
   - tests/extract-with-index.test.ts: 13 (browser-side query, contains
     filter, hidden element skipping, invalid selector graceful failure,
     stable selector building, formatter, parser via Brain.parse)
   - tests/run-state.test.ts: 7 new in 'Gen 10 cost cap' describe block
   - tests/runner-execute-plan.test.ts: 2 new (extractWithIndex deviation
     with match list, cost cap exhaustion)

Gates: TypeScript clean, boundaries clean, full test suite 981/981 PASS,
Tier1 deterministic gate PASSED.

Refs: .evolve/pursuits/2026-04-08-gen9-retro-and-gen10-proposal.md

* feat(runner): isMeaningfulRunScriptOutput helper + runScript-empty fall-through

Cherry-picked from the abandoned Gen 9 branch (commit 63e16fe). The original
Gen 9 PR was closed because the LLM-iteration recovery loop didn't move the
pass rate AND introduced cost regressions on previously-passing tasks (reddit
death-spirals at $0.25-$0.32 / 130-173k tokens per case in 5-rep validation).

In Gen 10 the same code is safe and useful for two reasons:
  1. Cost cap (100k tokens, default) bounds any death spiral
  2. Per-action loop has extractWithIndex available — when the deviation
     reason mentions "runScript returned no meaningful output", the LLM can
     respond with extractWithIndex (per data-extraction rule #25) instead
     of retrying the same wrong selector

What this brings into Gen 10:

isMeaningfulRunScriptOutput() helper:
  - Detects null / undefined / empty / whitespace
  - Detects literal "null" / "undefined" / "" / ''
  - Detects empty JSON shells {} / []
  - Detects {x: null} / partial-extraction patterns (any null = retry)
  - Detects placeholder patterns via hasPlaceholderPattern

executePlan auto-complete branch hardened:
  - Old: auto-complete fires whenever lastRunScriptOutput is truthy
  - New: auto-complete fires only when isMeaningfulRunScriptOutput is true
  - Catches the literal "null" string bug that previously slipped through

executePlan runScript-empty fall-through:
  - When the last step is runScript and the output isn't meaningful, return
    deviated with a reason that names the failure AND points the per-action
    LLM at extractWithIndex (the Gen 10 recovery tool)
  - This is the path that did NOT work in Gen 9 alone — but in Gen 10 the
    per-action loop has extractWithIndex available AND the cost cap bounds
    runaway recovery loops

Tests cherry-picked: 12 (all pass)
  - 11 isMeaningfulRunScriptOutput unit tests in tests/runner-execute-plan.test.ts
  - 4 executePlan integration tests (Gen 7.2/9 fall-through, declines on
    {x:null}, declines on literal "null", positive control auto-completes
    on real values)

Conflict resolution: the Gen 10 extractWithIndex fall-through and the Gen 9
runScript-empty fall-through are mutually exclusive (different last-step
types). Both kept, ordered Gen 10 first then Gen 9.

Tests: 993/993 (was 981 before this cherry-pick, +12 from Gen 9)
TypeScript clean. Boundaries clean.

* docs(gen10): changeset + 5-rep validation results

5-rep matched same-day validation per CLAUDE.md rules #3 + #6:

  Gen 8 same-day 5-rep: 29/50 = 58%
  Gen 10 5-rep:         37/50 = 74%
  Delta:                +8 tasks (+16 percentage points)

Architectural wins (consistent across 3-rep AND 5-rep, same-day):
  - npm-package-downloads:    0/5 -> 5/5 (+5) extractWithIndex
  - w3c-html-spec-find-element: 2/5 -> 5/5 (+3) bigger snapshot
  - github-pr-count:          4/5 -> 5/5 (+1)
  - stackoverflow-answer-count: 2/5 -> 3/5 (+1)

Cost analysis (matched same-day):
  - Raw cost: +59% ($0.017 -> $0.027)
  - Cost per pass: +28% ($0.029 -> $0.037, more honest framing)
  - Death spirals: 0 (cost cap held; peak run $0.16)
  - Reddit Gen 9.1 regression FIXED: 5/5 at $0.015 mean (was 3/5 at $0.25-$0.32)

Failure modes that remain (Gen 10.1 candidates):
  - Wikipedia oracle compliance: agent emits raw '1815' not {year:1815}.
    Same in Gen 8, not a regression. Fixable via prompt, not architecture.
  - Supervisor extra-context bloat on stuck turns (1 wikipedia rep burned 75K tokens).
  - mdn/arxiv variance within Wilson 95% CI overlap.

Files:
  - .changeset/gen10-dom-index-extraction.md (honest writeup)
  - .evolve/progress.md (round 2 result + per-task table)
  - .evolve/current.json (status: round2_complete_promote)
  - .evolve/experiments.jsonl (gen10-002 with verdict KEEP)
drewstone added a commit that referenced this pull request Apr 9, 2026
Validated at 5-rep matched same-day vs browser-use 0.12.6 baseline:

  bad gpt-5.2 (Tier A 5rep): 34/50 = 68% pass, $0.047 cpp, 14.6s mean wall
  bad gpt-5.4 (R1 5rep):     43/50 = 86% pass, $0.042 cpp, 8.8s mean wall  ⭐
  browser-use (Tier A 5rep): 41/50 = 82% pass, $0.031 cpp, 65.3s mean wall

bad+gpt-5.4 BEATS browser-use on pass rate (+2 tasks at 5-rep matched)
AND is 7.4x faster mean wall, 9.3x faster p95 wall.

Cost-per-pass is +35% vs browser-use ($0.042 vs $0.031). Drew explicitly
approved the trade — speed advantage justifies the cost increase.

Per-task gpt-5.4 wins vs gpt-5.2 (same gauntlet, same day):
  w3c-html-spec-find-element:    2/5 -> 5/5  (+3)
  npm-package-downloads:         2/5 -> 5/5  (+3)
  python-docs-method-signature:  3/5 -> 5/5  (+2)
  wikipedia-fact-lookup:         3/5 -> 4/5  (+1)
  mdn-array-flatmap:             2/5 -> 3/5  (+1)
  arxiv-paper-abstract:          5/5 -> 4/5  (-1, variance)
  stackoverflow / hn / github / reddit: parity

These are STRUCTURAL fixes from a smarter model on extraction tasks where
the planner needs to write a precise runScript first try.

The 3-rep 93% from Gen 11 Tier C was on the optimistic end. 5-rep is 86%
— the proper rigor number per CLAUDE.md rule #6. Still beats browser-use.

Per evolve protocol Phase 9 persistence:
  - .evolve/current.json: round 1 KEEP, status round1_complete_keep_promoted
  - .evolve/progress.md: full round 1 writeup with per-task table
  - .evolve/experiments.jsonl: gen11-002 logged

Next round candidates (Gen 11 evolve R2):
  1. Wikipedia oracle compliance prompt fix (4/5 -> 5/5)
  2. mdn / stackoverflow stabilization
  3. Re-run WebVoyager curated 30 with gpt-5.4
drewstone added a commit that referenced this pull request Apr 9, 2026
…+ multi-model) (#62)

* feat(bench): Gen 11 — master comparison orchestrator

Gen 11 ships the truth-table benchmark infrastructure:

  - scripts/run-master-comparison.mjs (290 LOC orchestrator)
    Walks 4 tiers in priority order, captures per-tier summary JSONs,
    aggregates into a single REPORT.md with executive summary, per-tier
    tables, cross-framework + cross-model truth tables, honest weak spots,
    and reproduction instructions.

    Tiers:
      A — bad Gen 10 vs browser-use 0.12.6 5-rep matched (10 real-web tasks)
      B — WebVoyager 30-task curated subset (bad only, LLM judge)
      C — multi-model (bad on gpt-5.2 vs gpt-5.4, 3-rep)
      D — Tier 1 deterministic gate (regression check)

    Features:
      - Resumable via --skip-tier
      - Single-tier override via --tier
      - Hard cost cap ($25 cumulative, configurable)
      - Tier failures don't stop other tiers
      - Pre-flight checks (browser-use venv, OPENAI_API_KEY, curated subset)
      - Per-tier launch + status logged to tier-log.jsonl

  - bench/external/webvoyager/curated-30.json
    30 hand-picked WebVoyager tasks (2 per site x 15 sites). Diverse,
    auth-free, fast to run. Cost estimate: $8.10 / 30 min for the full set.

  - bench/external/webvoyager/run.mjs
    Added --cases-file flag so the master orchestrator can pass curated
    subsets without overwriting the canonical converted cases.json.

  - package.json: bench:master script

  - .evolve/pursuits/2026-04-09-comprehensive-benchmark-gen11.md
    Full Gen 11 pursuit spec: thesis, system audit, tier plan, risks,
    cost envelope, success criteria.

This commit ships the orchestration. The actual benchmark runs happen in
the next commit when bench:master executes the full battery.

Sanity-checked: pnpm exec tsc --noEmit clean, pnpm check:boundaries clean,
node scripts/run-master-comparison.mjs --tier D produces a clean REPORT.md
with the Tier 1 gate result.

* feat(bench): Gen 11 — master comparison truth table (run + report)

Gen 11 ships the truth table that shows where bad actually stands across
every benchmark surface that's runnable today. The shipping artifact is
docs/GEN11-MASTER-COMPARISON.md plus scripts/run-master-comparison.mjs to
reproduce it.

What ran (4 tiers, ~3 hrs wall, ~$15):
  Tier A — bad Gen 10 vs browser-use 0.12.6 5-rep matched same-day,
           10 real-web tasks, gpt-5.2:
             bad        34/50 = 68%, $0.0318 mean, 14.6s, $0.047 cost/pass
             browser-use 41/50 = 82%, $0.0257 mean, 65.3s, $0.031 cost/pass
             bad is 4.5x faster but loses 7 tasks on pass rate
             bad wins stackoverflow (+2); browser-use wins npm (-3),
             wikipedia (-2), mdn (-2), w3c (-2); parity on hn/github/
             arxiv/reddit/python-docs

  Tier B — WebVoyager curated 30-task sample (2 per site x 15 sites),
           bad Gen 10 only, GPT-4o vision judge:
             12/30 = 40% judge pass rate
             100% judge-agent agreement (bad does NOT lie)
             Perfect 2/2: Apple, Coursera, Google Search, Wolfram Alpha
             Half 1/2: ArXiv, BBC News, ESPN, GitHub
             Zero 0/2: Allrecipes, Amazon, Booking, Cambridge, Google
             Flights, Google Map, Huggingface (long multi-step tasks
             hit the 15-turn / 120s caps)

  Tier C — bad Gen 10 on gpt-5.4 3-rep, same 10 tasks:
             28/30 = 93% (vs 68% on gpt-5.2 in Tier A)
             cost-per-pass $0.038 (vs $0.047 on gpt-5.2)
             mean wall 9.4s (vs 14.6s on gpt-5.2)
             gpt-5.4 fixes mdn/npm/w3c/python-docs (60pp each)
             *** TOP FINDING: gpt-5.4 + bad Gen 10 = strict-upgrade ***

  Tier D — Tier 1 deterministic gate (regression check):
             FAILED both runs on local-form-multistep fast-explore at
             100k+ tokens. Same dist/cli.js Gen 10 build that passed
             at 47k tokens earlier today. Pure load-sensitivity flake.

NEW finding: concurrent-load sensitivity
  bad pass rate: 74% (Gen 10 5-rep isolation) -> 68% (Gen 11 4-tier
  concurrent load). All losses on extraction tasks Gen 10 had previously
  fixed. browser-use barely moved (84% -> 82%). The cost cap (100k)
  prevented death spirals but bad's recovery loops fire more under load.
  Investigate in Gen 12.

What ships:
  - scripts/run-master-comparison.mjs (~600 LOC orchestrator + aggregator)
    * Walks 4 tiers, captures per-tier JSON, aggregates into REPORT.md
    * Resumable via --skip-tier, single-tier override via --tier
    * --aggregate-only re-builds REPORT.md from existing data
    * Hard cost cap ($25 cumulative)
    * recomputeFromRunsJsonl() merges partial data when canonical summary missing
    * Derives realWebTasks from bench/competitive/tasks/real-web/*.json
      (was hardcoded — now picks up new tasks automatically)

  - bench/external/webvoyager/curated-30.json
    30 hand-picked diverse tasks (2 per site x 15 sites). Auth-free,
    fast to run. Site list derived dynamically in the report.

  - bench/external/webvoyager/run.mjs
    Added --cases-file flag so master orchestrator can pass curated subsets
    without overwriting the canonical converted cases.json

  - bench/external/webvoyager/evaluate.mjs
    3 bug fixes:
    1. Missing openai npm dep (judge couldn't import)
    2. Wrong verdict field check (was testing testResult.verdict === 'PASS'
       but verdict is the agent's freeform completion text, not a status —
       fixed to use testResult.agentSuccess)
    3. Missing env-loader (OPENAI_API_KEY wasn't loaded from .env)

  - package.json: bench:master script + openai dep

  - docs/GEN11-MASTER-COMPARISON.md
    The truth table (167 lines, all data from this session, no stale refs)

What's NOT a regression:
  - wikipedia 3/5: same pattern in Gen 10 5-rep — agent emits raw '1815'
    instead of {"year":1815}. LLM-compliance issue with goal prompt.
  - Tier 1 fast-explore failures: same Gen 10 build that passed earlier.
    Load-sensitive flake, not code regression.
  - WebVoyager 0/2 on long multi-step sites: 15-turn / 120s caps too
    tight for these tasks. Configuration choice.

Reproducibility:
  pnpm install && pnpm build && pnpm bench:master
  Each tier writes raw data to a per-tier subdir of agent-results/
  master-comparison-<ts>/ (gitignored, ~580MB). Aggregator reads JSONs
  and produces docs/GEN11-MASTER-COMPARISON.md (committed).

Gen 12 candidates:
  1. Make bad robust to concurrent system load
  2. Default to gpt-5.4 for real-web tasks (+25pp)
  3. Wikipedia oracle compliance prompt fix
  4. Configurable per-task max-turns for WebVoyager long-form
  5. Stagehand adapter (currently a stub)

* feat(bench): Gen 11 evolve R1 — promote gpt-5.4 as default for real-web

Validated at 5-rep matched same-day vs browser-use 0.12.6 baseline:

  bad gpt-5.2 (Tier A 5rep): 34/50 = 68% pass, $0.047 cpp, 14.6s mean wall
  bad gpt-5.4 (R1 5rep):     43/50 = 86% pass, $0.042 cpp, 8.8s mean wall  ⭐
  browser-use (Tier A 5rep): 41/50 = 82% pass, $0.031 cpp, 65.3s mean wall

bad+gpt-5.4 BEATS browser-use on pass rate (+2 tasks at 5-rep matched)
AND is 7.4x faster mean wall, 9.3x faster p95 wall.

Cost-per-pass is +35% vs browser-use ($0.042 vs $0.031). Drew explicitly
approved the trade — speed advantage justifies the cost increase.

Per-task gpt-5.4 wins vs gpt-5.2 (same gauntlet, same day):
  w3c-html-spec-find-element:    2/5 -> 5/5  (+3)
  npm-package-downloads:         2/5 -> 5/5  (+3)
  python-docs-method-signature:  3/5 -> 5/5  (+2)
  wikipedia-fact-lookup:         3/5 -> 4/5  (+1)
  mdn-array-flatmap:             2/5 -> 3/5  (+1)
  arxiv-paper-abstract:          5/5 -> 4/5  (-1, variance)
  stackoverflow / hn / github / reddit: parity

These are STRUCTURAL fixes from a smarter model on extraction tasks where
the planner needs to write a precise runScript first try.

The 3-rep 93% from Gen 11 Tier C was on the optimistic end. 5-rep is 86%
— the proper rigor number per CLAUDE.md rule #6. Still beats browser-use.

Per evolve protocol Phase 9 persistence:
  - .evolve/current.json: round 1 KEEP, status round1_complete_keep_promoted
  - .evolve/progress.md: full round 1 writeup with per-task table
  - .evolve/experiments.jsonl: gen11-002 logged

Next round candidates (Gen 11 evolve R2):
  1. Wikipedia oracle compliance prompt fix (4/5 -> 5/5)
  2. mdn / stackoverflow stabilization
  3. Re-run WebVoyager curated 30 with gpt-5.4

* feat(bench): Gen 11 evolve R2 — 3 parallel experiments

Exp A — WebVoyager gpt-5.4 standard caps (30 tasks):
  Agent pass rate: 22/30 = 73% (was 12/30 = 40% on gpt-5.2, +33pp)
  Judge pass rate: 14/30 = 47% (judge is stricter — 8 disagreements)
  Agent-judge agreement: 73% (was 100% on gpt-5.2)
  Key wins: Allrecipes 0/2→2/2, Booking 0/2→2/2, Google Map 0/2→2/2

Exp B — Wikipedia oracle compliance fix:
  4/5 (same as before). JSON wrapping works (all reps emit {year:N}).
  The 1 fail is a real extraction error (returned 1843 death year, not
  1815 birth year). Verdict: KEEP the prompt fix, but 4/5 is the floor.

Exp C — WebVoyager gpt-5.4 extended caps (25 turns, 240s):
  Agent pass rate: 23/30 = 77% (+1 net vs standard caps).
  Extended caps barely help: +3 wins (apple, bbc, google-flights)
  offset by -2 regressions (booking — more turns = more chances to fail).
  Verdict: the real gain is the MODEL UPGRADE, not the cap extension.

Key finding: gpt-5.4 agent-judge disagreement
  On gpt-5.2: 100% agreement (agent never lied about success).
  On gpt-5.4: 73% agreement (8 tasks where agent claims PASS but judge
  says FAIL). gpt-5.4 is more capable but less well-calibrated.
  The honest WebVoyager number is judge rate (47%), not agent rate (73%).

Files:
  - wikipedia-fact-lookup.json: stronger JSON-wrapping prompt
  - curated-30-extended.json: 25-turn / 240s variant for Exp C
  - .evolve/ state updates
drewstone added a commit that referenced this pull request Apr 9, 2026
* feat(bench): Gen 11 — master comparison orchestrator

Gen 11 ships the truth-table benchmark infrastructure:

  - scripts/run-master-comparison.mjs (290 LOC orchestrator)
    Walks 4 tiers in priority order, captures per-tier summary JSONs,
    aggregates into a single REPORT.md with executive summary, per-tier
    tables, cross-framework + cross-model truth tables, honest weak spots,
    and reproduction instructions.

    Tiers:
      A — bad Gen 10 vs browser-use 0.12.6 5-rep matched (10 real-web tasks)
      B — WebVoyager 30-task curated subset (bad only, LLM judge)
      C — multi-model (bad on gpt-5.2 vs gpt-5.4, 3-rep)
      D — Tier 1 deterministic gate (regression check)

    Features:
      - Resumable via --skip-tier
      - Single-tier override via --tier
      - Hard cost cap ($25 cumulative, configurable)
      - Tier failures don't stop other tiers
      - Pre-flight checks (browser-use venv, OPENAI_API_KEY, curated subset)
      - Per-tier launch + status logged to tier-log.jsonl

  - bench/external/webvoyager/curated-30.json
    30 hand-picked WebVoyager tasks (2 per site x 15 sites). Diverse,
    auth-free, fast to run. Cost estimate: $8.10 / 30 min for the full set.

  - bench/external/webvoyager/run.mjs
    Added --cases-file flag so the master orchestrator can pass curated
    subsets without overwriting the canonical converted cases.json.

  - package.json: bench:master script

  - .evolve/pursuits/2026-04-09-comprehensive-benchmark-gen11.md
    Full Gen 11 pursuit spec: thesis, system audit, tier plan, risks,
    cost envelope, success criteria.

This commit ships the orchestration. The actual benchmark runs happen in
the next commit when bench:master executes the full battery.

Sanity-checked: pnpm exec tsc --noEmit clean, pnpm check:boundaries clean,
node scripts/run-master-comparison.mjs --tier D produces a clean REPORT.md
with the Tier 1 gate result.

* feat(bench): Gen 11 — master comparison truth table (run + report)

Gen 11 ships the truth table that shows where bad actually stands across
every benchmark surface that's runnable today. The shipping artifact is
docs/GEN11-MASTER-COMPARISON.md plus scripts/run-master-comparison.mjs to
reproduce it.

What ran (4 tiers, ~3 hrs wall, ~$15):
  Tier A — bad Gen 10 vs browser-use 0.12.6 5-rep matched same-day,
           10 real-web tasks, gpt-5.2:
             bad        34/50 = 68%, $0.0318 mean, 14.6s, $0.047 cost/pass
             browser-use 41/50 = 82%, $0.0257 mean, 65.3s, $0.031 cost/pass
             bad is 4.5x faster but loses 7 tasks on pass rate
             bad wins stackoverflow (+2); browser-use wins npm (-3),
             wikipedia (-2), mdn (-2), w3c (-2); parity on hn/github/
             arxiv/reddit/python-docs

  Tier B — WebVoyager curated 30-task sample (2 per site x 15 sites),
           bad Gen 10 only, GPT-4o vision judge:
             12/30 = 40% judge pass rate
             100% judge-agent agreement (bad does NOT lie)
             Perfect 2/2: Apple, Coursera, Google Search, Wolfram Alpha
             Half 1/2: ArXiv, BBC News, ESPN, GitHub
             Zero 0/2: Allrecipes, Amazon, Booking, Cambridge, Google
             Flights, Google Map, Huggingface (long multi-step tasks
             hit the 15-turn / 120s caps)

  Tier C — bad Gen 10 on gpt-5.4 3-rep, same 10 tasks:
             28/30 = 93% (vs 68% on gpt-5.2 in Tier A)
             cost-per-pass $0.038 (vs $0.047 on gpt-5.2)
             mean wall 9.4s (vs 14.6s on gpt-5.2)
             gpt-5.4 fixes mdn/npm/w3c/python-docs (60pp each)
             *** TOP FINDING: gpt-5.4 + bad Gen 10 = strict-upgrade ***

  Tier D — Tier 1 deterministic gate (regression check):
             FAILED both runs on local-form-multistep fast-explore at
             100k+ tokens. Same dist/cli.js Gen 10 build that passed
             at 47k tokens earlier today. Pure load-sensitivity flake.

NEW finding: concurrent-load sensitivity
  bad pass rate: 74% (Gen 10 5-rep isolation) -> 68% (Gen 11 4-tier
  concurrent load). All losses on extraction tasks Gen 10 had previously
  fixed. browser-use barely moved (84% -> 82%). The cost cap (100k)
  prevented death spirals but bad's recovery loops fire more under load.
  Investigate in Gen 12.

What ships:
  - scripts/run-master-comparison.mjs (~600 LOC orchestrator + aggregator)
    * Walks 4 tiers, captures per-tier JSON, aggregates into REPORT.md
    * Resumable via --skip-tier, single-tier override via --tier
    * --aggregate-only re-builds REPORT.md from existing data
    * Hard cost cap ($25 cumulative)
    * recomputeFromRunsJsonl() merges partial data when canonical summary missing
    * Derives realWebTasks from bench/competitive/tasks/real-web/*.json
      (was hardcoded — now picks up new tasks automatically)

  - bench/external/webvoyager/curated-30.json
    30 hand-picked diverse tasks (2 per site x 15 sites). Auth-free,
    fast to run. Site list derived dynamically in the report.

  - bench/external/webvoyager/run.mjs
    Added --cases-file flag so master orchestrator can pass curated subsets
    without overwriting the canonical converted cases.json

  - bench/external/webvoyager/evaluate.mjs
    3 bug fixes:
    1. Missing openai npm dep (judge couldn't import)
    2. Wrong verdict field check (was testing testResult.verdict === 'PASS'
       but verdict is the agent's freeform completion text, not a status —
       fixed to use testResult.agentSuccess)
    3. Missing env-loader (OPENAI_API_KEY wasn't loaded from .env)

  - package.json: bench:master script + openai dep

  - docs/GEN11-MASTER-COMPARISON.md
    The truth table (167 lines, all data from this session, no stale refs)

What's NOT a regression:
  - wikipedia 3/5: same pattern in Gen 10 5-rep — agent emits raw '1815'
    instead of {"year":1815}. LLM-compliance issue with goal prompt.
  - Tier 1 fast-explore failures: same Gen 10 build that passed earlier.
    Load-sensitive flake, not code regression.
  - WebVoyager 0/2 on long multi-step sites: 15-turn / 120s caps too
    tight for these tasks. Configuration choice.

Reproducibility:
  pnpm install && pnpm build && pnpm bench:master
  Each tier writes raw data to a per-tier subdir of agent-results/
  master-comparison-<ts>/ (gitignored, ~580MB). Aggregator reads JSONs
  and produces docs/GEN11-MASTER-COMPARISON.md (committed).

Gen 12 candidates:
  1. Make bad robust to concurrent system load
  2. Default to gpt-5.4 for real-web tasks (+25pp)
  3. Wikipedia oracle compliance prompt fix
  4. Configurable per-task max-turns for WebVoyager long-form
  5. Stagehand adapter (currently a stub)

* feat(bench): Gen 11 evolve R1 — promote gpt-5.4 as default for real-web

Validated at 5-rep matched same-day vs browser-use 0.12.6 baseline:

  bad gpt-5.2 (Tier A 5rep): 34/50 = 68% pass, $0.047 cpp, 14.6s mean wall
  bad gpt-5.4 (R1 5rep):     43/50 = 86% pass, $0.042 cpp, 8.8s mean wall  ⭐
  browser-use (Tier A 5rep): 41/50 = 82% pass, $0.031 cpp, 65.3s mean wall

bad+gpt-5.4 BEATS browser-use on pass rate (+2 tasks at 5-rep matched)
AND is 7.4x faster mean wall, 9.3x faster p95 wall.

Cost-per-pass is +35% vs browser-use ($0.042 vs $0.031). Drew explicitly
approved the trade — speed advantage justifies the cost increase.

Per-task gpt-5.4 wins vs gpt-5.2 (same gauntlet, same day):
  w3c-html-spec-find-element:    2/5 -> 5/5  (+3)
  npm-package-downloads:         2/5 -> 5/5  (+3)
  python-docs-method-signature:  3/5 -> 5/5  (+2)
  wikipedia-fact-lookup:         3/5 -> 4/5  (+1)
  mdn-array-flatmap:             2/5 -> 3/5  (+1)
  arxiv-paper-abstract:          5/5 -> 4/5  (-1, variance)
  stackoverflow / hn / github / reddit: parity

These are STRUCTURAL fixes from a smarter model on extraction tasks where
the planner needs to write a precise runScript first try.

The 3-rep 93% from Gen 11 Tier C was on the optimistic end. 5-rep is 86%
— the proper rigor number per CLAUDE.md rule #6. Still beats browser-use.

Per evolve protocol Phase 9 persistence:
  - .evolve/current.json: round 1 KEEP, status round1_complete_keep_promoted
  - .evolve/progress.md: full round 1 writeup with per-task table
  - .evolve/experiments.jsonl: gen11-002 logged

Next round candidates (Gen 11 evolve R2):
  1. Wikipedia oracle compliance prompt fix (4/5 -> 5/5)
  2. mdn / stackoverflow stabilization
  3. Re-run WebVoyager curated 30 with gpt-5.4

* feat(bench): Gen 11 evolve R2 — 3 parallel experiments

Exp A — WebVoyager gpt-5.4 standard caps (30 tasks):
  Agent pass rate: 22/30 = 73% (was 12/30 = 40% on gpt-5.2, +33pp)
  Judge pass rate: 14/30 = 47% (judge is stricter — 8 disagreements)
  Agent-judge agreement: 73% (was 100% on gpt-5.2)
  Key wins: Allrecipes 0/2→2/2, Booking 0/2→2/2, Google Map 0/2→2/2

Exp B — Wikipedia oracle compliance fix:
  4/5 (same as before). JSON wrapping works (all reps emit {year:N}).
  The 1 fail is a real extraction error (returned 1843 death year, not
  1815 birth year). Verdict: KEEP the prompt fix, but 4/5 is the floor.

Exp C — WebVoyager gpt-5.4 extended caps (25 turns, 240s):
  Agent pass rate: 23/30 = 77% (+1 net vs standard caps).
  Extended caps barely help: +3 wins (apple, bbc, google-flights)
  offset by -2 regressions (booking — more turns = more chances to fail).
  Verdict: the real gain is the MODEL UPGRADE, not the cap extension.

Key finding: gpt-5.4 agent-judge disagreement
  On gpt-5.2: 100% agreement (agent never lied about success).
  On gpt-5.4: 73% agreement (8 tasks where agent claims PASS but judge
  says FAIL). gpt-5.4 is more capable but less well-calibrated.
  The honest WebVoyager number is judge rate (47%), not agent rate (73%).

Files:
  - wikipedia-fact-lookup.json: stronger JSON-wrapping prompt
  - curated-30-extended.json: 25-turn / 240s variant for Exp C
  - .evolve/ state updates

* fix(runner): Gen 12 — content-aware fast-path verifier

The fast-path goal verifier at runner.ts:1596 checks:
  agentResult.length > 50 && recentErrors === 0 && hasScriptEvidence

This rubber-stamps success without reading the result content. On gpt-5.4,
the agent writes verbose narratives admitting failure ("could not complete",
"price not visible", "did not take effect") yet still marks success: true.

In Gen 11 evolve R2, 6 of 8 judge disagreements were caused by this:
- Booking: "date selection did not take effect" → fast-path stamped PASS
- Google Flights: "could not complete the Jan. 22 lookup" → PASS
- Google Map: "fifth qualifying salon is not visible" → PASS
- GitHub: "sorted by Best match, not confirmed most starred" → PASS
- Wolfram: "did not return a visible answer" → PASS

Fix: add a selfContradicting regex gate that scans the result text for
failure-admitting phrases. When found, the fast-path is blocked and the
full LLM verifier runs instead, correctly marking these as failures.

The regex catches:
  could not complete/find/fulfill/verify/confirm/locate/access/extract/retrieve
  not visible/available/found/present/accessible/displayed/shown/confirmed/verified
  did not take effect/work/succeed/load/return
  unable to find/complete/verify/access/extract/retrieve
  no visible answer/result/data/content
  no results found/returned/available
  failed/failure to find/complete/set/select/navigate
  unfortunately / I was unable / task is incomplete

Tested: 8 match cases + 5 non-match cases all pass.

Expected impact:
  Agent self-report accuracy on WebVoyager goes from 73% (inflated) to
  ~53% (honest). Agent-judge agreement goes from 73% back toward 100%.
  The honest agent pass rate is now trustworthy — when bad says it
  succeeded, it actually did.

993/993 tests pass. TypeScript clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant