ci: add release-driven npm publish with OIDC by drewstone · Pull Request #6 · tangle-network/browser-agent-driver

drewstone · 2026-03-03T23:58:13Z

Summary\n- add .github/workflows/publish-npm.yml\n- publish on release published or manual dispatch\n- enforce tag version == package.json version\n- publish with npm provenance and public access\n- add publishConfig.access=public and docs for one-time trusted publisher setup\n\nValidation\n- pnpm -C /home/drew/code/agent-browser-driver build\n- pnpm -C /home/drew/code/agent-browser-driver test

…ured gain) Mechanism in place. No measured pass-rate improvement at n=3 reps on the real-web gauntlet. Honest non-result that points at the next architectural fix. What this PR is: A surgical change to executePlan: when the planner-emitted runScript step returns null / empty / {x: null} / placeholder pattern, the runner now declines to auto-complete with that garbage and falls through to the per-action loop with a [REPLAN] context that names the failure. The per-action loop's Brain.decide gets a fresh observation and a chance to emit a smarter action. Architectural mirror of how browser-use's per-action loop wins on tasks like npm/mdn/w3c — it iterates after a failed extraction. Gen 9 gives bad's per-action loop the same recovery surface while keeping the planner's first-try speed advantage on the cases that succeed cleanly. Verified result: no measured improvement. Gen 9 was validated against the same 10-task gauntlet as Gen 8, 3 reps each, same conditions: | metric | Gen 8 (head-to-head bad) | Gen 9 | Δ | |----------------|-------------------------:|--------------:|-------------| | pass rate | 23/30 = 77% | 21/30 = 70% | -2 (variance) | | mean wall-time | 9.2s | 13.5s | +4.3s | | mean cost | $0.0168 | $0.0256 | +$0.009 | The pass rate did NOT improve. The mechanism IS firing (visible in 5-7 turn runs where the per-action loop kicked in), but the recovery isn't smart enough — when the per-action loop fires, it has the SAME LLM that picked the wrong selector the first time. Iteration alone doesn't help if the LLM keeps making the same wrong call. Per-task delta vs Gen 8 head-to-head: | task | Gen 8 | Gen 9 | what happened | |-------------------------------|------:|------:|-------------------------| | npm-package-downloads | 1/3 | 2/3 | recovery worked sometimes | | github-pr-count | 2/3 | 3/3 | recovery worked | | arxiv-paper-abstract | 3/3 | 2/3 | variance | | python-docs-method-signature | 2/3 | 1/3 | recovery couldn't fix wrong-selector | | mdn-array-flatmap | 2/3 | 0/3 | recovery REGRESSED | | 5 other tasks | - | - | unchanged | | TOTAL | 23/30 | 21/30 | within variance | Why this is NOT shipping as an improvement: Per CLAUDE.md rule #6 (quality wins need ≥5 reps) and the honest-eval rules: a non-improvement is not an improvement, even when the mechanism is architecturally sound. Calling this a "Gen 9 win" would be reward- hacking the headline. Honest framing: Gen 9 ships the MECHANISM, not the IMPROVEMENT. The substitution path, the isMeaningfulRunScriptOutput helper, and the fall-through context are all in place for future generations to build on. Why ship Gen 9 anyway: 1. The unit tests are valuable regardless (12 new tests) 2. The infrastructure is reusable for Gen 9.1: vision fallback, smarter recovery prompts, multiple parallel runScript candidates — all of these slot into the same fall-through point 3. isMeaningfulRunScriptOutput is a real primitive 4. Reverting feels like throwing away architectural correctness because the LLM isn't smart enough yet — the right path is to make recovery smarter, not remove the fall-through What ships: isMeaningfulRunScriptOutput(output) in src/runner/runner.ts — exported helper that detects null/empty/placeholder runScript output: - Rejects: null, undefined, empty/whitespace, "null", "undefined", '""', "''", '{}', '[]', {x: null}, hasPlaceholderPattern matches, JSON objects where every value is null/empty/zero - Accepts: real JSON with values, non-empty strings, non-empty arrays - Conservative: ANY top-level null in JSON object = "not meaningful" (the agent should retry to get all fields) executePlan fall-through change — when last plan step is runScript AND isMeaningfulRunScriptOutput is false, returns kind: deviated with reason "runScript returned no meaningful output (got: ...)". The per-action loop's [REPLAN] context names this failure to Brain.decide. Tests: 951 → 963 passing (+12 net new): - 11 isMeaningfulRunScriptOutput unit tests - 4 executePlan fall-through integration tests - existing Gen 7.2 placeholder substitution tests still pass - Tier1 deterministic gate: PASSED (no regressions) What Gen 9.1 should do (the actual fix): 1. Recovery-specific prompt: when fall-through fires, Brain.decide context should say "your previous runScript used selector X and returned null. Try a DIFFERENT approach." Today the context is generic. 2. Vision fallback: when runScript fails twice, take a screenshot + ask the LLM to find the element. Slower per call but only fires on failures. This is how Atlas/Cursor handle ambiguous DOM. 3. Multiple parallel runScript candidates: planner emits 2-3 alternative selectors in a single step; runner tries them in order. No extra LLM calls, just better recall. Gen 9 is the SUBSTRATE for all three. Gen 9.1 picks one (or all). This is what an honest non-result looks like under the rigor protocol.

5-rep matched same-day validation per CLAUDE.md rules #3 + #6: Gen 8 same-day 5-rep: 29/50 = 58% Gen 10 5-rep: 37/50 = 74% Delta: +8 tasks (+16 percentage points) Architectural wins (consistent across 3-rep AND 5-rep, same-day): - npm-package-downloads: 0/5 -> 5/5 (+5) extractWithIndex - w3c-html-spec-find-element: 2/5 -> 5/5 (+3) bigger snapshot - github-pr-count: 4/5 -> 5/5 (+1) - stackoverflow-answer-count: 2/5 -> 3/5 (+1) Cost analysis (matched same-day): - Raw cost: +59% ($0.017 -> $0.027) - Cost per pass: +28% ($0.029 -> $0.037, more honest framing) - Death spirals: 0 (cost cap held; peak run $0.16) - Reddit Gen 9.1 regression FIXED: 5/5 at $0.015 mean (was 3/5 at $0.25-$0.32) Failure modes that remain (Gen 10.1 candidates): - Wikipedia oracle compliance: agent emits raw '1815' not {year:1815}. Same in Gen 8, not a regression. Fixable via prompt, not architecture. - Supervisor extra-context bloat on stuck turns (1 wikipedia rep burned 75K tokens). - mdn/arxiv variance within Wilson 95% CI overlap. Files: - .changeset/gen10-dom-index-extraction.md (honest writeup) - .evolve/progress.md (round 2 result + per-task table) - .evolve/current.json (status: round2_complete_promote) - .evolve/experiments.jsonl (gen10-002 with verdict KEEP)

* feat(runner): Gen 10 — DOM index extraction + bigger snapshot + cost cap Three coordinated changes that ship together as Gen 10: A) extractWithIndex action — pick-by-content over pick-by-selector New action {action:'extractWithIndex', query:'p, dd, code', contains:'downloads'} that returns a numbered, text-rich list of every visible element matching the query. The agent picks elements by index in the next turn. This is the architectural fix Gen 9 was missing: instead of asking the LLM to write a precise CSS selector for data it hasn't seen yet (the failure mode on npm/mdn/python-docs), the wide query finds candidates and the response shows actual textContent so the LLM can pick by content match. Wired into: - src/types.ts (ExtractWithIndexAction type, added to Action union) - src/brain/index.ts (validateAction parser, system prompt, planner prompt, data-extraction rule #25 explaining when to prefer extractWithIndex over runScript on extraction tasks) - src/drivers/extract-with-index.ts (browser-side query helper, returns {index, tag, text, attributes, selector} for each visible match, capped at 80 matches) - src/drivers/playwright.ts (driver.execute dispatch, returns formatted output as data so executePlan can capture it like runScript) - src/runner/runner.ts (per-action loop handler with feedback injection, executePlan capture into lastExtractOutput, plan-ends-with-extract fall-through to per-action loop with the match list as REPLAN context) - src/supervisor/policy.ts (action signature for stuck-detection) C) Bigger snapshot + content-line preservation src/brain/index.ts:budgetSnapshot now preserves term/definition/code/pre/ paragraph content lines (which previously got dropped as "decorative" by the interactive-only filter). These are exactly the lines that carry the data agents need on MDN/Python docs/W3C spec/arxiv pages. Budgets raised: - Default budgetSnapshot cap: 16k → 24k chars - Decide() new-page snapshot: 16k → 24k - Planner snapshot: 12k → 24k (planner is the most important caller for extraction tasks because it writes the runScript on the first observation) Same-page snapshot stays at 8k (after the LLM has already seen the page). Empirical verification: probed Playwright's locator.ariaSnapshot() output on a fixture with <dl><dt><code>flatMap(callbackFn)</code></dt><dd>...</dd></dl> — confirmed Playwright DOES emit `term`/`definition`/`code` lines with text content. The bug was in the budgetSnapshot filter dropping them, not in the snapshot pipeline missing them. Cost cap (mandatory safety net for any iteration-based mechanism) src/run-state.ts adds totalTokensUsed accumulator, tokenBudget (default 100k, override via Scenario.tokenBudget or BAD_TOKEN_BUDGET env), and isTokenBudgetExhausted gate. src/runner/runner.ts checks the gate at the top of every loop iteration (before the next LLM call) and returns `cost_cap_exceeded` if exceeded. Calibration: - Gen 8 real-web mean: ~6k tokens (well under 100k) - Tier 1 form-multistep full-evidence: ~60k tokens (within cap + 40k headroom) - Gen 9 death-spirals: 132k–173k (above cap → caught and aborted) 100k = above any normal case I've measured, well below any death spiral. Catches the Gen 9 reddit failure mode (rep 3: $0.25/132k tokens, rep 4: $0.32/173k tokens) within 5–8 turns of futility instead of running for the full case timeout. Tests: 18 new (981 total, +18 from baseline) - tests/budget-snapshot.test.ts: 6 (filter preservation, content lines, priority bucket, paragraph handling) - tests/extract-with-index.test.ts: 13 (browser-side query, contains filter, hidden element skipping, invalid selector graceful failure, stable selector building, formatter, parser via Brain.parse) - tests/run-state.test.ts: 7 new in 'Gen 10 cost cap' describe block - tests/runner-execute-plan.test.ts: 2 new (extractWithIndex deviation with match list, cost cap exhaustion) Gates: TypeScript clean, boundaries clean, full test suite 981/981 PASS, Tier1 deterministic gate PASSED. Refs: .evolve/pursuits/2026-04-08-gen9-retro-and-gen10-proposal.md * feat(runner): isMeaningfulRunScriptOutput helper + runScript-empty fall-through Cherry-picked from the abandoned Gen 9 branch (commit 63e16fe). The original Gen 9 PR was closed because the LLM-iteration recovery loop didn't move the pass rate AND introduced cost regressions on previously-passing tasks (reddit death-spirals at $0.25-$0.32 / 130-173k tokens per case in 5-rep validation). In Gen 10 the same code is safe and useful for two reasons: 1. Cost cap (100k tokens, default) bounds any death spiral 2. Per-action loop has extractWithIndex available — when the deviation reason mentions "runScript returned no meaningful output", the LLM can respond with extractWithIndex (per data-extraction rule #25) instead of retrying the same wrong selector What this brings into Gen 10: isMeaningfulRunScriptOutput() helper: - Detects null / undefined / empty / whitespace - Detects literal "null" / "undefined" / "" / '' - Detects empty JSON shells {} / [] - Detects {x: null} / partial-extraction patterns (any null = retry) - Detects placeholder patterns via hasPlaceholderPattern executePlan auto-complete branch hardened: - Old: auto-complete fires whenever lastRunScriptOutput is truthy - New: auto-complete fires only when isMeaningfulRunScriptOutput is true - Catches the literal "null" string bug that previously slipped through executePlan runScript-empty fall-through: - When the last step is runScript and the output isn't meaningful, return deviated with a reason that names the failure AND points the per-action LLM at extractWithIndex (the Gen 10 recovery tool) - This is the path that did NOT work in Gen 9 alone — but in Gen 10 the per-action loop has extractWithIndex available AND the cost cap bounds runaway recovery loops Tests cherry-picked: 12 (all pass) - 11 isMeaningfulRunScriptOutput unit tests in tests/runner-execute-plan.test.ts - 4 executePlan integration tests (Gen 7.2/9 fall-through, declines on {x:null}, declines on literal "null", positive control auto-completes on real values) Conflict resolution: the Gen 10 extractWithIndex fall-through and the Gen 9 runScript-empty fall-through are mutually exclusive (different last-step types). Both kept, ordered Gen 10 first then Gen 9. Tests: 993/993 (was 981 before this cherry-pick, +12 from Gen 9) TypeScript clean. Boundaries clean. * docs(gen10): changeset + 5-rep validation results 5-rep matched same-day validation per CLAUDE.md rules #3 + #6: Gen 8 same-day 5-rep: 29/50 = 58% Gen 10 5-rep: 37/50 = 74% Delta: +8 tasks (+16 percentage points) Architectural wins (consistent across 3-rep AND 5-rep, same-day): - npm-package-downloads: 0/5 -> 5/5 (+5) extractWithIndex - w3c-html-spec-find-element: 2/5 -> 5/5 (+3) bigger snapshot - github-pr-count: 4/5 -> 5/5 (+1) - stackoverflow-answer-count: 2/5 -> 3/5 (+1) Cost analysis (matched same-day): - Raw cost: +59% ($0.017 -> $0.027) - Cost per pass: +28% ($0.029 -> $0.037, more honest framing) - Death spirals: 0 (cost cap held; peak run $0.16) - Reddit Gen 9.1 regression FIXED: 5/5 at $0.015 mean (was 3/5 at $0.25-$0.32) Failure modes that remain (Gen 10.1 candidates): - Wikipedia oracle compliance: agent emits raw '1815' not {year:1815}. Same in Gen 8, not a regression. Fixable via prompt, not architecture. - Supervisor extra-context bloat on stuck turns (1 wikipedia rep burned 75K tokens). - mdn/arxiv variance within Wilson 95% CI overlap. Files: - .changeset/gen10-dom-index-extraction.md (honest writeup) - .evolve/progress.md (round 2 result + per-task table) - .evolve/current.json (status: round2_complete_promote) - .evolve/experiments.jsonl (gen10-002 with verdict KEEP)

Validated at 5-rep matched same-day vs browser-use 0.12.6 baseline: bad gpt-5.2 (Tier A 5rep): 34/50 = 68% pass, $0.047 cpp, 14.6s mean wall bad gpt-5.4 (R1 5rep): 43/50 = 86% pass, $0.042 cpp, 8.8s mean wall ⭐ browser-use (Tier A 5rep): 41/50 = 82% pass, $0.031 cpp, 65.3s mean wall bad+gpt-5.4 BEATS browser-use on pass rate (+2 tasks at 5-rep matched) AND is 7.4x faster mean wall, 9.3x faster p95 wall. Cost-per-pass is +35% vs browser-use ($0.042 vs $0.031). Drew explicitly approved the trade — speed advantage justifies the cost increase. Per-task gpt-5.4 wins vs gpt-5.2 (same gauntlet, same day): w3c-html-spec-find-element: 2/5 -> 5/5 (+3) npm-package-downloads: 2/5 -> 5/5 (+3) python-docs-method-signature: 3/5 -> 5/5 (+2) wikipedia-fact-lookup: 3/5 -> 4/5 (+1) mdn-array-flatmap: 2/5 -> 3/5 (+1) arxiv-paper-abstract: 5/5 -> 4/5 (-1, variance) stackoverflow / hn / github / reddit: parity These are STRUCTURAL fixes from a smarter model on extraction tasks where the planner needs to write a precise runScript first try. The 3-rep 93% from Gen 11 Tier C was on the optimistic end. 5-rep is 86% — the proper rigor number per CLAUDE.md rule #6. Still beats browser-use. Per evolve protocol Phase 9 persistence: - .evolve/current.json: round 1 KEEP, status round1_complete_keep_promoted - .evolve/progress.md: full round 1 writeup with per-task table - .evolve/experiments.jsonl: gen11-002 logged Next round candidates (Gen 11 evolve R2): 1. Wikipedia oracle compliance prompt fix (4/5 -> 5/5) 2. mdn / stackoverflow stabilization 3. Re-run WebVoyager curated 30 with gpt-5.4

…+ multi-model) (#62) * feat(bench): Gen 11 — master comparison orchestrator Gen 11 ships the truth-table benchmark infrastructure: - scripts/run-master-comparison.mjs (290 LOC orchestrator) Walks 4 tiers in priority order, captures per-tier summary JSONs, aggregates into a single REPORT.md with executive summary, per-tier tables, cross-framework + cross-model truth tables, honest weak spots, and reproduction instructions. Tiers: A — bad Gen 10 vs browser-use 0.12.6 5-rep matched (10 real-web tasks) B — WebVoyager 30-task curated subset (bad only, LLM judge) C — multi-model (bad on gpt-5.2 vs gpt-5.4, 3-rep) D — Tier 1 deterministic gate (regression check) Features: - Resumable via --skip-tier - Single-tier override via --tier - Hard cost cap ($25 cumulative, configurable) - Tier failures don't stop other tiers - Pre-flight checks (browser-use venv, OPENAI_API_KEY, curated subset) - Per-tier launch + status logged to tier-log.jsonl - bench/external/webvoyager/curated-30.json 30 hand-picked WebVoyager tasks (2 per site x 15 sites). Diverse, auth-free, fast to run. Cost estimate: $8.10 / 30 min for the full set. - bench/external/webvoyager/run.mjs Added --cases-file flag so the master orchestrator can pass curated subsets without overwriting the canonical converted cases.json. - package.json: bench:master script - .evolve/pursuits/2026-04-09-comprehensive-benchmark-gen11.md Full Gen 11 pursuit spec: thesis, system audit, tier plan, risks, cost envelope, success criteria. This commit ships the orchestration. The actual benchmark runs happen in the next commit when bench:master executes the full battery. Sanity-checked: pnpm exec tsc --noEmit clean, pnpm check:boundaries clean, node scripts/run-master-comparison.mjs --tier D produces a clean REPORT.md with the Tier 1 gate result. * feat(bench): Gen 11 — master comparison truth table (run + report) Gen 11 ships the truth table that shows where bad actually stands across every benchmark surface that's runnable today. The shipping artifact is docs/GEN11-MASTER-COMPARISON.md plus scripts/run-master-comparison.mjs to reproduce it. What ran (4 tiers, ~3 hrs wall, ~$15): Tier A — bad Gen 10 vs browser-use 0.12.6 5-rep matched same-day, 10 real-web tasks, gpt-5.2: bad 34/50 = 68%, $0.0318 mean, 14.6s, $0.047 cost/pass browser-use 41/50 = 82%, $0.0257 mean, 65.3s, $0.031 cost/pass bad is 4.5x faster but loses 7 tasks on pass rate bad wins stackoverflow (+2); browser-use wins npm (-3), wikipedia (-2), mdn (-2), w3c (-2); parity on hn/github/ arxiv/reddit/python-docs Tier B — WebVoyager curated 30-task sample (2 per site x 15 sites), bad Gen 10 only, GPT-4o vision judge: 12/30 = 40% judge pass rate 100% judge-agent agreement (bad does NOT lie) Perfect 2/2: Apple, Coursera, Google Search, Wolfram Alpha Half 1/2: ArXiv, BBC News, ESPN, GitHub Zero 0/2: Allrecipes, Amazon, Booking, Cambridge, Google Flights, Google Map, Huggingface (long multi-step tasks hit the 15-turn / 120s caps) Tier C — bad Gen 10 on gpt-5.4 3-rep, same 10 tasks: 28/30 = 93% (vs 68% on gpt-5.2 in Tier A) cost-per-pass $0.038 (vs $0.047 on gpt-5.2) mean wall 9.4s (vs 14.6s on gpt-5.2) gpt-5.4 fixes mdn/npm/w3c/python-docs (60pp each) *** TOP FINDING: gpt-5.4 + bad Gen 10 = strict-upgrade *** Tier D — Tier 1 deterministic gate (regression check): FAILED both runs on local-form-multistep fast-explore at 100k+ tokens. Same dist/cli.js Gen 10 build that passed at 47k tokens earlier today. Pure load-sensitivity flake. NEW finding: concurrent-load sensitivity bad pass rate: 74% (Gen 10 5-rep isolation) -> 68% (Gen 11 4-tier concurrent load). All losses on extraction tasks Gen 10 had previously fixed. browser-use barely moved (84% -> 82%). The cost cap (100k) prevented death spirals but bad's recovery loops fire more under load. Investigate in Gen 12. What ships: - scripts/run-master-comparison.mjs (~600 LOC orchestrator + aggregator) * Walks 4 tiers, captures per-tier JSON, aggregates into REPORT.md * Resumable via --skip-tier, single-tier override via --tier * --aggregate-only re-builds REPORT.md from existing data * Hard cost cap ($25 cumulative) * recomputeFromRunsJsonl() merges partial data when canonical summary missing * Derives realWebTasks from bench/competitive/tasks/real-web/*.json (was hardcoded — now picks up new tasks automatically) - bench/external/webvoyager/curated-30.json 30 hand-picked diverse tasks (2 per site x 15 sites). Auth-free, fast to run. Site list derived dynamically in the report. - bench/external/webvoyager/run.mjs Added --cases-file flag so master orchestrator can pass curated subsets without overwriting the canonical converted cases.json - bench/external/webvoyager/evaluate.mjs 3 bug fixes: 1. Missing openai npm dep (judge couldn't import) 2. Wrong verdict field check (was testing testResult.verdict === 'PASS' but verdict is the agent's freeform completion text, not a status — fixed to use testResult.agentSuccess) 3. Missing env-loader (OPENAI_API_KEY wasn't loaded from .env) - package.json: bench:master script + openai dep - docs/GEN11-MASTER-COMPARISON.md The truth table (167 lines, all data from this session, no stale refs) What's NOT a regression: - wikipedia 3/5: same pattern in Gen 10 5-rep — agent emits raw '1815' instead of {"year":1815}. LLM-compliance issue with goal prompt. - Tier 1 fast-explore failures: same Gen 10 build that passed earlier. Load-sensitive flake, not code regression. - WebVoyager 0/2 on long multi-step sites: 15-turn / 120s caps too tight for these tasks. Configuration choice. Reproducibility: pnpm install && pnpm build && pnpm bench:master Each tier writes raw data to a per-tier subdir of agent-results/ master-comparison-<ts>/ (gitignored, ~580MB). Aggregator reads JSONs and produces docs/GEN11-MASTER-COMPARISON.md (committed). Gen 12 candidates: 1. Make bad robust to concurrent system load 2. Default to gpt-5.4 for real-web tasks (+25pp) 3. Wikipedia oracle compliance prompt fix 4. Configurable per-task max-turns for WebVoyager long-form 5. Stagehand adapter (currently a stub) * feat(bench): Gen 11 evolve R1 — promote gpt-5.4 as default for real-web Validated at 5-rep matched same-day vs browser-use 0.12.6 baseline: bad gpt-5.2 (Tier A 5rep): 34/50 = 68% pass, $0.047 cpp, 14.6s mean wall bad gpt-5.4 (R1 5rep): 43/50 = 86% pass, $0.042 cpp, 8.8s mean wall ⭐ browser-use (Tier A 5rep): 41/50 = 82% pass, $0.031 cpp, 65.3s mean wall bad+gpt-5.4 BEATS browser-use on pass rate (+2 tasks at 5-rep matched) AND is 7.4x faster mean wall, 9.3x faster p95 wall. Cost-per-pass is +35% vs browser-use ($0.042 vs $0.031). Drew explicitly approved the trade — speed advantage justifies the cost increase. Per-task gpt-5.4 wins vs gpt-5.2 (same gauntlet, same day): w3c-html-spec-find-element: 2/5 -> 5/5 (+3) npm-package-downloads: 2/5 -> 5/5 (+3) python-docs-method-signature: 3/5 -> 5/5 (+2) wikipedia-fact-lookup: 3/5 -> 4/5 (+1) mdn-array-flatmap: 2/5 -> 3/5 (+1) arxiv-paper-abstract: 5/5 -> 4/5 (-1, variance) stackoverflow / hn / github / reddit: parity These are STRUCTURAL fixes from a smarter model on extraction tasks where the planner needs to write a precise runScript first try. The 3-rep 93% from Gen 11 Tier C was on the optimistic end. 5-rep is 86% — the proper rigor number per CLAUDE.md rule #6. Still beats browser-use. Per evolve protocol Phase 9 persistence: - .evolve/current.json: round 1 KEEP, status round1_complete_keep_promoted - .evolve/progress.md: full round 1 writeup with per-task table - .evolve/experiments.jsonl: gen11-002 logged Next round candidates (Gen 11 evolve R2): 1. Wikipedia oracle compliance prompt fix (4/5 -> 5/5) 2. mdn / stackoverflow stabilization 3. Re-run WebVoyager curated 30 with gpt-5.4 * feat(bench): Gen 11 evolve R2 — 3 parallel experiments Exp A — WebVoyager gpt-5.4 standard caps (30 tasks): Agent pass rate: 22/30 = 73% (was 12/30 = 40% on gpt-5.2, +33pp) Judge pass rate: 14/30 = 47% (judge is stricter — 8 disagreements) Agent-judge agreement: 73% (was 100% on gpt-5.2) Key wins: Allrecipes 0/2→2/2, Booking 0/2→2/2, Google Map 0/2→2/2 Exp B — Wikipedia oracle compliance fix: 4/5 (same as before). JSON wrapping works (all reps emit {year:N}). The 1 fail is a real extraction error (returned 1843 death year, not 1815 birth year). Verdict: KEEP the prompt fix, but 4/5 is the floor. Exp C — WebVoyager gpt-5.4 extended caps (25 turns, 240s): Agent pass rate: 23/30 = 77% (+1 net vs standard caps). Extended caps barely help: +3 wins (apple, bbc, google-flights) offset by -2 regressions (booking — more turns = more chances to fail). Verdict: the real gain is the MODEL UPGRADE, not the cap extension. Key finding: gpt-5.4 agent-judge disagreement On gpt-5.2: 100% agreement (agent never lied about success). On gpt-5.4: 73% agreement (8 tasks where agent claims PASS but judge says FAIL). gpt-5.4 is more capable but less well-calibrated. The honest WebVoyager number is judge rate (47%), not agent rate (73%). Files: - wikipedia-fact-lookup.json: stronger JSON-wrapping prompt - curated-30-extended.json: 25-turn / 240s variant for Exp C - .evolve/ state updates

* feat(bench): Gen 11 — master comparison orchestrator Gen 11 ships the truth-table benchmark infrastructure: - scripts/run-master-comparison.mjs (290 LOC orchestrator) Walks 4 tiers in priority order, captures per-tier summary JSONs, aggregates into a single REPORT.md with executive summary, per-tier tables, cross-framework + cross-model truth tables, honest weak spots, and reproduction instructions. Tiers: A — bad Gen 10 vs browser-use 0.12.6 5-rep matched (10 real-web tasks) B — WebVoyager 30-task curated subset (bad only, LLM judge) C — multi-model (bad on gpt-5.2 vs gpt-5.4, 3-rep) D — Tier 1 deterministic gate (regression check) Features: - Resumable via --skip-tier - Single-tier override via --tier - Hard cost cap ($25 cumulative, configurable) - Tier failures don't stop other tiers - Pre-flight checks (browser-use venv, OPENAI_API_KEY, curated subset) - Per-tier launch + status logged to tier-log.jsonl - bench/external/webvoyager/curated-30.json 30 hand-picked WebVoyager tasks (2 per site x 15 sites). Diverse, auth-free, fast to run. Cost estimate: $8.10 / 30 min for the full set. - bench/external/webvoyager/run.mjs Added --cases-file flag so the master orchestrator can pass curated subsets without overwriting the canonical converted cases.json. - package.json: bench:master script - .evolve/pursuits/2026-04-09-comprehensive-benchmark-gen11.md Full Gen 11 pursuit spec: thesis, system audit, tier plan, risks, cost envelope, success criteria. This commit ships the orchestration. The actual benchmark runs happen in the next commit when bench:master executes the full battery. Sanity-checked: pnpm exec tsc --noEmit clean, pnpm check:boundaries clean, node scripts/run-master-comparison.mjs --tier D produces a clean REPORT.md with the Tier 1 gate result. * feat(bench): Gen 11 — master comparison truth table (run + report) Gen 11 ships the truth table that shows where bad actually stands across every benchmark surface that's runnable today. The shipping artifact is docs/GEN11-MASTER-COMPARISON.md plus scripts/run-master-comparison.mjs to reproduce it. What ran (4 tiers, ~3 hrs wall, ~$15): Tier A — bad Gen 10 vs browser-use 0.12.6 5-rep matched same-day, 10 real-web tasks, gpt-5.2: bad 34/50 = 68%, $0.0318 mean, 14.6s, $0.047 cost/pass browser-use 41/50 = 82%, $0.0257 mean, 65.3s, $0.031 cost/pass bad is 4.5x faster but loses 7 tasks on pass rate bad wins stackoverflow (+2); browser-use wins npm (-3), wikipedia (-2), mdn (-2), w3c (-2); parity on hn/github/ arxiv/reddit/python-docs Tier B — WebVoyager curated 30-task sample (2 per site x 15 sites), bad Gen 10 only, GPT-4o vision judge: 12/30 = 40% judge pass rate 100% judge-agent agreement (bad does NOT lie) Perfect 2/2: Apple, Coursera, Google Search, Wolfram Alpha Half 1/2: ArXiv, BBC News, ESPN, GitHub Zero 0/2: Allrecipes, Amazon, Booking, Cambridge, Google Flights, Google Map, Huggingface (long multi-step tasks hit the 15-turn / 120s caps) Tier C — bad Gen 10 on gpt-5.4 3-rep, same 10 tasks: 28/30 = 93% (vs 68% on gpt-5.2 in Tier A) cost-per-pass $0.038 (vs $0.047 on gpt-5.2) mean wall 9.4s (vs 14.6s on gpt-5.2) gpt-5.4 fixes mdn/npm/w3c/python-docs (60pp each) *** TOP FINDING: gpt-5.4 + bad Gen 10 = strict-upgrade *** Tier D — Tier 1 deterministic gate (regression check): FAILED both runs on local-form-multistep fast-explore at 100k+ tokens. Same dist/cli.js Gen 10 build that passed at 47k tokens earlier today. Pure load-sensitivity flake. NEW finding: concurrent-load sensitivity bad pass rate: 74% (Gen 10 5-rep isolation) -> 68% (Gen 11 4-tier concurrent load). All losses on extraction tasks Gen 10 had previously fixed. browser-use barely moved (84% -> 82%). The cost cap (100k) prevented death spirals but bad's recovery loops fire more under load. Investigate in Gen 12. What ships: - scripts/run-master-comparison.mjs (~600 LOC orchestrator + aggregator) * Walks 4 tiers, captures per-tier JSON, aggregates into REPORT.md * Resumable via --skip-tier, single-tier override via --tier * --aggregate-only re-builds REPORT.md from existing data * Hard cost cap ($25 cumulative) * recomputeFromRunsJsonl() merges partial data when canonical summary missing * Derives realWebTasks from bench/competitive/tasks/real-web/*.json (was hardcoded — now picks up new tasks automatically) - bench/external/webvoyager/curated-30.json 30 hand-picked diverse tasks (2 per site x 15 sites). Auth-free, fast to run. Site list derived dynamically in the report. - bench/external/webvoyager/run.mjs Added --cases-file flag so master orchestrator can pass curated subsets without overwriting the canonical converted cases.json - bench/external/webvoyager/evaluate.mjs 3 bug fixes: 1. Missing openai npm dep (judge couldn't import) 2. Wrong verdict field check (was testing testResult.verdict === 'PASS' but verdict is the agent's freeform completion text, not a status — fixed to use testResult.agentSuccess) 3. Missing env-loader (OPENAI_API_KEY wasn't loaded from .env) - package.json: bench:master script + openai dep - docs/GEN11-MASTER-COMPARISON.md The truth table (167 lines, all data from this session, no stale refs) What's NOT a regression: - wikipedia 3/5: same pattern in Gen 10 5-rep — agent emits raw '1815' instead of {"year":1815}. LLM-compliance issue with goal prompt. - Tier 1 fast-explore failures: same Gen 10 build that passed earlier. Load-sensitive flake, not code regression. - WebVoyager 0/2 on long multi-step sites: 15-turn / 120s caps too tight for these tasks. Configuration choice. Reproducibility: pnpm install && pnpm build && pnpm bench:master Each tier writes raw data to a per-tier subdir of agent-results/ master-comparison-<ts>/ (gitignored, ~580MB). Aggregator reads JSONs and produces docs/GEN11-MASTER-COMPARISON.md (committed). Gen 12 candidates: 1. Make bad robust to concurrent system load 2. Default to gpt-5.4 for real-web tasks (+25pp) 3. Wikipedia oracle compliance prompt fix 4. Configurable per-task max-turns for WebVoyager long-form 5. Stagehand adapter (currently a stub) * feat(bench): Gen 11 evolve R1 — promote gpt-5.4 as default for real-web Validated at 5-rep matched same-day vs browser-use 0.12.6 baseline: bad gpt-5.2 (Tier A 5rep): 34/50 = 68% pass, $0.047 cpp, 14.6s mean wall bad gpt-5.4 (R1 5rep): 43/50 = 86% pass, $0.042 cpp, 8.8s mean wall ⭐ browser-use (Tier A 5rep): 41/50 = 82% pass, $0.031 cpp, 65.3s mean wall bad+gpt-5.4 BEATS browser-use on pass rate (+2 tasks at 5-rep matched) AND is 7.4x faster mean wall, 9.3x faster p95 wall. Cost-per-pass is +35% vs browser-use ($0.042 vs $0.031). Drew explicitly approved the trade — speed advantage justifies the cost increase. Per-task gpt-5.4 wins vs gpt-5.2 (same gauntlet, same day): w3c-html-spec-find-element: 2/5 -> 5/5 (+3) npm-package-downloads: 2/5 -> 5/5 (+3) python-docs-method-signature: 3/5 -> 5/5 (+2) wikipedia-fact-lookup: 3/5 -> 4/5 (+1) mdn-array-flatmap: 2/5 -> 3/5 (+1) arxiv-paper-abstract: 5/5 -> 4/5 (-1, variance) stackoverflow / hn / github / reddit: parity These are STRUCTURAL fixes from a smarter model on extraction tasks where the planner needs to write a precise runScript first try. The 3-rep 93% from Gen 11 Tier C was on the optimistic end. 5-rep is 86% — the proper rigor number per CLAUDE.md rule #6. Still beats browser-use. Per evolve protocol Phase 9 persistence: - .evolve/current.json: round 1 KEEP, status round1_complete_keep_promoted - .evolve/progress.md: full round 1 writeup with per-task table - .evolve/experiments.jsonl: gen11-002 logged Next round candidates (Gen 11 evolve R2): 1. Wikipedia oracle compliance prompt fix (4/5 -> 5/5) 2. mdn / stackoverflow stabilization 3. Re-run WebVoyager curated 30 with gpt-5.4 * feat(bench): Gen 11 evolve R2 — 3 parallel experiments Exp A — WebVoyager gpt-5.4 standard caps (30 tasks): Agent pass rate: 22/30 = 73% (was 12/30 = 40% on gpt-5.2, +33pp) Judge pass rate: 14/30 = 47% (judge is stricter — 8 disagreements) Agent-judge agreement: 73% (was 100% on gpt-5.2) Key wins: Allrecipes 0/2→2/2, Booking 0/2→2/2, Google Map 0/2→2/2 Exp B — Wikipedia oracle compliance fix: 4/5 (same as before). JSON wrapping works (all reps emit {year:N}). The 1 fail is a real extraction error (returned 1843 death year, not 1815 birth year). Verdict: KEEP the prompt fix, but 4/5 is the floor. Exp C — WebVoyager gpt-5.4 extended caps (25 turns, 240s): Agent pass rate: 23/30 = 77% (+1 net vs standard caps). Extended caps barely help: +3 wins (apple, bbc, google-flights) offset by -2 regressions (booking — more turns = more chances to fail). Verdict: the real gain is the MODEL UPGRADE, not the cap extension. Key finding: gpt-5.4 agent-judge disagreement On gpt-5.2: 100% agreement (agent never lied about success). On gpt-5.4: 73% agreement (8 tasks where agent claims PASS but judge says FAIL). gpt-5.4 is more capable but less well-calibrated. The honest WebVoyager number is judge rate (47%), not agent rate (73%). Files: - wikipedia-fact-lookup.json: stronger JSON-wrapping prompt - curated-30-extended.json: 25-turn / 240s variant for Exp C - .evolve/ state updates * fix(runner): Gen 12 — content-aware fast-path verifier The fast-path goal verifier at runner.ts:1596 checks: agentResult.length > 50 && recentErrors === 0 && hasScriptEvidence This rubber-stamps success without reading the result content. On gpt-5.4, the agent writes verbose narratives admitting failure ("could not complete", "price not visible", "did not take effect") yet still marks success: true. In Gen 11 evolve R2, 6 of 8 judge disagreements were caused by this: - Booking: "date selection did not take effect" → fast-path stamped PASS - Google Flights: "could not complete the Jan. 22 lookup" → PASS - Google Map: "fifth qualifying salon is not visible" → PASS - GitHub: "sorted by Best match, not confirmed most starred" → PASS - Wolfram: "did not return a visible answer" → PASS Fix: add a selfContradicting regex gate that scans the result text for failure-admitting phrases. When found, the fast-path is blocked and the full LLM verifier runs instead, correctly marking these as failures. The regex catches: could not complete/find/fulfill/verify/confirm/locate/access/extract/retrieve not visible/available/found/present/accessible/displayed/shown/confirmed/verified did not take effect/work/succeed/load/return unable to find/complete/verify/access/extract/retrieve no visible answer/result/data/content no results found/returned/available failed/failure to find/complete/set/select/navigate unfortunately / I was unable / task is incomplete Tested: 8 match cases + 5 non-match cases all pass. Expected impact: Agent self-report accuracy on WebVoyager goes from 73% (inflated) to ~53% (honest). Agent-judge agreement goes from 73% back toward 100%. The honest agent pass rate is now trustworthy — when bad says it succeeded, it actually did. 993/993 tests pass. TypeScript clean.

ci: add release-driven npm publish via trusted publishing

1df9584

drewstone merged commit 0dd76d3 into main Mar 3, 2026
3 checks passed

drewstone deleted the chore/npm-oidc-publish branch March 3, 2026 23:58

drewstone mentioned this pull request Apr 8, 2026

feat(runner): Gen 7.2 — fix planner placeholder bug for extraction tasks #55

Merged

7 tasks

github-actions bot mentioned this pull request Apr 8, 2026

Release: version packages #56

Merged

This was referenced Apr 9, 2026

feat(runner): Gen 9 — runtime two-pass extraction (mechanism, no measured gain) #59

Closed

Gen 10 — DOM index extraction + bigger snapshot + cost cap #60

Merged

github-actions bot mentioned this pull request Apr 9, 2026

Release: version packages #61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: add release-driven npm publish with OIDC#6

ci: add release-driven npm publish with OIDC#6
drewstone merged 1 commit intomainfrom
chore/npm-oidc-publish

drewstone commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant