feat(runner): Gen 9 — runtime two-pass extraction (mechanism, no measured gain)#59
Closed
feat(runner): Gen 9 — runtime two-pass extraction (mechanism, no measured gain)#59
Conversation
…ured gain)
Mechanism in place. No measured pass-rate improvement at n=3 reps on the
real-web gauntlet. Honest non-result that points at the next architectural fix.
What this PR is:
A surgical change to executePlan: when the planner-emitted runScript step
returns null / empty / {x: null} / placeholder pattern, the runner now
declines to auto-complete with that garbage and falls through to the
per-action loop with a [REPLAN] context that names the failure. The
per-action loop's Brain.decide gets a fresh observation and a chance to
emit a smarter action.
Architectural mirror of how browser-use's per-action loop wins on
tasks like npm/mdn/w3c — it iterates after a failed extraction. Gen 9
gives bad's per-action loop the same recovery surface while keeping the
planner's first-try speed advantage on the cases that succeed cleanly.
Verified result: no measured improvement.
Gen 9 was validated against the same 10-task gauntlet as Gen 8, 3 reps
each, same conditions:
| metric | Gen 8 (head-to-head bad) | Gen 9 | Δ |
|----------------|-------------------------:|--------------:|-------------|
| pass rate | 23/30 = 77% | 21/30 = 70% | -2 (variance) |
| mean wall-time | 9.2s | 13.5s | +4.3s |
| mean cost | $0.0168 | $0.0256 | +$0.009 |
The pass rate did NOT improve. The mechanism IS firing (visible in 5-7
turn runs where the per-action loop kicked in), but the recovery isn't
smart enough — when the per-action loop fires, it has the SAME LLM that
picked the wrong selector the first time. Iteration alone doesn't help
if the LLM keeps making the same wrong call.
Per-task delta vs Gen 8 head-to-head:
| task | Gen 8 | Gen 9 | what happened |
|-------------------------------|------:|------:|-------------------------|
| npm-package-downloads | 1/3 | 2/3 | recovery worked sometimes |
| github-pr-count | 2/3 | 3/3 | recovery worked |
| arxiv-paper-abstract | 3/3 | 2/3 | variance |
| python-docs-method-signature | 2/3 | 1/3 | recovery couldn't fix wrong-selector |
| mdn-array-flatmap | 2/3 | 0/3 | recovery REGRESSED |
| 5 other tasks | - | - | unchanged |
| TOTAL | 23/30 | 21/30 | within variance |
Why this is NOT shipping as an improvement:
Per CLAUDE.md rule #6 (quality wins need ≥5 reps) and the honest-eval
rules: a non-improvement is not an improvement, even when the mechanism
is architecturally sound. Calling this a "Gen 9 win" would be reward-
hacking the headline.
Honest framing: Gen 9 ships the MECHANISM, not the IMPROVEMENT. The
substitution path, the isMeaningfulRunScriptOutput helper, and the
fall-through context are all in place for future generations to build on.
Why ship Gen 9 anyway:
1. The unit tests are valuable regardless (12 new tests)
2. The infrastructure is reusable for Gen 9.1: vision fallback, smarter
recovery prompts, multiple parallel runScript candidates — all of
these slot into the same fall-through point
3. isMeaningfulRunScriptOutput is a real primitive
4. Reverting feels like throwing away architectural correctness because
the LLM isn't smart enough yet — the right path is to make recovery
smarter, not remove the fall-through
What ships:
isMeaningfulRunScriptOutput(output) in src/runner/runner.ts — exported
helper that detects null/empty/placeholder runScript output:
- Rejects: null, undefined, empty/whitespace, "null", "undefined",
'""', "''", '{}', '[]', {x: null}, hasPlaceholderPattern matches,
JSON objects where every value is null/empty/zero
- Accepts: real JSON with values, non-empty strings, non-empty arrays
- Conservative: ANY top-level null in JSON object = "not meaningful"
(the agent should retry to get all fields)
executePlan fall-through change — when last plan step is runScript AND
isMeaningfulRunScriptOutput is false, returns kind: deviated with reason
"runScript returned no meaningful output (got: ...)". The per-action loop's
[REPLAN] context names this failure to Brain.decide.
Tests: 951 → 963 passing (+12 net new):
- 11 isMeaningfulRunScriptOutput unit tests
- 4 executePlan fall-through integration tests
- existing Gen 7.2 placeholder substitution tests still pass
- Tier1 deterministic gate: PASSED (no regressions)
What Gen 9.1 should do (the actual fix):
1. Recovery-specific prompt: when fall-through fires, Brain.decide
context should say "your previous runScript used selector X and
returned null. Try a DIFFERENT approach." Today the context is
generic.
2. Vision fallback: when runScript fails twice, take a screenshot + ask
the LLM to find the element. Slower per call but only fires on
failures. This is how Atlas/Cursor handle ambiguous DOM.
3. Multiple parallel runScript candidates: planner emits 2-3 alternative
selectors in a single step; runner tries them in order. No extra LLM
calls, just better recall.
Gen 9 is the SUBSTRATE for all three. Gen 9.1 picks one (or all).
This is what an honest non-result looks like under the rigor protocol.
Contributor
Author
Closing without merging — Gen 9 LLM-iteration approach is fundamentally brokenAfter Gen 9.1 enhancements (explicit recovery prompt + failed-script-in-deviation-reason), 5-rep validation revealed a critical cost regression that makes this approach worse than baseline: Cost regression on previously-passing tasks
Pass rate did NOT improve
Root causeThe LLM-iteration approach (whether bad's per-action loop or browser-use's per-action loop) doesn't reliably recover from extraction failures. Iterating with the same LLM that picked the wrong selector burns tokens without changing the outcome. The per-action loop has unbounded cost when recovery fails — there's no budget cap that stops a stuck recovery from burning $0.30+ per run. What we keep
What's next: Gen 10A fundamentally different approach is needed:
Closing per the rigor protocol: a non-improvement that introduces cost regression is a regression, not a "mechanism PR." |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Gen 9 ships the mechanism, not the improvement. The architectural fall-through path is in place + tested, but n=3 reps on the real-web gauntlet show no measured pass-rate gain (21/30 = 70% vs Gen 8 head-to-head 23/30 = 77%, within variance).
This is what an honest non-result looks like. Calling it "Gen 9 wins" would be reward-hacking the headline. The mechanism still ships because the substrate is reusable for Gen 9.1 (the actual fix).
What this PR is
A surgical change to
executePlan: when the planner-emitted runScript step returns null / empty /{x: null}/ placeholder pattern, the runner now declines to auto-complete with that garbage and falls through to the per-action loop with a[REPLAN]context that names the failure. The per-action loop'sBrain.decidethen gets a fresh observation of the loaded page and a chance to emit a smarter action.This is the architectural mirror of how browser-use's per-action loop wins on tasks like npm/mdn/w3c — it iterates after a failed extraction. Gen 9 gives
bad's per-action loop the same recovery surface while keeping the planner's first-try speed advantage on the cases that succeed cleanly.Verified result: no measured improvement
The pass rate did NOT improve. The mechanism IS firing (visible in 5-7 turn runs where the per-action loop kicked in after a failed runScript), but the recovery isn't smart enough — when the per-action loop fires, it has the same LLM that picked the wrong selector the first time. Iteration alone doesn't help if the LLM keeps making the same wrong call.
Per-task delta vs Gen 8 head-to-head
Why this is NOT shipping as an improvement
Per
CLAUDE.mdrule #6 ("quality wins need ≥5 reps") and the honest-eval rules: a non-improvement is not an improvement, even when the mechanism is architecturally sound. Calling this a "Gen 9 win" would be reward-hacking the headline.The honest framing: Gen 9 ships the mechanism, not the improvement. The substitution path, the
isMeaningfulRunScriptOutputhelper, and the fall-through context are all in place for future generations to build on.Why ship Gen 9 anyway (the mechanism PR)
isMeaningfulRunScriptOutput+ the fall-through path)isMeaningfulRunScriptOutputis a real primitive that future code can useWhat ships
isMeaningfulRunScriptOutput(output)insrc/runner/runner.ts:"null","undefined","",'{}','[]',{x: null}, any value matchinghasPlaceholderPattern, JSON objects where every value is null/empty/zeroexecutePlanfall-through change — when the last plan step is arunScriptANDisMeaningfulRunScriptOutput(lastRunScriptOutput)is false, the runner returnskind: deviatedwith reasonrunScript returned no meaningful output (got: ...). The per-action loop's[REPLAN]context (built inBrowserAgent.run) then names this failure toBrain.decide.Tests
951 → 963 passing (+12 net new):
isMeaningfulRunScriptOutputunit tests: rejects null/undefined/empty/whitespace, literal"null"/"undefined"/""/'', empty JSON shells{}/[], JSON objects where all values are empty, partial-extraction JSON{x: null, y: 5}(any null = retry), placeholder patterns. Accepts real JSON, non-JSON strings, non-empty arrays.executePlanintegration tests: declines auto-complete on{x: null}, declines on literal"null", still auto-completes on meaningful output (positive control), existing Gen 7.2 placeholder tests still pass.Tier1 deterministic gate: PASSED (no regressions — Gen 9 only fires on plans that already would have failed via auto-complete).
What Gen 9.1 should do (the actual fix)
Three approaches that could actually move the pass rate:
Recovery-specific prompt for the per-action LLM: when the fall-through fires, the
Brain.decidecontext should explicitly say "your previous runScript used selector X and returned null. Try a DIFFERENT approach: a wider DOM query, a different element type, or click + wait first." Today the context is generic. This is the smallest, cheapest fix and should be tried first.Vision fallback: when runScript fails twice, take a screenshot + ask the LLM "find the element matching X, return its CSS selector or text content." This is how Atlas/Cursor handle ambiguous DOM. Slower per call but only fires on failures.
Multiple parallel runScript candidates: planner emits 2-3 alternative selectors in a single step; runner tries them in order and uses the first one that returns meaningful output. No extra LLM calls, just better recall.
The Gen 9 mechanism is the substrate for all three. Gen 9.1 picks one (or all).
Test plan
pnpm test(963/963 pass)pnpm exec tsc --noEmit(clean)pnpm bench:tier1:gate(PASSED, no regressions)Honest deliverables checklist
This is what an honest non-result looks like under the rigor protocol. The mechanism ships; the improvement doesn't.