Skip to content

feat(runner): Gen 9 — runtime two-pass extraction (mechanism, no measured gain)#59

Closed
drewstone wants to merge 1 commit intomainfrom
gen9-runtime-two-pass-extraction
Closed

feat(runner): Gen 9 — runtime two-pass extraction (mechanism, no measured gain)#59
drewstone wants to merge 1 commit intomainfrom
gen9-runtime-two-pass-extraction

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Summary

Gen 9 ships the mechanism, not the improvement. The architectural fall-through path is in place + tested, but n=3 reps on the real-web gauntlet show no measured pass-rate gain (21/30 = 70% vs Gen 8 head-to-head 23/30 = 77%, within variance).

This is what an honest non-result looks like. Calling it "Gen 9 wins" would be reward-hacking the headline. The mechanism still ships because the substrate is reusable for Gen 9.1 (the actual fix).

What this PR is

A surgical change to executePlan: when the planner-emitted runScript step returns null / empty / {x: null} / placeholder pattern, the runner now declines to auto-complete with that garbage and falls through to the per-action loop with a [REPLAN] context that names the failure. The per-action loop's Brain.decide then gets a fresh observation of the loaded page and a chance to emit a smarter action.

This is the architectural mirror of how browser-use's per-action loop wins on tasks like npm/mdn/w3c — it iterates after a failed extraction. Gen 9 gives bad's per-action loop the same recovery surface while keeping the planner's first-try speed advantage on the cases that succeed cleanly.

Verified result: no measured improvement

metric Gen 8 (head-to-head bad) Gen 9 Δ
pass rate 23/30 = 77% 21/30 = 70% −2 (within n=3 variance)
mean wall-time 9.2s 13.5s +4.3s
mean cost $0.0168 $0.0256 +$0.009
mean tokens 6,134 8,737 +2,603

The pass rate did NOT improve. The mechanism IS firing (visible in 5-7 turn runs where the per-action loop kicked in after a failed runScript), but the recovery isn't smart enough — when the per-action loop fires, it has the same LLM that picked the wrong selector the first time. Iteration alone doesn't help if the LLM keeps making the same wrong call.

Per-task delta vs Gen 8 head-to-head

task Gen 8 Gen 9 Δ what happened
npm-package-downloads 1/3 2/3 +1 per-action recovery worked sometimes
github-pr-count 2/3 3/3 +1 recovery worked
arxiv-paper-abstract 3/3 2/3 −1 variance
python-docs-method-signature 2/3 1/3 −1 recovery couldn't fix wrong-selector
mdn-array-flatmap 2/3 0/3 −2 recovery REGRESSED — more chances to fail
5 other tasks unchanged unchanged 0
TOTAL 23/30 21/30 −2 (variance)

Why this is NOT shipping as an improvement

Per CLAUDE.md rule #6 ("quality wins need ≥5 reps") and the honest-eval rules: a non-improvement is not an improvement, even when the mechanism is architecturally sound. Calling this a "Gen 9 win" would be reward-hacking the headline.

The honest framing: Gen 9 ships the mechanism, not the improvement. The substitution path, the isMeaningfulRunScriptOutput helper, and the fall-through context are all in place for future generations to build on.

Why ship Gen 9 anyway (the mechanism PR)

  1. The unit tests are valuable regardless (12 new tests covering isMeaningfulRunScriptOutput + the fall-through path)
  2. The infrastructure is reusable for Gen 9.1: vision fallback, smarter recovery prompts, multiple parallel runScript candidates — all of these slot into the same fall-through point
  3. isMeaningfulRunScriptOutput is a real primitive that future code can use
  4. Reverting feels like throwing away architectural correctness because the LLM isn't smart enough yet — the right path is to make recovery smarter, not remove the fall-through

What ships

isMeaningfulRunScriptOutput(output) in src/runner/runner.ts:

  • Rejects: null, undefined, empty/whitespace, "null", "undefined", "", '{}', '[]', {x: null}, any value matching hasPlaceholderPattern, JSON objects where every value is null/empty/zero
  • Accepts: real JSON with values, non-empty strings, non-empty arrays
  • Conservative: if ANY top-level field is null in a JSON object, treats it as "not meaningful" (the agent should retry to get all fields)

executePlan fall-through change — when the last plan step is a runScript AND isMeaningfulRunScriptOutput(lastRunScriptOutput) is false, the runner returns kind: deviated with reason runScript returned no meaningful output (got: ...). The per-action loop's [REPLAN] context (built in BrowserAgent.run) then names this failure to Brain.decide.

Tests

951 → 963 passing (+12 net new):

  • 11 isMeaningfulRunScriptOutput unit tests: rejects null/undefined/empty/whitespace, literal "null"/"undefined"/""/'', empty JSON shells {}/[], JSON objects where all values are empty, partial-extraction JSON {x: null, y: 5} (any null = retry), placeholder patterns. Accepts real JSON, non-JSON strings, non-empty arrays.
  • 4 executePlan integration tests: declines auto-complete on {x: null}, declines on literal "null", still auto-completes on meaningful output (positive control), existing Gen 7.2 placeholder tests still pass.

Tier1 deterministic gate: PASSED (no regressions — Gen 9 only fires on plans that already would have failed via auto-complete).

What Gen 9.1 should do (the actual fix)

Three approaches that could actually move the pass rate:

  1. Recovery-specific prompt for the per-action LLM: when the fall-through fires, the Brain.decide context should explicitly say "your previous runScript used selector X and returned null. Try a DIFFERENT approach: a wider DOM query, a different element type, or click + wait first." Today the context is generic. This is the smallest, cheapest fix and should be tried first.

  2. Vision fallback: when runScript fails twice, take a screenshot + ask the LLM "find the element matching X, return its CSS selector or text content." This is how Atlas/Cursor handle ambiguous DOM. Slower per call but only fires on failures.

  3. Multiple parallel runScript candidates: planner emits 2-3 alternative selectors in a single step; runner tries them in order and uses the first one that returns meaningful output. No extra LLM calls, just better recall.

The Gen 9 mechanism is the substrate for all three. Gen 9.1 picks one (or all).

Test plan

  • pnpm test (963/963 pass)
  • pnpm exec tsc --noEmit (clean)
  • pnpm bench:tier1:gate (PASSED, no regressions)
  • Re-ran the 10-task gauntlet with Gen 9 active
  • Honestly compared per-task results vs Gen 8 head-to-head
  • Did NOT cherry-pick numbers
  • Did NOT reword the failure as a win

Honest deliverables checklist

  • ✅ Mechanism designed and tested (12 new tests pass)
  • ✅ Mechanism fires correctly on real-web failures (visible in 5-7 turn recovery runs)
  • ✅ Tier1 gate maintained
  • ❌ Pass rate unchanged: 21/30 vs 23/30 = within n=3 variance
  • ❌ Wall-time and cost increased on the recovery path
  • ❌ Did NOT claim "Gen 9 wins" when the data says it doesn't

This is what an honest non-result looks like under the rigor protocol. The mechanism ships; the improvement doesn't.

…ured gain)

Mechanism in place. No measured pass-rate improvement at n=3 reps on the
real-web gauntlet. Honest non-result that points at the next architectural fix.

What this PR is:

A surgical change to executePlan: when the planner-emitted runScript step
returns null / empty / {x: null} / placeholder pattern, the runner now
declines to auto-complete with that garbage and falls through to the
per-action loop with a [REPLAN] context that names the failure. The
per-action loop's Brain.decide gets a fresh observation and a chance to
emit a smarter action.

Architectural mirror of how browser-use's per-action loop wins on
tasks like npm/mdn/w3c — it iterates after a failed extraction. Gen 9
gives bad's per-action loop the same recovery surface while keeping the
planner's first-try speed advantage on the cases that succeed cleanly.

Verified result: no measured improvement.

Gen 9 was validated against the same 10-task gauntlet as Gen 8, 3 reps
each, same conditions:

| metric         | Gen 8 (head-to-head bad) | Gen 9         | Δ           |
|----------------|-------------------------:|--------------:|-------------|
| pass rate      | 23/30 = 77%              | 21/30 = 70%   | -2 (variance) |
| mean wall-time | 9.2s                     | 13.5s         | +4.3s       |
| mean cost      | $0.0168                  | $0.0256       | +$0.009     |

The pass rate did NOT improve. The mechanism IS firing (visible in 5-7
turn runs where the per-action loop kicked in), but the recovery isn't
smart enough — when the per-action loop fires, it has the SAME LLM that
picked the wrong selector the first time. Iteration alone doesn't help
if the LLM keeps making the same wrong call.

Per-task delta vs Gen 8 head-to-head:

| task                          | Gen 8 | Gen 9 | what happened           |
|-------------------------------|------:|------:|-------------------------|
| npm-package-downloads         |   1/3 |   2/3 | recovery worked sometimes |
| github-pr-count               |   2/3 |   3/3 | recovery worked         |
| arxiv-paper-abstract          |   3/3 |   2/3 | variance                |
| python-docs-method-signature  |   2/3 |   1/3 | recovery couldn't fix wrong-selector |
| mdn-array-flatmap             |   2/3 |   0/3 | recovery REGRESSED      |
| 5 other tasks                 |     - |     - | unchanged               |
| TOTAL                         | 23/30 | 21/30 | within variance         |

Why this is NOT shipping as an improvement:

Per CLAUDE.md rule #6 (quality wins need ≥5 reps) and the honest-eval
rules: a non-improvement is not an improvement, even when the mechanism
is architecturally sound. Calling this a "Gen 9 win" would be reward-
hacking the headline.

Honest framing: Gen 9 ships the MECHANISM, not the IMPROVEMENT. The
substitution path, the isMeaningfulRunScriptOutput helper, and the
fall-through context are all in place for future generations to build on.

Why ship Gen 9 anyway:
1. The unit tests are valuable regardless (12 new tests)
2. The infrastructure is reusable for Gen 9.1: vision fallback, smarter
   recovery prompts, multiple parallel runScript candidates — all of
   these slot into the same fall-through point
3. isMeaningfulRunScriptOutput is a real primitive
4. Reverting feels like throwing away architectural correctness because
   the LLM isn't smart enough yet — the right path is to make recovery
   smarter, not remove the fall-through

What ships:

isMeaningfulRunScriptOutput(output) in src/runner/runner.ts — exported
helper that detects null/empty/placeholder runScript output:
- Rejects: null, undefined, empty/whitespace, "null", "undefined",
  '""', "''", '{}', '[]', {x: null}, hasPlaceholderPattern matches,
  JSON objects where every value is null/empty/zero
- Accepts: real JSON with values, non-empty strings, non-empty arrays
- Conservative: ANY top-level null in JSON object = "not meaningful"
  (the agent should retry to get all fields)

executePlan fall-through change — when last plan step is runScript AND
isMeaningfulRunScriptOutput is false, returns kind: deviated with reason
"runScript returned no meaningful output (got: ...)". The per-action loop's
[REPLAN] context names this failure to Brain.decide.

Tests: 951 → 963 passing (+12 net new):
- 11 isMeaningfulRunScriptOutput unit tests
- 4 executePlan fall-through integration tests
- existing Gen 7.2 placeholder substitution tests still pass
- Tier1 deterministic gate: PASSED (no regressions)

What Gen 9.1 should do (the actual fix):

1. Recovery-specific prompt: when fall-through fires, Brain.decide
   context should say "your previous runScript used selector X and
   returned null. Try a DIFFERENT approach." Today the context is
   generic.
2. Vision fallback: when runScript fails twice, take a screenshot + ask
   the LLM to find the element. Slower per call but only fires on
   failures. This is how Atlas/Cursor handle ambiguous DOM.
3. Multiple parallel runScript candidates: planner emits 2-3 alternative
   selectors in a single step; runner tries them in order. No extra LLM
   calls, just better recall.

Gen 9 is the SUBSTRATE for all three. Gen 9.1 picks one (or all).

This is what an honest non-result looks like under the rigor protocol.
@drewstone drewstone marked this pull request as draft April 9, 2026 00:22
@drewstone
Copy link
Copy Markdown
Contributor Author

Closing without merging — Gen 9 LLM-iteration approach is fundamentally broken

After Gen 9.1 enhancements (explicit recovery prompt + failed-script-in-deviation-reason), 5-rep validation revealed a critical cost regression that makes this approach worse than baseline:

Cost regression on previously-passing tasks

  • reddit-subreddit-titles: Gen 8 = 3/3 @ ~$0.015/run → Gen 9.1 = 3/5 with reps 3-4 burning $0.25 / 132K tokens and $0.32 / 173K tokens in death-spiral recovery loops
  • mdn-array-flatmap: Gen 8 = 2/3 → Gen 9.1 = 0/5 (more chances to fail with the same wrong selector logic)

Pass rate did NOT improve

  • Gen 8 baseline: 23/30 = 77%
  • Gen 9 (3 reps): 21/30 = 70% — within variance
  • Gen 9.1 (3 reps): 23/30 = 77% — net zero, +1 here, -1 there
  • 5-rep validation killed mid-run after observing the cost death-spiral

Root cause

The LLM-iteration approach (whether bad's per-action loop or browser-use's per-action loop) doesn't reliably recover from extraction failures. Iterating with the same LLM that picked the wrong selector burns tokens without changing the outcome. The per-action loop has unbounded cost when recovery fails — there's no budget cap that stops a stuck recovery from burning $0.30+ per run.

What we keep

  • isMeaningfulRunScriptOutput() helper is a real primitive that's worth having for future code (validators, metric attribution, smarter cost gates)
  • The 12 unit tests are valid regardless

What's next: Gen 10

A fundamentally different approach is needed:

  1. Element-by-index addressing (browser-use's winning approach on npm/mdn/w3c)
  2. Vision fallback (Atlas/Cursor — screenshot + ask LLM to find element, only on failure)
  3. Hard cost cap on recovery loops (stop bleeding tokens when recovery isn't converging)

Closing per the rigor protocol: a non-improvement that introduces cost regression is a regression, not a "mechanism PR."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant