Open
Conversation
758b430 to
8d14211
Compare
8d14211 to
e1291e4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR was opened by the Changesets release GitHub action. When you're ready to do a release, you can merge this and the packages will be published to npm automatically. If you're not ready to do a release yet, that's fine, whenever you add more changesets to main, this PR will be updated.
Releases
@tangle-network/browser-agent-driver@0.23.0
Minor Changes
#60
a12e466Thanks @drewstone! - Gen 10 — DOM index extraction (extractWithIndex) + bigger snapshot + content-line preservation + cost cap. +8 tasks (+16 pp) on the real-web gauntlet vs same-day Gen 8 baseline, validated at 5-rep per CLAUDE.md rules #3 and #6.Honest 5-rep numbers (matched same-day baseline)
Key wins (5-rep, same-day):
Reddit Gen 9.1 regression FIXED: 5/5 at $0.015 mean (Gen 9.1 had 3/5 at $0.25-$0.32 death spirals).
What ships
A —
extractWithIndexaction (the capability change)New action
{action:'extractWithIndex', query:'p, dd, code', contains:'downloads'}returns a numbered list of every visible element matchingquery, each with full textContent + key attributes + a stable selector. The agent picks elements by index in the next turn.This is the architectural fix Gen 9 was missing. Instead of asking the LLM to write a precise CSS selector for data it hasn't seen yet (the failure mode on npm/mdn/python-docs/w3c), the wide query finds candidates and the response shows actual textContent so the LLM picks by content match. Pick-by-content beats pick-by-selector on every page where the planner couldn't see the data at plan time.
Wired into:
src/types.ts—ExtractWithIndexActiontype, added toActionunionsrc/brain/index.ts—validateActionparser, system prompt, planner prompt, data-extraction rule #25 explaining when to preferextractWithIndexoverrunScriptsrc/drivers/extract-with-index.ts— browser-side query helper (visibility check, stable selector building, hidden-element skipping, 80-match cap)src/drivers/playwright.ts— driver dispatch returns formatted output asdatasoexecutePlancan capture itsrc/runner/runner.ts— per-action loop handler with feedback injection,executePlancapture intolastExtractOutput, plan-ends-with-extract fall-through to per-action loop with the match list as REPLAN contextsrc/supervisor/policy.ts— action signature for stuck-detectionC — Bigger snapshot + content-line preservation
src/brain/index.ts:budgetSnapshotnow preservesterm/definition/code/pre/paragraphcontent lines (which previously got dropped as "decorative" by the interactive-only filter). These are exactly the lines that carry the data agents need on MDN/Python docs/W3C spec/arxiv pages.Budgets raised:
budgetSnapshotcap: 16k → 24k charsSame-page snapshot stays at 8k (after the LLM has already seen the page).
Empirical verification: probed Playwright's
locator.ariaSnapshot()output on a fixture with<dl><dt><code>flatMap(callbackFn)</code></dt><dd>...</dd></dl>— confirmed Playwright DOES emitterm/definition/codelines with text content. The bug was the filter dropping them, not the snapshot pipeline missing them.Cost cap (mandatory safety net)
src/run-state.tsaddstotalTokensUsedaccumulator,tokenBudget(default 100k, override viaScenario.tokenBudgetorBAD_TOKEN_BUDGETenv), andisTokenBudgetExhaustedgate.src/runner/runner.tschecks the gate at the top of every loop iteration (before the next LLM call) and returnssuccess: false, reason: 'cost_cap_exceeded: ...'if exceeded.Calibration:
100k = above any normal case observed, well below any death spiral. Result: zero cost cap hits in 50 runs. Reddit Gen 9.1 regression eliminated.
Cherry-picked Gen 9 helper (safe in Gen 10)
isMeaningfulRunScriptOutput()helper detects when a runScript output is too null/empty/placeholder to be a valid extraction. The original Gen 9 PR (#59) was closed because the LLM-iteration recovery loop didn't move pass rate AND introduced cost regressions. In Gen 10 the same code is safe because:extractWithIndex— when the deviation reason mentions "runScript returned no meaningful output", rule #25 directs the LLM to extractWithIndex instead of retrying the same wrong selectorThe helper hardens the
executePlanauto-complete branch (rejects"null",{x:null}, etc.) and gates a runScript-empty fall-through that points the per-action LLM at extractWithIndex.Tests
993/993 passing (+12 net new vs Gen 8):
tests/budget-snapshot.test.ts— 6 (filter preservation, content lines, priority bucket, paragraph handling)tests/extract-with-index.test.ts— 13 (browser-side query, contains filter, hidden element skipping, invalid selector graceful fail, stable selector, formatter, parser viaBrain.parse)tests/run-state.test.ts— 7 in 'Gen 10 cost cap' describe (default, env override, accumulator, exhaustion threshold)tests/runner-execute-plan.test.ts— 14 new (extractWithIndex deviation with match list, cost cap exhaustion, plus 12 cherry-picked Gen 9 fall-through tests)Gates
pnpm exec tsc --noEmit)pnpm check:boundaries)pnpm test) — 993/993Honest assessment
What this PR is: a real architectural improvement that adds a new capability (DOM index extraction) and removes a known failure mode (recovery loop death spirals).
What it isn't: a free win. Cost is +59% raw / +28% per-pass. Wall-time is +34%. Some tasks still fail (wikipedia oracle compliance, mdn/arxiv variance).
What the data says: Gen 10 is unambiguously better than Gen 8 at the same model and same conditions. The +8 task gain is well outside Wilson 95% CI overlap. The architectural changes (extractWithIndex, bigger snapshot) deliver exactly the wins they were designed for (npm 0→5, w3c 2→5).
What Gen 10.1 should fix:
{"year":1815}not'1815'