Conversation
Three performance optimizations targeting LLM token cost and page load time: 1. Snapshot Diffing - Track previous ARIA snapshot elements and compute structured diffs (added/removed/changed/unchanged). Inject compact SNAPSHOT CHANGES section into LLM context when diff is small (<30% of full snapshot). 86% of turns get diffs; unchanged pages produce 24-char diffs vs 45K-char snapshots. 2. Resource Blocking - Configurable request interception via Playwright page.route() to abort analytics, tracking, images, and media. 100+ analytics/ad domains (Google Analytics, Segment, Mixpanel, etc). CLI flags: --block-analytics, --block-images, --block-media. 3. Compact Snapshot Format - Flat, ref-first, one-line-per-element format used in history compaction instead of stripping old snapshots entirely. Preserves element awareness from older turns. Benchmark shows 55% reduction in per-turn token growth after compaction kicks in. Includes benchmark suite (bench/perf-benchmark.ts) and 13 new unit tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
drewstone
added a commit
that referenced
this pull request
Apr 9, 2026
5-rep matched same-day validation per CLAUDE.md rules #3 + #6: Gen 8 same-day 5-rep: 29/50 = 58% Gen 10 5-rep: 37/50 = 74% Delta: +8 tasks (+16 percentage points) Architectural wins (consistent across 3-rep AND 5-rep, same-day): - npm-package-downloads: 0/5 -> 5/5 (+5) extractWithIndex - w3c-html-spec-find-element: 2/5 -> 5/5 (+3) bigger snapshot - github-pr-count: 4/5 -> 5/5 (+1) - stackoverflow-answer-count: 2/5 -> 3/5 (+1) Cost analysis (matched same-day): - Raw cost: +59% ($0.017 -> $0.027) - Cost per pass: +28% ($0.029 -> $0.037, more honest framing) - Death spirals: 0 (cost cap held; peak run $0.16) - Reddit Gen 9.1 regression FIXED: 5/5 at $0.015 mean (was 3/5 at $0.25-$0.32) Failure modes that remain (Gen 10.1 candidates): - Wikipedia oracle compliance: agent emits raw '1815' not {year:1815}. Same in Gen 8, not a regression. Fixable via prompt, not architecture. - Supervisor extra-context bloat on stuck turns (1 wikipedia rep burned 75K tokens). - mdn/arxiv variance within Wilson 95% CI overlap. Files: - .changeset/gen10-dom-index-extraction.md (honest writeup) - .evolve/progress.md (round 2 result + per-task table) - .evolve/current.json (status: round2_complete_promote) - .evolve/experiments.jsonl (gen10-002 with verdict KEEP)
drewstone
added a commit
that referenced
this pull request
Apr 9, 2026
* feat(runner): Gen 10 — DOM index extraction + bigger snapshot + cost cap
Three coordinated changes that ship together as Gen 10:
A) extractWithIndex action — pick-by-content over pick-by-selector
New action {action:'extractWithIndex', query:'p, dd, code', contains:'downloads'}
that returns a numbered, text-rich list of every visible element matching
the query. The agent picks elements by index in the next turn.
This is the architectural fix Gen 9 was missing: instead of asking the LLM
to write a precise CSS selector for data it hasn't seen yet (the failure
mode on npm/mdn/python-docs), the wide query finds candidates and the
response shows actual textContent so the LLM can pick by content match.
Wired into:
- src/types.ts (ExtractWithIndexAction type, added to Action union)
- src/brain/index.ts (validateAction parser, system prompt, planner prompt,
data-extraction rule #25 explaining when to prefer extractWithIndex over
runScript on extraction tasks)
- src/drivers/extract-with-index.ts (browser-side query helper, returns
{index, tag, text, attributes, selector} for each visible match, capped
at 80 matches)
- src/drivers/playwright.ts (driver.execute dispatch, returns formatted
output as data so executePlan can capture it like runScript)
- src/runner/runner.ts (per-action loop handler with feedback injection,
executePlan capture into lastExtractOutput, plan-ends-with-extract
fall-through to per-action loop with the match list as REPLAN context)
- src/supervisor/policy.ts (action signature for stuck-detection)
C) Bigger snapshot + content-line preservation
src/brain/index.ts:budgetSnapshot now preserves term/definition/code/pre/
paragraph content lines (which previously got dropped as "decorative" by
the interactive-only filter). These are exactly the lines that carry the
data agents need on MDN/Python docs/W3C spec/arxiv pages.
Budgets raised:
- Default budgetSnapshot cap: 16k → 24k chars
- Decide() new-page snapshot: 16k → 24k
- Planner snapshot: 12k → 24k (planner is the most important caller for
extraction tasks because it writes the runScript on the first observation)
Same-page snapshot stays at 8k (after the LLM has already seen the page).
Empirical verification: probed Playwright's locator.ariaSnapshot() output
on a fixture with <dl><dt><code>flatMap(callbackFn)</code></dt><dd>...</dd></dl>
— confirmed Playwright DOES emit `term`/`definition`/`code` lines with text
content. The bug was in the budgetSnapshot filter dropping them, not in
the snapshot pipeline missing them.
Cost cap (mandatory safety net for any iteration-based mechanism)
src/run-state.ts adds totalTokensUsed accumulator, tokenBudget (default
100k, override via Scenario.tokenBudget or BAD_TOKEN_BUDGET env), and
isTokenBudgetExhausted gate. src/runner/runner.ts checks the gate at the
top of every loop iteration (before the next LLM call) and returns
`cost_cap_exceeded` if exceeded.
Calibration:
- Gen 8 real-web mean: ~6k tokens (well under 100k)
- Tier 1 form-multistep full-evidence: ~60k tokens (within cap + 40k headroom)
- Gen 9 death-spirals: 132k–173k (above cap → caught and aborted)
100k = above any normal case I've measured, well below any death spiral.
Catches the Gen 9 reddit failure mode (rep 3: $0.25/132k tokens, rep 4:
$0.32/173k tokens) within 5–8 turns of futility instead of running for
the full case timeout.
Tests: 18 new (981 total, +18 from baseline)
- tests/budget-snapshot.test.ts: 6 (filter preservation, content lines,
priority bucket, paragraph handling)
- tests/extract-with-index.test.ts: 13 (browser-side query, contains
filter, hidden element skipping, invalid selector graceful failure,
stable selector building, formatter, parser via Brain.parse)
- tests/run-state.test.ts: 7 new in 'Gen 10 cost cap' describe block
- tests/runner-execute-plan.test.ts: 2 new (extractWithIndex deviation
with match list, cost cap exhaustion)
Gates: TypeScript clean, boundaries clean, full test suite 981/981 PASS,
Tier1 deterministic gate PASSED.
Refs: .evolve/pursuits/2026-04-08-gen9-retro-and-gen10-proposal.md
* feat(runner): isMeaningfulRunScriptOutput helper + runScript-empty fall-through
Cherry-picked from the abandoned Gen 9 branch (commit 63e16fe). The original
Gen 9 PR was closed because the LLM-iteration recovery loop didn't move the
pass rate AND introduced cost regressions on previously-passing tasks (reddit
death-spirals at $0.25-$0.32 / 130-173k tokens per case in 5-rep validation).
In Gen 10 the same code is safe and useful for two reasons:
1. Cost cap (100k tokens, default) bounds any death spiral
2. Per-action loop has extractWithIndex available — when the deviation
reason mentions "runScript returned no meaningful output", the LLM can
respond with extractWithIndex (per data-extraction rule #25) instead
of retrying the same wrong selector
What this brings into Gen 10:
isMeaningfulRunScriptOutput() helper:
- Detects null / undefined / empty / whitespace
- Detects literal "null" / "undefined" / "" / ''
- Detects empty JSON shells {} / []
- Detects {x: null} / partial-extraction patterns (any null = retry)
- Detects placeholder patterns via hasPlaceholderPattern
executePlan auto-complete branch hardened:
- Old: auto-complete fires whenever lastRunScriptOutput is truthy
- New: auto-complete fires only when isMeaningfulRunScriptOutput is true
- Catches the literal "null" string bug that previously slipped through
executePlan runScript-empty fall-through:
- When the last step is runScript and the output isn't meaningful, return
deviated with a reason that names the failure AND points the per-action
LLM at extractWithIndex (the Gen 10 recovery tool)
- This is the path that did NOT work in Gen 9 alone — but in Gen 10 the
per-action loop has extractWithIndex available AND the cost cap bounds
runaway recovery loops
Tests cherry-picked: 12 (all pass)
- 11 isMeaningfulRunScriptOutput unit tests in tests/runner-execute-plan.test.ts
- 4 executePlan integration tests (Gen 7.2/9 fall-through, declines on
{x:null}, declines on literal "null", positive control auto-completes
on real values)
Conflict resolution: the Gen 10 extractWithIndex fall-through and the Gen 9
runScript-empty fall-through are mutually exclusive (different last-step
types). Both kept, ordered Gen 10 first then Gen 9.
Tests: 993/993 (was 981 before this cherry-pick, +12 from Gen 9)
TypeScript clean. Boundaries clean.
* docs(gen10): changeset + 5-rep validation results
5-rep matched same-day validation per CLAUDE.md rules #3 + #6:
Gen 8 same-day 5-rep: 29/50 = 58%
Gen 10 5-rep: 37/50 = 74%
Delta: +8 tasks (+16 percentage points)
Architectural wins (consistent across 3-rep AND 5-rep, same-day):
- npm-package-downloads: 0/5 -> 5/5 (+5) extractWithIndex
- w3c-html-spec-find-element: 2/5 -> 5/5 (+3) bigger snapshot
- github-pr-count: 4/5 -> 5/5 (+1)
- stackoverflow-answer-count: 2/5 -> 3/5 (+1)
Cost analysis (matched same-day):
- Raw cost: +59% ($0.017 -> $0.027)
- Cost per pass: +28% ($0.029 -> $0.037, more honest framing)
- Death spirals: 0 (cost cap held; peak run $0.16)
- Reddit Gen 9.1 regression FIXED: 5/5 at $0.015 mean (was 3/5 at $0.25-$0.32)
Failure modes that remain (Gen 10.1 candidates):
- Wikipedia oracle compliance: agent emits raw '1815' not {year:1815}.
Same in Gen 8, not a regression. Fixable via prompt, not architecture.
- Supervisor extra-context bloat on stuck turns (1 wikipedia rep burned 75K tokens).
- mdn/arxiv variance within Wilson 95% CI overlap.
Files:
- .changeset/gen10-dom-index-extraction.md (honest writeup)
- .evolve/progress.md (round 2 result + per-task table)
- .evolve/current.json (status: round2_complete_promote)
- .evolve/experiments.jsonl (gen10-002 with verdict KEEP)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three performance optimizations targeting the two biggest latency sources: LLM token cost (snapshot size → decision time) and page load time (unnecessary network requests).
Snapshot Diffing — Tracks previous ARIA snapshot elements in
AriaSnapshotHelper, computes structured diffs (added/removed/changed/unchanged), and injects a compactSNAPSHOT CHANGESsection into LLM context when the diff is small. 86% of turns get diffs; unchanged pages produce 24-char diffs vs 45K-char full snapshots.Resource Blocking — Configurable request interception via Playwright
page.route()to abort analytics scripts, tracking pixels, images, and media. Ships 100+ analytics/ad domain patterns. New CLI flags:--block-analytics,--block-images,--block-media.Compact Snapshot Format — Flat, ref-first, one-line-per-element format (
@ref role "name" val="...") used in history compaction instead of stripping old snapshots entirely. Preserves element awareness from older turns while using ~50-60% fewer tokens.Benchmark Results (7-turn Hacker News navigation task, gpt-4o)
Files Changed
src/drivers/snapshot.tsdiffSnapshots(),getDiff(),formatCompact()src/drivers/playwright.tssetupResourceBlocking(), snapshotDiff in observe outputsrc/drivers/block-patterns.tssrc/drivers/types.tsResourceBlockingOptionsinterfacesrc/types.tssnapshotDifffield onPageStatesrc/brain/index.tssrc/config.tsresourceBlockingconfig optionsrc/cli.ts--block-analytics/images/mediaflagssrc/index.tstests/snapshot.test.tsbench/perf-benchmark.tsTest plan
pnpm buildpasses clean🤖 Generated with Claude Code