feat: snapshot diffing, resource blocking, compact snapshots by drewstone · Pull Request #3 · tangle-network/browser-agent-driver

drewstone · 2026-03-02T19:21:08Z

Summary

Three performance optimizations targeting the two biggest latency sources: LLM token cost (snapshot size → decision time) and page load time (unnecessary network requests).

Snapshot Diffing — Tracks previous ARIA snapshot elements in AriaSnapshotHelper, computes structured diffs (added/removed/changed/unchanged), and injects a compact SNAPSHOT CHANGES section into LLM context when the diff is small. 86% of turns get diffs; unchanged pages produce 24-char diffs vs 45K-char full snapshots.
Resource Blocking — Configurable request interception via Playwright page.route() to abort analytics scripts, tracking pixels, images, and media. Ships 100+ analytics/ad domain patterns. New CLI flags: --block-analytics, --block-images, --block-media.
Compact Snapshot Format — Flat, ref-first, one-line-per-element format (@ref role "name" val="...") used in history compaction instead of stripping old snapshots entirely. Preserves element awareness from older turns while using ~50-60% fewer tokens.

Benchmark Results (7-turn Hacker News navigation task, gpt-4o)

Metric	Value
Per-turn token growth (turn 1→2)	+14,127 tokens (full snapshot in history)
Per-turn token growth (turns 2+)	+6,295 tokens avg (compact history active)
Compact history savings	~55% reduction in per-turn token growth
Snapshot diff coverage	86% of turns
Diff size (unchanged page)	24 chars vs 45,000 char snapshot
Success rate	100% (all benchmark runs)

Files Changed

File	Change
`src/drivers/snapshot.ts`	Element tracking, `diffSnapshots()`, `getDiff()`, `formatCompact()`
`src/drivers/playwright.ts`	`setupResourceBlocking()`, snapshotDiff in observe output
`src/drivers/block-patterns.ts`	New — 100+ analytics/ad domain patterns
`src/drivers/types.ts`	`ResourceBlockingOptions` interface
`src/types.ts`	`snapshotDiff` field on `PageState`
`src/brain/index.ts`	Compact history format, SNAPSHOT CHANGES injection
`src/config.ts`	`resourceBlocking` config option
`src/cli.ts`	`--block-analytics/images/media` flags
`src/index.ts`	New exports
`tests/snapshot.test.ts`	13 new tests (diffing, compact format, blocking patterns)
`bench/perf-benchmark.ts`	New — A/B benchmark with per-turn token analysis

Test plan

pnpm build passes clean
All 93 unit tests pass (13 new)
Benchmark: 7-turn HN navigation completes 100% success rate
Snapshot diffs generated on 86% of turns
Compact history reduces per-turn token growth by ~55%
Resource blocking patterns match expected analytics URLs and don't false-positive on regular URLs

🤖 Generated with Claude Code

Three performance optimizations targeting LLM token cost and page load time: 1. Snapshot Diffing - Track previous ARIA snapshot elements and compute structured diffs (added/removed/changed/unchanged). Inject compact SNAPSHOT CHANGES section into LLM context when diff is small (<30% of full snapshot). 86% of turns get diffs; unchanged pages produce 24-char diffs vs 45K-char snapshots. 2. Resource Blocking - Configurable request interception via Playwright page.route() to abort analytics, tracking, images, and media. 100+ analytics/ad domains (Google Analytics, Segment, Mixpanel, etc). CLI flags: --block-analytics, --block-images, --block-media. 3. Compact Snapshot Format - Flat, ref-first, one-line-per-element format used in history compaction instead of stripping old snapshots entirely. Preserves element awareness from older turns. Benchmark shows 55% reduction in per-turn token growth after compaction kicks in. Includes benchmark suite (bench/perf-benchmark.ts) and 13 new unit tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

5-rep matched same-day validation per CLAUDE.md rules #3 + #6: Gen 8 same-day 5-rep: 29/50 = 58% Gen 10 5-rep: 37/50 = 74% Delta: +8 tasks (+16 percentage points) Architectural wins (consistent across 3-rep AND 5-rep, same-day): - npm-package-downloads: 0/5 -> 5/5 (+5) extractWithIndex - w3c-html-spec-find-element: 2/5 -> 5/5 (+3) bigger snapshot - github-pr-count: 4/5 -> 5/5 (+1) - stackoverflow-answer-count: 2/5 -> 3/5 (+1) Cost analysis (matched same-day): - Raw cost: +59% ($0.017 -> $0.027) - Cost per pass: +28% ($0.029 -> $0.037, more honest framing) - Death spirals: 0 (cost cap held; peak run $0.16) - Reddit Gen 9.1 regression FIXED: 5/5 at $0.015 mean (was 3/5 at $0.25-$0.32) Failure modes that remain (Gen 10.1 candidates): - Wikipedia oracle compliance: agent emits raw '1815' not {year:1815}. Same in Gen 8, not a regression. Fixable via prompt, not architecture. - Supervisor extra-context bloat on stuck turns (1 wikipedia rep burned 75K tokens). - mdn/arxiv variance within Wilson 95% CI overlap. Files: - .changeset/gen10-dom-index-extraction.md (honest writeup) - .evolve/progress.md (round 2 result + per-task table) - .evolve/current.json (status: round2_complete_promote) - .evolve/experiments.jsonl (gen10-002 with verdict KEEP)

* feat(runner): Gen 10 — DOM index extraction + bigger snapshot + cost cap Three coordinated changes that ship together as Gen 10: A) extractWithIndex action — pick-by-content over pick-by-selector New action {action:'extractWithIndex', query:'p, dd, code', contains:'downloads'} that returns a numbered, text-rich list of every visible element matching the query. The agent picks elements by index in the next turn. This is the architectural fix Gen 9 was missing: instead of asking the LLM to write a precise CSS selector for data it hasn't seen yet (the failure mode on npm/mdn/python-docs), the wide query finds candidates and the response shows actual textContent so the LLM can pick by content match. Wired into: - src/types.ts (ExtractWithIndexAction type, added to Action union) - src/brain/index.ts (validateAction parser, system prompt, planner prompt, data-extraction rule #25 explaining when to prefer extractWithIndex over runScript on extraction tasks) - src/drivers/extract-with-index.ts (browser-side query helper, returns {index, tag, text, attributes, selector} for each visible match, capped at 80 matches) - src/drivers/playwright.ts (driver.execute dispatch, returns formatted output as data so executePlan can capture it like runScript) - src/runner/runner.ts (per-action loop handler with feedback injection, executePlan capture into lastExtractOutput, plan-ends-with-extract fall-through to per-action loop with the match list as REPLAN context) - src/supervisor/policy.ts (action signature for stuck-detection) C) Bigger snapshot + content-line preservation src/brain/index.ts:budgetSnapshot now preserves term/definition/code/pre/ paragraph content lines (which previously got dropped as "decorative" by the interactive-only filter). These are exactly the lines that carry the data agents need on MDN/Python docs/W3C spec/arxiv pages. Budgets raised: - Default budgetSnapshot cap: 16k → 24k chars - Decide() new-page snapshot: 16k → 24k - Planner snapshot: 12k → 24k (planner is the most important caller for extraction tasks because it writes the runScript on the first observation) Same-page snapshot stays at 8k (after the LLM has already seen the page). Empirical verification: probed Playwright's locator.ariaSnapshot() output on a fixture with <dl><dt><code>flatMap(callbackFn)</code></dt><dd>...</dd></dl> — confirmed Playwright DOES emit `term`/`definition`/`code` lines with text content. The bug was in the budgetSnapshot filter dropping them, not in the snapshot pipeline missing them. Cost cap (mandatory safety net for any iteration-based mechanism) src/run-state.ts adds totalTokensUsed accumulator, tokenBudget (default 100k, override via Scenario.tokenBudget or BAD_TOKEN_BUDGET env), and isTokenBudgetExhausted gate. src/runner/runner.ts checks the gate at the top of every loop iteration (before the next LLM call) and returns `cost_cap_exceeded` if exceeded. Calibration: - Gen 8 real-web mean: ~6k tokens (well under 100k) - Tier 1 form-multistep full-evidence: ~60k tokens (within cap + 40k headroom) - Gen 9 death-spirals: 132k–173k (above cap → caught and aborted) 100k = above any normal case I've measured, well below any death spiral. Catches the Gen 9 reddit failure mode (rep 3: $0.25/132k tokens, rep 4: $0.32/173k tokens) within 5–8 turns of futility instead of running for the full case timeout. Tests: 18 new (981 total, +18 from baseline) - tests/budget-snapshot.test.ts: 6 (filter preservation, content lines, priority bucket, paragraph handling) - tests/extract-with-index.test.ts: 13 (browser-side query, contains filter, hidden element skipping, invalid selector graceful failure, stable selector building, formatter, parser via Brain.parse) - tests/run-state.test.ts: 7 new in 'Gen 10 cost cap' describe block - tests/runner-execute-plan.test.ts: 2 new (extractWithIndex deviation with match list, cost cap exhaustion) Gates: TypeScript clean, boundaries clean, full test suite 981/981 PASS, Tier1 deterministic gate PASSED. Refs: .evolve/pursuits/2026-04-08-gen9-retro-and-gen10-proposal.md * feat(runner): isMeaningfulRunScriptOutput helper + runScript-empty fall-through Cherry-picked from the abandoned Gen 9 branch (commit 63e16fe). The original Gen 9 PR was closed because the LLM-iteration recovery loop didn't move the pass rate AND introduced cost regressions on previously-passing tasks (reddit death-spirals at $0.25-$0.32 / 130-173k tokens per case in 5-rep validation). In Gen 10 the same code is safe and useful for two reasons: 1. Cost cap (100k tokens, default) bounds any death spiral 2. Per-action loop has extractWithIndex available — when the deviation reason mentions "runScript returned no meaningful output", the LLM can respond with extractWithIndex (per data-extraction rule #25) instead of retrying the same wrong selector What this brings into Gen 10: isMeaningfulRunScriptOutput() helper: - Detects null / undefined / empty / whitespace - Detects literal "null" / "undefined" / "" / '' - Detects empty JSON shells {} / [] - Detects {x: null} / partial-extraction patterns (any null = retry) - Detects placeholder patterns via hasPlaceholderPattern executePlan auto-complete branch hardened: - Old: auto-complete fires whenever lastRunScriptOutput is truthy - New: auto-complete fires only when isMeaningfulRunScriptOutput is true - Catches the literal "null" string bug that previously slipped through executePlan runScript-empty fall-through: - When the last step is runScript and the output isn't meaningful, return deviated with a reason that names the failure AND points the per-action LLM at extractWithIndex (the Gen 10 recovery tool) - This is the path that did NOT work in Gen 9 alone — but in Gen 10 the per-action loop has extractWithIndex available AND the cost cap bounds runaway recovery loops Tests cherry-picked: 12 (all pass) - 11 isMeaningfulRunScriptOutput unit tests in tests/runner-execute-plan.test.ts - 4 executePlan integration tests (Gen 7.2/9 fall-through, declines on {x:null}, declines on literal "null", positive control auto-completes on real values) Conflict resolution: the Gen 10 extractWithIndex fall-through and the Gen 9 runScript-empty fall-through are mutually exclusive (different last-step types). Both kept, ordered Gen 10 first then Gen 9. Tests: 993/993 (was 981 before this cherry-pick, +12 from Gen 9) TypeScript clean. Boundaries clean. * docs(gen10): changeset + 5-rep validation results 5-rep matched same-day validation per CLAUDE.md rules #3 + #6: Gen 8 same-day 5-rep: 29/50 = 58% Gen 10 5-rep: 37/50 = 74% Delta: +8 tasks (+16 percentage points) Architectural wins (consistent across 3-rep AND 5-rep, same-day): - npm-package-downloads: 0/5 -> 5/5 (+5) extractWithIndex - w3c-html-spec-find-element: 2/5 -> 5/5 (+3) bigger snapshot - github-pr-count: 4/5 -> 5/5 (+1) - stackoverflow-answer-count: 2/5 -> 3/5 (+1) Cost analysis (matched same-day): - Raw cost: +59% ($0.017 -> $0.027) - Cost per pass: +28% ($0.029 -> $0.037, more honest framing) - Death spirals: 0 (cost cap held; peak run $0.16) - Reddit Gen 9.1 regression FIXED: 5/5 at $0.015 mean (was 3/5 at $0.25-$0.32) Failure modes that remain (Gen 10.1 candidates): - Wikipedia oracle compliance: agent emits raw '1815' not {year:1815}. Same in Gen 8, not a regression. Fixable via prompt, not architecture. - Supervisor extra-context bloat on stuck turns (1 wikipedia rep burned 75K tokens). - mdn/arxiv variance within Wilson 95% CI overlap. Files: - .changeset/gen10-dom-index-extraction.md (honest writeup) - .evolve/progress.md (round 2 result + per-task table) - .evolve/current.json (status: round2_complete_promote) - .evolve/experiments.jsonl (gen10-002 with verdict KEEP)

drewstone merged commit 5416ac3 into main Mar 2, 2026
3 checks passed

github-actions bot mentioned this pull request Apr 9, 2026

Release: version packages #61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: snapshot diffing, resource blocking, compact snapshots#3

feat: snapshot diffing, resource blocking, compact snapshots#3
drewstone merged 1 commit intomainfrom
feat/performance-snapshot-diffing-resource-blocking

drewstone commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Mar 2, 2026

Summary

Benchmark Results (7-turn Hacker News navigation task, gpt-4o)

Files Changed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant