Skip to content

feat: snapshot diffing, resource blocking, compact snapshots#3

Merged
drewstone merged 1 commit intomainfrom
feat/performance-snapshot-diffing-resource-blocking
Mar 2, 2026
Merged

feat: snapshot diffing, resource blocking, compact snapshots#3
drewstone merged 1 commit intomainfrom
feat/performance-snapshot-diffing-resource-blocking

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Summary

Three performance optimizations targeting the two biggest latency sources: LLM token cost (snapshot size → decision time) and page load time (unnecessary network requests).

  • Snapshot Diffing — Tracks previous ARIA snapshot elements in AriaSnapshotHelper, computes structured diffs (added/removed/changed/unchanged), and injects a compact SNAPSHOT CHANGES section into LLM context when the diff is small. 86% of turns get diffs; unchanged pages produce 24-char diffs vs 45K-char full snapshots.

  • Resource Blocking — Configurable request interception via Playwright page.route() to abort analytics scripts, tracking pixels, images, and media. Ships 100+ analytics/ad domain patterns. New CLI flags: --block-analytics, --block-images, --block-media.

  • Compact Snapshot Format — Flat, ref-first, one-line-per-element format (@ref role "name" val="...") used in history compaction instead of stripping old snapshots entirely. Preserves element awareness from older turns while using ~50-60% fewer tokens.

Benchmark Results (7-turn Hacker News navigation task, gpt-4o)

Metric Value
Per-turn token growth (turn 1→2) +14,127 tokens (full snapshot in history)
Per-turn token growth (turns 2+) +6,295 tokens avg (compact history active)
Compact history savings ~55% reduction in per-turn token growth
Snapshot diff coverage 86% of turns
Diff size (unchanged page) 24 chars vs 45,000 char snapshot
Success rate 100% (all benchmark runs)

Files Changed

File Change
src/drivers/snapshot.ts Element tracking, diffSnapshots(), getDiff(), formatCompact()
src/drivers/playwright.ts setupResourceBlocking(), snapshotDiff in observe output
src/drivers/block-patterns.ts New — 100+ analytics/ad domain patterns
src/drivers/types.ts ResourceBlockingOptions interface
src/types.ts snapshotDiff field on PageState
src/brain/index.ts Compact history format, SNAPSHOT CHANGES injection
src/config.ts resourceBlocking config option
src/cli.ts --block-analytics/images/media flags
src/index.ts New exports
tests/snapshot.test.ts 13 new tests (diffing, compact format, blocking patterns)
bench/perf-benchmark.ts New — A/B benchmark with per-turn token analysis

Test plan

  • pnpm build passes clean
  • All 93 unit tests pass (13 new)
  • Benchmark: 7-turn HN navigation completes 100% success rate
  • Snapshot diffs generated on 86% of turns
  • Compact history reduces per-turn token growth by ~55%
  • Resource blocking patterns match expected analytics URLs and don't false-positive on regular URLs

🤖 Generated with Claude Code

Three performance optimizations targeting LLM token cost and page load time:

1. Snapshot Diffing - Track previous ARIA snapshot elements and compute
   structured diffs (added/removed/changed/unchanged). Inject compact
   SNAPSHOT CHANGES section into LLM context when diff is small (<30%
   of full snapshot). 86% of turns get diffs; unchanged pages produce
   24-char diffs vs 45K-char snapshots.

2. Resource Blocking - Configurable request interception via Playwright
   page.route() to abort analytics, tracking, images, and media.
   100+ analytics/ad domains (Google Analytics, Segment, Mixpanel, etc).
   CLI flags: --block-analytics, --block-images, --block-media.

3. Compact Snapshot Format - Flat, ref-first, one-line-per-element format
   used in history compaction instead of stripping old snapshots entirely.
   Preserves element awareness from older turns. Benchmark shows 55%
   reduction in per-turn token growth after compaction kicks in.

Includes benchmark suite (bench/perf-benchmark.ts) and 13 new unit tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@drewstone drewstone merged commit 5416ac3 into main Mar 2, 2026
3 checks passed
drewstone added a commit that referenced this pull request Apr 9, 2026
5-rep matched same-day validation per CLAUDE.md rules #3 + #6:

  Gen 8 same-day 5-rep: 29/50 = 58%
  Gen 10 5-rep:         37/50 = 74%
  Delta:                +8 tasks (+16 percentage points)

Architectural wins (consistent across 3-rep AND 5-rep, same-day):
  - npm-package-downloads:    0/5 -> 5/5 (+5) extractWithIndex
  - w3c-html-spec-find-element: 2/5 -> 5/5 (+3) bigger snapshot
  - github-pr-count:          4/5 -> 5/5 (+1)
  - stackoverflow-answer-count: 2/5 -> 3/5 (+1)

Cost analysis (matched same-day):
  - Raw cost: +59% ($0.017 -> $0.027)
  - Cost per pass: +28% ($0.029 -> $0.037, more honest framing)
  - Death spirals: 0 (cost cap held; peak run $0.16)
  - Reddit Gen 9.1 regression FIXED: 5/5 at $0.015 mean (was 3/5 at $0.25-$0.32)

Failure modes that remain (Gen 10.1 candidates):
  - Wikipedia oracle compliance: agent emits raw '1815' not {year:1815}.
    Same in Gen 8, not a regression. Fixable via prompt, not architecture.
  - Supervisor extra-context bloat on stuck turns (1 wikipedia rep burned 75K tokens).
  - mdn/arxiv variance within Wilson 95% CI overlap.

Files:
  - .changeset/gen10-dom-index-extraction.md (honest writeup)
  - .evolve/progress.md (round 2 result + per-task table)
  - .evolve/current.json (status: round2_complete_promote)
  - .evolve/experiments.jsonl (gen10-002 with verdict KEEP)
drewstone added a commit that referenced this pull request Apr 9, 2026
* feat(runner): Gen 10 — DOM index extraction + bigger snapshot + cost cap

Three coordinated changes that ship together as Gen 10:

A) extractWithIndex action — pick-by-content over pick-by-selector

   New action {action:'extractWithIndex', query:'p, dd, code', contains:'downloads'}
   that returns a numbered, text-rich list of every visible element matching
   the query. The agent picks elements by index in the next turn.

   This is the architectural fix Gen 9 was missing: instead of asking the LLM
   to write a precise CSS selector for data it hasn't seen yet (the failure
   mode on npm/mdn/python-docs), the wide query finds candidates and the
   response shows actual textContent so the LLM can pick by content match.

   Wired into:
   - src/types.ts (ExtractWithIndexAction type, added to Action union)
   - src/brain/index.ts (validateAction parser, system prompt, planner prompt,
     data-extraction rule #25 explaining when to prefer extractWithIndex over
     runScript on extraction tasks)
   - src/drivers/extract-with-index.ts (browser-side query helper, returns
     {index, tag, text, attributes, selector} for each visible match, capped
     at 80 matches)
   - src/drivers/playwright.ts (driver.execute dispatch, returns formatted
     output as data so executePlan can capture it like runScript)
   - src/runner/runner.ts (per-action loop handler with feedback injection,
     executePlan capture into lastExtractOutput, plan-ends-with-extract
     fall-through to per-action loop with the match list as REPLAN context)
   - src/supervisor/policy.ts (action signature for stuck-detection)

C) Bigger snapshot + content-line preservation

   src/brain/index.ts:budgetSnapshot now preserves term/definition/code/pre/
   paragraph content lines (which previously got dropped as "decorative" by
   the interactive-only filter). These are exactly the lines that carry the
   data agents need on MDN/Python docs/W3C spec/arxiv pages.

   Budgets raised:
   - Default budgetSnapshot cap: 16k → 24k chars
   - Decide() new-page snapshot: 16k → 24k
   - Planner snapshot: 12k → 24k (planner is the most important caller for
     extraction tasks because it writes the runScript on the first observation)

   Same-page snapshot stays at 8k (after the LLM has already seen the page).

   Empirical verification: probed Playwright's locator.ariaSnapshot() output
   on a fixture with <dl><dt><code>flatMap(callbackFn)</code></dt><dd>...</dd></dl>
   — confirmed Playwright DOES emit `term`/`definition`/`code` lines with text
   content. The bug was in the budgetSnapshot filter dropping them, not in
   the snapshot pipeline missing them.

Cost cap (mandatory safety net for any iteration-based mechanism)

   src/run-state.ts adds totalTokensUsed accumulator, tokenBudget (default
   100k, override via Scenario.tokenBudget or BAD_TOKEN_BUDGET env), and
   isTokenBudgetExhausted gate. src/runner/runner.ts checks the gate at the
   top of every loop iteration (before the next LLM call) and returns
   `cost_cap_exceeded` if exceeded.

   Calibration:
   - Gen 8 real-web mean: ~6k tokens (well under 100k)
   - Tier 1 form-multistep full-evidence: ~60k tokens (within cap + 40k headroom)
   - Gen 9 death-spirals: 132k–173k (above cap → caught and aborted)

   100k = above any normal case I've measured, well below any death spiral.
   Catches the Gen 9 reddit failure mode (rep 3: $0.25/132k tokens, rep 4:
   $0.32/173k tokens) within 5–8 turns of futility instead of running for
   the full case timeout.

Tests: 18 new (981 total, +18 from baseline)
   - tests/budget-snapshot.test.ts: 6 (filter preservation, content lines,
     priority bucket, paragraph handling)
   - tests/extract-with-index.test.ts: 13 (browser-side query, contains
     filter, hidden element skipping, invalid selector graceful failure,
     stable selector building, formatter, parser via Brain.parse)
   - tests/run-state.test.ts: 7 new in 'Gen 10 cost cap' describe block
   - tests/runner-execute-plan.test.ts: 2 new (extractWithIndex deviation
     with match list, cost cap exhaustion)

Gates: TypeScript clean, boundaries clean, full test suite 981/981 PASS,
Tier1 deterministic gate PASSED.

Refs: .evolve/pursuits/2026-04-08-gen9-retro-and-gen10-proposal.md

* feat(runner): isMeaningfulRunScriptOutput helper + runScript-empty fall-through

Cherry-picked from the abandoned Gen 9 branch (commit 63e16fe). The original
Gen 9 PR was closed because the LLM-iteration recovery loop didn't move the
pass rate AND introduced cost regressions on previously-passing tasks (reddit
death-spirals at $0.25-$0.32 / 130-173k tokens per case in 5-rep validation).

In Gen 10 the same code is safe and useful for two reasons:
  1. Cost cap (100k tokens, default) bounds any death spiral
  2. Per-action loop has extractWithIndex available — when the deviation
     reason mentions "runScript returned no meaningful output", the LLM can
     respond with extractWithIndex (per data-extraction rule #25) instead
     of retrying the same wrong selector

What this brings into Gen 10:

isMeaningfulRunScriptOutput() helper:
  - Detects null / undefined / empty / whitespace
  - Detects literal "null" / "undefined" / "" / ''
  - Detects empty JSON shells {} / []
  - Detects {x: null} / partial-extraction patterns (any null = retry)
  - Detects placeholder patterns via hasPlaceholderPattern

executePlan auto-complete branch hardened:
  - Old: auto-complete fires whenever lastRunScriptOutput is truthy
  - New: auto-complete fires only when isMeaningfulRunScriptOutput is true
  - Catches the literal "null" string bug that previously slipped through

executePlan runScript-empty fall-through:
  - When the last step is runScript and the output isn't meaningful, return
    deviated with a reason that names the failure AND points the per-action
    LLM at extractWithIndex (the Gen 10 recovery tool)
  - This is the path that did NOT work in Gen 9 alone — but in Gen 10 the
    per-action loop has extractWithIndex available AND the cost cap bounds
    runaway recovery loops

Tests cherry-picked: 12 (all pass)
  - 11 isMeaningfulRunScriptOutput unit tests in tests/runner-execute-plan.test.ts
  - 4 executePlan integration tests (Gen 7.2/9 fall-through, declines on
    {x:null}, declines on literal "null", positive control auto-completes
    on real values)

Conflict resolution: the Gen 10 extractWithIndex fall-through and the Gen 9
runScript-empty fall-through are mutually exclusive (different last-step
types). Both kept, ordered Gen 10 first then Gen 9.

Tests: 993/993 (was 981 before this cherry-pick, +12 from Gen 9)
TypeScript clean. Boundaries clean.

* docs(gen10): changeset + 5-rep validation results

5-rep matched same-day validation per CLAUDE.md rules #3 + #6:

  Gen 8 same-day 5-rep: 29/50 = 58%
  Gen 10 5-rep:         37/50 = 74%
  Delta:                +8 tasks (+16 percentage points)

Architectural wins (consistent across 3-rep AND 5-rep, same-day):
  - npm-package-downloads:    0/5 -> 5/5 (+5) extractWithIndex
  - w3c-html-spec-find-element: 2/5 -> 5/5 (+3) bigger snapshot
  - github-pr-count:          4/5 -> 5/5 (+1)
  - stackoverflow-answer-count: 2/5 -> 3/5 (+1)

Cost analysis (matched same-day):
  - Raw cost: +59% ($0.017 -> $0.027)
  - Cost per pass: +28% ($0.029 -> $0.037, more honest framing)
  - Death spirals: 0 (cost cap held; peak run $0.16)
  - Reddit Gen 9.1 regression FIXED: 5/5 at $0.015 mean (was 3/5 at $0.25-$0.32)

Failure modes that remain (Gen 10.1 candidates):
  - Wikipedia oracle compliance: agent emits raw '1815' not {year:1815}.
    Same in Gen 8, not a regression. Fixable via prompt, not architecture.
  - Supervisor extra-context bloat on stuck turns (1 wikipedia rep burned 75K tokens).
  - mdn/arxiv variance within Wilson 95% CI overlap.

Files:
  - .changeset/gen10-dom-index-extraction.md (honest writeup)
  - .evolve/progress.md (round 2 result + per-task table)
  - .evolve/current.json (status: round2_complete_promote)
  - .evolve/experiments.jsonl (gen10-002 with verdict KEEP)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant