feat: research pipeline + speed-v1 experiment results by drewstone · Pull Request #19 · tangle-network/browser-agent-driver

drewstone · 2026-03-18T02:51:58Z

Summary

Research pipeline (scripts/run-research-pipeline.mjs): automated hypothesis testing with two-stage screening, cost estimation, parallel execution, and decision classification
Speed-v1 queue (bench/research/speed-v1.json): 10 hypotheses tested, all annotated with results
Config promotions: micro-plan enabled globally; image/media blocking applied to benchmark profiles
A/B runner fixes: arbitrary arm ID support, spec.modes passthrough, .startsWith('webbench') mode detection

Experiment Results

Hypothesis	Decision	Impact
micro-plan-2	promoted (global)	+8pp pass, -19% tokens
retry-1	promoted (bench profile)	-23% tokens
block-images-media	promoted (bench profile)	-20% tokens, -21% duration
micro-plan-3	candidate	+8pp but CI too wide
compact-first-turn	rejected	+66% tokens
scout-enabled	rejected	+30% tokens
vision-auto-escalation	rejected	+48% tokens
combo-speed	rejected	components individually negative
llm-timeout-30s	neutral	no meaningful change
supervisor-early	neutral	false positive stalls

Pipeline Features

--two-stage: screen (1 rep) → validate candidates (5 reps), ~40% cheaper
--estimate: cost estimation before running
--hypothesis-concurrency N: parallel hypothesis execution
--max-priority N, --hypothesis <id>, --resume: filtering and resumption

Test plan

pnpm lint passes
pnpm check:boundaries passes
pnpm test passes (629 tests)
All 10 hypotheses annotated with results
Verify pnpm research:pipeline --queue bench/research/speed-v1.json --estimate shows cost estimate
Verify pnpm research:pipeline --queue bench/research/speed-v1.json --resume skips completed (all p99)

- --extract-tokens: add video extraction (<video>, <source>, poster), lazy-load data-* attributes, URL decoding on filenames, inline script library detection (GSAP/p5/Three.js/Lottie/etc.), raise element cap to 50k, new VideoAsset type, detectedLibraries field on DesignTokens - --rip: full site download via Playwright network interception. Captures every request/response, rewrites HTML/CSS references to local paths, auto-scrolls for lazy loading, reveals hidden content (accordions/tabs/ carousels), multi-page crawl, self-contained output directory - --design-compare: comprehensive side-by-side comparison of two URLs. Token extraction + pixel diff (pixelmatch) + structural diff. Interactive content reveal before screenshots (expands accordions, clicks tabs, scrolls carousels, opens mobile menus, dismisses modals). Captures per- interaction-state screenshots. HTML + JSON report output New module: src/design/ (page-interaction, rip, compare, types, index)

… and guide

Add automated hypothesis testing pipeline (run-research-pipeline.mjs) with two-stage screening, cost estimation, and parallel execution. Speed-v1 results (10 hypotheses, 3 reps each): - Promoted: micro-plan-2 (global default), retry-1 + block-images-media (benchmark profiles) - Rejected: compact-first-turn, scout-enabled, vision-auto-escalation, combo-speed - Neutral: llm-timeout-30s, supervisor-early - Candidate: micro-plan-3 (positive signal, needs larger case set) Config changes: - microPlan enabled by default (maxActionsPerTurn: 2) - resourceBlocking explicitly set in DEFAULTS (blockImages/blockMedia off globally) - Benchmark profiles block images/media for both webbench and webbench-stealth - A/B runner supports arbitrary arm IDs and spec.modes

Unpublished since 0.10.0: - feat: screenX/screenY CDP fix for Cloudflare Turnstile (#29) - fix: boost output tokens near max turns (#28) - feat: canvas fingerprint noise + stealth patches (#27) - fix: headless UA override — platform-agnostic Akamai bypass (#26) - fix: nightly CI — Xvfb headed stealth + system Chrome (#25) - feat: retry malformed JSON with minimal context (#24) - feat: three-tier history compression -22% cost (#23) - feat: headless passthrough + Docker benchmark runner (#22) - feat: WebVoyager + WebArena benchmark adapters (#20) - fix: graceful recovery from execute wall-clock timeouts (#21) - feat: showcase command for marketing asset capture (#18) - feat: research pipeline + speed-v1 experiment results (#19) - feat: design rip, compare, and extract-tokens overhaul (#17) - feat: CDP connection, browser profiles, and asset downloader (#16)

drewstone added 3 commits March 17, 2026 12:39

docs: add design-audit rip, compare, and token extraction to CLI help…

4ef9ada

… and guide

drewstone merged commit cecaf01 into main Mar 18, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: research pipeline + speed-v1 experiment results#19

feat: research pipeline + speed-v1 experiment results#19
drewstone merged 3 commits intomainfrom
feat/research-pipeline

drewstone commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Mar 18, 2026

Summary

Experiment Results

Pipeline Features

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant