Skip to content

feat: research pipeline + speed-v1 experiment results#19

Merged
drewstone merged 3 commits intomainfrom
feat/research-pipeline
Mar 18, 2026
Merged

feat: research pipeline + speed-v1 experiment results#19
drewstone merged 3 commits intomainfrom
feat/research-pipeline

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Summary

  • Research pipeline (scripts/run-research-pipeline.mjs): automated hypothesis testing with two-stage screening, cost estimation, parallel execution, and decision classification
  • Speed-v1 queue (bench/research/speed-v1.json): 10 hypotheses tested, all annotated with results
  • Config promotions: micro-plan enabled globally; image/media blocking applied to benchmark profiles
  • A/B runner fixes: arbitrary arm ID support, spec.modes passthrough, .startsWith('webbench') mode detection

Experiment Results

Hypothesis Decision Impact
micro-plan-2 promoted (global) +8pp pass, -19% tokens
retry-1 promoted (bench profile) -23% tokens
block-images-media promoted (bench profile) -20% tokens, -21% duration
micro-plan-3 candidate +8pp but CI too wide
compact-first-turn rejected +66% tokens
scout-enabled rejected +30% tokens
vision-auto-escalation rejected +48% tokens
combo-speed rejected components individually negative
llm-timeout-30s neutral no meaningful change
supervisor-early neutral false positive stalls

Pipeline Features

  • --two-stage: screen (1 rep) → validate candidates (5 reps), ~40% cheaper
  • --estimate: cost estimation before running
  • --hypothesis-concurrency N: parallel hypothesis execution
  • --max-priority N, --hypothesis <id>, --resume: filtering and resumption

Test plan

  • pnpm lint passes
  • pnpm check:boundaries passes
  • pnpm test passes (629 tests)
  • All 10 hypotheses annotated with results
  • Verify pnpm research:pipeline --queue bench/research/speed-v1.json --estimate shows cost estimate
  • Verify pnpm research:pipeline --queue bench/research/speed-v1.json --resume skips completed (all p99)

- --extract-tokens: add video extraction (<video>, <source>, poster),
  lazy-load data-* attributes, URL decoding on filenames, inline script
  library detection (GSAP/p5/Three.js/Lottie/etc.), raise element cap
  to 50k, new VideoAsset type, detectedLibraries field on DesignTokens

- --rip: full site download via Playwright network interception. Captures
  every request/response, rewrites HTML/CSS references to local paths,
  auto-scrolls for lazy loading, reveals hidden content (accordions/tabs/
  carousels), multi-page crawl, self-contained output directory

- --design-compare: comprehensive side-by-side comparison of two URLs.
  Token extraction + pixel diff (pixelmatch) + structural diff. Interactive
  content reveal before screenshots (expands accordions, clicks tabs,
  scrolls carousels, opens mobile menus, dismisses modals). Captures per-
  interaction-state screenshots. HTML + JSON report output

New module: src/design/ (page-interaction, rip, compare, types, index)
Add automated hypothesis testing pipeline (run-research-pipeline.mjs)
with two-stage screening, cost estimation, and parallel execution.

Speed-v1 results (10 hypotheses, 3 reps each):
- Promoted: micro-plan-2 (global default), retry-1 + block-images-media (benchmark profiles)
- Rejected: compact-first-turn, scout-enabled, vision-auto-escalation, combo-speed
- Neutral: llm-timeout-30s, supervisor-early
- Candidate: micro-plan-3 (positive signal, needs larger case set)

Config changes:
- microPlan enabled by default (maxActionsPerTurn: 2)
- resourceBlocking explicitly set in DEFAULTS (blockImages/blockMedia off globally)
- Benchmark profiles block images/media for both webbench and webbench-stealth
- A/B runner supports arbitrary arm IDs and spec.modes
@drewstone drewstone merged commit cecaf01 into main Mar 18, 2026
5 checks passed
drewstone added a commit that referenced this pull request Mar 19, 2026
Unpublished since 0.10.0:
- feat: screenX/screenY CDP fix for Cloudflare Turnstile (#29)
- fix: boost output tokens near max turns (#28)
- feat: canvas fingerprint noise + stealth patches (#27)
- fix: headless UA override — platform-agnostic Akamai bypass (#26)
- fix: nightly CI — Xvfb headed stealth + system Chrome (#25)
- feat: retry malformed JSON with minimal context (#24)
- feat: three-tier history compression -22% cost (#23)
- feat: headless passthrough + Docker benchmark runner (#22)
- feat: WebVoyager + WebArena benchmark adapters (#20)
- fix: graceful recovery from execute wall-clock timeouts (#21)
- feat: showcase command for marketing asset capture (#18)
- feat: research pipeline + speed-v1 experiment results (#19)
- feat: design rip, compare, and extract-tokens overhaul (#17)
- feat: CDP connection, browser profiles, and asset downloader (#16)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant