fix: nightly CI — Xvfb headed stealth + system Chrome#25
Merged
Conversation
Headless Chrome gets blocked by 20/50 WEBBENCH sites (Cloudflare, captcha, access-denied). Headed mode with Xvfb virtual display passes 48/50. Changes: - Install system Chrome via patchright for real TLS fingerprint - Use xvfb-run for headed mode without a physical display - Switch from webbench to webbench-stealth profile - Update model from gpt-5.2 to gpt-5.4 (validated baseline) Requires: OPENAI_API_KEY secret in repo settings to unblock nightly runs.
drewstone
added a commit
that referenced
this pull request
Mar 19, 2026
Unpublished since 0.10.0: - feat: screenX/screenY CDP fix for Cloudflare Turnstile (#29) - fix: boost output tokens near max turns (#28) - feat: canvas fingerprint noise + stealth patches (#27) - fix: headless UA override — platform-agnostic Akamai bypass (#26) - fix: nightly CI — Xvfb headed stealth + system Chrome (#25) - feat: retry malformed JSON with minimal context (#24) - feat: three-tier history compression -22% cost (#23) - feat: headless passthrough + Docker benchmark runner (#22) - feat: WebVoyager + WebArena benchmark adapters (#20) - fix: graceful recovery from execute wall-clock timeouts (#21) - feat: showcase command for marketing asset capture (#18) - feat: research pipeline + speed-v1 experiment results (#19) - feat: design rip, compare, and extract-tokens overhaul (#17) - feat: CDP connection, browser profiles, and asset downloader (#16)
drewstone
added a commit
that referenced
this pull request
Apr 9, 2026
Three coordinated changes that ship together as Gen 10:
A) extractWithIndex action — pick-by-content over pick-by-selector
New action {action:'extractWithIndex', query:'p, dd, code', contains:'downloads'}
that returns a numbered, text-rich list of every visible element matching
the query. The agent picks elements by index in the next turn.
This is the architectural fix Gen 9 was missing: instead of asking the LLM
to write a precise CSS selector for data it hasn't seen yet (the failure
mode on npm/mdn/python-docs), the wide query finds candidates and the
response shows actual textContent so the LLM can pick by content match.
Wired into:
- src/types.ts (ExtractWithIndexAction type, added to Action union)
- src/brain/index.ts (validateAction parser, system prompt, planner prompt,
data-extraction rule #25 explaining when to prefer extractWithIndex over
runScript on extraction tasks)
- src/drivers/extract-with-index.ts (browser-side query helper, returns
{index, tag, text, attributes, selector} for each visible match, capped
at 80 matches)
- src/drivers/playwright.ts (driver.execute dispatch, returns formatted
output as data so executePlan can capture it like runScript)
- src/runner/runner.ts (per-action loop handler with feedback injection,
executePlan capture into lastExtractOutput, plan-ends-with-extract
fall-through to per-action loop with the match list as REPLAN context)
- src/supervisor/policy.ts (action signature for stuck-detection)
C) Bigger snapshot + content-line preservation
src/brain/index.ts:budgetSnapshot now preserves term/definition/code/pre/
paragraph content lines (which previously got dropped as "decorative" by
the interactive-only filter). These are exactly the lines that carry the
data agents need on MDN/Python docs/W3C spec/arxiv pages.
Budgets raised:
- Default budgetSnapshot cap: 16k → 24k chars
- Decide() new-page snapshot: 16k → 24k
- Planner snapshot: 12k → 24k (planner is the most important caller for
extraction tasks because it writes the runScript on the first observation)
Same-page snapshot stays at 8k (after the LLM has already seen the page).
Empirical verification: probed Playwright's locator.ariaSnapshot() output
on a fixture with <dl><dt><code>flatMap(callbackFn)</code></dt><dd>...</dd></dl>
— confirmed Playwright DOES emit `term`/`definition`/`code` lines with text
content. The bug was in the budgetSnapshot filter dropping them, not in
the snapshot pipeline missing them.
Cost cap (mandatory safety net for any iteration-based mechanism)
src/run-state.ts adds totalTokensUsed accumulator, tokenBudget (default
100k, override via Scenario.tokenBudget or BAD_TOKEN_BUDGET env), and
isTokenBudgetExhausted gate. src/runner/runner.ts checks the gate at the
top of every loop iteration (before the next LLM call) and returns
`cost_cap_exceeded` if exceeded.
Calibration:
- Gen 8 real-web mean: ~6k tokens (well under 100k)
- Tier 1 form-multistep full-evidence: ~60k tokens (within cap + 40k headroom)
- Gen 9 death-spirals: 132k–173k (above cap → caught and aborted)
100k = above any normal case I've measured, well below any death spiral.
Catches the Gen 9 reddit failure mode (rep 3: $0.25/132k tokens, rep 4:
$0.32/173k tokens) within 5–8 turns of futility instead of running for
the full case timeout.
Tests: 18 new (981 total, +18 from baseline)
- tests/budget-snapshot.test.ts: 6 (filter preservation, content lines,
priority bucket, paragraph handling)
- tests/extract-with-index.test.ts: 13 (browser-side query, contains
filter, hidden element skipping, invalid selector graceful failure,
stable selector building, formatter, parser via Brain.parse)
- tests/run-state.test.ts: 7 new in 'Gen 10 cost cap' describe block
- tests/runner-execute-plan.test.ts: 2 new (extractWithIndex deviation
with match list, cost cap exhaustion)
Gates: TypeScript clean, boundaries clean, full test suite 981/981 PASS,
Tier1 deterministic gate PASSED.
Refs: .evolve/pursuits/2026-04-08-gen9-retro-and-gen10-proposal.md
drewstone
added a commit
that referenced
this pull request
Apr 9, 2026
…ll-through Cherry-picked from the abandoned Gen 9 branch (commit 63e16fe). The original Gen 9 PR was closed because the LLM-iteration recovery loop didn't move the pass rate AND introduced cost regressions on previously-passing tasks (reddit death-spirals at $0.25-$0.32 / 130-173k tokens per case in 5-rep validation). In Gen 10 the same code is safe and useful for two reasons: 1. Cost cap (100k tokens, default) bounds any death spiral 2. Per-action loop has extractWithIndex available — when the deviation reason mentions "runScript returned no meaningful output", the LLM can respond with extractWithIndex (per data-extraction rule #25) instead of retrying the same wrong selector What this brings into Gen 10: isMeaningfulRunScriptOutput() helper: - Detects null / undefined / empty / whitespace - Detects literal "null" / "undefined" / "" / '' - Detects empty JSON shells {} / [] - Detects {x: null} / partial-extraction patterns (any null = retry) - Detects placeholder patterns via hasPlaceholderPattern executePlan auto-complete branch hardened: - Old: auto-complete fires whenever lastRunScriptOutput is truthy - New: auto-complete fires only when isMeaningfulRunScriptOutput is true - Catches the literal "null" string bug that previously slipped through executePlan runScript-empty fall-through: - When the last step is runScript and the output isn't meaningful, return deviated with a reason that names the failure AND points the per-action LLM at extractWithIndex (the Gen 10 recovery tool) - This is the path that did NOT work in Gen 9 alone — but in Gen 10 the per-action loop has extractWithIndex available AND the cost cap bounds runaway recovery loops Tests cherry-picked: 12 (all pass) - 11 isMeaningfulRunScriptOutput unit tests in tests/runner-execute-plan.test.ts - 4 executePlan integration tests (Gen 7.2/9 fall-through, declines on {x:null}, declines on literal "null", positive control auto-completes on real values) Conflict resolution: the Gen 10 extractWithIndex fall-through and the Gen 9 runScript-empty fall-through are mutually exclusive (different last-step types). Both kept, ordered Gen 10 first then Gen 9. Tests: 993/993 (was 981 before this cherry-pick, +12 from Gen 9) TypeScript clean. Boundaries clean.
drewstone
added a commit
that referenced
this pull request
Apr 9, 2026
* feat(runner): Gen 10 — DOM index extraction + bigger snapshot + cost cap
Three coordinated changes that ship together as Gen 10:
A) extractWithIndex action — pick-by-content over pick-by-selector
New action {action:'extractWithIndex', query:'p, dd, code', contains:'downloads'}
that returns a numbered, text-rich list of every visible element matching
the query. The agent picks elements by index in the next turn.
This is the architectural fix Gen 9 was missing: instead of asking the LLM
to write a precise CSS selector for data it hasn't seen yet (the failure
mode on npm/mdn/python-docs), the wide query finds candidates and the
response shows actual textContent so the LLM can pick by content match.
Wired into:
- src/types.ts (ExtractWithIndexAction type, added to Action union)
- src/brain/index.ts (validateAction parser, system prompt, planner prompt,
data-extraction rule #25 explaining when to prefer extractWithIndex over
runScript on extraction tasks)
- src/drivers/extract-with-index.ts (browser-side query helper, returns
{index, tag, text, attributes, selector} for each visible match, capped
at 80 matches)
- src/drivers/playwright.ts (driver.execute dispatch, returns formatted
output as data so executePlan can capture it like runScript)
- src/runner/runner.ts (per-action loop handler with feedback injection,
executePlan capture into lastExtractOutput, plan-ends-with-extract
fall-through to per-action loop with the match list as REPLAN context)
- src/supervisor/policy.ts (action signature for stuck-detection)
C) Bigger snapshot + content-line preservation
src/brain/index.ts:budgetSnapshot now preserves term/definition/code/pre/
paragraph content lines (which previously got dropped as "decorative" by
the interactive-only filter). These are exactly the lines that carry the
data agents need on MDN/Python docs/W3C spec/arxiv pages.
Budgets raised:
- Default budgetSnapshot cap: 16k → 24k chars
- Decide() new-page snapshot: 16k → 24k
- Planner snapshot: 12k → 24k (planner is the most important caller for
extraction tasks because it writes the runScript on the first observation)
Same-page snapshot stays at 8k (after the LLM has already seen the page).
Empirical verification: probed Playwright's locator.ariaSnapshot() output
on a fixture with <dl><dt><code>flatMap(callbackFn)</code></dt><dd>...</dd></dl>
— confirmed Playwright DOES emit `term`/`definition`/`code` lines with text
content. The bug was in the budgetSnapshot filter dropping them, not in
the snapshot pipeline missing them.
Cost cap (mandatory safety net for any iteration-based mechanism)
src/run-state.ts adds totalTokensUsed accumulator, tokenBudget (default
100k, override via Scenario.tokenBudget or BAD_TOKEN_BUDGET env), and
isTokenBudgetExhausted gate. src/runner/runner.ts checks the gate at the
top of every loop iteration (before the next LLM call) and returns
`cost_cap_exceeded` if exceeded.
Calibration:
- Gen 8 real-web mean: ~6k tokens (well under 100k)
- Tier 1 form-multistep full-evidence: ~60k tokens (within cap + 40k headroom)
- Gen 9 death-spirals: 132k–173k (above cap → caught and aborted)
100k = above any normal case I've measured, well below any death spiral.
Catches the Gen 9 reddit failure mode (rep 3: $0.25/132k tokens, rep 4:
$0.32/173k tokens) within 5–8 turns of futility instead of running for
the full case timeout.
Tests: 18 new (981 total, +18 from baseline)
- tests/budget-snapshot.test.ts: 6 (filter preservation, content lines,
priority bucket, paragraph handling)
- tests/extract-with-index.test.ts: 13 (browser-side query, contains
filter, hidden element skipping, invalid selector graceful failure,
stable selector building, formatter, parser via Brain.parse)
- tests/run-state.test.ts: 7 new in 'Gen 10 cost cap' describe block
- tests/runner-execute-plan.test.ts: 2 new (extractWithIndex deviation
with match list, cost cap exhaustion)
Gates: TypeScript clean, boundaries clean, full test suite 981/981 PASS,
Tier1 deterministic gate PASSED.
Refs: .evolve/pursuits/2026-04-08-gen9-retro-and-gen10-proposal.md
* feat(runner): isMeaningfulRunScriptOutput helper + runScript-empty fall-through
Cherry-picked from the abandoned Gen 9 branch (commit 63e16fe). The original
Gen 9 PR was closed because the LLM-iteration recovery loop didn't move the
pass rate AND introduced cost regressions on previously-passing tasks (reddit
death-spirals at $0.25-$0.32 / 130-173k tokens per case in 5-rep validation).
In Gen 10 the same code is safe and useful for two reasons:
1. Cost cap (100k tokens, default) bounds any death spiral
2. Per-action loop has extractWithIndex available — when the deviation
reason mentions "runScript returned no meaningful output", the LLM can
respond with extractWithIndex (per data-extraction rule #25) instead
of retrying the same wrong selector
What this brings into Gen 10:
isMeaningfulRunScriptOutput() helper:
- Detects null / undefined / empty / whitespace
- Detects literal "null" / "undefined" / "" / ''
- Detects empty JSON shells {} / []
- Detects {x: null} / partial-extraction patterns (any null = retry)
- Detects placeholder patterns via hasPlaceholderPattern
executePlan auto-complete branch hardened:
- Old: auto-complete fires whenever lastRunScriptOutput is truthy
- New: auto-complete fires only when isMeaningfulRunScriptOutput is true
- Catches the literal "null" string bug that previously slipped through
executePlan runScript-empty fall-through:
- When the last step is runScript and the output isn't meaningful, return
deviated with a reason that names the failure AND points the per-action
LLM at extractWithIndex (the Gen 10 recovery tool)
- This is the path that did NOT work in Gen 9 alone — but in Gen 10 the
per-action loop has extractWithIndex available AND the cost cap bounds
runaway recovery loops
Tests cherry-picked: 12 (all pass)
- 11 isMeaningfulRunScriptOutput unit tests in tests/runner-execute-plan.test.ts
- 4 executePlan integration tests (Gen 7.2/9 fall-through, declines on
{x:null}, declines on literal "null", positive control auto-completes
on real values)
Conflict resolution: the Gen 10 extractWithIndex fall-through and the Gen 9
runScript-empty fall-through are mutually exclusive (different last-step
types). Both kept, ordered Gen 10 first then Gen 9.
Tests: 993/993 (was 981 before this cherry-pick, +12 from Gen 9)
TypeScript clean. Boundaries clean.
* docs(gen10): changeset + 5-rep validation results
5-rep matched same-day validation per CLAUDE.md rules #3 + #6:
Gen 8 same-day 5-rep: 29/50 = 58%
Gen 10 5-rep: 37/50 = 74%
Delta: +8 tasks (+16 percentage points)
Architectural wins (consistent across 3-rep AND 5-rep, same-day):
- npm-package-downloads: 0/5 -> 5/5 (+5) extractWithIndex
- w3c-html-spec-find-element: 2/5 -> 5/5 (+3) bigger snapshot
- github-pr-count: 4/5 -> 5/5 (+1)
- stackoverflow-answer-count: 2/5 -> 3/5 (+1)
Cost analysis (matched same-day):
- Raw cost: +59% ($0.017 -> $0.027)
- Cost per pass: +28% ($0.029 -> $0.037, more honest framing)
- Death spirals: 0 (cost cap held; peak run $0.16)
- Reddit Gen 9.1 regression FIXED: 5/5 at $0.015 mean (was 3/5 at $0.25-$0.32)
Failure modes that remain (Gen 10.1 candidates):
- Wikipedia oracle compliance: agent emits raw '1815' not {year:1815}.
Same in Gen 8, not a regression. Fixable via prompt, not architecture.
- Supervisor extra-context bloat on stuck turns (1 wikipedia rep burned 75K tokens).
- mdn/arxiv variance within Wilson 95% CI overlap.
Files:
- .changeset/gen10-dom-index-extraction.md (honest writeup)
- .evolve/progress.md (round 2 result + per-task table)
- .evolve/current.json (status: round2_complete_promote)
- .evolve/experiments.jsonl (gen10-002 with verdict KEEP)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
OPENAI_API_KEYsecretChanges
npx patchright install chromestep for real TLS fingerprintxvfb-run(headed mode without physical display)webbench→webbench-stealthprofile (patchright + stealth args)gpt-5.2→gpt-5.4(current validated baseline)Why this works
Xvfb provides a virtual X11 display so Chrome runs in full headed mode with:
navigator.plugins)Prerequisite
Add
OPENAI_API_KEYto repo secrets (Settings → Secrets → Actions)Test plan
OPENAI_API_KEYsecret to repoworkflow_dispatch🤖 Generated with Claude Code