fix: nightly CI — Xvfb headed stealth + system Chrome by drewstone · Pull Request #25 · tangle-network/browser-agent-driver

drewstone · 2026-03-19T01:32:46Z

Summary

Nightly CI has never successfully run (Issue Nightly reliability regression #14) — missing OPENAI_API_KEY secret
Even once the secret is added, headless Chrome fails on 20/50 WEBBENCH sites
Fix: use Xvfb virtual display + stealth profile for headed mode on CI

Changes

Add npx patchright install chrome step for real TLS fingerprint
WebBench sample uses xvfb-run (headed mode without physical display)
Switch webbench → webbench-stealth profile (patchright + stealth args)
Update model gpt-5.2 → gpt-5.4 (current validated baseline)

Why this works

Mode	WEBBENCH-50 pass rate	Anti-bot blocks
Headed (stealth)	48/50 (96%)	2 (Cambridge, AllTrails)
Headless (stealth)	28/50 (56%)	22
Xvfb + headed (stealth)	48/50 (96%) expected	Same as headed

Xvfb provides a virtual X11 display so Chrome runs in full headed mode with:

Real GPU/plugin signals (no empty navigator.plugins)
System Chrome TLS fingerprint (no detectable JA3 hash)
Full patchright CDP leak prevention

Prerequisite

Add OPENAI_API_KEY to repo secrets (Settings → Secrets → Actions)

Test plan

Add OPENAI_API_KEY secret to repo
Trigger workflow manually via workflow_dispatch
Verify Tier1 gate and WebBench sample both pass

🤖 Generated with Claude Code

Headless Chrome gets blocked by 20/50 WEBBENCH sites (Cloudflare, captcha, access-denied). Headed mode with Xvfb virtual display passes 48/50. Changes: - Install system Chrome via patchright for real TLS fingerprint - Use xvfb-run for headed mode without a physical display - Switch from webbench to webbench-stealth profile - Update model from gpt-5.2 to gpt-5.4 (validated baseline) Requires: OPENAI_API_KEY secret in repo settings to unblock nightly runs.

Unpublished since 0.10.0: - feat: screenX/screenY CDP fix for Cloudflare Turnstile (#29) - fix: boost output tokens near max turns (#28) - feat: canvas fingerprint noise + stealth patches (#27) - fix: headless UA override — platform-agnostic Akamai bypass (#26) - fix: nightly CI — Xvfb headed stealth + system Chrome (#25) - feat: retry malformed JSON with minimal context (#24) - feat: three-tier history compression -22% cost (#23) - feat: headless passthrough + Docker benchmark runner (#22) - feat: WebVoyager + WebArena benchmark adapters (#20) - fix: graceful recovery from execute wall-clock timeouts (#21) - feat: showcase command for marketing asset capture (#18) - feat: research pipeline + speed-v1 experiment results (#19) - feat: design rip, compare, and extract-tokens overhaul (#17) - feat: CDP connection, browser profiles, and asset downloader (#16)

Three coordinated changes that ship together as Gen 10: A) extractWithIndex action — pick-by-content over pick-by-selector New action {action:'extractWithIndex', query:'p, dd, code', contains:'downloads'} that returns a numbered, text-rich list of every visible element matching the query. The agent picks elements by index in the next turn. This is the architectural fix Gen 9 was missing: instead of asking the LLM to write a precise CSS selector for data it hasn't seen yet (the failure mode on npm/mdn/python-docs), the wide query finds candidates and the response shows actual textContent so the LLM can pick by content match. Wired into: - src/types.ts (ExtractWithIndexAction type, added to Action union) - src/brain/index.ts (validateAction parser, system prompt, planner prompt, data-extraction rule #25 explaining when to prefer extractWithIndex over runScript on extraction tasks) - src/drivers/extract-with-index.ts (browser-side query helper, returns {index, tag, text, attributes, selector} for each visible match, capped at 80 matches) - src/drivers/playwright.ts (driver.execute dispatch, returns formatted output as data so executePlan can capture it like runScript) - src/runner/runner.ts (per-action loop handler with feedback injection, executePlan capture into lastExtractOutput, plan-ends-with-extract fall-through to per-action loop with the match list as REPLAN context) - src/supervisor/policy.ts (action signature for stuck-detection) C) Bigger snapshot + content-line preservation src/brain/index.ts:budgetSnapshot now preserves term/definition/code/pre/ paragraph content lines (which previously got dropped as "decorative" by the interactive-only filter). These are exactly the lines that carry the data agents need on MDN/Python docs/W3C spec/arxiv pages. Budgets raised: - Default budgetSnapshot cap: 16k → 24k chars - Decide() new-page snapshot: 16k → 24k - Planner snapshot: 12k → 24k (planner is the most important caller for extraction tasks because it writes the runScript on the first observation) Same-page snapshot stays at 8k (after the LLM has already seen the page). Empirical verification: probed Playwright's locator.ariaSnapshot() output on a fixture with <dl><dt><code>flatMap(callbackFn)</code></dt><dd>...</dd></dl> — confirmed Playwright DOES emit `term`/`definition`/`code` lines with text content. The bug was in the budgetSnapshot filter dropping them, not in the snapshot pipeline missing them. Cost cap (mandatory safety net for any iteration-based mechanism) src/run-state.ts adds totalTokensUsed accumulator, tokenBudget (default 100k, override via Scenario.tokenBudget or BAD_TOKEN_BUDGET env), and isTokenBudgetExhausted gate. src/runner/runner.ts checks the gate at the top of every loop iteration (before the next LLM call) and returns `cost_cap_exceeded` if exceeded. Calibration: - Gen 8 real-web mean: ~6k tokens (well under 100k) - Tier 1 form-multistep full-evidence: ~60k tokens (within cap + 40k headroom) - Gen 9 death-spirals: 132k–173k (above cap → caught and aborted) 100k = above any normal case I've measured, well below any death spiral. Catches the Gen 9 reddit failure mode (rep 3: $0.25/132k tokens, rep 4: $0.32/173k tokens) within 5–8 turns of futility instead of running for the full case timeout. Tests: 18 new (981 total, +18 from baseline) - tests/budget-snapshot.test.ts: 6 (filter preservation, content lines, priority bucket, paragraph handling) - tests/extract-with-index.test.ts: 13 (browser-side query, contains filter, hidden element skipping, invalid selector graceful failure, stable selector building, formatter, parser via Brain.parse) - tests/run-state.test.ts: 7 new in 'Gen 10 cost cap' describe block - tests/runner-execute-plan.test.ts: 2 new (extractWithIndex deviation with match list, cost cap exhaustion) Gates: TypeScript clean, boundaries clean, full test suite 981/981 PASS, Tier1 deterministic gate PASSED. Refs: .evolve/pursuits/2026-04-08-gen9-retro-and-gen10-proposal.md

…ll-through Cherry-picked from the abandoned Gen 9 branch (commit 63e16fe). The original Gen 9 PR was closed because the LLM-iteration recovery loop didn't move the pass rate AND introduced cost regressions on previously-passing tasks (reddit death-spirals at $0.25-$0.32 / 130-173k tokens per case in 5-rep validation). In Gen 10 the same code is safe and useful for two reasons: 1. Cost cap (100k tokens, default) bounds any death spiral 2. Per-action loop has extractWithIndex available — when the deviation reason mentions "runScript returned no meaningful output", the LLM can respond with extractWithIndex (per data-extraction rule #25) instead of retrying the same wrong selector What this brings into Gen 10: isMeaningfulRunScriptOutput() helper: - Detects null / undefined / empty / whitespace - Detects literal "null" / "undefined" / "" / '' - Detects empty JSON shells {} / [] - Detects {x: null} / partial-extraction patterns (any null = retry) - Detects placeholder patterns via hasPlaceholderPattern executePlan auto-complete branch hardened: - Old: auto-complete fires whenever lastRunScriptOutput is truthy - New: auto-complete fires only when isMeaningfulRunScriptOutput is true - Catches the literal "null" string bug that previously slipped through executePlan runScript-empty fall-through: - When the last step is runScript and the output isn't meaningful, return deviated with a reason that names the failure AND points the per-action LLM at extractWithIndex (the Gen 10 recovery tool) - This is the path that did NOT work in Gen 9 alone — but in Gen 10 the per-action loop has extractWithIndex available AND the cost cap bounds runaway recovery loops Tests cherry-picked: 12 (all pass) - 11 isMeaningfulRunScriptOutput unit tests in tests/runner-execute-plan.test.ts - 4 executePlan integration tests (Gen 7.2/9 fall-through, declines on {x:null}, declines on literal "null", positive control auto-completes on real values) Conflict resolution: the Gen 10 extractWithIndex fall-through and the Gen 9 runScript-empty fall-through are mutually exclusive (different last-step types). Both kept, ordered Gen 10 first then Gen 9. Tests: 993/993 (was 981 before this cherry-pick, +12 from Gen 9) TypeScript clean. Boundaries clean.

* feat(runner): Gen 10 — DOM index extraction + bigger snapshot + cost cap Three coordinated changes that ship together as Gen 10: A) extractWithIndex action — pick-by-content over pick-by-selector New action {action:'extractWithIndex', query:'p, dd, code', contains:'downloads'} that returns a numbered, text-rich list of every visible element matching the query. The agent picks elements by index in the next turn. This is the architectural fix Gen 9 was missing: instead of asking the LLM to write a precise CSS selector for data it hasn't seen yet (the failure mode on npm/mdn/python-docs), the wide query finds candidates and the response shows actual textContent so the LLM can pick by content match. Wired into: - src/types.ts (ExtractWithIndexAction type, added to Action union) - src/brain/index.ts (validateAction parser, system prompt, planner prompt, data-extraction rule #25 explaining when to prefer extractWithIndex over runScript on extraction tasks) - src/drivers/extract-with-index.ts (browser-side query helper, returns {index, tag, text, attributes, selector} for each visible match, capped at 80 matches) - src/drivers/playwright.ts (driver.execute dispatch, returns formatted output as data so executePlan can capture it like runScript) - src/runner/runner.ts (per-action loop handler with feedback injection, executePlan capture into lastExtractOutput, plan-ends-with-extract fall-through to per-action loop with the match list as REPLAN context) - src/supervisor/policy.ts (action signature for stuck-detection) C) Bigger snapshot + content-line preservation src/brain/index.ts:budgetSnapshot now preserves term/definition/code/pre/ paragraph content lines (which previously got dropped as "decorative" by the interactive-only filter). These are exactly the lines that carry the data agents need on MDN/Python docs/W3C spec/arxiv pages. Budgets raised: - Default budgetSnapshot cap: 16k → 24k chars - Decide() new-page snapshot: 16k → 24k - Planner snapshot: 12k → 24k (planner is the most important caller for extraction tasks because it writes the runScript on the first observation) Same-page snapshot stays at 8k (after the LLM has already seen the page). Empirical verification: probed Playwright's locator.ariaSnapshot() output on a fixture with <dl><dt><code>flatMap(callbackFn)</code></dt><dd>...</dd></dl> — confirmed Playwright DOES emit `term`/`definition`/`code` lines with text content. The bug was in the budgetSnapshot filter dropping them, not in the snapshot pipeline missing them. Cost cap (mandatory safety net for any iteration-based mechanism) src/run-state.ts adds totalTokensUsed accumulator, tokenBudget (default 100k, override via Scenario.tokenBudget or BAD_TOKEN_BUDGET env), and isTokenBudgetExhausted gate. src/runner/runner.ts checks the gate at the top of every loop iteration (before the next LLM call) and returns `cost_cap_exceeded` if exceeded. Calibration: - Gen 8 real-web mean: ~6k tokens (well under 100k) - Tier 1 form-multistep full-evidence: ~60k tokens (within cap + 40k headroom) - Gen 9 death-spirals: 132k–173k (above cap → caught and aborted) 100k = above any normal case I've measured, well below any death spiral. Catches the Gen 9 reddit failure mode (rep 3: $0.25/132k tokens, rep 4: $0.32/173k tokens) within 5–8 turns of futility instead of running for the full case timeout. Tests: 18 new (981 total, +18 from baseline) - tests/budget-snapshot.test.ts: 6 (filter preservation, content lines, priority bucket, paragraph handling) - tests/extract-with-index.test.ts: 13 (browser-side query, contains filter, hidden element skipping, invalid selector graceful failure, stable selector building, formatter, parser via Brain.parse) - tests/run-state.test.ts: 7 new in 'Gen 10 cost cap' describe block - tests/runner-execute-plan.test.ts: 2 new (extractWithIndex deviation with match list, cost cap exhaustion) Gates: TypeScript clean, boundaries clean, full test suite 981/981 PASS, Tier1 deterministic gate PASSED. Refs: .evolve/pursuits/2026-04-08-gen9-retro-and-gen10-proposal.md * feat(runner): isMeaningfulRunScriptOutput helper + runScript-empty fall-through Cherry-picked from the abandoned Gen 9 branch (commit 63e16fe). The original Gen 9 PR was closed because the LLM-iteration recovery loop didn't move the pass rate AND introduced cost regressions on previously-passing tasks (reddit death-spirals at $0.25-$0.32 / 130-173k tokens per case in 5-rep validation). In Gen 10 the same code is safe and useful for two reasons: 1. Cost cap (100k tokens, default) bounds any death spiral 2. Per-action loop has extractWithIndex available — when the deviation reason mentions "runScript returned no meaningful output", the LLM can respond with extractWithIndex (per data-extraction rule #25) instead of retrying the same wrong selector What this brings into Gen 10: isMeaningfulRunScriptOutput() helper: - Detects null / undefined / empty / whitespace - Detects literal "null" / "undefined" / "" / '' - Detects empty JSON shells {} / [] - Detects {x: null} / partial-extraction patterns (any null = retry) - Detects placeholder patterns via hasPlaceholderPattern executePlan auto-complete branch hardened: - Old: auto-complete fires whenever lastRunScriptOutput is truthy - New: auto-complete fires only when isMeaningfulRunScriptOutput is true - Catches the literal "null" string bug that previously slipped through executePlan runScript-empty fall-through: - When the last step is runScript and the output isn't meaningful, return deviated with a reason that names the failure AND points the per-action LLM at extractWithIndex (the Gen 10 recovery tool) - This is the path that did NOT work in Gen 9 alone — but in Gen 10 the per-action loop has extractWithIndex available AND the cost cap bounds runaway recovery loops Tests cherry-picked: 12 (all pass) - 11 isMeaningfulRunScriptOutput unit tests in tests/runner-execute-plan.test.ts - 4 executePlan integration tests (Gen 7.2/9 fall-through, declines on {x:null}, declines on literal "null", positive control auto-completes on real values) Conflict resolution: the Gen 10 extractWithIndex fall-through and the Gen 9 runScript-empty fall-through are mutually exclusive (different last-step types). Both kept, ordered Gen 10 first then Gen 9. Tests: 993/993 (was 981 before this cherry-pick, +12 from Gen 9) TypeScript clean. Boundaries clean. * docs(gen10): changeset + 5-rep validation results 5-rep matched same-day validation per CLAUDE.md rules #3 + #6: Gen 8 same-day 5-rep: 29/50 = 58% Gen 10 5-rep: 37/50 = 74% Delta: +8 tasks (+16 percentage points) Architectural wins (consistent across 3-rep AND 5-rep, same-day): - npm-package-downloads: 0/5 -> 5/5 (+5) extractWithIndex - w3c-html-spec-find-element: 2/5 -> 5/5 (+3) bigger snapshot - github-pr-count: 4/5 -> 5/5 (+1) - stackoverflow-answer-count: 2/5 -> 3/5 (+1) Cost analysis (matched same-day): - Raw cost: +59% ($0.017 -> $0.027) - Cost per pass: +28% ($0.029 -> $0.037, more honest framing) - Death spirals: 0 (cost cap held; peak run $0.16) - Reddit Gen 9.1 regression FIXED: 5/5 at $0.015 mean (was 3/5 at $0.25-$0.32) Failure modes that remain (Gen 10.1 candidates): - Wikipedia oracle compliance: agent emits raw '1815' not {year:1815}. Same in Gen 8, not a regression. Fixable via prompt, not architecture. - Supervisor extra-context bloat on stuck turns (1 wikipedia rep burned 75K tokens). - mdn/arxiv variance within Wilson 95% CI overlap. Files: - .changeset/gen10-dom-index-extraction.md (honest writeup) - .evolve/progress.md (round 2 result + per-task table) - .evolve/current.json (status: round2_complete_promote) - .evolve/experiments.jsonl (gen10-002 with verdict KEEP)

drewstone merged commit 5d30291 into main Mar 19, 2026
5 checks passed

github-actions bot mentioned this pull request Apr 9, 2026

Release: version packages #61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: nightly CI — Xvfb headed stealth + system Chrome#25

fix: nightly CI — Xvfb headed stealth + system Chrome#25
drewstone merged 1 commit intomainfrom
fix/nightly-xvfb-stealth

drewstone commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Mar 19, 2026

Summary

Changes

Why this works

Prerequisite

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant