Release: version packages by github-actions[bot] · Pull Request #90 · tangle-network/browser-agent-driver

github-actions · 2026-04-27T15:31:37Z

This PR was opened by the Changesets release GitHub action. When you're ready to do a release, you can merge this and the packages will be published to npm automatically. If you're not ready to do a release yet, that's fine, whenever you add more changesets to main, this PR will be updated.

Releases

@tangle-network/browser-agent-driver@0.32.0

Minor Changes

#89 9e9e0d8 Thanks @drewstone! - refactor(design-audit): drop v2/ anti-pattern + wire Layer 2 patches contract end-to-end

Two changes that fold into one coherent diff:

Canonicalization — no version numbers in file or directory names. The src/design/audit/v2/ directory is gone:
- v2/types.ts → src/design/audit/score-types.ts (scoring/classifier/patches/tags types)
- v2/build-result.ts → src/design/audit/build-result.ts
- v2/score.ts → src/design/audit/score.ts
- tests/design-audit-v2-result.test.ts → tests/design-audit-build-result.test.ts
Identifier renames: AuditResult_v2 → AuditResult, BuildV2ResultInput → BuildAuditResultInput, parseAuditResponseV2 → parseAuditResponse, buildEvalPromptV2 → buildEvalPrompt, buildAuditResultV2 → buildAuditResult, synthesizeScoresFromV1 → synthesizeScoresFromLegacy, auditResultV2 field → auditResult, DesignFindingV1 → DesignFindingBase, AppliesWhenV1 → BaseAppliesWhen, V2_INTERNALS → BUILD_RESULT_INTERNALS.

Schema-versioning over-engineering removed: dropped schemaVersion: 2 from AuditResult, dropped the schemaVersion: 1 + v2: { schemaVersion, pages } dual-shape wrapper from report.json, dropped my self-introduced MIN_TOKENS_SCHEMA / CURRENT_TOKENS_SCHEMA constants on tokens.json. (Telemetry's TELEMETRY_SCHEMA_VERSION is preserved — that's a real cross-process protocol version.)

Layer 2 patches contract wired end-to-end. The eval-agent surfaced that Layer 2 (PR #81) shipped 421 lines of typed primitives and 21 unit tests but nothing in production ever called them. Three independent gaps:
1. src/design/audit/evaluate.ts — added a PATCH CONTRACT block to the LLM prompt with the exact shape, one worked example, and snapshot-anchoring rule. Few-shot examples (standard, trust) now include patches[]. Brain.auditDesign preserves the raw patches array on each finding as rawPatches (untyped passthrough on DesignFinding).
2. src/design/audit/build-result.ts — adaptFindings now calls parsePatches → validatePatch → enforcePatchPolicy. Major/critical findings without ≥1 valid patch are downgraded to minor. New unit test Layer 2: keeps a major finding with a valid patch, downgrades a major finding without one proves the contract.
3. src/design/audit/pipeline.ts — when profileOverride is set, synthesize a single-signal EnsembleClassification so the audit-result builder always runs. Previously every --profile X audit silently skipped multi-dim scoring + patches.
4. src/design/audit/patches/validate.ts — snapshot-anchoring is required only when target.scope ∈ {html, structural}. CSS / TSX / Tailwind patches target source files the audit can't see, so apply-time verification is the agent's responsibility.
Eval-agent caught a follow-up regression. Calibration metric dropped from 1.00 → 0.60 → 0.00 across two iterations as the patch contract expanded the prompt. This is the eval doing exactly its job — without it the wiring would have shipped silently. Documented in .evolve/critical-audit/<ts>/reaudit-2026-04-27.md. Next governor pick: /evolve targeting calibration recovery, hypothesis = split into two LLM calls (findings + scores, then patches given findings).

+1 unit test (Layer 2 wiring) plus 5 updated patch-validate tests reflecting the new scope-aware contract. Total: 1505 passing.

#89 9e9e0d8 Thanks @drewstone! - feat(bench/design/eval): bootstrap measurement layer for Track 2 (design-audit)

Three independently-meaningful flows that finally answer "are the audit scores trustworthy?" — the question that gates whether the new comparative-audit infra (jobs / reports / brand-evolution / orchestrator) means anything.

Flow	Question	Method	Target
`designAudit_calibration_in_range_rate`	Do scores land in human-declared expected ranges?	corpus tier ranges, fraction-in-range	≥ 0.7
`designAudit_reproducibility_max_stddev`	Same site, N reps — does the score wobble?	per-site stddev, max across sites	≤ 0.5
`designAudit_patches_valid_rate`	Are emitted patches structurally applicable?	reuse `validatePatch` from Layer 2	≥ 0.95

bench/design/eval/ — pure-function evaluators, AI SDK independent. run.ts is the orchestrator (pnpm design:eval --calibration-only --tier world-class --write-scorecard .evolve/scorecard.json). scorecard.ts is the envelope shape. Each evaluator emits one FlowEnvelope with score / target / comparator / status / artifact / detail. The runner merges fresh flows into .evolve/scorecard.json without clobbering older flows from prior generations.

Baseline established: designAudit_calibration_in_range_rate = 1.00 (5/5 world-class sites in expected range). Stripe → 8.0, Linear → 9.0, Vercel → 8.0, Raycast → 8.0, Cursor → 8.0.

Real gap surfaced: designAudit_patches_valid_rate = unmeasured. None of the 4 critical/major findings on stripe.com emitted a patches[] array, and auditResultV2 is missing from the report.json. Layer 1 v2 + Layer 2 patches aren't writing through to the v1-shaped output. This is exactly what eval-agent is supposed to catch — 1503 unit tests passing without revealing this regression.

+9 new tests across design-eval-scorecard and design-eval-patches. Total: 1503 passing.

#89 9e9e0d8 Thanks @drewstone! - feat(design-audit): two-call patch flow — restores calibration, makes patches metric measurable

Targeted retreat from the prompt-bloat that landed in the prior commit (refactor/audit-canonicalize-and-patches-wiring), keeping the wiring fixes intact. Splits the audit into two LLM calls:

Findings + scores (evaluate.ts) — slim, focused, no patch contract. Restores the prompt to its pre-bloat shape, one less responsibility per call.
Patches (new src/design/audit/patches/generate.ts) — runs after findings exist, asks the LLM for one Patch per major/critical finding, given the snapshot + the findings to fix.

build-result.ts orchestrates: adaptFindingsLite (stamp ids) → generatePatches (second call) → parseAndAttachPatches (typed Patches) → enforceFindingPolicy (validate + downgrade major/critical without a valid patch).

Eval-agent verdict on this round:

Flow	Before this commit	After
`designAudit_calibration_in_range_rate`	0.00 (broken by prompt bloat)	0.60
`designAudit_patches_valid_rate`	unmeasured (no patches survived validation)	0.94 (17/18 patches valid)

Calibration is still 0.10 below target (stripe and raycast scored 7.3 and 7.5 against an 8-10 expected band — close but not in range). The patches metric is 0.01 below its 0.95 target — one validation failure on linear.app where the LLM emitted a placeholder before text. Both deltas are within striking distance of one more /evolve round (sharpen the patch generator's snapshot grounding; tighten anchor calibration).

+5 unit tests for generatePatches. Total: 1510 passing.

Patch Changes

#88 9513492 Thanks @drewstone! - fix(brain): gpt-5.x via OpenAI-compatible proxy now works; was 0/30 → 60% on WebVoyager-30

Two production-blocking bugs surfaced by the bad-app landing-page validation harness:
1. src/brain/index.ts:589 set forceReasoning: true for every gpt-5.x model with provider=openai. This routes the AI SDK to OpenAI's Responses API (/v1/responses). Most third-party OpenAI-compatible proxies (router.tangle.tools, LiteLLM, Together, etc.) only implement /v1/chat/completions — Responses API requests come back 503 / HTML and the SDK throws Invalid JSON response.
2. scripts/run-{mode-baseline,scenario-track}.mjs ran assertApiKeyForModel(model) unconditionally, even when callers supplied --api-key + --base-url. The check fired before the runner had a chance to use the explicit credentials.
Fixes:
- New Brain.isProxiedOpenAI(providerName) predicate. Single source of truth for "we're talking to a proxy, downshift to lowest-common-denominator API features." Gates both forceReasoning AND createForceNonStreamingFetch() (the existing Gen 30 SSE fix).
- Skip assertApiKeyForModel when --api-key/--base-url are supplied.
- New tests/brain-proxy.integration.test.ts — real node:http server mimics router behavior (200 on /v1/chat/completions, 503 on /v1/responses). Asserts requests hit the right endpoint with stream: false. No mocks; +4 tests.
WebVoyager validation results (curated-30, gpt-5.4, router.tangle.tools/v1):
- Before: 0/30 (every case fails at turn 0 with Invalid JSON response)
- After: 18/30 = 60.0% (12 remaining failures are 10× cost_cap_exceeded and 2× 120s timeout — configuration-bound, not brain bugs)
Total tests: 1514 (+4).

#89 9e9e0d8 Thanks @drewstone! - fix(design-audit): Track 2 eval metrics converge — both flows pass (N=1)

Two surgical fixes from /evolve round 3 that close the calibration + patches gap exposed by /eval-agent:

Flow	Round 0	Round 3	Target
`designAudit_calibration_in_range_rate`	0.00 (broken by prompt bloat)	1.00 (5/5 world-class in band)	≥ 0.70
`designAudit_patches_valid_rate`	unmeasured	0.96 (22/23 patches valid)	≥ 0.95

Calibration fix: bench/design/eval/calibration.ts:readScore now prefers page.score (the holistic LLM judgement) over auditResult.rollup.score (the per-dimension weighted aggregate). Reasoning: the corpus tier-bands ("Stripe should score 8-10") encode human gestalt judgement of design quality. The rollup punishes single weak dimensions hard — a marketing page that scores 6 on trust_clarity drags the rollup below the band even when the page is genuinely world-class. Holistic score is the right calibration target. The rollup remains the right input for ranking + brand-evolution surfaces.

Patches fix: src/design/audit/patches/generate.ts:buildPrompt — sharpened the snapshot-anchoring rule. Default target.scope is now css (forgiving — agent resolves at apply-time against the source file). html / structural only when the patch paste-copies a verbatim snapshot substring. Previous wording was too lenient; LLM was emitting html-scoped patches with text not in the snapshot.

Final live numbers: linear=9.0, stripe=8.0, vercel=8.0, raycast=8.0, cursor=8.0. 22/23 patches structurally apply.

Caveat: N=1. Stats discipline asks for ≥3 reps before promotion. Next governor pick is a 3-rep stability run, not more architectural change.

chore: version packages

7507e44

github-actions Bot force-pushed the changeset-release/main branch from cd4086d to 7507e44 Compare April 27, 2026 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release: version packages#90

Release: version packages#90
github-actions[bot] wants to merge 1 commit intomainfrom
changeset-release/main

github-actions Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

github-actions Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Releases

@tangle-network/browser-agent-driver@0.32.0

Minor Changes

Patch Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

github-actions Bot commented Apr 27, 2026 •

edited

Loading