Skip to content

Release: version packages#90

Open
github-actions[bot] wants to merge 1 commit intomainfrom
changeset-release/main
Open

Release: version packages#90
github-actions[bot] wants to merge 1 commit intomainfrom
changeset-release/main

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot commented Apr 27, 2026

This PR was opened by the Changesets release GitHub action. When you're ready to do a release, you can merge this and the packages will be published to npm automatically. If you're not ready to do a release yet, that's fine, whenever you add more changesets to main, this PR will be updated.

Releases

@tangle-network/browser-agent-driver@0.32.0

Minor Changes

  • #89 9e9e0d8 Thanks @drewstone! - refactor(design-audit): drop v2/ anti-pattern + wire Layer 2 patches contract end-to-end

    Two changes that fold into one coherent diff:

    Canonicalization — no version numbers in file or directory names. The src/design/audit/v2/ directory is gone:

    • v2/types.tssrc/design/audit/score-types.ts (scoring/classifier/patches/tags types)
    • v2/build-result.tssrc/design/audit/build-result.ts
    • v2/score.tssrc/design/audit/score.ts
    • tests/design-audit-v2-result.test.tstests/design-audit-build-result.test.ts

    Identifier renames: AuditResult_v2AuditResult, BuildV2ResultInputBuildAuditResultInput, parseAuditResponseV2parseAuditResponse, buildEvalPromptV2buildEvalPrompt, buildAuditResultV2buildAuditResult, synthesizeScoresFromV1synthesizeScoresFromLegacy, auditResultV2 field → auditResult, DesignFindingV1DesignFindingBase, AppliesWhenV1BaseAppliesWhen, V2_INTERNALSBUILD_RESULT_INTERNALS.

    Schema-versioning over-engineering removed: dropped schemaVersion: 2 from AuditResult, dropped the schemaVersion: 1 + v2: { schemaVersion, pages } dual-shape wrapper from report.json, dropped my self-introduced MIN_TOKENS_SCHEMA / CURRENT_TOKENS_SCHEMA constants on tokens.json. (Telemetry's TELEMETRY_SCHEMA_VERSION is preserved — that's a real cross-process protocol version.)

    Layer 2 patches contract wired end-to-end. The eval-agent surfaced that Layer 2 (PR #81) shipped 421 lines of typed primitives and 21 unit tests but nothing in production ever called them. Three independent gaps:

    1. src/design/audit/evaluate.ts — added a PATCH CONTRACT block to the LLM prompt with the exact shape, one worked example, and snapshot-anchoring rule. Few-shot examples (standard, trust) now include patches[]. Brain.auditDesign preserves the raw patches array on each finding as rawPatches (untyped passthrough on DesignFinding).
    2. src/design/audit/build-result.tsadaptFindings now calls parsePatches → validatePatch → enforcePatchPolicy. Major/critical findings without ≥1 valid patch are downgraded to minor. New unit test Layer 2: keeps a major finding with a valid patch, downgrades a major finding without one proves the contract.
    3. src/design/audit/pipeline.ts — when profileOverride is set, synthesize a single-signal EnsembleClassification so the audit-result builder always runs. Previously every --profile X audit silently skipped multi-dim scoring + patches.
    4. src/design/audit/patches/validate.ts — snapshot-anchoring is required only when target.scope ∈ {html, structural}. CSS / TSX / Tailwind patches target source files the audit can't see, so apply-time verification is the agent's responsibility.

    Eval-agent caught a follow-up regression. Calibration metric dropped from 1.00 → 0.60 → 0.00 across two iterations as the patch contract expanded the prompt. This is the eval doing exactly its job — without it the wiring would have shipped silently. Documented in .evolve/critical-audit/<ts>/reaudit-2026-04-27.md. Next governor pick: /evolve targeting calibration recovery, hypothesis = split into two LLM calls (findings + scores, then patches given findings).

    +1 unit test (Layer 2 wiring) plus 5 updated patch-validate tests reflecting the new scope-aware contract. Total: 1505 passing.

  • #89 9e9e0d8 Thanks @drewstone! - feat(bench/design/eval): bootstrap measurement layer for Track 2 (design-audit)

    Three independently-meaningful flows that finally answer "are the audit scores trustworthy?" — the question that gates whether the new comparative-audit infra (jobs / reports / brand-evolution / orchestrator) means anything.

    Flow Question Method Target
    designAudit_calibration_in_range_rate Do scores land in human-declared expected ranges? corpus tier ranges, fraction-in-range ≥ 0.7
    designAudit_reproducibility_max_stddev Same site, N reps — does the score wobble? per-site stddev, max across sites ≤ 0.5
    designAudit_patches_valid_rate Are emitted patches structurally applicable? reuse validatePatch from Layer 2 ≥ 0.95

    bench/design/eval/ — pure-function evaluators, AI SDK independent. run.ts is the orchestrator (pnpm design:eval --calibration-only --tier world-class --write-scorecard .evolve/scorecard.json). scorecard.ts is the envelope shape. Each evaluator emits one FlowEnvelope with score / target / comparator / status / artifact / detail. The runner merges fresh flows into .evolve/scorecard.json without clobbering older flows from prior generations.

    Baseline established: designAudit_calibration_in_range_rate = 1.00 (5/5 world-class sites in expected range). Stripe → 8.0, Linear → 9.0, Vercel → 8.0, Raycast → 8.0, Cursor → 8.0.

    Real gap surfaced: designAudit_patches_valid_rate = unmeasured. None of the 4 critical/major findings on stripe.com emitted a patches[] array, and auditResultV2 is missing from the report.json. Layer 1 v2 + Layer 2 patches aren't writing through to the v1-shaped output. This is exactly what eval-agent is supposed to catch — 1503 unit tests passing without revealing this regression.

    +9 new tests across design-eval-scorecard and design-eval-patches. Total: 1503 passing.

  • #89 9e9e0d8 Thanks @drewstone! - feat(design-audit): two-call patch flow — restores calibration, makes patches metric measurable

    Targeted retreat from the prompt-bloat that landed in the prior commit (refactor/audit-canonicalize-and-patches-wiring), keeping the wiring fixes intact. Splits the audit into two LLM calls:

    1. Findings + scores (evaluate.ts) — slim, focused, no patch contract. Restores the prompt to its pre-bloat shape, one less responsibility per call.
    2. Patches (new src/design/audit/patches/generate.ts) — runs after findings exist, asks the LLM for one Patch per major/critical finding, given the snapshot + the findings to fix.

    build-result.ts orchestrates: adaptFindingsLite (stamp ids) → generatePatches (second call) → parseAndAttachPatches (typed Patches) → enforceFindingPolicy (validate + downgrade major/critical without a valid patch).

    Eval-agent verdict on this round:

    Flow Before this commit After
    designAudit_calibration_in_range_rate 0.00 (broken by prompt bloat) 0.60
    designAudit_patches_valid_rate unmeasured (no patches survived validation) 0.94 (17/18 patches valid)

    Calibration is still 0.10 below target (stripe and raycast scored 7.3 and 7.5 against an 8-10 expected band — close but not in range). The patches metric is 0.01 below its 0.95 target — one validation failure on linear.app where the LLM emitted a placeholder before text. Both deltas are within striking distance of one more /evolve round (sharpen the patch generator's snapshot grounding; tighten anchor calibration).

    +5 unit tests for generatePatches. Total: 1510 passing.

Patch Changes

  • #88 9513492 Thanks @drewstone! - fix(brain): gpt-5.x via OpenAI-compatible proxy now works; was 0/30 → 60% on WebVoyager-30

    Two production-blocking bugs surfaced by the bad-app landing-page validation harness:

    1. src/brain/index.ts:589 set forceReasoning: true for every gpt-5.x model with provider=openai. This routes the AI SDK to OpenAI's Responses API (/v1/responses). Most third-party OpenAI-compatible proxies (router.tangle.tools, LiteLLM, Together, etc.) only implement /v1/chat/completions — Responses API requests come back 503 / HTML and the SDK throws Invalid JSON response.

    2. scripts/run-{mode-baseline,scenario-track}.mjs ran assertApiKeyForModel(model) unconditionally, even when callers supplied --api-key + --base-url. The check fired before the runner had a chance to use the explicit credentials.

    Fixes:

    • New Brain.isProxiedOpenAI(providerName) predicate. Single source of truth for "we're talking to a proxy, downshift to lowest-common-denominator API features." Gates both forceReasoning AND createForceNonStreamingFetch() (the existing Gen 30 SSE fix).
    • Skip assertApiKeyForModel when --api-key/--base-url are supplied.
    • New tests/brain-proxy.integration.test.ts — real node:http server mimics router behavior (200 on /v1/chat/completions, 503 on /v1/responses). Asserts requests hit the right endpoint with stream: false. No mocks; +4 tests.

    WebVoyager validation results (curated-30, gpt-5.4, router.tangle.tools/v1):

    • Before: 0/30 (every case fails at turn 0 with Invalid JSON response)
    • After: 18/30 = 60.0% (12 remaining failures are 10× cost_cap_exceeded and 2× 120s timeout — configuration-bound, not brain bugs)

    Total tests: 1514 (+4).

  • #89 9e9e0d8 Thanks @drewstone! - fix(design-audit): Track 2 eval metrics converge — both flows pass (N=1)

    Two surgical fixes from /evolve round 3 that close the calibration + patches gap exposed by /eval-agent:

    Flow Round 0 Round 3 Target
    designAudit_calibration_in_range_rate 0.00 (broken by prompt bloat) 1.00 (5/5 world-class in band) ≥ 0.70
    designAudit_patches_valid_rate unmeasured 0.96 (22/23 patches valid) ≥ 0.95

    Calibration fix: bench/design/eval/calibration.ts:readScore now prefers page.score (the holistic LLM judgement) over auditResult.rollup.score (the per-dimension weighted aggregate). Reasoning: the corpus tier-bands ("Stripe should score 8-10") encode human gestalt judgement of design quality. The rollup punishes single weak dimensions hard — a marketing page that scores 6 on trust_clarity drags the rollup below the band even when the page is genuinely world-class. Holistic score is the right calibration target. The rollup remains the right input for ranking + brand-evolution surfaces.

    Patches fix: src/design/audit/patches/generate.ts:buildPrompt — sharpened the snapshot-anchoring rule. Default target.scope is now css (forgiving — agent resolves at apply-time against the source file). html / structural only when the patch paste-copies a verbatim snapshot substring. Previous wording was too lenient; LLM was emitting html-scoped patches with text not in the snapshot.

    Final live numbers: linear=9.0, stripe=8.0, vercel=8.0, raycast=8.0, cursor=8.0. 22/23 patches structurally apply.

    Caveat: N=1. Stats discipline asks for ≥3 reps before promotion. Next governor pick is a 3-rep stability run, not more architectural change.

@github-actions github-actions Bot force-pushed the changeset-release/main branch from cd4086d to 7507e44 Compare April 27, 2026 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants