Open
Conversation
cd4086d to
7507e44
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR was opened by the Changesets release GitHub action. When you're ready to do a release, you can merge this and the packages will be published to npm automatically. If you're not ready to do a release yet, that's fine, whenever you add more changesets to main, this PR will be updated.
Releases
@tangle-network/browser-agent-driver@0.32.0
Minor Changes
#89
9e9e0d8Thanks @drewstone! - refactor(design-audit): drop v2/ anti-pattern + wire Layer 2 patches contract end-to-endTwo changes that fold into one coherent diff:
Canonicalization — no version numbers in file or directory names. The
src/design/audit/v2/directory is gone:v2/types.ts→src/design/audit/score-types.ts(scoring/classifier/patches/tags types)v2/build-result.ts→src/design/audit/build-result.tsv2/score.ts→src/design/audit/score.tstests/design-audit-v2-result.test.ts→tests/design-audit-build-result.test.tsIdentifier renames:
AuditResult_v2→AuditResult,BuildV2ResultInput→BuildAuditResultInput,parseAuditResponseV2→parseAuditResponse,buildEvalPromptV2→buildEvalPrompt,buildAuditResultV2→buildAuditResult,synthesizeScoresFromV1→synthesizeScoresFromLegacy,auditResultV2field →auditResult,DesignFindingV1→DesignFindingBase,AppliesWhenV1→BaseAppliesWhen,V2_INTERNALS→BUILD_RESULT_INTERNALS.Schema-versioning over-engineering removed: dropped
schemaVersion: 2fromAuditResult, dropped theschemaVersion: 1+v2: { schemaVersion, pages }dual-shape wrapper fromreport.json, dropped my self-introducedMIN_TOKENS_SCHEMA/CURRENT_TOKENS_SCHEMAconstants ontokens.json. (Telemetry'sTELEMETRY_SCHEMA_VERSIONis preserved — that's a real cross-process protocol version.)Layer 2 patches contract wired end-to-end. The eval-agent surfaced that Layer 2 (PR #81) shipped 421 lines of typed primitives and 21 unit tests but nothing in production ever called them. Three independent gaps:
src/design/audit/evaluate.ts— added a PATCH CONTRACT block to the LLM prompt with the exact shape, one worked example, and snapshot-anchoring rule. Few-shot examples (standard,trust) now includepatches[]. Brain.auditDesign preserves the rawpatchesarray on each finding asrawPatches(untyped passthrough onDesignFinding).src/design/audit/build-result.ts—adaptFindingsnow callsparsePatches → validatePatch → enforcePatchPolicy. Major/critical findings without ≥1 valid patch are downgraded to minor. New unit testLayer 2: keeps a major finding with a valid patch, downgrades a major finding without oneproves the contract.src/design/audit/pipeline.ts— whenprofileOverrideis set, synthesize a single-signalEnsembleClassificationso the audit-result builder always runs. Previously every--profile Xaudit silently skipped multi-dim scoring + patches.src/design/audit/patches/validate.ts— snapshot-anchoring is required only whentarget.scope ∈ {html, structural}. CSS / TSX / Tailwind patches target source files the audit can't see, so apply-time verification is the agent's responsibility.Eval-agent caught a follow-up regression. Calibration metric dropped from 1.00 → 0.60 → 0.00 across two iterations as the patch contract expanded the prompt. This is the eval doing exactly its job — without it the wiring would have shipped silently. Documented in
.evolve/critical-audit/<ts>/reaudit-2026-04-27.md. Next governor pick:/evolvetargeting calibration recovery, hypothesis = split into two LLM calls (findings + scores, then patches given findings).+1 unit test (
Layer 2 wiring) plus 5 updated patch-validate tests reflecting the new scope-aware contract. Total: 1505 passing.#89
9e9e0d8Thanks @drewstone! - feat(bench/design/eval): bootstrap measurement layer for Track 2 (design-audit)Three independently-meaningful flows that finally answer "are the audit scores trustworthy?" — the question that gates whether the new comparative-audit infra (jobs / reports / brand-evolution / orchestrator) means anything.
designAudit_calibration_in_range_ratedesignAudit_reproducibility_max_stddevdesignAudit_patches_valid_ratevalidatePatchfrom Layer 2bench/design/eval/— pure-function evaluators, AI SDK independent.run.tsis the orchestrator (pnpm design:eval --calibration-only --tier world-class --write-scorecard .evolve/scorecard.json).scorecard.tsis the envelope shape. Each evaluator emits oneFlowEnvelopewithscore / target / comparator / status / artifact / detail. The runner merges fresh flows into.evolve/scorecard.jsonwithout clobbering older flows from prior generations.Baseline established:
designAudit_calibration_in_range_rate = 1.00(5/5 world-class sites in expected range). Stripe → 8.0, Linear → 9.0, Vercel → 8.0, Raycast → 8.0, Cursor → 8.0.Real gap surfaced:
designAudit_patches_valid_rate = unmeasured. None of the 4 critical/major findings on stripe.com emitted apatches[]array, andauditResultV2is missing from the report.json. Layer 1 v2 + Layer 2 patches aren't writing through to the v1-shaped output. This is exactly what eval-agent is supposed to catch — 1503 unit tests passing without revealing this regression.+9 new tests across
design-eval-scorecardanddesign-eval-patches. Total: 1503 passing.#89
9e9e0d8Thanks @drewstone! - feat(design-audit): two-call patch flow — restores calibration, makes patches metric measurableTargeted retreat from the prompt-bloat that landed in the prior commit (refactor/audit-canonicalize-and-patches-wiring), keeping the wiring fixes intact. Splits the audit into two LLM calls:
evaluate.ts) — slim, focused, no patch contract. Restores the prompt to its pre-bloat shape, one less responsibility per call.src/design/audit/patches/generate.ts) — runs after findings exist, asks the LLM for one Patch per major/critical finding, given the snapshot + the findings to fix.build-result.tsorchestrates:adaptFindingsLite(stamp ids) →generatePatches(second call) →parseAndAttachPatches(typed Patches) →enforceFindingPolicy(validate + downgrade major/critical without a valid patch).Eval-agent verdict on this round:
designAudit_calibration_in_range_ratedesignAudit_patches_valid_rateCalibration is still 0.10 below target (stripe and raycast scored 7.3 and 7.5 against an 8-10 expected band — close but not in range). The patches metric is 0.01 below its 0.95 target — one validation failure on linear.app where the LLM emitted a placeholder
beforetext. Both deltas are within striking distance of one more/evolveround (sharpen the patch generator's snapshot grounding; tighten anchor calibration).+5 unit tests for
generatePatches. Total: 1510 passing.Patch Changes
#88
9513492Thanks @drewstone! - fix(brain): gpt-5.x via OpenAI-compatible proxy now works; was 0/30 → 60% on WebVoyager-30Two production-blocking bugs surfaced by the bad-app landing-page validation harness:
src/brain/index.ts:589setforceReasoning: truefor everygpt-5.xmodel withprovider=openai. This routes the AI SDK to OpenAI's Responses API (/v1/responses). Most third-party OpenAI-compatible proxies (router.tangle.tools, LiteLLM, Together, etc.) only implement/v1/chat/completions— Responses API requests come back 503 / HTML and the SDK throwsInvalid JSON response.scripts/run-{mode-baseline,scenario-track}.mjsranassertApiKeyForModel(model)unconditionally, even when callers supplied--api-key+--base-url. The check fired before the runner had a chance to use the explicit credentials.Fixes:
Brain.isProxiedOpenAI(providerName)predicate. Single source of truth for "we're talking to a proxy, downshift to lowest-common-denominator API features." Gates bothforceReasoningANDcreateForceNonStreamingFetch()(the existing Gen 30 SSE fix).assertApiKeyForModelwhen--api-key/--base-urlare supplied.tests/brain-proxy.integration.test.ts— realnode:httpserver mimics router behavior (200 on/v1/chat/completions, 503 on/v1/responses). Asserts requests hit the right endpoint withstream: false. No mocks; +4 tests.WebVoyager validation results (curated-30, gpt-5.4, router.tangle.tools/v1):
Invalid JSON response)cost_cap_exceededand 2× 120s timeout — configuration-bound, not brain bugs)Total tests: 1514 (+4).
#89
9e9e0d8Thanks @drewstone! - fix(design-audit): Track 2 eval metrics converge — both flows pass (N=1)Two surgical fixes from
/evolveround 3 that close the calibration + patches gap exposed by/eval-agent:designAudit_calibration_in_range_ratedesignAudit_patches_valid_rateCalibration fix:
bench/design/eval/calibration.ts:readScorenow preferspage.score(the holistic LLM judgement) overauditResult.rollup.score(the per-dimension weighted aggregate). Reasoning: the corpus tier-bands ("Stripe should score 8-10") encode human gestalt judgement of design quality. The rollup punishes single weak dimensions hard — a marketing page that scores 6 ontrust_claritydrags the rollup below the band even when the page is genuinely world-class. Holistic score is the right calibration target. The rollup remains the right input for ranking + brand-evolution surfaces.Patches fix:
src/design/audit/patches/generate.ts:buildPrompt— sharpened the snapshot-anchoring rule. Defaulttarget.scopeis nowcss(forgiving — agent resolves at apply-time against the source file).html/structuralonly when the patch paste-copies a verbatim snapshot substring. Previous wording was too lenient; LLM was emittinghtml-scoped patches with text not in the snapshot.Final live numbers: linear=9.0, stripe=8.0, vercel=8.0, raycast=8.0, cursor=8.0. 22/23 patches structurally apply.
Caveat: N=1. Stats discipline asks for ≥3 reps before promotion. Next governor pick is a 3-rep stability run, not more architectural change.