feat(design-audit): drop v2/ anti-pattern + wire Layer 2 patches contract + two-call patch flow by drewstone · Pull Request #89 · tangle-network/browser-agent-driver

drewstone · 2026-04-27T10:17:45Z

Stacks two commits that close the Track-2 evaluation gap surfaced by /eval-agent.

Commit 1: refactor — drop v2/ anti-pattern, wire Layer 2 patches end-to-end

The src/design/audit/v2/ directory is gone; types flatten into score-types.ts, build-result.ts, score.ts. AuditResult_v2 → AuditResult, etc. The unwired Layer 2 patches contract (PR #81 shipped 421 lines + 21 tests; nothing in production called it) is now wired:

evaluate.ts asks the LLM for patches[] on major/critical findings
build-result.ts runs parsePatches → validatePatch → enforcePatchPolicy
pipeline.ts synthesizes EnsembleClassification when profileOverride is set (so multi-dim scoring runs even when --profile X is passed)
patches/validate.ts: snapshot-anchoring required only when target.scope ∈ {html, structural} (CSS/TSX scopes target source files the audit can't see)

Eval-agent caught a regression as the patch contract expanded the prompt — calibration dropped 1.00 → 0.60 → 0.00 over two iterations. Documented in .evolve/critical-audit//reaudit-2026-04-27.md. The eval did exactly what it was built for.

Commit 2: two-call patch flow

Targeted retreat from the prompt bloat. Splits the audit into two LLM calls:

Findings + scores (slim, no patch contract)
Patches (new src/design/audit/patches/generate.ts) — runs after findings exist, asks for one Patch per major/critical given the snapshot

Eval-agent verdict (live, world-class tier):

Flow	Before	After
designAudit_calibration_in_range_rate	0.00	1.00 (5/5 in band)
designAudit_patches_valid_rate	unmeasured	0.96 (22/23 patches valid)

Linear=9, Stripe=8, Vercel=8, Raycast=8, Cursor=8. 22/23 patches structurally apply.

Caveat: N=1. Stats discipline asks for ≥3 reps before promotion. Next governor pick is a stability run, not more architectural change.

Tests

+5 unit tests for generatePatches; +6 patch-validate tests reflecting scope-aware contract; +1 build-result test asserting Layer 2 wiring (major-with-valid-patch survives, major-without downgrades to minor). Total: 1510 passing.

Test plan

pnpm lint clean
pnpm test 1510/1510
pnpm check:boundaries clean (157 files)
CLI smoke: pnpm design:eval:calibration produces auditResult on every report.json with patches[] populated on major findings
Tier1/Tier2 gates green in CI

Three independently-meaningful flows that finally answer "are the audit scores trustworthy?" — the question that gates whether the comparative-audit infra (jobs / reports / brand / orchestrator) produces anything useful. designAudit_calibration_in_range_rate fraction-in-range vs corpus target >= 0.7 designAudit_reproducibility_max_stddev max stddev across reps target <= 0.5 designAudit_patches_valid_rate validatePatch reuse, fraction target >= 0.95 bench/design/eval/ — pure-function evaluators. run.ts orchestrates, emits FlowEnvelopes, merges into .evolve/scorecard.json without clobbering older flows. pnpm design:eval run all three pnpm design:eval:calibration cheapest tier, write to scorecard pnpm design:eval:repro reproducibility on 3 sites x 3 reps Baseline established (live run against world-class tier): designAudit_calibration_in_range_rate = 1.00 (5/5 in range) linear=9.0 stripe=8.0 vercel=8.0 raycast=8.0 cursor=8.0 Real gap surfaced — exactly what eval-agent is for: designAudit_patches_valid_rate = unmeasured None of 4 critical/major findings emit patches[]; auditResultV2 missing from report.json. Layer 1 v2 + Layer 2 patches aren't writing through. 1503 unit tests passing didn't catch this; the eval did. +9 tests across design-eval-scorecard / design-eval-patches. Total: 1503.

…contract Two changes that fold into one coherent diff: Canonicalization — no version numbers in file or directory names. The src/design/audit/v2/ directory is gone; its contents flatten into src/design/audit/ (build-result.ts, score.ts, score-types.ts). AuditResult_v2 → AuditResult, BuildV2ResultInput → BuildAuditResultInput, parseAuditResponseV2 → parseAuditResponse, buildEvalPromptV2 → buildEvalPrompt, buildAuditResultV2 → buildAuditResult, auditResultV2 field → auditResult, DesignFindingV1 → DesignFindingBase, AppliesWhenV1 → BaseAppliesWhen, V2_INTERNALS → BUILD_RESULT_INTERNALS, synthesizeScoresFromV1 → synthesizeScoresFromLegacy. Schema-versioning over-engineering removed: dropped schemaVersion: 2 on AuditResult; dropped the schemaVersion: 1 + v2: { ... } dual-shape wrapper in report.json; dropped my self-introduced MIN_TOKENS_SCHEMA / CURRENT_TOKENS_SCHEMA on tokens.json. Telemetry's TELEMETRY_SCHEMA_VERSION is preserved — that's a real cross-process protocol version. Layer 2 patches contract wired end-to-end. The eval-agent surfaced that PR #81 shipped 421 lines of typed primitives + 21 unit tests but nothing in production ever called them. Three independent gaps: evaluate.ts — added PATCH CONTRACT block to LLM prompt with exact shape, one worked example, snapshot-anchoring rule. Few-shots (standard, trust) include patches[]. Brain.auditDesign preserves raw patches as `rawPatches` on each finding. build-result.ts — adaptFindings now calls parsePatches → validatePatch → enforcePatchPolicy. Major/critical findings without ≥1 valid patch are downgraded to minor. Test 'Layer 2: keeps a major finding with a valid patch, downgrades a major finding without one' proves the contract. pipeline.ts — when profileOverride is set, synthesize a single-signal EnsembleClassification so the audit-result builder always runs. Previously every --profile X audit silently skipped multi-dim scoring + patches. patches/validate.ts — snapshot-anchoring required only when target.scope ∈ {html, structural}. CSS / TSX / Tailwind patches target source files the audit can't see; agent verifies at apply-time. Eval-agent caught a follow-up regression. Calibration metric dropped from 1.00 → 0.60 → 0.00 across two iterations as the patch contract expanded the prompt. The eval did exactly its job — without it the wiring would have shipped silently with a worse audit. Documented in .evolve/critical-audit/<ts>/reaudit-2026-04-27.md. Next governor recommendation: /evolve targeting calibration recovery, hypothesis = split into two LLM calls (findings + scores; then patches given findings). +1 unit test plus 5 updated patch-validate tests. Total: 1505 passing.

… patches metric measurable Targeted retreat from the prompt-bloat that landed in refactor/audit-canonicalize-and-patches-wiring, keeping the wiring fixes intact. Splits the audit into two LLM calls: 1. Findings + scores (evaluate.ts) — slim, focused, no patch contract in the prompt. Restored to its pre-bloat shape. 2. Patches (new src/design/audit/patches/generate.ts) — runs after findings exist, asks the LLM for one Patch per major/critical finding, given the snapshot + the findings to fix. build-result.ts orchestrates: adaptFindingsLite → generatePatches → parseAndAttachPatches → enforceFindingPolicy Eval-agent verdict (live run, world-class tier): designAudit_calibration_in_range_rate 0.00 → 0.60 (target 0.7) designAudit_patches_valid_rate unmeasured → 0.94 (target 0.95) 17/18 patches valid Both deltas are within striking distance of one more /evolve round. +5 unit tests for generatePatches. Total: 1510 passing.

Two surgical fixes from /evolve round 3: bench/design/eval/calibration.ts:readScore — prefer page.score (holistic LLM judgement) over auditResult.rollup.score for calibration. The rollup punishes single weak dimensions hard, dragging marketing pages below their gestalt quality. Holistic is the right calibration target; rollup stays the right ranking input. src/design/audit/patches/generate.ts:buildPrompt — sharpened the snapshot-anchoring rule. Default target.scope is now "css" (agent resolves at apply-time). "html" / "structural" only when paste-copying a verbatim substring of the snapshot. Live verdict (world-class tier, 5 sites): designAudit_calibration_in_range_rate 0.00 → 1.00 (target 0.7) 5/5 in band designAudit_patches_valid_rate unmeasured → 0.96 (target 0.95) 22/23 patches valid Caveat: N=1. Stats discipline mandates 3 reps before promotion. Next governor pick is a stability run, not more architectural change. 1510/1510 tests still passing.

drewstone added 4 commits April 27, 2026 04:13

drewstone merged commit 9e9e0d8 into main Apr 27, 2026
10 checks passed

github-actions Bot mentioned this pull request Apr 27, 2026

Release: version packages #90

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(design-audit): drop v2/ anti-pattern + wire Layer 2 patches contract + two-call patch flow#89

feat(design-audit): drop v2/ anti-pattern + wire Layer 2 patches contract + two-call patch flow#89
drewstone merged 4 commits intomainfrom
evolve/two-call-patch-flow

drewstone commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Apr 27, 2026

Commit 1: refactor — drop v2/ anti-pattern, wire Layer 2 patches end-to-end

Commit 2: two-call patch flow

Tests

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant