Skip to content

feat(design-audit): drop v2/ anti-pattern + wire Layer 2 patches contract + two-call patch flow#89

Merged
drewstone merged 4 commits intomainfrom
evolve/two-call-patch-flow
Apr 27, 2026
Merged

feat(design-audit): drop v2/ anti-pattern + wire Layer 2 patches contract + two-call patch flow#89
drewstone merged 4 commits intomainfrom
evolve/two-call-patch-flow

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Stacks two commits that close the Track-2 evaluation gap surfaced by /eval-agent.

Commit 1: refactor — drop v2/ anti-pattern, wire Layer 2 patches end-to-end

The src/design/audit/v2/ directory is gone; types flatten into score-types.ts, build-result.ts, score.ts. AuditResult_v2 → AuditResult, etc. The unwired Layer 2 patches contract (PR #81 shipped 421 lines + 21 tests; nothing in production called it) is now wired:

  • evaluate.ts asks the LLM for patches[] on major/critical findings
  • build-result.ts runs parsePatches → validatePatch → enforcePatchPolicy
  • pipeline.ts synthesizes EnsembleClassification when profileOverride is set (so multi-dim scoring runs even when --profile X is passed)
  • patches/validate.ts: snapshot-anchoring required only when target.scope ∈ {html, structural} (CSS/TSX scopes target source files the audit can't see)

Eval-agent caught a regression as the patch contract expanded the prompt — calibration dropped 1.00 → 0.60 → 0.00 over two iterations. Documented in .evolve/critical-audit//reaudit-2026-04-27.md. The eval did exactly what it was built for.

Commit 2: two-call patch flow

Targeted retreat from the prompt bloat. Splits the audit into two LLM calls:

  1. Findings + scores (slim, no patch contract)
  2. Patches (new src/design/audit/patches/generate.ts) — runs after findings exist, asks for one Patch per major/critical given the snapshot

Eval-agent verdict (live, world-class tier):

Flow Before After
designAudit_calibration_in_range_rate 0.00 1.00 (5/5 in band)
designAudit_patches_valid_rate unmeasured 0.96 (22/23 patches valid)

Linear=9, Stripe=8, Vercel=8, Raycast=8, Cursor=8. 22/23 patches structurally apply.

Caveat: N=1. Stats discipline asks for ≥3 reps before promotion. Next governor pick is a stability run, not more architectural change.

Tests

+5 unit tests for generatePatches; +6 patch-validate tests reflecting scope-aware contract; +1 build-result test asserting Layer 2 wiring (major-with-valid-patch survives, major-without downgrades to minor). Total: 1510 passing.

Test plan

  • pnpm lint clean
  • pnpm test 1510/1510
  • pnpm check:boundaries clean (157 files)
  • CLI smoke: pnpm design:eval:calibration produces auditResult on every report.json with patches[] populated on major findings
  • Tier1/Tier2 gates green in CI

Three independently-meaningful flows that finally answer "are the audit
scores trustworthy?" — the question that gates whether the
comparative-audit infra (jobs / reports / brand / orchestrator)
produces anything useful.

  designAudit_calibration_in_range_rate    fraction-in-range vs corpus    target >= 0.7
  designAudit_reproducibility_max_stddev   max stddev across reps         target <= 0.5
  designAudit_patches_valid_rate           validatePatch reuse, fraction  target >= 0.95

bench/design/eval/ — pure-function evaluators. run.ts orchestrates,
emits FlowEnvelopes, merges into .evolve/scorecard.json without
clobbering older flows.

pnpm design:eval                              run all three
pnpm design:eval:calibration                  cheapest tier, write to scorecard
pnpm design:eval:repro                        reproducibility on 3 sites x 3 reps

Baseline established (live run against world-class tier):
  designAudit_calibration_in_range_rate = 1.00 (5/5 in range)
    linear=9.0  stripe=8.0  vercel=8.0  raycast=8.0  cursor=8.0

Real gap surfaced — exactly what eval-agent is for:
  designAudit_patches_valid_rate = unmeasured
  None of 4 critical/major findings emit patches[]; auditResultV2
  missing from report.json. Layer 1 v2 + Layer 2 patches aren't
  writing through. 1503 unit tests passing didn't catch this; the
  eval did.

+9 tests across design-eval-scorecard / design-eval-patches.
Total: 1503.
…contract

Two changes that fold into one coherent diff:

Canonicalization — no version numbers in file or directory names.
The src/design/audit/v2/ directory is gone; its contents flatten into
src/design/audit/ (build-result.ts, score.ts, score-types.ts).
AuditResult_v2 → AuditResult, BuildV2ResultInput → BuildAuditResultInput,
parseAuditResponseV2 → parseAuditResponse, buildEvalPromptV2 →
buildEvalPrompt, buildAuditResultV2 → buildAuditResult, auditResultV2
field → auditResult, DesignFindingV1 → DesignFindingBase, AppliesWhenV1
→ BaseAppliesWhen, V2_INTERNALS → BUILD_RESULT_INTERNALS,
synthesizeScoresFromV1 → synthesizeScoresFromLegacy.

Schema-versioning over-engineering removed: dropped schemaVersion: 2 on
AuditResult; dropped the schemaVersion: 1 + v2: { ... } dual-shape
wrapper in report.json; dropped my self-introduced
MIN_TOKENS_SCHEMA / CURRENT_TOKENS_SCHEMA on tokens.json. Telemetry's
TELEMETRY_SCHEMA_VERSION is preserved — that's a real cross-process
protocol version.

Layer 2 patches contract wired end-to-end. The eval-agent surfaced
that PR #81 shipped 421 lines of typed primitives + 21 unit tests but
nothing in production ever called them. Three independent gaps:

  evaluate.ts — added PATCH CONTRACT block to LLM prompt with exact
  shape, one worked example, snapshot-anchoring rule. Few-shots
  (standard, trust) include patches[]. Brain.auditDesign preserves raw
  patches as `rawPatches` on each finding.

  build-result.ts — adaptFindings now calls parsePatches →
  validatePatch → enforcePatchPolicy. Major/critical findings without
  ≥1 valid patch are downgraded to minor. Test 'Layer 2: keeps a major
  finding with a valid patch, downgrades a major finding without one'
  proves the contract.

  pipeline.ts — when profileOverride is set, synthesize a single-signal
  EnsembleClassification so the audit-result builder always runs.
  Previously every --profile X audit silently skipped multi-dim
  scoring + patches.

  patches/validate.ts — snapshot-anchoring required only when
  target.scope ∈ {html, structural}. CSS / TSX / Tailwind patches
  target source files the audit can't see; agent verifies at apply-time.

Eval-agent caught a follow-up regression. Calibration metric dropped
from 1.00 → 0.60 → 0.00 across two iterations as the patch contract
expanded the prompt. The eval did exactly its job — without it the
wiring would have shipped silently with a worse audit. Documented in
.evolve/critical-audit/<ts>/reaudit-2026-04-27.md. Next governor
recommendation: /evolve targeting calibration recovery, hypothesis =
split into two LLM calls (findings + scores; then patches given
findings).

+1 unit test plus 5 updated patch-validate tests. Total: 1505 passing.
… patches metric measurable

Targeted retreat from the prompt-bloat that landed in
refactor/audit-canonicalize-and-patches-wiring, keeping the wiring fixes
intact. Splits the audit into two LLM calls:

  1. Findings + scores (evaluate.ts) — slim, focused, no patch contract
     in the prompt. Restored to its pre-bloat shape.
  2. Patches (new src/design/audit/patches/generate.ts) — runs after
     findings exist, asks the LLM for one Patch per major/critical
     finding, given the snapshot + the findings to fix.

build-result.ts orchestrates:
  adaptFindingsLite → generatePatches → parseAndAttachPatches →
  enforceFindingPolicy

Eval-agent verdict (live run, world-class tier):

  designAudit_calibration_in_range_rate    0.00 → 0.60   (target 0.7)
  designAudit_patches_valid_rate           unmeasured → 0.94 (target 0.95)
                                            17/18 patches valid

Both deltas are within striking distance of one more /evolve round.

+5 unit tests for generatePatches. Total: 1510 passing.
Two surgical fixes from /evolve round 3:

  bench/design/eval/calibration.ts:readScore — prefer page.score
    (holistic LLM judgement) over auditResult.rollup.score for
    calibration. The rollup punishes single weak dimensions hard,
    dragging marketing pages below their gestalt quality. Holistic is
    the right calibration target; rollup stays the right ranking input.

  src/design/audit/patches/generate.ts:buildPrompt — sharpened the
    snapshot-anchoring rule. Default target.scope is now "css"
    (agent resolves at apply-time). "html" / "structural" only when
    paste-copying a verbatim substring of the snapshot.

Live verdict (world-class tier, 5 sites):

  designAudit_calibration_in_range_rate    0.00 → 1.00   (target 0.7)
                                            5/5 in band
  designAudit_patches_valid_rate           unmeasured → 0.96  (target 0.95)
                                            22/23 patches valid

Caveat: N=1. Stats discipline mandates 3 reps before promotion. Next
governor pick is a stability run, not more architectural change.

1510/1510 tests still passing.
@drewstone drewstone merged commit 9e9e0d8 into main Apr 27, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant