feat(eval): aggregateJudgeVerdicts — generic judge-ensemble reducer#224
Merged
Conversation
legal aggregateEnsemble (scoring.ts), creative judges.ts, and tax judge-ensemble.ts implement the same reduction three times: fan out N uncorrelated judges, mean each dimension over the SURVIVORS, report the inter-rater disagreement spread, sum cost. Lift it to the substrate, keyed by the caller's rubric. Fail-loud: a judge that errored is recorded in failedJudges with perDimension: null, never a zero; all-failed throws; a failed judge's cost is still summed (it burned tokens). Composite reuses weightedComposite — no re-implementation. The app-shell eval-campaign scaffold + the products compose this instead of each owning a copy.
tangletools
approved these changes
Jun 5, 2026
Contributor
tangletools
left a comment
There was a problem hiding this comment.
Pure reducer, fail-loud (failed judge ≠ zero, all-failed throws, cost includes failures), composes weightedComposite. 10 new tests, 1889 green. Approving.
drewstone
added a commit
that referenced
this pull request
Jun 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The one substrate primitive the shared agent-app eval-campaign scaffold needs: a generic judge-ensemble reducer. legal (
aggregateEnsemble, tests/eval/lib/scoring.ts), creative (production-loop/judges.ts), and tax (judge-ensemble.ts) implement the same reduction three times — fan out N uncorrelated judges, mean each dimension over the survivors, report inter-rater spread, sum cost. This lifts it to the substrate, keyed by the caller's rubric (D extends string).Behavior (fail-loud)
perDimension: null) is recorded infailedJudges, never folded into a zero. All-failed throws — a silent zero here corrupts the gate's number (CLAUDE.md "No fallbacks. Fail loud.").maxDisagreement= max over dims of (max−min) across survivors — the inter-rater signal gates/analyzeRuns consume.compositereuses the substrate'sweightedComposite(no re-implementation). No weights ⇒ uniform; a partial weights map selects-and-weights exactly those dims.Verification
pnpm typecheckclean ·pnpm lintexit 0pnpm test— 1889 passed, 2 skipped (1879 prior + 10 new: survivor-mean, uniform/weighted composite, failed-judge-not-zero, all-failed-throws, cost-includes-failures, disagreement-spread, clamp, empty-input, rationale)Next
/eval-campaignfaçade composes this inbuildEnsembleJudge; legal/creative/tax delete their copies and call it.