Skip to content

feat(eval): aggregateJudgeVerdicts — generic judge-ensemble reducer#224

Merged
drewstone merged 1 commit into
mainfrom
feat/judge-ensemble-reducer
Jun 5, 2026
Merged

feat(eval): aggregateJudgeVerdicts — generic judge-ensemble reducer#224
drewstone merged 1 commit into
mainfrom
feat/judge-ensemble-reducer

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

The one substrate primitive the shared agent-app eval-campaign scaffold needs: a generic judge-ensemble reducer. legal (aggregateEnsemble, tests/eval/lib/scoring.ts), creative (production-loop/judges.ts), and tax (judge-ensemble.ts) implement the same reduction three times — fan out N uncorrelated judges, mean each dimension over the survivors, report inter-rater spread, sum cost. This lifts it to the substrate, keyed by the caller's rubric (D extends string).

Behavior (fail-loud)

  • Survivor mean per dimension; out-of-range judge scores clamped to [0,1].
  • A failed judge (perDimension: null) is recorded in failedJudges, never folded into a zero. All-failed throws — a silent zero here corrupts the gate's number (CLAUDE.md "No fallbacks. Fail loud.").
  • Cost sums over ALL verdicts, failed included — a failed judge still burned tokens; counting only survivors under-reports spend (the cost-ledger silent-zero class).
  • maxDisagreement = max over dims of (max−min) across survivors — the inter-rater signal gates/analyzeRuns consume.
  • composite reuses the substrate's weightedComposite (no re-implementation). No weights ⇒ uniform; a partial weights map selects-and-weights exactly those dims.

Verification

  • pnpm typecheck clean · pnpm lint exit 0
  • pnpm test1889 passed, 2 skipped (1879 prior + 10 new: survivor-mean, uniform/weighted composite, failed-judge-not-zero, all-failed-throws, cost-includes-failures, disagreement-spread, clamp, empty-input, rationale)
  • Additive only

Next

  • agent-app /eval-campaign façade composes this in buildEnsembleJudge; legal/creative/tax delete their copies and call it.

legal aggregateEnsemble (scoring.ts), creative judges.ts, and tax
judge-ensemble.ts implement the same reduction three times: fan out N
uncorrelated judges, mean each dimension over the SURVIVORS, report the
inter-rater disagreement spread, sum cost. Lift it to the substrate, keyed by
the caller's rubric.

Fail-loud: a judge that errored is recorded in failedJudges with
perDimension: null, never a zero; all-failed throws; a failed judge's cost is
still summed (it burned tokens). Composite reuses weightedComposite — no
re-implementation. The app-shell eval-campaign scaffold + the products compose
this instead of each owning a copy.
Copy link
Copy Markdown
Contributor

@tangletools tangletools left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pure reducer, fail-loud (failed judge ≠ zero, all-failed throws, cost includes failures), composes weightedComposite. 10 new tests, 1889 green. Approving.

@drewstone drewstone merged commit c58a823 into main Jun 5, 2026
1 check passed
@drewstone drewstone deleted the feat/judge-ensemble-reducer branch June 5, 2026 22:50
drewstone added a commit that referenced this pull request Jun 5, 2026
Lockstep version bump (npm + pyproject + python __version__ fallback) for the
eval-campaign scaffold prep primitives merged in #223 + #224.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants