feat(eval): aggregateJudgeVerdicts — generic judge-ensemble reducer by drewstone · Pull Request #224 · tangle-network/agent-eval

drewstone · 2026-06-05T22:48:25Z

The one substrate primitive the shared agent-app eval-campaign scaffold needs: a generic judge-ensemble reducer. legal (aggregateEnsemble, tests/eval/lib/scoring.ts), creative (production-loop/judges.ts), and tax (judge-ensemble.ts) implement the same reduction three times — fan out N uncorrelated judges, mean each dimension over the survivors, report inter-rater spread, sum cost. This lifts it to the substrate, keyed by the caller's rubric (D extends string).

Behavior (fail-loud)

Survivor mean per dimension; out-of-range judge scores clamped to [0,1].
A failed judge (perDimension: null) is recorded in failedJudges, never folded into a zero. All-failed throws — a silent zero here corrupts the gate's number (CLAUDE.md "No fallbacks. Fail loud.").
Cost sums over ALL verdicts, failed included — a failed judge still burned tokens; counting only survivors under-reports spend (the cost-ledger silent-zero class).
maxDisagreement = max over dims of (max−min) across survivors — the inter-rater signal gates/analyzeRuns consume.
composite reuses the substrate's weightedComposite (no re-implementation). No weights ⇒ uniform; a partial weights map selects-and-weights exactly those dims.

Verification

pnpm typecheck clean · pnpm lint exit 0
pnpm test — 1889 passed, 2 skipped (1879 prior + 10 new: survivor-mean, uniform/weighted composite, failed-judge-not-zero, all-failed-throws, cost-includes-failures, disagreement-spread, clamp, empty-input, rationale)
Additive only

legal aggregateEnsemble (scoring.ts), creative judges.ts, and tax judge-ensemble.ts implement the same reduction three times: fan out N uncorrelated judges, mean each dimension over the SURVIVORS, report the inter-rater disagreement spread, sum cost. Lift it to the substrate, keyed by the caller's rubric. Fail-loud: a judge that errored is recorded in failedJudges with perDimension: null, never a zero; all-failed throws; a failed judge's cost is still summed (it burned tokens). Composite reuses weightedComposite — no re-implementation. The app-shell eval-campaign scaffold + the products compose this instead of each owning a copy.

tangletools

Pure reducer, fail-loud (failed judge ≠ zero, all-failed throws, cost includes failures), composes weightedComposite. 10 new tests, 1889 green. Approving.

Lockstep version bump (npm + pyproject + python __version__ fallback) for the eval-campaign scaffold prep primitives merged in #223 + #224.

tangletools approved these changes Jun 5, 2026

View reviewed changes

drewstone merged commit c58a823 into main Jun 5, 2026
1 check passed

drewstone deleted the feat/judge-ensemble-reducer branch June 5, 2026 22:50

drewstone mentioned this pull request Jun 5, 2026

chore(release): 0.81.0 — eval-campaign scaffold prep primitives #225

Merged

drewstone added a commit that referenced this pull request Jun 5, 2026

chore(release): 0.81.0 — eval-campaign scaffold prep primitives (#225)

9368f7d

Lockstep version bump (npm + pyproject + python __version__ fallback) for the eval-campaign scaffold prep primitives merged in #223 + #224.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): aggregateJudgeVerdicts — generic judge-ensemble reducer#224

feat(eval): aggregateJudgeVerdicts — generic judge-ensemble reducer#224
drewstone merged 1 commit into
mainfrom
feat/judge-ensemble-reducer

drewstone commented Jun 5, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 5, 2026

Behavior (fail-loud)

Verification

Next

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants