feat(runtime): defineLeaderboard — declarative eval leaderboard facade over runProfileMatrix#453
Conversation
…e over runProfileMatrix A product's harness×model leaderboard reduces to cases + prompt + score: defineLeaderboard assembles expandProfileAxes × loopDispatch/naiveDriver × the backend resolvers into ONE runProfileMatrix call, with toBenchmarkAdapter() exposing the same domain surface in the structural BenchmarkAdapter shape. No new execution/judging/metering code — every default is overridable (backends, flags, parseOutput, onCellEvents, export, dispatch, judges, matrix passthrough) and runProfileMatrix stays public as the escape floor. The default run dir is FRESH per invocation (timestamp+pid under tmpdir): runProfileMatrix caches cells by run dir, and a stable default silently resumes a prior failed zero-token cell — only an explicit --run-dir opts in. Requires agent-eval >=0.101.0 (expandProfileAxes / harnessAxisOf / CODING_HARNESSES); peer floor raised accordingly. 10 offline tests via inProcessSandboxClient. Minor bump to 0.84.0.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved drewstone PR — 0d1936d6
This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: drewstone_author · 2026-07-03T04:59:34Z
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 2 (1 low, 1 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.1s |
| Interrogation | 147.4s (2 bridge agents) |
| Total | 147.5s |
💰 Value — sound-with-nits
A declarative facade that assembles existing primitives (expandProfileAxes × loopDispatch × naiveDriver × runProfileMatrix) into one call, collapsing ~650 hand-rolled lines per product into ~40; sound consolidation in the codebase's established grain, with only minor nits.
- What it does: Adds
defineLeaderboard<TCase>(spec)to @tangle-network/agent-runtime/loops: give itcases+prompt+score, and.run(argv)expands a base profile across harness×model axes, runs every (profile, case) cell as a driven loop vialoopDispatch+naiveDriver, wrapsscoreas the campaign judge, and emits ONErunProfileMatrixcall, then exports a ranked markdown leaderboard. `.toBenchmarkA - Goals it achieves: Kill the per-product hand-rolled matrix runner (~650 lines each, per the PR body) and the three recurring footguns it re-hits: stale run-dir cell-cache reuse silently skipping dispatch, zero-token stub cells, and RunRecord rejection of bare model ids. Give products one declarative seam so the domain is just cases+prompt+score, and let a product leaderboard register into the bench BenchmarkAdapter
- Assessment: Good change, in the grain of the codebase. It is pure assembly — zero new execution/judging/metering code; every moving part (expandProfileAxes, loopDispatch, naiveDriver, resolveSandboxClient, leaderboard/renderLeaderboardMarkdown, runProfileMatrix) is an existing primitive already exported from this package. It directly continues the recent consolidation siblings:
loop-dispatch.ts(collapse ru - Better / existing approach: Searched:
runProfileMatrix,expandProfileAxes,BenchmarkAdapter,toBenchmarkAdapter,loopCampaignDispatchconsumers (src/**/*.ts), the bench adapter registry (bench/src/benchmarks/types.ts, bench/src/adapters.ts), and hand-rolled matrix.ts across ~/company. No existing equivalent in agent-runtime — none found. The correct alternative (importing the realBenchmarkAdapterfrom bench) is - Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 2
- Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content
🎯 Usefulness — sound
A clean declarative facade that collapses the documented ~650-line hand-rolled matrix assembly (tax/legal/gtm) into ~40 lines of domain code, wiring only existing primitives with every default overridable and the documented footguns fixed by default.
- Integration: Properly exported from src/runtime/index.ts:93-104 onto the /loops surface. Every primitive it assembles is confirmed present: loopDispatch (loop-dispatch.ts:159), naiveDriver (steering-drivers.ts:109), resolveSandboxClient (resolve-sandbox-client.ts:54), leaderboard/renderLeaderboardMarkdown (benchmark-report.ts), and expandProfileAxes/harnessAxisOf/CODING_HARNESSES + runProfileMatrix/JudgeConfig
- Fit with existing patterns: Matches the codebase's established grain exactly. runPersonaDispatch in src/conversation/run-persona.ts:221 is the same shape (wrap a runner as a ProfileDispatchFn for runProfileMatrix); loopDispatch itself is the same shape; examples/webcode-matrix/webcode-matrix.ts is the canonical hand-rolled assembly. docs/archive/benchmark-matrix-consolidation.md identifies tax/legal/gtm as three duplicated ~
- Real-world viability: Test coverage hits the load-bearing paths: end-to-end offline run with real metering (integrity verdict='real'), fresh run-dir per invocation (the stale-cell-cache footgun fix), explicit --run-dir resume, case subsetting + unknown-id rejection, snapshot stamping, structured score with dimensions, onCellEvents metric-capture seam, custom flag parsing, fail-loud on the default 'sandbox' backend, and
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: console debug added src/runtime/define-leaderboard.ts
console.log(table)
💰 Value Audit
🟡 Local structural BenchmarkAdapter types shadow the canonical bench ones [maintenance] ``
define-leaderboard.ts:96-124 redeclares
LeaderboardBenchTask/LeaderboardBenchScore/LeaderboardBenchmarkAdapteras structural mirrors ofBenchTask/BenchScore/BenchmarkAdapterin bench/src/benchmarks/types.ts:13-46. This is forced by the dep direction (bench → agent-runtime, not back), so it's defensible, but the two definitions can drift silently: e.g. the canonicalBenchmarkAdapteralso has optionaloutput?: OutputAdapter<string>andleafClient?(types.ts:46,57) which the local
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
| glm | deepseek | aggregate | |
|---|---|---|---|
| Readiness | 62 | 74 | 62 |
| Confidence | 85 | 85 | 85 |
| Correctness | 62 | 74 | 62 |
| Security | 62 | 74 | 62 |
| Testing | 62 | 74 | 62 |
| Architecture | 62 | 74 | 62 |
Reviewer score is advisory once the run is complete and the verdict has no blockers.
Full multi-shot audit completed 5/5 planned shots over 8 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 5/5 planned shots over 8 changed files. Global verifier still owns final merge decision.
🟡 LOW Several newly-listed agent-eval symbols carry no-summary placeholders — docs/api/primitive-catalog.md
New campaign entries discoverEvalFixtures, loadEvalFixture, loadEvalFixtureScenarios, planCampaignRun, planEvalFixtureRun, policyEditProposer all render '(no summary — add a TSDoc line at the declaration)'. These originate from the bumped @tangle-network/agent-eval@0.103.1 dependency, not this repo's source, so the catalog is faithfully generated — but the docs ship public API rows with no description. Track as a follow-up TSDoc task in agent-eval; not a blocker for this PR.
🟡 LOW defineLeaderboard has no TSDoc summary at declaration — docs/api/primitive-catalog.md
Catalog row reads '(no summary — add a TSDoc line at the declaration)'. The module-level comment (define-leaderboard.ts:1-30) documents the facade thoroughly, but the exported function at line 255 has no JSDoc directly above it (line 254 blank, preceded by normalizeScore). Add a one-line TSDoc on the export so the catalog table self-describes instead of showing the placeholder. Cosmetic only — signature and file:line are correct.
🟡 LOW Range-form inconsistency: agent-eval uses caret, sibling Tangle deps use open range — package.json
@tangle-network/agent-interfaceand@tangle-network/sandbox(both dev and peer) use the>=X <1.0.0form; only agent-eval devDep was switched to caret^0.103.1. Cosmetic but breaks the local convention and narrows resolution semantics (caret for 0.x caps at next minor). If the intent was 'latest 0.103.x', keep caret; if intent was 'any compatible', restore>=0.101.0 <1.0.0to match siblings.
🟡 LOW devDep caret pin narrower than peerDep advertised range creates untested compatibility window — package.json
devDep
^0.103.1resolves to>=0.103.1 <0.104.0for 0.x versions, so CI/test only ever exercises 0.103.x. Yet peerDependencies advertises>=0.101.0 <1.0.0(line 126), claiming 0.101.0–0.102.x are supported. The previous devDep (>=0.100.0 <1.0.0) covered the same breadth as the peer range; the new caret does not. Consumers pinning 0.101.x/0.102.x will hit a peer range the project never tested. Fix: either widen devDep back to>=0.101.0 <1.0.0to match peerDep, or raise peerDep floor to>=0.103.1to match what is actually tested. Preferable: align both on the tested floor.
🟡 LOW devDep version pin narrow but within peer range; no minimum-version CI coverage — package.json
devDep changed from
>=0.100.0 <1.0.0to^0.103.1(effectively >=0.103.1 <0.104.0). CI tests only the latest version, never the declared peer floor of 0.101.0. If a consumer resolves to 0.101.0–0.102.x and hits a runtime API gap not exercised by tests, CI won't catch it. The imported APIs (expandProfileAxes, runProfileMatrix, etc.) all exist in 0.101.0, so risk is low.
🟡 LOW peerDependency minimum bump (0.97.0 → 0.101.0) is breaking for existing consumers — package.json
Raising the peerDep floor by 4 minor versions drops support for agent-eval 0.97.0–0.100.x. Under 0.x semver this is permitted in a minor bump, but consumers will see
ERESOLVEpeer warnings on install. No CHANGELOG entry is visible in this shot's scope; recommend the PR description or CHANGELOG explicitly call out the required consumer upgrade so downstreampnpm installfailures are not a surprise.
🟡 LOW Transitive sandbox major-series jump (0.3.0 -> 0.9.5) introduced via tcloud@0.4.14 — pnpm-lock.yaml
tcloud@0.4.14 now resolves its sandbox dependency to 0.9.5 instead of 0.3.0, a large minor-series jump inside the 0.x range (semver-major in 0.x terms). The lock file itself is correct and consistent; the only risk is runtime/behavioral drift in sandbox APIs consumed transitively via tcloud. This is not a lock defect — flagging so source-level shots confirm no sandbox 0.3-era APIs are relied upon. No action required within this shot.
🟡 LOW Default scoreJudge declares only 'composite' but passes extra dimensions — src/runtime/define-leaderboard.ts
The default scoreJudge at line 281-292 declares
dimensions: [{ key: 'composite' }]but its score() implementation spreadss.dimensionsinto the returned dimensions object (line 288). Extra dimensions from spec.score() are recorded but undeclared in the judge config. Downstream consumers iterating declared dimensions would miss them. Either declare all known dimensions or document that undeclared dimensions are only carried in the raw data.
🟡 LOW --shots/--reps parse to NaN with no validation, silently unbounding the naive retry cap — src/runtime/define-leaderboard.ts
const shots = Number(args.shots ?? spec.shots ?? 1)(and same for reps) accepts non-numeric input like--shots abcand yields NaN. In naiveDriver.plan (steering-drivers.ts:122),history.length >= NaNis always false, so the retry cap silently disappears — the loop only stops on a valid verdict, potentially hammering the backend indefinitely on a case that never passes. Fix: validate withNumber.isFinite(n) && n >= 1and throw a fail-loud error otherwise, consistent with the other fail-loud guards in this function.
🟡 LOW as never cast on sandboxOverrides hides shape drift — src/runtime/define-leaderboard.ts
The constructed
sandboxOverrides: { backend: { type: axis.harness, model: {...} } }is cast viaas neverto bypass the type system. Tests pass today, but a future agent-eval type change to the expectedsandboxOverridesshape will not surface as a compile error here — the facade could silently send a wrong-shaped override to the sandbox. Replace with a structural cast or import the real override type.
🟡 LOW sandboxClient = makeClient() runs unconditionally even when spec.dispatch overrides the default — src/runtime/define-leaderboard.ts
A spec that fully replaces
dispatch(LEVEL 2 override) and has no need for the default loop still hitsmakeClient()— and the defaultsandboxbackend factory throws unless overridden. So a LEVEL-2 dispatch user is forced to also overridespec.backends.sandbox(or pass--backend <custom>) just to skip a client they never use. Guard withconst sandboxClient = spec.dispatch ? null : makeClient()and pass it through only to the default path.
🟡 LOW teardown is skipped when setup throws — src/runtime/define-leaderboard.ts
await spec.setup?.(ctx)runs BEFORE thetry/finallythat wrapsrunProfileMatrix(line 481 vs try at 482). Ifsetuppartially succeeds (opens handles, acquires boxes) and then throws,teardownnever runs — a resource leak path. The JSDoc says teardown runs 'after the matrix, even on failure,' which is honored for matrix failure but not for setup failure. Wrap setup in the same try, or document that setup must be self-cleaning on throw.
🟡 LOW toBenchmarkAdapter().loadTasks() ignores opts.split — src/runtime/define-leaderboard.ts
The adapter accepts
split?: string(declared at line 119 to match the BenchmarkAdapter structural contract) but the implementation only honorsidsandlimit. A registry requestingsplit: 'holdout'silently receives every case — a quiet contract gap rather than a fail-loud refusal. Either filter by a per-case split field (if the spec exposes one), or throwsplit is not modeled by this leaderboardso callers don't get a wrong-shaped slice.
🟡 LOW bareModel() has dead null-coalescing fallback — src/runtime/define-leaderboard.ts
model.split('@')[0] ?? model— the?? modelis unreachable becauseString.prototype.splitalways returns at least one element. Remove the?? modelor change tomodel.split('@')[0].
🟡 LOW maxIterations duplicated on naiveDriver and top-level loop options — src/runtime/define-leaderboard.ts
maxIterations: shotsappears on both the naiveDriver options (line 445) and the returned loop options object (line 476). Both use the same variable now so consistent, but fragile: if a future edit updates one and not the other, the effective cap becomes the minimum, potentially masking the bug.
🟡 LOW setup() outside try/finally — teardown not called on setup failure — src/runtime/define-leaderboard.ts
setup()at line 481 is called before the try/finally. If setup allocates resources (e.g., opens a fixture connection, starts a local server) and then throws,teardown()never runs. The documented contract says teardown runs 'even on failure' but that refers to matrix failures. Resources allocated in setup before its throw will leak. Fix: wrap setup inside the try block, or wrap the entire (setup + try/finally) in its own try/finally.
🟡 LOW shotNonce is shared across concurrent cells — src/runtime/define-leaderboard.ts
shotNonceis a single mutable counter shared by all cells. When runProfileMatrix dispatches cells concurrently, taskToPrompt reads and increments it without synchronization. Practically benign — different cells have different base prompts so caching isn't an issue — but semantically the nonce should be per-cell or even per-shot to match intent. A non-atomic increment could produce duplicate nonces across cells.
🟡 LOW toBenchmarkAdapter().judge() has dead code guard — src/runtime/define-leaderboard.ts
const [c] = selectCases([task.id])at line 538 thenif (c === undefined) throwat line 539 is dead at runtime. selectCases() always throws for unknown ids (line 272-274) and always returns a populated array for known ids (map preserves length). The guard is needed for TypeScript strict index access but unreachable at runtime. The throw message will never fire.
tangletools · 2026-07-03T05:39:13Z · trace
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 18 non-blocking findings — 0d1936d6
Full multi-shot audit completed 5/5 planned shots over 8 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 5/5 planned shots over 8 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-07-03T05:39:13Z · immutable trace
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved drewstone PR — 76d764d0
This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: drewstone_author · 2026-07-03T06:47:22Z
What
defineLeaderboard<TCase>(spec)on@tangle-network/agent-runtime/loops: the declarative facade that turns a product's hand-written matrix runner (~650 lines, e.g. tax'smatrix.ts) into ~40 lines of domain code —cases+prompt+score— assembled into ONErunProfileMatrixcall.Design gates held
expandProfileAxes(harness×model axis),loopDispatch+naiveDriver(per-cell driven loop),resolveSandboxClient(cli-bridge backend),leaderboard/renderLeaderboardMarkdown(default export). Zero new execution/judging/metering code.runProfileMatrixstays the escape floor. Level-1 seams:backends,flags,parseOutput,onCellEvents,setup/teardown,export,modelBackend,matrixpassthrough (spread last onto therunProfileMatrixcall). Level-2:dispatch/judgesfull replacement.prompt(shell-to-python, resolved once per case),onCellEvents(search-metric capture),parseOutput(marker/echo-guard scoring), customexport,--cases/--harnesses/--models/--shots/--reps+ spec-declared flags.toBenchmarkAdapter()returns a structurally BenchmarkAdapter-compatible object (name/preflight/loadTasks/judge/goldArtifact) with no dependency on a bench package.Footgun fixed by default
The default run dir is fresh per invocation (timestamp+pid under tmpdir).
runProfileMatrixcaches cells by run dir; a stable default silently resumes a prior FAILED zero-token cell and skips dispatch. Only an explicit--run-diropts into resume. Also default: bare model ids are snapshot-stamped (--model-snapshot, default@leaderboard) so RunRecord validation can't reject the run.Version notes
@tangle-network/agent-eval0.100.0 → 0.103.1; peer floor raised>=0.97.0→>=0.101.0(expandProfileAxes/harnessAxisOf/CODING_HARNESSESfirst shipped in v0.101.0). Additive new subpath otherwise — no fleet bump needed.Proof
inProcessSandboxClient, no network/creds): end-to-end matrix run withintegrity.verdict === 'real', fresh-run-dir default + explicit--run-dir,--casessubsetting + unknown-id rejection, snapshot stamping, score→judge wrapping (dimensions/notes),onCellEventsseam, spec flags →ctx.args, fail-loud default sandbox backend,toBenchmarkAdapterloadTasks/judge round-trip + duplicate-id preflight.typecheck(src + examples),build,lint,verify:package,docs:check(API docs regenerated) all green.