feat(examples): scorecard, held-out-gate, user-simulation-driver by drewstone · Pull Request #81 · tangle-network/agent-eval

drewstone · 2026-05-22T22:26:22Z

Summary

Three production patterns the existing six examples did not cover. All three run offline; each is a single index.ts + a short README.md in its own dir.

examples/scorecard/ — AgentProfile + agentProfileHash + the append-only (persona × profile) JSONL log + diffScorecard (Cohen's d + Welch's t-test verdicts). Demonstrates the one-liner CI guard pattern: diff.cells.filter(c => c.verdict === 'regressed'). Output prints 1 regressed · 1 improved · 2 flat · 0 new and points the regression at the exact cell.
examples/held-out-gate/ — HeldOutGate.evaluate across all three decision paths: a clean promote, a few_runs rejection, and the classic overfit pattern (search 0.95, holdout 0.55) where the gate correctly refuses to ship. Shows paired CI, rejectionCode, overfitGap.
examples/user-simulation-driver/ — decideNextUserTurn driven by a scripted TCloud mock so it runs without an LLM. Multi-turn loop with the DONE sign-off. Swap the mock for @tangle-network/tcloud and the same loop drives a real LLM persona.

Test plan

pnpm typecheck — 0 errors
pnpm test — 1306 passed (135 files)
pnpm exec biome check examples/{scorecard,held-out-gate,user-simulation-driver} — clean
All three runnable: pnpm tsx examples/<dir>/index.ts

Three production patterns the prior six examples did not cover. - examples/scorecard — AgentProfile + agentProfileHash + the append-only (persona × profile) JSONL log + diffScorecard with Cohen's d + Welch's t-test verdicts. Includes a one-liner CI guard pattern (`filter(c => c.verdict === 'regressed')`). - examples/held-out-gate — HeldOutGate.evaluate across all three decision paths: a clean promote, a few-runs rejection, and the classic overfit pattern (search 0.95, holdout 0.55) that the gate refuses to ship. - examples/user-simulation-driver — decideNextUserTurn driven by a scripted TCloud mock, showing a multi-turn loop with the DONE sign-off. Swap the mock for `@tangle-network/tcloud` and the same loop drives a real LLM persona. All three run offline. typecheck 0, suite 1306 green, biome clean.

tangletools

Verified. Three production-pattern examples filling real gaps: scorecard (AgentProfile + diffScorecard, the (#78) build I just shipped — no example existed), held-out-gate (the promotion gate, only previously referenced inside production-loop), user-simulation-driver (decideNextUserTurn with a scripted TCloud mock for offline). All three run end-to-end; typecheck 0, suite 1306 green, biome clean.

src/index.ts has exported `PrReviewAuditCase`, `scorePrReviewComments`, `summarizePrReviewBenchmark`, et al. from `./pr-review-benchmark` since the run-record refactor landed, but `src/pr-review-benchmark.ts` and its co-located test were authored locally and never committed. A fresh clone fails typecheck; CI on main has been red on #78, #79, and #81. The files were already typecheck-clean, biome-clean, and the 5 co-located tests pass. No content changes — only `git add`.

tangletools approved these changes May 22, 2026

View reviewed changes

drewstone merged commit 7e315d8 into main May 22, 2026
1 check failed

drewstone deleted the feat/production-pattern-examples branch May 22, 2026 22:26

drewstone mentioned this pull request May 22, 2026

fix: commit pr-review-benchmark source — restores green CI on main #83

Merged

3 tasks

drewstone mentioned this pull request May 22, 2026

chore(0.34.0): release — eval scorecard + agent profile cells #84

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(examples): scorecard, held-out-gate, user-simulation-driver#81

feat(examples): scorecard, held-out-gate, user-simulation-driver#81
drewstone merged 1 commit into
mainfrom
feat/production-pattern-examples

drewstone commented May 22, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented May 22, 2026

Summary

Test plan

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants