feat(examples): scorecard, held-out-gate, user-simulation-driver#81
Merged
Conversation
Three production patterns the prior six examples did not cover. - examples/scorecard — AgentProfile + agentProfileHash + the append-only (persona × profile) JSONL log + diffScorecard with Cohen's d + Welch's t-test verdicts. Includes a one-liner CI guard pattern (`filter(c => c.verdict === 'regressed')`). - examples/held-out-gate — HeldOutGate.evaluate across all three decision paths: a clean promote, a few-runs rejection, and the classic overfit pattern (search 0.95, holdout 0.55) that the gate refuses to ship. - examples/user-simulation-driver — decideNextUserTurn driven by a scripted TCloud mock, showing a multi-turn loop with the DONE sign-off. Swap the mock for `@tangle-network/tcloud` and the same loop drives a real LLM persona. All three run offline. typecheck 0, suite 1306 green, biome clean.
tangletools
approved these changes
May 22, 2026
Contributor
tangletools
left a comment
There was a problem hiding this comment.
Verified. Three production-pattern examples filling real gaps: scorecard (AgentProfile + diffScorecard, the (#78) build I just shipped — no example existed), held-out-gate (the promotion gate, only previously referenced inside production-loop), user-simulation-driver (decideNextUserTurn with a scripted TCloud mock for offline). All three run end-to-end; typecheck 0, suite 1306 green, biome clean.
3 tasks
drewstone
added a commit
that referenced
this pull request
May 22, 2026
src/index.ts has exported `PrReviewAuditCase`, `scorePrReviewComments`, `summarizePrReviewBenchmark`, et al. from `./pr-review-benchmark` since the run-record refactor landed, but `src/pr-review-benchmark.ts` and its co-located test were authored locally and never committed. A fresh clone fails typecheck; CI on main has been red on #78, #79, and #81. The files were already typecheck-clean, biome-clean, and the 5 co-located tests pass. No content changes — only `git add`.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three production patterns the existing six examples did not cover. All three run offline; each is a single
index.ts+ a shortREADME.mdin its own dir.examples/scorecard/—AgentProfile+agentProfileHash+ the append-only(persona × profile)JSONL log +diffScorecard(Cohen's d + Welch's t-test verdicts). Demonstrates the one-liner CI guard pattern:diff.cells.filter(c => c.verdict === 'regressed'). Output prints1 regressed · 1 improved · 2 flat · 0 newand points the regression at the exact cell.examples/held-out-gate/—HeldOutGate.evaluateacross all three decision paths: a clean promote, afew_runsrejection, and the classic overfit pattern (search 0.95, holdout 0.55) where the gate correctly refuses to ship. Showspaired CI,rejectionCode,overfitGap.examples/user-simulation-driver/—decideNextUserTurndriven by a scriptedTCloudmock so it runs without an LLM. Multi-turn loop with theDONEsign-off. Swap the mock for@tangle-network/tcloudand the same loop drives a real LLM persona.Test plan
pnpm typecheck— 0 errorspnpm test— 1306 passed (135 files)pnpm exec biome check examples/{scorecard,held-out-gate,user-simulation-driver}— cleanpnpm tsx examples/<dir>/index.ts