docs(bench): unify rollout/shot terminology + honestly scope the HumanEval gate#170
Conversation
… scope the HumanEval gate A 'shot' was overloaded: the canon (HARNESS.md, roadmap-rsi.md) means rollout = one agent running an AgentProfile to completion (multi-turn allowed; k counts ROLLOUTS, turns live inside one). The HumanEval gate used 'shot' for a raw stateless single completion that bypasses AgentProfile/the runtime — exactly the 'equal-k-on-stateless-samples' the harness warns against. - HARNESS.md: add a one-word Terminology block (rollout ≡ shot = one AgentProfile run; the stateless completion is the degenerate maxTurns=0 case; name the HumanEval probe as the no-self-correction selector LOWER BOUND, distinct from the rollout-based keystone gate). - humaneval-gate.mts: SCOPE note in the header + a runtime regime banner — its numbers are the selector lower bound (generator can't self-correct), not a rollout/product number; bridge = run the same arms as real rollouts (AgentProfile through runLoop, dial maxTurns).
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 95 | 95 | 95 |
| Confidence | 70 | 70 | 70 |
| Correctness | 95 | 95 | 95 |
| Security | 95 | 95 | 95 |
| Testing | 95 | 95 | 95 |
| Architecture | 95 | 95 | 95 |
Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.
No findings.
tangletools · 2026-06-05T22:30:21Z · trace
tangletools
left a comment
There was a problem hiding this comment.
✅ Clean — 5f41aa3f
Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-05T22:30:21Z · immutable trace
"Shot" was overloaded. The canon (HARNESS.md, roadmap-rsi.md) already has the right unit: a rollout = one agent running an
AgentProfileto completion — a full, possibly multi-turn/stateful trajectory;kcounts rollouts, turns live inside one. The HumanEval gate, though, used "shot" for a raw stateless single completion that bypassesAgentProfile/the runtime — exactly the equal-k-on-stateless-samples the harness explicitly warns against.AgentProfilerun; a stateless completion (maxTurns=0,harness:null) is the degenerate case; nameshumaneval-gate.mtsas the no-self-correction selector LOWER BOUND, distinct from the rollout-based keystone gate.AgentProfilethroughrunLoop, dialingmaxTurns).No code-path change; docs + comments + one log line. Merges clean.
🤖 Generated with Claude Code