Skip to content

docs(bench): unify rollout/shot terminology + honestly scope the HumanEval gate#170

Merged
drewstone merged 1 commit into
mainfrom
docs/rollout-terminology
Jun 5, 2026
Merged

docs(bench): unify rollout/shot terminology + honestly scope the HumanEval gate#170
drewstone merged 1 commit into
mainfrom
docs/rollout-terminology

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

"Shot" was overloaded. The canon (HARNESS.md, roadmap-rsi.md) already has the right unit: a rollout = one agent running an AgentProfile to completion — a full, possibly multi-turn/stateful trajectory; k counts rollouts, turns live inside one. The HumanEval gate, though, used "shot" for a raw stateless single completion that bypasses AgentProfile/the runtime — exactly the equal-k-on-stateless-samples the harness explicitly warns against.

  • HARNESS.md — adds a one-word Terminology block: rollout ≡ shot = one AgentProfile run; a stateless completion (maxTurns=0, harness:null) is the degenerate case; names humaneval-gate.mts as the no-self-correction selector LOWER BOUND, distinct from the rollout-based keystone gate.
  • humaneval-gate.mts — a SCOPE note in the header + a runtime regime banner: its numbers isolate the selector with the generator unable to self-correct (the selector's maximum leverage), so a win is the science (the selector works in a deployable-checker regime), not the product. Bridge = run the same arms as real rollouts (AgentProfile through runLoop, dialing maxTurns).

No code-path change; docs + comments + one log line. Merges clean.

🤖 Generated with Claude Code

… scope the HumanEval gate

A 'shot' was overloaded: the canon (HARNESS.md, roadmap-rsi.md) means rollout = one agent running an
AgentProfile to completion (multi-turn allowed; k counts ROLLOUTS, turns live inside one). The
HumanEval gate used 'shot' for a raw stateless single completion that bypasses AgentProfile/the
runtime — exactly the 'equal-k-on-stateless-samples' the harness warns against.

- HARNESS.md: add a one-word Terminology block (rollout ≡ shot = one AgentProfile run; the stateless
  completion is the degenerate maxTurns=0 case; name the HumanEval probe as the no-self-correction
  selector LOWER BOUND, distinct from the rollout-based keystone gate).
- humaneval-gate.mts: SCOPE note in the header + a runtime regime banner — its numbers are the
  selector lower bound (generator can't self-correct), not a rollout/product number; bridge = run the
  same arms as real rollouts (AgentProfile through runLoop, dial maxTurns).
@tangletools
Copy link
Copy Markdown
Contributor

✅ No Blockers — 5f41aa3f

Readiness 95/100 · Confidence 70/100 · 0 findings (none)

deepseek glm aggregate
Readiness 95 95 95
Confidence 70 70 70
Correctness 95 95 95
Security 95 95 95
Testing 95 95 95
Architecture 95 95 95

Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.

No findings.


tangletools · 2026-06-05T22:30:21Z · trace

Copy link
Copy Markdown
Contributor

@tangletools tangletools left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Clean — 5f41aa3f

Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-05T22:30:21Z · immutable trace

@drewstone drewstone merged commit 3e92f5b into main Jun 5, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants