Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 14 additions & 1 deletion bench/HARNESS.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,20 @@ within-run adaptive-driver layer): **does any non-blind topology beat blind comp
(non-oracle) selector, at significant n?** Gate A is a **narrow diagnostic** — the cost-justification
for parallel/adaptive topology, **NOT** the product verdict. A failed Gate A deletes within-run
steering only; it never touches the corpus+policy product (Gate B). The invariant is equal-COMPUTE,
not equal-k-on-stateless-samples. Two things to keep straight: today's judges grade a single
not equal-k-on-stateless-samples.

**Terminology (one word, used consistently).** A **rollout** (≡ a "shot") is ONE agent running an
`AgentProfile` to completion — a full, possibly **multi-turn / stateful** trajectory. `k` counts
*rollouts*; **turns live *inside* a rollout**, never as separate shots. A single **stateless
completion** (`maxTurns=0`, `harness: null`, one model call, no persistent workspace) is the
*degenerate* rollout — fine as a selector **lower bound**, never the canonical unit. The HumanEval
probe (`bench/src/humaneval-gate.mts`) uses exactly that degenerate shape — it calls the router
directly and does **not** route through `AgentProfile` / the sandbox / the keystone — so its numbers
are the **no-self-correction lower bound** on the selector, distinct from the rollout-based keystone
gate above. Bridge it to the product by running the same arms with real rollouts (an `AgentProfile`
through `runLoop`), dialing `maxTurns`.

Two things to keep straight: today's judges grade a single
*correctness* scalar (the multi-objective vector is the open contract, architecture.md §6), and every
number below is single-objective + within-run — read them as Gate-A diagnostics, not Gate-B results.
- Within-run STEER (verify-and-revise family) **LOSES** (rung-0, n=40: blind 37.5% →
Expand Down
19 changes: 16 additions & 3 deletions bench/src/humaneval-gate.mts
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,19 @@
* keeps a passer. This file asks: at EQUAL k, does diverse@k + a deployable
* verifier-grounded pick beat random@k + the same pick, and beat blind@1?
*
* Two paired arms over the SAME tasks:
* random@K — K identical-base-prompt shots/task (the compute control)
* diverse@K — K shots, the i-th prefixed with composeStrategies(base, K)[i]
* SCOPE — read the numbers as a LOWER BOUND. Here a "shot" is a single STATELESS
* completion (one router call, `maxTurns=0`, NO `AgentProfile` / sandbox / keystone —
* it calls the router directly). That is the *degenerate* rollout (HARNESS.md's
* "Terminology"): it isolates the SELECTOR with the generator unable to self-correct,
* so it measures the selector's value at its MAXIMUM. A real rollout (an `AgentProfile`
* through `runLoop`, `maxTurns>0` over a persistent workspace) self-verifies by
* iterating, which shrinks the external selector's job — that is the next experiment,
* not this one. A positive result here is the science (the selector works in a
* deployable-checker regime), not the product.
*
* Two paired arms over the SAME tasks (each "shot" = one stateless completion):
* random@K — K identical-base-prompt completions/task (the compute control)
* diverse@K — K completions, the i-th prefixed with composeStrategies(base, K)[i]
*
* The DEPLOYABLE CHECKER runs each candidate against the task's own `test` in an
* isolated `--network=none` python:3.12-slim container (hard timeout) — exit 0 = pass.
Expand Down Expand Up @@ -266,6 +276,9 @@ async function main(): Promise<void> {

console.log(`=== HumanEval deployable-verifier gate · N=${n} K=${k} offset=${offset} model=${model} ===`)
console.log(` router=${routerBaseUrl} docker=${dockerImage} (--network=none, timeout ${dockerTimeoutMs}ms)`)
console.log(
' regime: STATELESS single completions (maxTurns=0, no AgentProfile/sandbox) — the selector no-self-correction LOWER BOUND, not a rollout/product number',
)

const tasks = await loadHumanEval(n, offset)
console.log(`loaded ${tasks.length} HumanEval task(s): ${tasks.map((t) => t.taskId).join(', ')}`)
Expand Down
Loading