diff --git a/bench/HARNESS.md b/bench/HARNESS.md index 6ef8660..550ebc7 100644 --- a/bench/HARNESS.md +++ b/bench/HARNESS.md @@ -20,7 +20,20 @@ within-run adaptive-driver layer): **does any non-blind topology beat blind comp (non-oracle) selector, at significant n?** Gate A is a **narrow diagnostic** — the cost-justification for parallel/adaptive topology, **NOT** the product verdict. A failed Gate A deletes within-run steering only; it never touches the corpus+policy product (Gate B). The invariant is equal-COMPUTE, -not equal-k-on-stateless-samples. Two things to keep straight: today's judges grade a single +not equal-k-on-stateless-samples. + +**Terminology (one word, used consistently).** A **rollout** (≡ a "shot") is ONE agent running an +`AgentProfile` to completion — a full, possibly **multi-turn / stateful** trajectory. `k` counts +*rollouts*; **turns live *inside* a rollout**, never as separate shots. A single **stateless +completion** (`maxTurns=0`, `harness: null`, one model call, no persistent workspace) is the +*degenerate* rollout — fine as a selector **lower bound**, never the canonical unit. The HumanEval +probe (`bench/src/humaneval-gate.mts`) uses exactly that degenerate shape — it calls the router +directly and does **not** route through `AgentProfile` / the sandbox / the keystone — so its numbers +are the **no-self-correction lower bound** on the selector, distinct from the rollout-based keystone +gate above. Bridge it to the product by running the same arms with real rollouts (an `AgentProfile` +through `runLoop`), dialing `maxTurns`. + +Two things to keep straight: today's judges grade a single *correctness* scalar (the multi-objective vector is the open contract, architecture.md §6), and every number below is single-objective + within-run — read them as Gate-A diagnostics, not Gate-B results. - Within-run STEER (verify-and-revise family) **LOSES** (rung-0, n=40: blind 37.5% → diff --git a/bench/src/humaneval-gate.mts b/bench/src/humaneval-gate.mts index 52f00c5..d10a963 100644 --- a/bench/src/humaneval-gate.mts +++ b/bench/src/humaneval-gate.mts @@ -10,9 +10,19 @@ * keeps a passer. This file asks: at EQUAL k, does diverse@k + a deployable * verifier-grounded pick beat random@k + the same pick, and beat blind@1? * - * Two paired arms over the SAME tasks: - * random@K — K identical-base-prompt shots/task (the compute control) - * diverse@K — K shots, the i-th prefixed with composeStrategies(base, K)[i] + * SCOPE — read the numbers as a LOWER BOUND. Here a "shot" is a single STATELESS + * completion (one router call, `maxTurns=0`, NO `AgentProfile` / sandbox / keystone — + * it calls the router directly). That is the *degenerate* rollout (HARNESS.md's + * "Terminology"): it isolates the SELECTOR with the generator unable to self-correct, + * so it measures the selector's value at its MAXIMUM. A real rollout (an `AgentProfile` + * through `runLoop`, `maxTurns>0` over a persistent workspace) self-verifies by + * iterating, which shrinks the external selector's job — that is the next experiment, + * not this one. A positive result here is the science (the selector works in a + * deployable-checker regime), not the product. + * + * Two paired arms over the SAME tasks (each "shot" = one stateless completion): + * random@K — K identical-base-prompt completions/task (the compute control) + * diverse@K — K completions, the i-th prefixed with composeStrategies(base, K)[i] * * The DEPLOYABLE CHECKER runs each candidate against the task's own `test` in an * isolated `--network=none` python:3.12-slim container (hard timeout) — exit 0 = pass. @@ -266,6 +276,9 @@ async function main(): Promise { console.log(`=== HumanEval deployable-verifier gate · N=${n} K=${k} offset=${offset} model=${model} ===`) console.log(` router=${routerBaseUrl} docker=${dockerImage} (--network=none, timeout ${dockerTimeoutMs}ms)`) + console.log( + ' regime: STATELESS single completions (maxTurns=0, no AgentProfile/sandbox) — the selector no-self-correction LOWER BOUND, not a rollout/product number', + ) const tasks = await loadHumanEval(n, offset) console.log(`loaded ${tasks.length} HumanEval task(s): ${tasks.map((t) => t.taskId).join(', ')}`)