Skip to content

feat(bench): wire real benchmark adapters — aec-bench, commit0, programbench, appworld + shared harness#153

Merged
drewstone merged 1 commit into
mainfrom
feat/wire-benchmarks
Jun 4, 2026
Merged

feat(bench): wire real benchmark adapters — aec-bench, commit0, programbench, appworld + shared harness#153
drewstone merged 1 commit into
mainfrom
feat/wire-benchmarks

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

What

Four real BenchmarkAdapters, built by reusing the contract + one shared harness — no duplication.

  • benchmarks/_harness.ts — the shared code-bench harness extracted from swe-bench/terminal-bench: stage artifact → run the bench's own evaluator in a .venv/Docker subprocess → parse the JSON report → {resolved, score}, plus runVenvScriptStdin. swe-bench + terminal-bench refactored onto it (behavior-identical; no per-adapter process/venv/temp plumbing left).
  • aec-benchREAL, runnable with zero extra deps. Judge = the task's own tests/verify.py over python3 stdlib (recompute closed-form ground truth, per-field math.isclosegraded partial credit). Deterministic, correctable middle band — the open-gate candidate domain. Real GitHub task tree + committed fixtures; goldArtifact returns the real golden_pass.md (offline verify-judge works).
  • commit0 / programbench — real loadTasks (HF + *_FIXTURES=1 offline) + judge via the official harness; fail loud (pip install … + Docker) — never a fabricated score.
  • appworld — finished the stub: loadTasks enumerates the live engine, judge = AppWorld's own world.evaluate(); fail loud, no offline fixture by honest design (task data exists only after appworld download data).

Registered in both ADAPTERS maps (run.ts + rsi.ts); offline fixture tests for all four; fixed two real execFile stdin-hang bugs (programbench, appworld) by routing through the shared stdin runner; HARNESS.md adapter section rewritten to the honest state.

Verified

cd bench && tsc --noEmit → 0 · 19/19 fixture tests (tsx --test) · repo lint clean (220 files). aec-bench verified end-to-end offline: gold → {resolved:true, score:1} with per-field credit; empty → {resolved:false, score:0}.

Not in scope (honest)

The code benches (commit0/programbench/swe-bench) need Docker; appworld needs its pip env + download data. Adapters are wired + fixture-tested, runnable when those deps are present (preflight fails loud with the exact fix). tau2-bench (multi-turn dual-control conversation) is a different harness regime — deliberately deferred.

Why aec-bench matters beyond "another adapter"

It's the first wired domain that is deterministic + graded + no-LLM-judge + no-Docker → a clean, cheap, test-retest-zero instrument for the open gate (diverse@k vs blind@k under a deployable selector), and the tie-break domain from architecture-alternatives.md.

…ambench, appworld

Ship four real BenchmarkAdapters by REUSING the contract + one shared harness,
no duplication:
- benchmarks/_harness.ts: shared code-bench harness (stage artifact -> run the
  bench's own evaluator in a .venv/Docker subprocess -> parse JSON report ->
  {resolved,score}) + runVenvScriptStdin. swe-bench + terminal-bench refactored
  onto it (behavior-identical; no per-adapter process/venv/temp plumbing left).
- aec-bench: REAL, runnable with ZERO extra deps — judge = the task's own
  tests/verify.py over python3 stdlib (recompute closed-form ground truth,
  per-field math.isclose -> graded partial credit). Deterministic, correctable
  middle band: the open-gate candidate domain. Real GitHub task tree + fixtures.
- commit0 / programbench: real loadTasks (HF + *_FIXTURES offline) + judge via
  the official harness; fail loud (pip + Docker), never a fabricated score.
- appworld: finished the stub — loadTasks via the engine, judge = world.evaluate;
  fail loud (no offline fixture by honest design).
Registered in both ADAPTERS maps (run.ts + rsi.ts); offline fixture tests for
all four; fixed two execFile stdin-hang bugs (programbench, appworld).
Verified: bench tsc 0, 19/19 fixture tests, lint clean.
@drewstone drewstone merged commit ef67004 into main Jun 4, 2026
1 check passed
@drewstone drewstone deleted the feat/wire-benchmarks branch June 4, 2026 14:51
drewstone added a commit that referenced this pull request Jun 6, 2026
Cuts the 58-commit backlog on main into a published release. Headline surface:
- runToolLoop / streamToolLoop — bounded turn-level tool-dispatch loop (#137)
- RSI agent tree: recursive Agent.act, Supervisor keystone, runProgram, the
  adaptive-driver channel (#139/#151/#165)
- optimization API collapsed onto agent-eval selfImprove; the runtime keeps the
  CODE-surface ImprovementDriver you pass as driver (#172)
- deployable benchmark adapters: AppWorld, commit0, aec-bench, EnterpriseOps-Gym;
  runBenchmarks over one ADAPTERS registry (#153/#156/#157)
- agent-eval floor raised to >=0.83.0 (#175)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant