feat(bench): wire real benchmark adapters — aec-bench, commit0, programbench, appworld + shared harness by drewstone · Pull Request #153 · tangle-network/agent-runtime

drewstone · 2026-06-04T14:50:56Z

What

Four real BenchmarkAdapters, built by reusing the contract + one shared harness — no duplication.

benchmarks/_harness.ts — the shared code-bench harness extracted from swe-bench/terminal-bench: stage artifact → run the bench's own evaluator in a .venv/Docker subprocess → parse the JSON report → {resolved, score}, plus runVenvScriptStdin. swe-bench + terminal-bench refactored onto it (behavior-identical; no per-adapter process/venv/temp plumbing left).
aec-bench — REAL, runnable with zero extra deps. Judge = the task's own tests/verify.py over python3 stdlib (recompute closed-form ground truth, per-field math.isclose → graded partial credit). Deterministic, correctable middle band — the open-gate candidate domain. Real GitHub task tree + committed fixtures; goldArtifact returns the real golden_pass.md (offline verify-judge works).
commit0 / programbench — real loadTasks (HF + *_FIXTURES=1 offline) + judge via the official harness; fail loud (pip install … + Docker) — never a fabricated score.
appworld — finished the stub: loadTasks enumerates the live engine, judge = AppWorld's own world.evaluate(); fail loud, no offline fixture by honest design (task data exists only after appworld download data).

Registered in both ADAPTERS maps (run.ts + rsi.ts); offline fixture tests for all four; fixed two real execFile stdin-hang bugs (programbench, appworld) by routing through the shared stdin runner; HARNESS.md adapter section rewritten to the honest state.

Verified

cd bench && tsc --noEmit → 0 · 19/19 fixture tests (tsx --test) · repo lint clean (220 files). aec-bench verified end-to-end offline: gold → {resolved:true, score:1} with per-field credit; empty → {resolved:false, score:0}.

Not in scope (honest)

The code benches (commit0/programbench/swe-bench) need Docker; appworld needs its pip env + download data. Adapters are wired + fixture-tested, runnable when those deps are present (preflight fails loud with the exact fix). tau2-bench (multi-turn dual-control conversation) is a different harness regime — deliberately deferred.

Why aec-bench matters beyond "another adapter"

It's the first wired domain that is deterministic + graded + no-LLM-judge + no-Docker → a clean, cheap, test-retest-zero instrument for the open gate (diverse@k vs blind@k under a deployable selector), and the tie-break domain from architecture-alternatives.md.

…ambench, appworld Ship four real BenchmarkAdapters by REUSING the contract + one shared harness, no duplication: - benchmarks/_harness.ts: shared code-bench harness (stage artifact -> run the bench's own evaluator in a .venv/Docker subprocess -> parse JSON report -> {resolved,score}) + runVenvScriptStdin. swe-bench + terminal-bench refactored onto it (behavior-identical; no per-adapter process/venv/temp plumbing left). - aec-bench: REAL, runnable with ZERO extra deps — judge = the task's own tests/verify.py over python3 stdlib (recompute closed-form ground truth, per-field math.isclose -> graded partial credit). Deterministic, correctable middle band: the open-gate candidate domain. Real GitHub task tree + fixtures. - commit0 / programbench: real loadTasks (HF + *_FIXTURES offline) + judge via the official harness; fail loud (pip + Docker), never a fabricated score. - appworld: finished the stub — loadTasks via the engine, judge = world.evaluate; fail loud (no offline fixture by honest design). Registered in both ADAPTERS maps (run.ts + rsi.ts); offline fixture tests for all four; fixed two execFile stdin-hang bugs (programbench, appworld). Verified: bench tsc 0, 19/19 fixture tests, lint clean.

Cuts the 58-commit backlog on main into a published release. Headline surface: - runToolLoop / streamToolLoop — bounded turn-level tool-dispatch loop (#137) - RSI agent tree: recursive Agent.act, Supervisor keystone, runProgram, the adaptive-driver channel (#139/#151/#165) - optimization API collapsed onto agent-eval selfImprove; the runtime keeps the CODE-surface ImprovementDriver you pass as driver (#172) - deployable benchmark adapters: AppWorld, commit0, aec-bench, EnterpriseOps-Gym; runBenchmarks over one ADAPTERS registry (#153/#156/#157) - agent-eval floor raised to >=0.83.0 (#175)

drewstone merged commit ef67004 into main Jun 4, 2026
1 check passed

drewstone deleted the feat/wire-benchmarks branch June 4, 2026 14:51

drewstone mentioned this pull request Jun 6, 2026

chore(release): agent-runtime 0.45.0 #176

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): wire real benchmark adapters — aec-bench, commit0, programbench, appworld + shared harness#153

feat(bench): wire real benchmark adapters — aec-bench, commit0, programbench, appworld + shared harness#153
drewstone merged 1 commit into
mainfrom
feat/wire-benchmarks

drewstone commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Jun 4, 2026

What

Verified

Not in scope (honest)

Why aec-bench matters beyond "another adapter"

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant