refactor(improvement): collapse optimization API onto agent-eval selfImprove#172
Conversation
…ve deployable-selector gate The docker checker leaked containers (timeout killed the client, not the container) and could hang the pool (stuck client = unresolved promise). Fix: unique --name + docker rm -f force-reap on every path + a JS backstop that guarantees each checker promise resolves. Validated: n=50 ran clean, 0 leaked containers. RESULT (n=50, k=4, gpt-3.5-turbo for a correctable band): verifier-grounded selection CAPTURES the oracle ceiling (94%->94%, gap 0) where self-consistency loses. verifier-pick - sc = +12.0pp CI[+4,+22] POSITIVE; random@k - blind = +18.0pp CI[+8,+30]; sc - random = -12.0pp (reproduces the -8/-9pp answer-oracle loss in the deployable-checker domain). First BH-significant admissible non-blind selection win. SCOPE: Layer-0 (stateless completions, no self-correction lower bound).
…Improve `selfImprove` (`@tangle-network/agent-eval/contract`, 0.83) is now the single entry point for closed-loop text/config optimization: gepaDriver + held-out gate + analyzeGeneration + production intake, behind one budget-shaped options object. agent-runtime keeps only the genuinely runtime-specific piece — the CODE-surface ImprovementDriver (worktree mutation via CandidateGenerator). - delete src/improvement/optimize-prompt.ts (+ test) — the thin wrapper over runImprovementLoop is subsumed by selfImprove's one call. - delete src/improvement/report-eval-runs.ts (+ test) — subsumed by selfImprove hostedTenant + /contract analyzeRuns / partitionRunsByAuthoringModel. - migrate selfImproveLoopRunner (src/loop-runner.ts) and bench gepa-refine onto selfImprove; field renames (baseline.compositeMean / winner.surface / lift / gateDecision), budget.holdoutScenarios/reps/promoteTopK. - bump @tangle-network/agent-eval 0.76 -> 0.83 (root + bench); 0.83 is the first release whose selfImprove exposes analyzeGeneration, closing the last gap. No back-compat shim. -739 LOC. typecheck/lint/build clean; 674 tests pass; bench typecheck clean.
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 72 | 83 | 72 |
| Confidence | 95 | 95 | 95 |
| Correctness | 72 | 83 | 72 |
| Security | 72 | 83 | 72 |
| Testing | 72 | 83 | 72 |
| Architecture | 72 | 83 | 72 |
Full multi-shot audit completed 7/7 planned shots over 8 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 7/7 planned shots over 8 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM Stale +20pp win claim contradicts repo's own evidence ledger — bench/src/improve-prompt.ts
Line 13: 'We proved evidence-gated refinement beats blind (FinSearchComp +20pp) with a HAND-WRITTEN refine directive.' Per CLAUDE.md (repo root): 'The earlier +20pp steering proven was confounded compute — a cautionary precedent.' The claim is demonstrably stale and misleads anyone reading this file as user-facing documentation of what the bench proved. The PR touched the adjacent comment block (lines 1-4) so this file is in scope. Fix: update the comment to reflect the actual evidence state (the +20pp was confounded; subsequent contr
🟠 MEDIUM peerDependencies range too wide — code requires >=0.83.0 — package.json
DevDependency @tangle-network/agent-eval was correctly bumped from ^0.76.0 to ^0.83.0 (line 104), but peerDependencies (line 127) still declares >=0.76.0 <1.0.0. The new code in src/loop-runner.ts:29 imports selfImprove, SelfImproveOptions, SelfImproveResult from @tangle-network/agent-eval/contract — a subpath export added between 0.76.0 and 0.83.0. A consumer with agent-eval 0.76.0–0.82.0 would get a module-resolution or import error at runtime/typecheck. Fix: tighten peerDependencies to >=0.83.0 <1.0.0.
🟡 LOW backstop timer not unref'd — keeps event loop alive — bench/src/humaneval-gate.mts
Line 166:
setTimeout(() => finish({ pass: 0 }), dockerTimeoutMs + 3000)— the backstop timer is cleared on normal/error paths viaclearTimeout(backstop)insidefinish/fail, which is correct. However, the timer is not.unref()'d, so while the pool workers are running, an idle backstop timer will prevent Node from exiting early. In practice this is harmless (the pool awaits all workers), but.unref()would be marginally cleaner for a bench script that might add a top-level timeout later.
🟡 LOW docker rm -f cleanup is fire-and-forget with no error logging — bench/src/humaneval-gate.mts
Line 147:
execFile('docker', ['rm', '-f', name], () => {})— the empty callback silently swallows any error fromdocker rm -f. If docker itself is down (the case where the daemon is unreachable), this will fail silently on every cleanup. Not a bug (the container won't exist if docker run failed), but addingif (err) console.warn(...)would aid debugging stuck-container issues in CI.
🟡 LOW Dropped seed: 42 from optimizePrompt→selfImprove migration — bench/src/improve-prompt.ts
The old
optimizePromptcall passedseed: 42(line 576 in the old file). The newselfImprovecall has noseedfield. IfselfImproveuses a nondeterministic seed by default, this changes GEPA's generation-to-generation reproducibility. Verify thatselfImprove's default seed behavior matches, or add aseedfield if the new API supports it. Impact: bench reproducibility only, not production.
🟡 LOW Breaking export removal: optimizePrompt and reportOptimizationRun — src/improvement/index.ts
Removes re-exports for optimizePrompt, reportOptimizationRun, OptimizePromptOptions, OptimizePromptResult, OptimizationRunMeta, optimizePromptResultToEvalRunEvents, and OptimizePromptReflection from the public barrel. All internal consumers have been migrated to @tangle-network/agent-eval/contract (loop-runner.ts, bench/src/improve-prompt.ts). The package.json version (0.44.0) should be bumped to reflect this breaking change for any external consumer importing from @tangle-network/agent-runtime/improvement.
🟡 LOW selfImproveLoopRunner ignores AbortSignal — src/loop-runner.ts
Line 285:
return async () => selfImprove(...)discards thesignal: AbortSignalparameter from theDelegatedLoopRunnertype. This is pre-existing (the old optimizePrompt wrapper had the same shape) so not a regression, but callers passing an AbortSignal get no cancellation semantics. Fix: threadsignalintoselfImproveoptions if the substrate supports it.
tangletools · 2026-06-06T00:57:00Z · trace
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 7 non-blocking findings — 7c0a1790
Full multi-shot audit completed 7/7 planned shots over 8 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 7/7 planned shots over 8 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-06T00:57:00Z · immutable trace
…o 0.83 (#175) PR #172 deleted optimizePrompt + report-eval-runs (selfImprove is the one entry point), but the docs/skills/pins still documented the removed APIs. Synced every surface so the docs match the code: - README + the SHIPPED adoption SKILL: the optimization story now points at agent-eval's selfImprove (@tangle-network/agent-eval/contract) — agent-runtime contributes only the code-surface improvementDriver; reportOptimizationRun → analyzeRuns; /improvement export table corrected to its real exports. - CLAUDE.md + bench/HARNESS.md: agent-eval pin ^0.76.0 → ^0.83.0; optimizePrompt → selfImprove. - package.json peerDependency floor >=0.76.0 → >=0.83.0 (selfImprove needs analyzeGeneration, added in 0.83) — a real correctness fix: a consumer on 0.76 would break. - drop a stale "0.76" comment label in improve-prompt.ts (heldoutSignificance is unchanged). Verified: 0 remaining optimizePrompt/reportOptimizationRun/^0.76 refs in tracked source/docs; examples typecheck clean; root typecheck/lint/build green. agent-eval is on the latest published (0.83.0).
Cuts the 58-commit backlog on main into a published release. Headline surface: - runToolLoop / streamToolLoop — bounded turn-level tool-dispatch loop (#137) - RSI agent tree: recursive Agent.act, Supervisor keystone, runProgram, the adaptive-driver channel (#139/#151/#165) - optimization API collapsed onto agent-eval selfImprove; the runtime keeps the CODE-surface ImprovementDriver you pass as driver (#172) - deployable benchmark adapters: AppWorld, commit0, aec-bench, EnterpriseOps-Gym; runBenchmarks over one ADAPTERS registry (#153/#156/#157) - agent-eval floor raised to >=0.83.0 (#175)
What
One entry point for closed-loop optimization: agent-eval's
selfImprove(@tangle-network/agent-eval/contract). It wrapsrunImprovementLoop+gepaDriver+ the held-out gate +analyzeGeneration+ production intake behind a single budget-shaped options object. agent-runtime keeps only the one genuinely runtime-specific piece — the CODE-surfaceImprovementDriver(git-worktree mutation viaCandidateGenerator), which you pass toselfImproveasdriver.Why
We had three overlapping optimization surfaces in this repo.
optimizePromptwas a thin wrapper overrunImprovementLoop;report-eval-runsre-implemented production-run intake. Both are now strictly subsumed byselfImprove+ the/contractanalysis helpers (analyzeRuns,partitionRunsByAuthoringModel). One function, not a wrapper zoo.The unlock was a version skew: installed
@tangle-network/agent-evalwas0.76, whoseselfImprovelackedanalyzeGeneration(the analyst→reflection wire we depend on).0.83adds it — soselfImproveis now strictly a superset of whatoptimizePromptdid, and the wrappers can go.Changes
src/improvement/optimize-prompt.ts(+ test) — subsumed byselfImprove.src/improvement/report-eval-runs.ts(+ test) — subsumed byselfImprovehostedTenant+/contractanalyzeRuns/partitionRunsByAuthoringModel.selfImproveLoopRunner(src/loop-runner.ts) andbench/src/improve-prompt.tsontoselfImprove. Field renames:baselineComposite→baseline.compositeMean,winnerComposite→winner.compositeMean,delta→lift,decision→gateDecision,prompt→winner.surface,rationale→winner.rationale; budget gainsholdoutScenarios/reps/promoteTopK.src/improvement/index.tsto export only the CODE-surface driver pieces (improvementDriver,agenticGenerator,reflectiveGenerator).@tangle-network/agent-eval0.76 → 0.83(root +bench/); 0.83 is the first release whoseselfImproveexposesanalyzeGeneration.No back-compat shim — this is greenfield optimization plumbing.
Verification
−817 LOCnet.typecheck0 /lintclean /buildsuccess / 658 tests pass.benchtypecheck 0 (theimprove-promptmigration compiles against 0.83).main(rebased on top of docs(bench): unify rollout/shot terminology + honestly scope the HumanEval gate #170/refactor(bench): rename gepa-refine → improve-prompt (name by purpose, not method) #171; reconciled thegepa-refine → improve-promptrename).