refactor(improvement): collapse optimization API onto agent-eval selfImprove by drewstone · Pull Request #172 · tangle-network/agent-runtime

drewstone · 2026-06-06T00:39:00Z

What

One entry point for closed-loop optimization: agent-eval's selfImprove (@tangle-network/agent-eval/contract). It wraps runImprovementLoop + gepaDriver + the held-out gate + analyzeGeneration + production intake behind a single budget-shaped options object. agent-runtime keeps only the one genuinely runtime-specific piece — the CODE-surface ImprovementDriver (git-worktree mutation via CandidateGenerator), which you pass to selfImprove as driver.

Why

We had three overlapping optimization surfaces in this repo. optimizePrompt was a thin wrapper over runImprovementLoop; report-eval-runs re-implemented production-run intake. Both are now strictly subsumed by selfImprove + the /contract analysis helpers (analyzeRuns, partitionRunsByAuthoringModel). One function, not a wrapper zoo.

The unlock was a version skew: installed @tangle-network/agent-eval was 0.76, whose selfImprove lacked analyzeGeneration (the analyst→reflection wire we depend on). 0.83 adds it — so selfImprove is now strictly a superset of what optimizePrompt did, and the wrappers can go.

Changes

delete src/improvement/optimize-prompt.ts (+ test) — subsumed by selfImprove.
delete src/improvement/report-eval-runs.ts (+ test) — subsumed by selfImprove hostedTenant + /contract analyzeRuns / partitionRunsByAuthoringModel.
migrate selfImproveLoopRunner (src/loop-runner.ts) and bench/src/improve-prompt.ts onto selfImprove. Field renames: baselineComposite→baseline.compositeMean, winnerComposite→winner.compositeMean, delta→lift, decision→gateDecision, prompt→winner.surface, rationale→winner.rationale; budget gains holdoutScenarios / reps / promoteTopK.
trim src/improvement/index.ts to export only the CODE-surface driver pieces (improvementDriver, agenticGenerator, reflectiveGenerator).
bump @tangle-network/agent-eval 0.76 → 0.83 (root + bench/); 0.83 is the first release whose selfImprove exposes analyzeGeneration.

No back-compat shim — this is greenfield optimization plumbing.

Verification

−817 LOC net.
root typecheck 0 / lint clean / build success / 658 tests pass.
bench typecheck 0 (the improve-prompt migration compiles against 0.83).
merges cleanly into main (rebased on top of docs(bench): unify rollout/shot terminology + honestly scope the HumanEval gate #170/refactor(bench): rename gepa-refine → improve-prompt (name by purpose, not method) #171; reconciled the gepa-refine → improve-prompt rename).

…ve deployable-selector gate The docker checker leaked containers (timeout killed the client, not the container) and could hang the pool (stuck client = unresolved promise). Fix: unique --name + docker rm -f force-reap on every path + a JS backstop that guarantees each checker promise resolves. Validated: n=50 ran clean, 0 leaked containers. RESULT (n=50, k=4, gpt-3.5-turbo for a correctable band): verifier-grounded selection CAPTURES the oracle ceiling (94%->94%, gap 0) where self-consistency loses. verifier-pick - sc = +12.0pp CI[+4,+22] POSITIVE; random@k - blind = +18.0pp CI[+8,+30]; sc - random = -12.0pp (reproduces the -8/-9pp answer-oracle loss in the deployable-checker domain). First BH-significant admissible non-blind selection win. SCOPE: Layer-0 (stateless completions, no self-correction lower bound).

…Improve `selfImprove` (`@tangle-network/agent-eval/contract`, 0.83) is now the single entry point for closed-loop text/config optimization: gepaDriver + held-out gate + analyzeGeneration + production intake, behind one budget-shaped options object. agent-runtime keeps only the genuinely runtime-specific piece — the CODE-surface ImprovementDriver (worktree mutation via CandidateGenerator). - delete src/improvement/optimize-prompt.ts (+ test) — the thin wrapper over runImprovementLoop is subsumed by selfImprove's one call. - delete src/improvement/report-eval-runs.ts (+ test) — subsumed by selfImprove hostedTenant + /contract analyzeRuns / partitionRunsByAuthoringModel. - migrate selfImproveLoopRunner (src/loop-runner.ts) and bench gepa-refine onto selfImprove; field renames (baseline.compositeMean / winner.surface / lift / gateDecision), budget.holdoutScenarios/reps/promoteTopK. - bump @tangle-network/agent-eval 0.76 -> 0.83 (root + bench); 0.83 is the first release whose selfImprove exposes analyzeGeneration, closing the last gap. No back-compat shim. -739 LOC. typecheck/lint/build clean; 674 tests pass; bench typecheck clean.

tangletools · 2026-06-06T00:57:03Z

✅ No Blockers — `7c0a1790`

Readiness 72/100 · Confidence 95/100 · 7 findings (2 medium, 5 low)

	deepseek	glm	aggregate
Readiness	72	83	72
Confidence	95	95	95
Correctness	72	83	72
Security	72	83	72
Testing	72	83	72
Architecture	72	83	72

Full multi-shot audit completed 7/7 planned shots over 8 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 7/7 planned shots over 8 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Stale +20pp win claim contradicts repo's own evidence ledger — bench/src/improve-prompt.ts

Line 13: 'We proved evidence-gated refinement beats blind (FinSearchComp +20pp) with a HAND-WRITTEN refine directive.' Per CLAUDE.md (repo root): 'The earlier +20pp steering proven was confounded compute — a cautionary precedent.' The claim is demonstrably stale and misleads anyone reading this file as user-facing documentation of what the bench proved. The PR touched the adjacent comment block (lines 1-4) so this file is in scope. Fix: update the comment to reflect the actual evidence state (the +20pp was confounded; subsequent contr

🟠 MEDIUM peerDependencies range too wide — code requires >=0.83.0 — package.json

DevDependency @tangle-network/agent-eval was correctly bumped from ^0.76.0 to ^0.83.0 (line 104), but peerDependencies (line 127) still declares >=0.76.0 <1.0.0. The new code in src/loop-runner.ts:29 imports selfImprove, SelfImproveOptions, SelfImproveResult from @tangle-network/agent-eval/contract — a subpath export added between 0.76.0 and 0.83.0. A consumer with agent-eval 0.76.0–0.82.0 would get a module-resolution or import error at runtime/typecheck. Fix: tighten peerDependencies to >=0.83.0 <1.0.0.

🟡 LOW backstop timer not unref'd — keeps event loop alive — bench/src/humaneval-gate.mts

Line 166: setTimeout(() => finish({ pass: 0 }), dockerTimeoutMs + 3000) — the backstop timer is cleared on normal/error paths via clearTimeout(backstop) inside finish/fail, which is correct. However, the timer is not .unref()'d, so while the pool workers are running, an idle backstop timer will prevent Node from exiting early. In practice this is harmless (the pool awaits all workers), but .unref() would be marginally cleaner for a bench script that might add a top-level timeout later.

🟡 LOW docker rm -f cleanup is fire-and-forget with no error logging — bench/src/humaneval-gate.mts

Line 147: execFile('docker', ['rm', '-f', name], () => {}) — the empty callback silently swallows any error from docker rm -f. If docker itself is down (the case where the daemon is unreachable), this will fail silently on every cleanup. Not a bug (the container won't exist if docker run failed), but adding if (err) console.warn(...) would aid debugging stuck-container issues in CI.

🟡 LOW Dropped seed: 42 from optimizePrompt→selfImprove migration — bench/src/improve-prompt.ts

The old optimizePrompt call passed seed: 42 (line 576 in the old file). The new selfImprove call has no seed field. If selfImprove uses a nondeterministic seed by default, this changes GEPA's generation-to-generation reproducibility. Verify that selfImprove's default seed behavior matches, or add a seed field if the new API supports it. Impact: bench reproducibility only, not production.

🟡 LOW Breaking export removal: optimizePrompt and reportOptimizationRun — src/improvement/index.ts

Removes re-exports for optimizePrompt, reportOptimizationRun, OptimizePromptOptions, OptimizePromptResult, OptimizationRunMeta, optimizePromptResultToEvalRunEvents, and OptimizePromptReflection from the public barrel. All internal consumers have been migrated to @tangle-network/agent-eval/contract (loop-runner.ts, bench/src/improve-prompt.ts). The package.json version (0.44.0) should be bumped to reflect this breaking change for any external consumer importing from @tangle-network/agent-runtime/improvement.

🟡 LOW selfImproveLoopRunner ignores AbortSignal — src/loop-runner.ts

Line 285: return async () => selfImprove(...) discards the signal: AbortSignal parameter from the DelegatedLoopRunner type. This is pre-existing (the old optimizePrompt wrapper had the same shape) so not a regression, but callers passing an AbortSignal get no cancellation semantics. Fix: thread signal into selfImprove options if the substrate supports it.

_{tangletools · 2026-06-06T00:57:00Z · trace}

tangletools

✅ Approved — 7 non-blocking findings — `7c0a1790`

Full multi-shot audit completed 7/7 planned shots over 8 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 7/7 planned shots over 8 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary

_{tangletools · 2026-06-06T00:57:00Z · immutable trace}

…o 0.83 (#175) PR #172 deleted optimizePrompt + report-eval-runs (selfImprove is the one entry point), but the docs/skills/pins still documented the removed APIs. Synced every surface so the docs match the code: - README + the SHIPPED adoption SKILL: the optimization story now points at agent-eval's selfImprove (@tangle-network/agent-eval/contract) — agent-runtime contributes only the code-surface improvementDriver; reportOptimizationRun → analyzeRuns; /improvement export table corrected to its real exports. - CLAUDE.md + bench/HARNESS.md: agent-eval pin ^0.76.0 → ^0.83.0; optimizePrompt → selfImprove. - package.json peerDependency floor >=0.76.0 → >=0.83.0 (selfImprove needs analyzeGeneration, added in 0.83) — a real correctness fix: a consumer on 0.76 would break. - drop a stale "0.76" comment label in improve-prompt.ts (heldoutSignificance is unchanged). Verified: 0 remaining optimizePrompt/reportOptimizationRun/^0.76 refs in tracked source/docs; examples typecheck clean; root typecheck/lint/build green. agent-eval is on the latest published (0.83.0).

Cuts the 58-commit backlog on main into a published release. Headline surface: - runToolLoop / streamToolLoop — bounded turn-level tool-dispatch loop (#137) - RSI agent tree: recursive Agent.act, Supervisor keystone, runProgram, the adaptive-driver channel (#139/#151/#165) - optimization API collapsed onto agent-eval selfImprove; the runtime keeps the CODE-surface ImprovementDriver you pass as driver (#172) - deployable benchmark adapters: AppWorld, commit0, aec-bench, EnterpriseOps-Gym; runBenchmarks over one ADAPTERS registry (#153/#156/#157) - agent-eval floor raised to >=0.83.0 (#175)

drewstone added 2 commits June 5, 2026 18:36

tangletools approved these changes Jun 6, 2026

View reviewed changes

drewstone merged commit 3be64be into main Jun 6, 2026
1 check passed

drewstone deleted the refactor/selfimprove-collapse branch June 6, 2026 01:05

drewstone mentioned this pull request Jun 6, 2026

docs: sync optimization docs to selfImprove + bump agent-eval floor to 0.83 #175

Merged

drewstone mentioned this pull request Jun 6, 2026

chore(release): agent-runtime 0.45.0 #176

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(improvement): collapse optimization API onto agent-eval selfImprove#172

refactor(improvement): collapse optimization API onto agent-eval selfImprove#172
drewstone merged 2 commits into
mainfrom
refactor/selfimprove-collapse

drewstone commented Jun 6, 2026

Uh oh!

tangletools commented Jun 6, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 6, 2026

What

Why

Changes

Verification

Uh oh!

tangletools commented Jun 6, 2026

✅ No Blockers — 7c0a1790

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Approved — 7 non-blocking findings — 7c0a1790

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ No Blockers — `7c0a1790`

✅ Approved — 7 non-blocking findings — `7c0a1790`