Skip to content

refactor(improvement): collapse optimization API onto agent-eval selfImprove#172

Merged
drewstone merged 2 commits into
mainfrom
refactor/selfimprove-collapse
Jun 6, 2026
Merged

refactor(improvement): collapse optimization API onto agent-eval selfImprove#172
drewstone merged 2 commits into
mainfrom
refactor/selfimprove-collapse

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

What

One entry point for closed-loop optimization: agent-eval's selfImprove (@tangle-network/agent-eval/contract). It wraps runImprovementLoop + gepaDriver + the held-out gate + analyzeGeneration + production intake behind a single budget-shaped options object. agent-runtime keeps only the one genuinely runtime-specific piece — the CODE-surface ImprovementDriver (git-worktree mutation via CandidateGenerator), which you pass to selfImprove as driver.

Why

We had three overlapping optimization surfaces in this repo. optimizePrompt was a thin wrapper over runImprovementLoop; report-eval-runs re-implemented production-run intake. Both are now strictly subsumed by selfImprove + the /contract analysis helpers (analyzeRuns, partitionRunsByAuthoringModel). One function, not a wrapper zoo.

The unlock was a version skew: installed @tangle-network/agent-eval was 0.76, whose selfImprove lacked analyzeGeneration (the analyst→reflection wire we depend on). 0.83 adds it — so selfImprove is now strictly a superset of what optimizePrompt did, and the wrappers can go.

Changes

  • delete src/improvement/optimize-prompt.ts (+ test) — subsumed by selfImprove.
  • delete src/improvement/report-eval-runs.ts (+ test) — subsumed by selfImprove hostedTenant + /contract analyzeRuns / partitionRunsByAuthoringModel.
  • migrate selfImproveLoopRunner (src/loop-runner.ts) and bench/src/improve-prompt.ts onto selfImprove. Field renames: baselineCompositebaseline.compositeMean, winnerCompositewinner.compositeMean, deltalift, decisiongateDecision, promptwinner.surface, rationalewinner.rationale; budget gains holdoutScenarios / reps / promoteTopK.
  • trim src/improvement/index.ts to export only the CODE-surface driver pieces (improvementDriver, agenticGenerator, reflectiveGenerator).
  • bump @tangle-network/agent-eval 0.76 → 0.83 (root + bench/); 0.83 is the first release whose selfImprove exposes analyzeGeneration.

No back-compat shim — this is greenfield optimization plumbing.

Verification

drewstone added 2 commits June 5, 2026 18:36
…ve deployable-selector gate

The docker checker leaked containers (timeout killed the client, not the container) and could
hang the pool (stuck client = unresolved promise). Fix: unique --name + docker rm -f force-reap on
every path + a JS backstop that guarantees each checker promise resolves. Validated: n=50 ran clean,
0 leaked containers.

RESULT (n=50, k=4, gpt-3.5-turbo for a correctable band): verifier-grounded selection CAPTURES the
oracle ceiling (94%->94%, gap 0) where self-consistency loses. verifier-pick - sc = +12.0pp CI[+4,+22]
POSITIVE; random@k - blind = +18.0pp CI[+8,+30]; sc - random = -12.0pp (reproduces the -8/-9pp
answer-oracle loss in the deployable-checker domain). First BH-significant admissible non-blind
selection win. SCOPE: Layer-0 (stateless completions, no self-correction lower bound).
…Improve

`selfImprove` (`@tangle-network/agent-eval/contract`, 0.83) is now the single
entry point for closed-loop text/config optimization: gepaDriver + held-out
gate + analyzeGeneration + production intake, behind one budget-shaped options
object. agent-runtime keeps only the genuinely runtime-specific piece — the
CODE-surface ImprovementDriver (worktree mutation via CandidateGenerator).

- delete src/improvement/optimize-prompt.ts (+ test) — the thin wrapper over
  runImprovementLoop is subsumed by selfImprove's one call.
- delete src/improvement/report-eval-runs.ts (+ test) — subsumed by selfImprove
  hostedTenant + /contract analyzeRuns / partitionRunsByAuthoringModel.
- migrate selfImproveLoopRunner (src/loop-runner.ts) and bench gepa-refine onto
  selfImprove; field renames (baseline.compositeMean / winner.surface / lift /
  gateDecision), budget.holdoutScenarios/reps/promoteTopK.
- bump @tangle-network/agent-eval 0.76 -> 0.83 (root + bench); 0.83 is the first
  release whose selfImprove exposes analyzeGeneration, closing the last gap.

No back-compat shim. -739 LOC. typecheck/lint/build clean; 674 tests pass;
bench typecheck clean.
@tangletools
Copy link
Copy Markdown
Contributor

✅ No Blockers — 7c0a1790

Readiness 72/100 · Confidence 95/100 · 7 findings (2 medium, 5 low)

deepseek glm aggregate
Readiness 72 83 72
Confidence 95 95 95
Correctness 72 83 72
Security 72 83 72
Testing 72 83 72
Architecture 72 83 72

Full multi-shot audit completed 7/7 planned shots over 8 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 7/7 planned shots over 8 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Stale +20pp win claim contradicts repo's own evidence ledger — bench/src/improve-prompt.ts

Line 13: 'We proved evidence-gated refinement beats blind (FinSearchComp +20pp) with a HAND-WRITTEN refine directive.' Per CLAUDE.md (repo root): 'The earlier +20pp steering proven was confounded compute — a cautionary precedent.' The claim is demonstrably stale and misleads anyone reading this file as user-facing documentation of what the bench proved. The PR touched the adjacent comment block (lines 1-4) so this file is in scope. Fix: update the comment to reflect the actual evidence state (the +20pp was confounded; subsequent contr

🟠 MEDIUM peerDependencies range too wide — code requires >=0.83.0 — package.json

DevDependency @tangle-network/agent-eval was correctly bumped from ^0.76.0 to ^0.83.0 (line 104), but peerDependencies (line 127) still declares >=0.76.0 <1.0.0. The new code in src/loop-runner.ts:29 imports selfImprove, SelfImproveOptions, SelfImproveResult from @tangle-network/agent-eval/contract — a subpath export added between 0.76.0 and 0.83.0. A consumer with agent-eval 0.76.0–0.82.0 would get a module-resolution or import error at runtime/typecheck. Fix: tighten peerDependencies to >=0.83.0 <1.0.0.

🟡 LOW backstop timer not unref'd — keeps event loop alive — bench/src/humaneval-gate.mts

Line 166: setTimeout(() => finish({ pass: 0 }), dockerTimeoutMs + 3000) — the backstop timer is cleared on normal/error paths via clearTimeout(backstop) inside finish/fail, which is correct. However, the timer is not .unref()'d, so while the pool workers are running, an idle backstop timer will prevent Node from exiting early. In practice this is harmless (the pool awaits all workers), but .unref() would be marginally cleaner for a bench script that might add a top-level timeout later.

🟡 LOW docker rm -f cleanup is fire-and-forget with no error logging — bench/src/humaneval-gate.mts

Line 147: execFile('docker', ['rm', '-f', name], () => {}) — the empty callback silently swallows any error from docker rm -f. If docker itself is down (the case where the daemon is unreachable), this will fail silently on every cleanup. Not a bug (the container won't exist if docker run failed), but adding if (err) console.warn(...) would aid debugging stuck-container issues in CI.

🟡 LOW Dropped seed: 42 from optimizePrompt→selfImprove migration — bench/src/improve-prompt.ts

The old optimizePrompt call passed seed: 42 (line 576 in the old file). The new selfImprove call has no seed field. If selfImprove uses a nondeterministic seed by default, this changes GEPA's generation-to-generation reproducibility. Verify that selfImprove's default seed behavior matches, or add a seed field if the new API supports it. Impact: bench reproducibility only, not production.

🟡 LOW Breaking export removal: optimizePrompt and reportOptimizationRun — src/improvement/index.ts

Removes re-exports for optimizePrompt, reportOptimizationRun, OptimizePromptOptions, OptimizePromptResult, OptimizationRunMeta, optimizePromptResultToEvalRunEvents, and OptimizePromptReflection from the public barrel. All internal consumers have been migrated to @tangle-network/agent-eval/contract (loop-runner.ts, bench/src/improve-prompt.ts). The package.json version (0.44.0) should be bumped to reflect this breaking change for any external consumer importing from @tangle-network/agent-runtime/improvement.

🟡 LOW selfImproveLoopRunner ignores AbortSignal — src/loop-runner.ts

Line 285: return async () => selfImprove(...) discards the signal: AbortSignal parameter from the DelegatedLoopRunner type. This is pre-existing (the old optimizePrompt wrapper had the same shape) so not a regression, but callers passing an AbortSignal get no cancellation semantics. Fix: thread signal into selfImprove options if the substrate supports it.


tangletools · 2026-06-06T00:57:00Z · trace

Copy link
Copy Markdown
Contributor

@tangletools tangletools left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — 7 non-blocking findings — 7c0a1790

Full multi-shot audit completed 7/7 planned shots over 8 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 7/7 planned shots over 8 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-06T00:57:00Z · immutable trace

@drewstone drewstone merged commit 3be64be into main Jun 6, 2026
1 check passed
@drewstone drewstone deleted the refactor/selfimprove-collapse branch June 6, 2026 01:05
drewstone added a commit that referenced this pull request Jun 6, 2026
…o 0.83 (#175)

PR #172 deleted optimizePrompt + report-eval-runs (selfImprove is the one entry
point), but the docs/skills/pins still documented the removed APIs. Synced every
surface so the docs match the code:
- README + the SHIPPED adoption SKILL: the optimization story now points at
  agent-eval's selfImprove (@tangle-network/agent-eval/contract) — agent-runtime
  contributes only the code-surface improvementDriver; reportOptimizationRun →
  analyzeRuns; /improvement export table corrected to its real exports.
- CLAUDE.md + bench/HARNESS.md: agent-eval pin ^0.76.0 → ^0.83.0; optimizePrompt → selfImprove.
- package.json peerDependency floor >=0.76.0 → >=0.83.0 (selfImprove needs analyzeGeneration,
  added in 0.83) — a real correctness fix: a consumer on 0.76 would break.
- drop a stale "0.76" comment label in improve-prompt.ts (heldoutSignificance is unchanged).

Verified: 0 remaining optimizePrompt/reportOptimizationRun/^0.76 refs in tracked
source/docs; examples typecheck clean; root typecheck/lint/build green. agent-eval
is on the latest published (0.83.0).
drewstone added a commit that referenced this pull request Jun 6, 2026
Cuts the 58-commit backlog on main into a published release. Headline surface:
- runToolLoop / streamToolLoop — bounded turn-level tool-dispatch loop (#137)
- RSI agent tree: recursive Agent.act, Supervisor keystone, runProgram, the
  adaptive-driver channel (#139/#151/#165)
- optimization API collapsed onto agent-eval selfImprove; the runtime keeps the
  CODE-surface ImprovementDriver you pass as driver (#172)
- deployable benchmark adapters: AppWorld, commit0, aec-bench, EnterpriseOps-Gym;
  runBenchmarks over one ADAPTERS registry (#153/#156/#157)
- agent-eval floor raised to >=0.83.0 (#175)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants