feat(loops+improvement): dynamic loop driver + identity-gated optimizePrompt#75
Merged
drewstone merged 2 commits intoMay 31, 2026
Conversation
Third example driver alongside refine and fanout-vote, built on the existing Driver seam with zero kernel changes. Where refine/fanout-vote encode a fixed shape as a pure function of history, createDynamicDriver delegates the per-round shape to an injected TopologyPlanner that emits one TopologyMove (refine | fanout | stop) per round. - createDynamicDriver: maps moves onto plan/decide, enforces the iteration + fanout caps, fails loud (PlannerError) on a malformed move. Planner invoked once per round in plan(); decide() reads the cached move so an LLM planner is never double-called. 'done' is already a kernel-terminal decision, so termination needs no kernel change. - createSandboxPlanner: wires the planner to a sandbox profile (any harness) — streams a prompt carrying the history summary, decodes the move from a JSON envelope (structured result event or fenced block). - summarizeHistory: bounded, planner-friendly view of iteration history. - PlannerError added to the error taxonomy (carries 'validation'). Topology is orthogonal to harness: the planner never names a backend; the kernel's agentRuns round-robin decides which harness runs a branch, so one dynamic driver spans claude-code/codex/opencode/pi, including fanning a single round across several at once. 11 tests through the real kernel (sandbox stubbed at the process boundary): adaptive refine→refine→fanout→stop, explicit scripted trajectory across two harnesses, maxIterations cap, maxFanout clamp, empty-fanout + unknown-kind PlannerError, createSandboxPlanner end-to-end + n-shorthand + fenced-delta parse + decodeTask rejection.
…ny text prompt surface The text-surface entry point onto agent-eval's runImprovementLoop, sibling to improvementDriver (the code/worktree path). Defaults the driver to agent-eval's gepaDriver (reflective text mutator) and the gate to heldOutGate; runtime-agnostic via a single runWithPrompt seam. Identity-gated by construction: the loop runs evals, collects per-scenario signal, proposes candidates, and the held-out gate compares candidate vs baseline. result.prompt is the baseline (identity) UNLESS the gate decided 'ship' — so registering a prompt for optimization can never regress it; it only improves when held-out data earns it. Generic over the surface's execution (sandbox streamPrompt, runLoop, direct model call) — the optimizer never assumes how a prompt runs. Fails loud on misconfig (no driver/reflection, empty scenarios/holdout) and on a non-string CodeSurface (wrong entry point). 4 tests through the real runImprovementLoop, zero LLM (deterministic driver + judge + runner, in-memory storage): identity holds when no candidate beats baseline on holdout (returns the untouched baseline), promotes + returns the improved prompt + rationale when a candidate wins, fail-loud on misconfig and empty holdout.
tangletools
pushed a commit
that referenced
this pull request
May 31, 2026
drewstone
added a commit
that referenced
this pull request
May 31, 2026
…amic loop driver, optimizePrompt (#76) * feat(loops): surface aggregated tokenUsage on LoopResult + reportLoopUsage bridge runLoop tracked per-call tokensIn/tokensOut (extractLlmCallEvent) but only aggregated costUsd — token counts were dropped before reaching Iteration or LoopResult. A runProfileMatrix/runCampaign dispatch wrapping runLoop could report cost but had no tokens to report, so agent-eval's backend-integrity guard (assertRealBackend, which keys on tokenUsage) would misread a real run as a stub and throw. - Iteration + LoopResult gain tokenUsage: { input, output }, summed across every llm_call event (per iteration) and across iterations (LoopResult). - reportLoopUsage(cost, result) forwards a finished loop's cost + tokens into a campaign cost meter in one call — the trivial consumption path for the new runProfileMatrix primitive. Typed structurally so loops stay free of an agent-eval import. Extends the existing cost-aggregation test to assert token aggregation + reportLoopUsage forwarding. Full suite 381 green. * chore(deps): bump @tangle-network/agent-eval ^0.54.0 → ^0.61.0 Consumes the published runProfileMatrix + token-capture release. 7-minor jump verified: typecheck + build + full suite (381) green. * feat(loops): loopDispatch — first-class runLoop→campaign dispatch adapter The seam critique found reportLoopUsage had one consumer (a test) and zero products: wiring runLoop into runProfileMatrix/runCampaign required hand-building ExecCtx, hand-adapting the campaign trace, and remembering to forward usage (forgetting the last yields a {0,0} stub cell). loopDispatch collapses all three into one typed call: const dispatch = loopDispatch({ sandboxClient, toLoopOptions }) await runProfileMatrix({ profiles, scenarios, dispatch, judges, commitSha }) It builds the ExecCtx, forwards loop.* trace events into the campaign's scoped trace (campaignTraceToLoopEmitter), runs runLoop, reports cost+tokens via reportLoopUsage internally, and returns winner.output. loopCampaignDispatch is the runCampaign (no-profile) variant. AgentProfile imported from agent-eval (the eval-harness type ProfileDispatchFn keys on), NOT sandbox's — closes the name-collision footgun at this call site. Tests: returns winner artifact + reports exact usage + forwards trace spans; usage still flows on a validator-failing run (must not read as a stub). Full suite 383 green. * chore(deps): declare agent-eval as a required peerDependency, not a hard dependency Version-discipline fix (boundary critique, VERSIONING 3/10). agent-eval was the lone hard dependency while sandbox + agent-knowledge are already peers. A hard dep lets pnpm install a SECOND, divergent agent-eval tree with an incompatible RunRecord/DefaultVerdict; today only pnpm.overrides prevents it. As a peer (>=0.61.0 <1.0.0, required — not optional), a consumer running a stale or divergent substrate gets a loud unmet-peer warning instead of a silent split tree. agent-eval moves to devDependencies for agent-runtime's own build/test. Typecheck + full suite (383) green with the peer layout. * chore(release): 0.32.0 — loopDispatch adapter + tokenUsage seam + agent-eval peer-dep * feat(loops+improvement): dynamic loop driver + identity-gated optimizePrompt (#75) * feat(loops): dynamic driver — agent-authored loop topology Third example driver alongside refine and fanout-vote, built on the existing Driver seam with zero kernel changes. Where refine/fanout-vote encode a fixed shape as a pure function of history, createDynamicDriver delegates the per-round shape to an injected TopologyPlanner that emits one TopologyMove (refine | fanout | stop) per round. - createDynamicDriver: maps moves onto plan/decide, enforces the iteration + fanout caps, fails loud (PlannerError) on a malformed move. Planner invoked once per round in plan(); decide() reads the cached move so an LLM planner is never double-called. 'done' is already a kernel-terminal decision, so termination needs no kernel change. - createSandboxPlanner: wires the planner to a sandbox profile (any harness) — streams a prompt carrying the history summary, decodes the move from a JSON envelope (structured result event or fenced block). - summarizeHistory: bounded, planner-friendly view of iteration history. - PlannerError added to the error taxonomy (carries 'validation'). Topology is orthogonal to harness: the planner never names a backend; the kernel's agentRuns round-robin decides which harness runs a branch, so one dynamic driver spans claude-code/codex/opencode/pi, including fanning a single round across several at once. 11 tests through the real kernel (sandbox stubbed at the process boundary): adaptive refine→refine→fanout→stop, explicit scripted trajectory across two harnesses, maxIterations cap, maxFanout clamp, empty-fanout + unknown-kind PlannerError, createSandboxPlanner end-to-end + n-shorthand + fenced-delta parse + decodeTask rejection. * feat(improvement): optimizePrompt — identity-gated optimization for any text prompt surface The text-surface entry point onto agent-eval's runImprovementLoop, sibling to improvementDriver (the code/worktree path). Defaults the driver to agent-eval's gepaDriver (reflective text mutator) and the gate to heldOutGate; runtime-agnostic via a single runWithPrompt seam. Identity-gated by construction: the loop runs evals, collects per-scenario signal, proposes candidates, and the held-out gate compares candidate vs baseline. result.prompt is the baseline (identity) UNLESS the gate decided 'ship' — so registering a prompt for optimization can never regress it; it only improves when held-out data earns it. Generic over the surface's execution (sandbox streamPrompt, runLoop, direct model call) — the optimizer never assumes how a prompt runs. Fails loud on misconfig (no driver/reflection, empty scenarios/holdout) and on a non-string CodeSurface (wrong entry point). 4 tests through the real runImprovementLoop, zero LLM (deterministic driver + judge + runner, in-memory storage): identity holds when no candidate beats baseline on holdout (returns the untouched baseline), promotes + returns the improved prompt + rationale when a candidate wins, fail-loud on misconfig and empty holdout. --------- Co-authored-by: Drew Stone <drewstone329@gmail.com> * chore(release): 0.33.0 — dynamic loop driver + identity-gated optimizePrompt (#75) --------- Co-authored-by: Drew Stone <drewstone329@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two additions to the loops / improvement substrate. Stacked on
feat/loop-token-usage-for-profile-matrixbecauseoptimizePromptneeds theagent-eval
^0.61bump that lives on that branch (gepaDriver / runImprovementLoop);it does not compile against main's
^0.54. Retarget to main once the base lands.1. Dynamic loop driver — agent-authored topology (
662cd4e)Third example driver beside
refineandfanout-vote, built on the existingDriverseam with zero kernel changes. Where the other two encode a fixedshape as a pure function of history,
createDynamicDriverdelegates the per-roundshape to an injected
TopologyPlannerthat emits oneTopologyMove(
refine | fanout | stop) per round.createDynamicDriver— maps moves ontoplan/decide, enforces the iterationPlannerError) on a malformed move. Planner invokedonce per round in
plan();decide()reads the cached move so an LLM planner isnever double-called.
'done'is already a kernel-terminal decision.createSandboxPlanner— wires the planner to a sandbox profile (any harness);decodes the move from a JSON envelope (structured
resultevent or fenced block).summarizeHistory— bounded, planner-friendly view of iteration history.kernel's
agentRunsround-robin decides which harness runs a branch, so onedriver spans claude-code/codex/opencode/pi (incl. fanning one round across
several).
2.
optimizePrompt— identity-gated prompt optimization (1050b25)The TEXT-surface entry point onto agent-eval's
runImprovementLoop, sibling tothe existing
improvementDriver(code/worktree path) — extends, does not fork.Defaults the driver to agent-eval's
gepaDriverand the gate toheldOutGate;runtime-agnostic via a single
runWithPromptseam.Identity-gated by construction: the loop runs evals, collects per-scenario
signal, proposes candidates, and the held-out gate compares candidate vs baseline.
result.promptis the baseline UNLESS the gate decided'ship'— so registeringa prompt for optimization can never regress it; it only improves when held-out
data earns it. Fails loud on misconfig and on a non-string
CodeSurface.Tests
tests/loops/dynamic.test.ts— 11 tests through the real kernel (sandboxstubbed at the process boundary): adaptive refine→refine→fanout→stop, scripted
trajectory across two harnesses, maxIterations cap, maxFanout clamp, empty-fanout
PlannerError,createSandboxPlannerend-to-end + n-shorthand +fenced-delta parse + decodeTask rejection.
tests/optimize-prompt.test.ts— 4 tests through the realrunImprovementLoop,zero LLM (deterministic driver + judge + runner, in-memory storage): identity
holds when no candidate beats baseline on holdout, promotes + returns the improved
prompt + rationale on a real win, fail-loud on misconfig + empty holdout.
Full suite 398/398, tsc + biome clean. Both modules
@experimental.Follow-ups: #825 (wire the dynamic driver into a real consumer — skeletal-os
composer #555 / research-loop #294), #826 (adopt
optimizePrompton real promptsurfaces, starting with the #294 research-loop which already has the infra).