feat(loops+improvement): dynamic loop driver + identity-gated optimizePrompt by tangletools · Pull Request #75 · tangle-network/agent-runtime

tangletools · 2026-05-31T00:57:59Z

Two additions to the loops / improvement substrate. Stacked on
feat/loop-token-usage-for-profile-matrix because optimizePrompt needs the
agent-eval ^0.61 bump that lives on that branch (gepaDriver / runImprovementLoop);
it does not compile against main's ^0.54. Retarget to main once the base lands.

1. Dynamic loop driver — agent-authored topology (`662cd4e`)

Third example driver beside refine and fanout-vote, built on the existing
Driver seam with zero kernel changes. Where the other two encode a fixed
shape as a pure function of history, createDynamicDriver delegates the per-round
shape to an injected TopologyPlanner that emits one TopologyMove
(refine | fanout | stop) per round.

createDynamicDriver — maps moves onto plan/decide, enforces the iteration
- fanout caps, fails loud (PlannerError) on a malformed move. Planner invoked
  once per round in plan(); decide() reads the cached move so an LLM planner is
  never double-called. 'done' is already a kernel-terminal decision.
createSandboxPlanner — wires the planner to a sandbox profile (any harness);
decodes the move from a JSON envelope (structured result event or fenced block).
summarizeHistory — bounded, planner-friendly view of iteration history.
Topology is orthogonal to harness: the planner never names a backend; the
kernel's agentRuns round-robin decides which harness runs a branch, so one
driver spans claude-code/codex/opencode/pi (incl. fanning one round across
several).

2. `optimizePrompt` — identity-gated prompt optimization (`1050b25`)

The TEXT-surface entry point onto agent-eval's runImprovementLoop, sibling to
the existing improvementDriver (code/worktree path) — extends, does not fork.
Defaults the driver to agent-eval's gepaDriver and the gate to heldOutGate;
runtime-agnostic via a single runWithPrompt seam.

Identity-gated by construction: the loop runs evals, collects per-scenario
signal, proposes candidates, and the held-out gate compares candidate vs baseline.
result.prompt is the baseline UNLESS the gate decided 'ship' — so registering
a prompt for optimization can never regress it; it only improves when held-out
data earns it. Fails loud on misconfig and on a non-string CodeSurface.

Tests

tests/loops/dynamic.test.ts — 11 tests through the real kernel (sandbox
stubbed at the process boundary): adaptive refine→refine→fanout→stop, scripted
trajectory across two harnesses, maxIterations cap, maxFanout clamp, empty-fanout
- unknown-kind PlannerError, createSandboxPlanner end-to-end + n-shorthand +
  fenced-delta parse + decodeTask rejection.
tests/optimize-prompt.test.ts — 4 tests through the real runImprovementLoop,
zero LLM (deterministic driver + judge + runner, in-memory storage): identity
holds when no candidate beats baseline on holdout, promotes + returns the improved
prompt + rationale on a real win, fail-loud on misconfig + empty holdout.

Full suite 398/398, tsc + biome clean. Both modules @experimental.

Follow-ups: #825 (wire the dynamic driver into a real consumer — skeletal-os
composer #555 / research-loop #294), #826 (adopt optimizePrompt on real prompt
surfaces, starting with the #294 research-loop which already has the infra).

Third example driver alongside refine and fanout-vote, built on the existing Driver seam with zero kernel changes. Where refine/fanout-vote encode a fixed shape as a pure function of history, createDynamicDriver delegates the per-round shape to an injected TopologyPlanner that emits one TopologyMove (refine | fanout | stop) per round. - createDynamicDriver: maps moves onto plan/decide, enforces the iteration + fanout caps, fails loud (PlannerError) on a malformed move. Planner invoked once per round in plan(); decide() reads the cached move so an LLM planner is never double-called. 'done' is already a kernel-terminal decision, so termination needs no kernel change. - createSandboxPlanner: wires the planner to a sandbox profile (any harness) — streams a prompt carrying the history summary, decodes the move from a JSON envelope (structured result event or fenced block). - summarizeHistory: bounded, planner-friendly view of iteration history. - PlannerError added to the error taxonomy (carries 'validation'). Topology is orthogonal to harness: the planner never names a backend; the kernel's agentRuns round-robin decides which harness runs a branch, so one dynamic driver spans claude-code/codex/opencode/pi, including fanning a single round across several at once. 11 tests through the real kernel (sandbox stubbed at the process boundary): adaptive refine→refine→fanout→stop, explicit scripted trajectory across two harnesses, maxIterations cap, maxFanout clamp, empty-fanout + unknown-kind PlannerError, createSandboxPlanner end-to-end + n-shorthand + fenced-delta parse + decodeTask rejection.

…ny text prompt surface The text-surface entry point onto agent-eval's runImprovementLoop, sibling to improvementDriver (the code/worktree path). Defaults the driver to agent-eval's gepaDriver (reflective text mutator) and the gate to heldOutGate; runtime-agnostic via a single runWithPrompt seam. Identity-gated by construction: the loop runs evals, collects per-scenario signal, proposes candidates, and the held-out gate compares candidate vs baseline. result.prompt is the baseline (identity) UNLESS the gate decided 'ship' — so registering a prompt for optimization can never regress it; it only improves when held-out data earns it. Generic over the surface's execution (sandbox streamPrompt, runLoop, direct model call) — the optimizer never assumes how a prompt runs. Fails loud on misconfig (no driver/reflection, empty scenarios/holdout) and on a non-string CodeSurface (wrong entry point). 4 tests through the real runImprovementLoop, zero LLM (deterministic driver + judge + runner, in-memory storage): identity holds when no candidate beats baseline on holdout (returns the untouched baseline), promotes + returns the improved prompt + rationale when a candidate wins, fail-loud on misconfig and empty holdout.

…ePrompt (#75)

…amic loop driver, optimizePrompt (#76) * feat(loops): surface aggregated tokenUsage on LoopResult + reportLoopUsage bridge runLoop tracked per-call tokensIn/tokensOut (extractLlmCallEvent) but only aggregated costUsd — token counts were dropped before reaching Iteration or LoopResult. A runProfileMatrix/runCampaign dispatch wrapping runLoop could report cost but had no tokens to report, so agent-eval's backend-integrity guard (assertRealBackend, which keys on tokenUsage) would misread a real run as a stub and throw. - Iteration + LoopResult gain tokenUsage: { input, output }, summed across every llm_call event (per iteration) and across iterations (LoopResult). - reportLoopUsage(cost, result) forwards a finished loop's cost + tokens into a campaign cost meter in one call — the trivial consumption path for the new runProfileMatrix primitive. Typed structurally so loops stay free of an agent-eval import. Extends the existing cost-aggregation test to assert token aggregation + reportLoopUsage forwarding. Full suite 381 green. * chore(deps): bump @tangle-network/agent-eval ^0.54.0 → ^0.61.0 Consumes the published runProfileMatrix + token-capture release. 7-minor jump verified: typecheck + build + full suite (381) green. * feat(loops): loopDispatch — first-class runLoop→campaign dispatch adapter The seam critique found reportLoopUsage had one consumer (a test) and zero products: wiring runLoop into runProfileMatrix/runCampaign required hand-building ExecCtx, hand-adapting the campaign trace, and remembering to forward usage (forgetting the last yields a {0,0} stub cell). loopDispatch collapses all three into one typed call: const dispatch = loopDispatch({ sandboxClient, toLoopOptions }) await runProfileMatrix({ profiles, scenarios, dispatch, judges, commitSha }) It builds the ExecCtx, forwards loop.* trace events into the campaign's scoped trace (campaignTraceToLoopEmitter), runs runLoop, reports cost+tokens via reportLoopUsage internally, and returns winner.output. loopCampaignDispatch is the runCampaign (no-profile) variant. AgentProfile imported from agent-eval (the eval-harness type ProfileDispatchFn keys on), NOT sandbox's — closes the name-collision footgun at this call site. Tests: returns winner artifact + reports exact usage + forwards trace spans; usage still flows on a validator-failing run (must not read as a stub). Full suite 383 green. * chore(deps): declare agent-eval as a required peerDependency, not a hard dependency Version-discipline fix (boundary critique, VERSIONING 3/10). agent-eval was the lone hard dependency while sandbox + agent-knowledge are already peers. A hard dep lets pnpm install a SECOND, divergent agent-eval tree with an incompatible RunRecord/DefaultVerdict; today only pnpm.overrides prevents it. As a peer (>=0.61.0 <1.0.0, required — not optional), a consumer running a stale or divergent substrate gets a loud unmet-peer warning instead of a silent split tree. agent-eval moves to devDependencies for agent-runtime's own build/test. Typecheck + full suite (383) green with the peer layout. * chore(release): 0.32.0 — loopDispatch adapter + tokenUsage seam + agent-eval peer-dep * feat(loops+improvement): dynamic loop driver + identity-gated optimizePrompt (#75) * feat(loops): dynamic driver — agent-authored loop topology Third example driver alongside refine and fanout-vote, built on the existing Driver seam with zero kernel changes. Where refine/fanout-vote encode a fixed shape as a pure function of history, createDynamicDriver delegates the per-round shape to an injected TopologyPlanner that emits one TopologyMove (refine | fanout | stop) per round. - createDynamicDriver: maps moves onto plan/decide, enforces the iteration + fanout caps, fails loud (PlannerError) on a malformed move. Planner invoked once per round in plan(); decide() reads the cached move so an LLM planner is never double-called. 'done' is already a kernel-terminal decision, so termination needs no kernel change. - createSandboxPlanner: wires the planner to a sandbox profile (any harness) — streams a prompt carrying the history summary, decodes the move from a JSON envelope (structured result event or fenced block). - summarizeHistory: bounded, planner-friendly view of iteration history. - PlannerError added to the error taxonomy (carries 'validation'). Topology is orthogonal to harness: the planner never names a backend; the kernel's agentRuns round-robin decides which harness runs a branch, so one dynamic driver spans claude-code/codex/opencode/pi, including fanning a single round across several at once. 11 tests through the real kernel (sandbox stubbed at the process boundary): adaptive refine→refine→fanout→stop, explicit scripted trajectory across two harnesses, maxIterations cap, maxFanout clamp, empty-fanout + unknown-kind PlannerError, createSandboxPlanner end-to-end + n-shorthand + fenced-delta parse + decodeTask rejection. * feat(improvement): optimizePrompt — identity-gated optimization for any text prompt surface The text-surface entry point onto agent-eval's runImprovementLoop, sibling to improvementDriver (the code/worktree path). Defaults the driver to agent-eval's gepaDriver (reflective text mutator) and the gate to heldOutGate; runtime-agnostic via a single runWithPrompt seam. Identity-gated by construction: the loop runs evals, collects per-scenario signal, proposes candidates, and the held-out gate compares candidate vs baseline. result.prompt is the baseline (identity) UNLESS the gate decided 'ship' — so registering a prompt for optimization can never regress it; it only improves when held-out data earns it. Generic over the surface's execution (sandbox streamPrompt, runLoop, direct model call) — the optimizer never assumes how a prompt runs. Fails loud on misconfig (no driver/reflection, empty scenarios/holdout) and on a non-string CodeSurface (wrong entry point). 4 tests through the real runImprovementLoop, zero LLM (deterministic driver + judge + runner, in-memory storage): identity holds when no candidate beats baseline on holdout (returns the untouched baseline), promotes + returns the improved prompt + rationale when a candidate wins, fail-loud on misconfig and empty holdout. --------- Co-authored-by: Drew Stone <drewstone329@gmail.com> * chore(release): 0.33.0 — dynamic loop driver + identity-gated optimizePrompt (#75) --------- Co-authored-by: Drew Stone <drewstone329@gmail.com>

drewstone added 2 commits May 30, 2026 18:35

drewstone merged commit 39ccd42 into feat/loop-token-usage-for-profile-matrix May 31, 2026

tangletools pushed a commit that referenced this pull request May 31, 2026

chore(release): 0.33.0 — dynamic loop driver + identity-gated optimiz…

d8c237e

…ePrompt (#75)

tangletools deleted the feat/dynamic-loop-driver branch May 31, 2026 01:06

tangletools mentioned this pull request May 31, 2026

Land 0.32.0 + 0.33.0 release line: loopDispatch, tokenUsage seam, dynamic loop driver, optimizePrompt #76

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(loops+improvement): dynamic loop driver + identity-gated optimizePrompt#75

feat(loops+improvement): dynamic loop driver + identity-gated optimizePrompt#75
drewstone merged 2 commits into
feat/loop-token-usage-for-profile-matrixfrom
feat/dynamic-loop-driver

tangletools commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tangletools commented May 31, 2026

1. Dynamic loop driver — agent-authored topology (662cd4e)

2. optimizePrompt — identity-gated prompt optimization (1050b25)

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Dynamic loop driver — agent-authored topology (`662cd4e`)

2. `optimizePrompt` — identity-gated prompt optimization (`1050b25`)