Skip to content

feat(loops+improvement): dynamic loop driver + identity-gated optimizePrompt#75

Merged
drewstone merged 2 commits into
feat/loop-token-usage-for-profile-matrixfrom
feat/dynamic-loop-driver
May 31, 2026
Merged

feat(loops+improvement): dynamic loop driver + identity-gated optimizePrompt#75
drewstone merged 2 commits into
feat/loop-token-usage-for-profile-matrixfrom
feat/dynamic-loop-driver

Conversation

@tangletools
Copy link
Copy Markdown
Contributor

Two additions to the loops / improvement substrate. Stacked on
feat/loop-token-usage-for-profile-matrix because optimizePrompt needs the
agent-eval ^0.61 bump that lives on that branch (gepaDriver / runImprovementLoop);
it does not compile against main's ^0.54. Retarget to main once the base lands.

1. Dynamic loop driver — agent-authored topology (662cd4e)

Third example driver beside refine and fanout-vote, built on the existing
Driver seam with zero kernel changes. Where the other two encode a fixed
shape as a pure function of history, createDynamicDriver delegates the per-round
shape to an injected TopologyPlanner that emits one TopologyMove
(refine | fanout | stop) per round.

  • createDynamicDriver — maps moves onto plan/decide, enforces the iteration
    • fanout caps, fails loud (PlannerError) on a malformed move. Planner invoked
      once per round in plan(); decide() reads the cached move so an LLM planner is
      never double-called. 'done' is already a kernel-terminal decision.
  • createSandboxPlanner — wires the planner to a sandbox profile (any harness);
    decodes the move from a JSON envelope (structured result event or fenced block).
  • summarizeHistory — bounded, planner-friendly view of iteration history.
  • Topology is orthogonal to harness: the planner never names a backend; the
    kernel's agentRuns round-robin decides which harness runs a branch, so one
    driver spans claude-code/codex/opencode/pi (incl. fanning one round across
    several).

2. optimizePrompt — identity-gated prompt optimization (1050b25)

The TEXT-surface entry point onto agent-eval's runImprovementLoop, sibling to
the existing improvementDriver (code/worktree path) — extends, does not fork.
Defaults the driver to agent-eval's gepaDriver and the gate to heldOutGate;
runtime-agnostic via a single runWithPrompt seam.

Identity-gated by construction: the loop runs evals, collects per-scenario
signal, proposes candidates, and the held-out gate compares candidate vs baseline.
result.prompt is the baseline UNLESS the gate decided 'ship' — so registering
a prompt for optimization can never regress it; it only improves when held-out
data earns it. Fails loud on misconfig and on a non-string CodeSurface.

Tests

  • tests/loops/dynamic.test.ts — 11 tests through the real kernel (sandbox
    stubbed at the process boundary): adaptive refine→refine→fanout→stop, scripted
    trajectory across two harnesses, maxIterations cap, maxFanout clamp, empty-fanout
    • unknown-kind PlannerError, createSandboxPlanner end-to-end + n-shorthand +
      fenced-delta parse + decodeTask rejection.
  • tests/optimize-prompt.test.ts — 4 tests through the real runImprovementLoop,
    zero LLM (deterministic driver + judge + runner, in-memory storage): identity
    holds when no candidate beats baseline on holdout, promotes + returns the improved
    prompt + rationale on a real win, fail-loud on misconfig + empty holdout.

Full suite 398/398, tsc + biome clean. Both modules @experimental.

Follow-ups: #825 (wire the dynamic driver into a real consumer — skeletal-os
composer #555 / research-loop #294), #826 (adopt optimizePrompt on real prompt
surfaces, starting with the #294 research-loop which already has the infra).

drewstone added 2 commits May 30, 2026 18:35
Third example driver alongside refine and fanout-vote, built on the
existing Driver seam with zero kernel changes. Where refine/fanout-vote
encode a fixed shape as a pure function of history, createDynamicDriver
delegates the per-round shape to an injected TopologyPlanner that emits
one TopologyMove (refine | fanout | stop) per round.

- createDynamicDriver: maps moves onto plan/decide, enforces the
  iteration + fanout caps, fails loud (PlannerError) on a malformed move.
  Planner invoked once per round in plan(); decide() reads the cached
  move so an LLM planner is never double-called. 'done' is already a
  kernel-terminal decision, so termination needs no kernel change.
- createSandboxPlanner: wires the planner to a sandbox profile (any
  harness) — streams a prompt carrying the history summary, decodes the
  move from a JSON envelope (structured result event or fenced block).
- summarizeHistory: bounded, planner-friendly view of iteration history.
- PlannerError added to the error taxonomy (carries 'validation').

Topology is orthogonal to harness: the planner never names a backend;
the kernel's agentRuns round-robin decides which harness runs a branch,
so one dynamic driver spans claude-code/codex/opencode/pi, including
fanning a single round across several at once.

11 tests through the real kernel (sandbox stubbed at the process
boundary): adaptive refine→refine→fanout→stop, explicit scripted
trajectory across two harnesses, maxIterations cap, maxFanout clamp,
empty-fanout + unknown-kind PlannerError, createSandboxPlanner
end-to-end + n-shorthand + fenced-delta parse + decodeTask rejection.
…ny text prompt surface

The text-surface entry point onto agent-eval's runImprovementLoop, sibling
to improvementDriver (the code/worktree path). Defaults the driver to
agent-eval's gepaDriver (reflective text mutator) and the gate to
heldOutGate; runtime-agnostic via a single runWithPrompt seam.

Identity-gated by construction: the loop runs evals, collects per-scenario
signal, proposes candidates, and the held-out gate compares candidate vs
baseline. result.prompt is the baseline (identity) UNLESS the gate decided
'ship' — so registering a prompt for optimization can never regress it; it
only improves when held-out data earns it.

Generic over the surface's execution (sandbox streamPrompt, runLoop, direct
model call) — the optimizer never assumes how a prompt runs. Fails loud on
misconfig (no driver/reflection, empty scenarios/holdout) and on a non-string
CodeSurface (wrong entry point).

4 tests through the real runImprovementLoop, zero LLM (deterministic driver +
judge + runner, in-memory storage): identity holds when no candidate beats
baseline on holdout (returns the untouched baseline), promotes + returns the
improved prompt + rationale when a candidate wins, fail-loud on misconfig and
empty holdout.
@drewstone drewstone merged commit 39ccd42 into feat/loop-token-usage-for-profile-matrix May 31, 2026
@tangletools tangletools deleted the feat/dynamic-loop-driver branch May 31, 2026 01:06
drewstone added a commit that referenced this pull request May 31, 2026
…amic loop driver, optimizePrompt (#76)

* feat(loops): surface aggregated tokenUsage on LoopResult + reportLoopUsage bridge

runLoop tracked per-call tokensIn/tokensOut (extractLlmCallEvent) but only
aggregated costUsd — token counts were dropped before reaching Iteration or
LoopResult. A runProfileMatrix/runCampaign dispatch wrapping runLoop could
report cost but had no tokens to report, so agent-eval's backend-integrity
guard (assertRealBackend, which keys on tokenUsage) would misread a real run
as a stub and throw.

- Iteration + LoopResult gain tokenUsage: { input, output }, summed across
  every llm_call event (per iteration) and across iterations (LoopResult).
- reportLoopUsage(cost, result) forwards a finished loop's cost + tokens into
  a campaign cost meter in one call — the trivial consumption path for the new
  runProfileMatrix primitive. Typed structurally so loops stay free of an
  agent-eval import.

Extends the existing cost-aggregation test to assert token aggregation +
reportLoopUsage forwarding. Full suite 381 green.

* chore(deps): bump @tangle-network/agent-eval ^0.54.0 → ^0.61.0

Consumes the published runProfileMatrix + token-capture release. 7-minor
jump verified: typecheck + build + full suite (381) green.

* feat(loops): loopDispatch — first-class runLoop→campaign dispatch adapter

The seam critique found reportLoopUsage had one consumer (a test) and zero
products: wiring runLoop into runProfileMatrix/runCampaign required hand-building
ExecCtx, hand-adapting the campaign trace, and remembering to forward usage
(forgetting the last yields a {0,0} stub cell). loopDispatch collapses all three
into one typed call:

  const dispatch = loopDispatch({ sandboxClient, toLoopOptions })
  await runProfileMatrix({ profiles, scenarios, dispatch, judges, commitSha })

It builds the ExecCtx, forwards loop.* trace events into the campaign's scoped
trace (campaignTraceToLoopEmitter), runs runLoop, reports cost+tokens via
reportLoopUsage internally, and returns winner.output. loopCampaignDispatch is
the runCampaign (no-profile) variant. AgentProfile imported from agent-eval
(the eval-harness type ProfileDispatchFn keys on), NOT sandbox's — closes the
name-collision footgun at this call site.

Tests: returns winner artifact + reports exact usage + forwards trace spans;
usage still flows on a validator-failing run (must not read as a stub).
Full suite 383 green.

* chore(deps): declare agent-eval as a required peerDependency, not a hard dependency

Version-discipline fix (boundary critique, VERSIONING 3/10). agent-eval was the
lone hard dependency while sandbox + agent-knowledge are already peers. A hard
dep lets pnpm install a SECOND, divergent agent-eval tree with an incompatible
RunRecord/DefaultVerdict; today only pnpm.overrides prevents it. As a peer
(>=0.61.0 <1.0.0, required — not optional), a consumer running a stale or
divergent substrate gets a loud unmet-peer warning instead of a silent split
tree. agent-eval moves to devDependencies for agent-runtime's own build/test.
Typecheck + full suite (383) green with the peer layout.

* chore(release): 0.32.0 — loopDispatch adapter + tokenUsage seam + agent-eval peer-dep

* feat(loops+improvement): dynamic loop driver + identity-gated optimizePrompt (#75)

* feat(loops): dynamic driver — agent-authored loop topology

Third example driver alongside refine and fanout-vote, built on the
existing Driver seam with zero kernel changes. Where refine/fanout-vote
encode a fixed shape as a pure function of history, createDynamicDriver
delegates the per-round shape to an injected TopologyPlanner that emits
one TopologyMove (refine | fanout | stop) per round.

- createDynamicDriver: maps moves onto plan/decide, enforces the
  iteration + fanout caps, fails loud (PlannerError) on a malformed move.
  Planner invoked once per round in plan(); decide() reads the cached
  move so an LLM planner is never double-called. 'done' is already a
  kernel-terminal decision, so termination needs no kernel change.
- createSandboxPlanner: wires the planner to a sandbox profile (any
  harness) — streams a prompt carrying the history summary, decodes the
  move from a JSON envelope (structured result event or fenced block).
- summarizeHistory: bounded, planner-friendly view of iteration history.
- PlannerError added to the error taxonomy (carries 'validation').

Topology is orthogonal to harness: the planner never names a backend;
the kernel's agentRuns round-robin decides which harness runs a branch,
so one dynamic driver spans claude-code/codex/opencode/pi, including
fanning a single round across several at once.

11 tests through the real kernel (sandbox stubbed at the process
boundary): adaptive refine→refine→fanout→stop, explicit scripted
trajectory across two harnesses, maxIterations cap, maxFanout clamp,
empty-fanout + unknown-kind PlannerError, createSandboxPlanner
end-to-end + n-shorthand + fenced-delta parse + decodeTask rejection.

* feat(improvement): optimizePrompt — identity-gated optimization for any text prompt surface

The text-surface entry point onto agent-eval's runImprovementLoop, sibling
to improvementDriver (the code/worktree path). Defaults the driver to
agent-eval's gepaDriver (reflective text mutator) and the gate to
heldOutGate; runtime-agnostic via a single runWithPrompt seam.

Identity-gated by construction: the loop runs evals, collects per-scenario
signal, proposes candidates, and the held-out gate compares candidate vs
baseline. result.prompt is the baseline (identity) UNLESS the gate decided
'ship' — so registering a prompt for optimization can never regress it; it
only improves when held-out data earns it.

Generic over the surface's execution (sandbox streamPrompt, runLoop, direct
model call) — the optimizer never assumes how a prompt runs. Fails loud on
misconfig (no driver/reflection, empty scenarios/holdout) and on a non-string
CodeSurface (wrong entry point).

4 tests through the real runImprovementLoop, zero LLM (deterministic driver +
judge + runner, in-memory storage): identity holds when no candidate beats
baseline on holdout (returns the untouched baseline), promotes + returns the
improved prompt + rationale when a candidate wins, fail-loud on misconfig and
empty holdout.

---------

Co-authored-by: Drew Stone <drewstone329@gmail.com>

* chore(release): 0.33.0 — dynamic loop driver + identity-gated optimizePrompt (#75)

---------

Co-authored-by: Drew Stone <drewstone329@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants