Skip to content

fix(runtime): use agent-centric hook target names#163

Merged
drewstone merged 1 commit into
mainfrom
feat/agent-hook-target-names
Jun 5, 2026
Merged

fix(runtime): use agent-centric hook target names#163
drewstone merged 1 commit into
mainfrom
feat/agent-hook-target-names

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

@drewstone drewstone commented Jun 5, 2026

Summary

  • rename runtime hook targets from implementation-shaped names to agent-centric names
  • use agent.run, agent.turn, agent.tool_call, agent.plan, and agent.decision
  • keep producer details in event metadata for debugging

Verification

  • pnpm lint
  • pnpm typecheck
  • pnpm test
  • pnpm build

@drewstone drewstone merged commit 7c2507b into main Jun 5, 2026
1 check passed
drewstone added a commit that referenced this pull request Jun 5, 2026
… tree, observable + deep-cleaned (#165)

* feat(bench): wire AppWorld as a router-worker GEPA target + harden the refine loop

Adds AppWorld to gepa-refine as a no-sandbox router worker (the worker writes the
Python solution via the router; the local venv AppWorld engine executes + scores
it — objective passes/num_tests, no LLM judge, never touches the sandbox SSE
path). Plus the weak-baseline + headroom setup the empirical-proof pursuit needed,
and three robustness fixes surfaced by live runs:

- AppWorld adapter wired into ADAPTERS + scoreBased (graded fraction); a
  router-worker branch in runWithPrompt (refine the solution k rounds, extract
  the last fenced python block); AppWorld-domain reflectionTarget +
  reflectionPrimitives so the analyst diagnoses in AppWorld terms.
- WEAK_APPWORLD_DIRECTIVE (deliberately bare) + BASELINE_DIRECTIVE env override —
  the standard GEPA "optimize a weak starting prompt" setup.
- Reflect-model default gpt-5 -> gpt-4o: gpt-5 takes 40-60s/call and aborts the
  client timeout; gpt-4o is ~2s and reliable on a latency-critical inner loop.
- analyzeGeneration: retry transient LLM failures (4x exp backoff) instead of
  silently starving a generation of findings.
- Deterministic difficulty-balanced train/holdout split (FNV hash of task id) —
  benchmark task lists arrive ordered, so a raw first/second-half slice can hand
  TRAIN only easy tasks (0 failures to learn from) and HOLDOUT only hard ones.

Verified: clean end-to-end AppWorld run, 0 stubs/aborts, EYES->HANDS findings
fire. (Substrate timeout hardening is agent-eval#212.)

* feat(bench): multi-turn REPL AppWorld agent (execution feedback) — the scaffold axis

The blind one-shot AppWorld worker can't act on directive guidance ("inspect
api_docs, authenticate, paginate, verify") because it never sees API output — so
a directive is inert and prompt-optimization measures ~0 lift. This adds the
missing scaffold: a multi-turn REPL agent.

- appworld_driver.py `react` subcommand: the agent writes ONE python block per
  turn, the engine EXECUTES it in a PERSISTENT AppWorld world, the output (or
  error traceback) is fed back, and it iterates until complete_task() or
  max-turns; scored in-process by AppWorld's own evaluator. Returns score + token
  usage + a compact transcript. Router calls go through the OpenAI-compatible
  router with retry on transient/429/5xx. Validated standalone: passes 1/2 (0.5)
  in 8 turns vs ~0 for the blind one-shot.
- gepa-refine: AppWorld worker defaults to the REPL agent (APPWORLD_REACT=0 keeps
  the blind one-shot as a control arm). The episode is scored in-process so the
  artifact carries the score (judge passes it through — a multi-turn episode
  can't be re-executed from one artifact), and the transcript rides along so the
  failure-analyst reflection diagnoses WHAT went wrong — giving the directive real
  signal to optimize against.

This is the genuinely-informative test: optimize the directive for a worker that
can actually use it. Powered run in flight.

* feat(bench): agentic EOPS rollout + the score-vs-n best-of-n curve

The gate's leaf was a one-shot completion, which floors at 0 on an agentic tool-use
domain (the agent must observe a tool result before it can act on it). gym-agent.ts
runs a REAL rollout: a tool-calling loop against the gym's MCP tools — seed an
isolated DB, loop completion->tool_calls->execute->feed-results->repeat with carried
state, then score the final DB with the deterministic SQL verifier. This is the POMDP
rollout (partial observations = tool results, actions = tool calls), not a blind plan.

gym-sweep.mts plots the missing measurement: best-of-m under the verifier over m=1..N
independent rollouts, the exact subset-expectation estimator (one batch of N rollouts
yields the whole curve). Bad-data robustness: a verifier whose SQL the server rejects
(a malformed query in the task) is excluded from the denominator, never charged to the
agent; an all-malformed task is skipped; a failed rollout drops from the sample.

First curve (EOPS itsm, gpt-4.1, 6 tasks x 8 rollouts): best-of-1 13.5% -> best-of-8
56.7% graded score (+43.2pp), pass@8 16.7%. The canonical inference-time-scaling result,
reproduced through real agentic rollouts + a real deployable verifier. This is the
PARALLEL best-of-n baseline (brute-force compute + a verifier) — the baseline the
sequential driver-steered loop must beat at equal compute.

* feat(bench): general agentic primitive (depth + breadth over a shared artifact) through the keystone

Collapses the EOPS spike into the real primitive. The domain now lives behind ONE seam,
AgenticSurface (open an artifact, list tools, call a tool, score it, close it); EnterpriseOps
is one implementation (agentic-eops.ts: seed a gym DB, MCP tools, SQL verifier). Commit0 /
AppWorld / terminal-bench plug in by implementing the same interface — the drivers never change.

The drivers are domain-blind Agents run through createSupervisor().run (Agent.act over a
conserved-budget Scope, spawning leaf shots via scope.spawn):
- DEPTH: ONE persistent artifact carried across shots. Each shot the agent works the tool loop;
  between shots a trace-analyst (selector != judge: reads the trajectory, never the score) steers
  the resumed session toward what's unfinished. shot n stands on shot n-1's artifact state +
  history. Reports the score-per-shot progress curve.
- BREADTH: K independent rollouts (each own artifact), the deployable verifier picks the best.
The leaf (one shot over a handle) is resolved per-spawn from a surface-closed registry — the open
LeafExecutor seam, not bespoke per-benchmark glue. Equal-k by the conserved pool.

Verified live through the Supervisor (EOPS itsm, gpt-4.1): depth + breadth both run end-to-end;
depth progress curve flat at 50% on a short itsm task (agent capability plateau, not a mechanism
break — the shots+analyst loop runs). EOPS is short-horizon; the long-horizon depth curve is
Commit0 (next surface impl). Exports the EOPS HTTP primitives from gym-agent for reuse.

* feat(bench): the operator atom — driver+analyst+IC fused, leads over the shared artifact

Replaces the hardcoded relay with a real operator: ONE agent whose brain reads the
trace (firewalled — never the score), judges, and each round picks the single best
move over the shared artifact — steer (delegate a worker shot), work (do one decisive
turn itself), branch (fresh line, adopt if better), or done. The driver, the analyst,
and an independent-contributor are the same self-similar atom now; how it leans is the
AgentProfile, not a separate type.

Also adds the mix driver (depth main line + branch-when-stalled, adopt-if-better, the
MCTS-PW shape the traces motivated), a trace hook (AgenticTraceEvent) surfacing every
steer + tool call + score + finding so the loop is readable, and MODE on the trace
runner.

Verified live + legible (EOPS itsm, gpt-4.1, operator on task 15): the operator
sequenced recon -> retire+migrate -> ownership-check -> incidents -> done, authoring
each instruction with a rationale, choosing steer/work/done with judgment — readable
end to end. Honest calibration gaps the trace exposed: it declared `done` at 86% (one
verifier still failing — firewalled stop is over-eager vs the persistent relay's 100%),
and the `work` move was a no-op (second-guessed an already-correct value). Both are
prompt calibration, not structural.

* refactor(bench): drive the surface through the existing personify combinators (delete-and-improve)

Audit (two parallel passes over agent-runtime src/loops + agent-eval) found the hand-rolled
agentic.ts drivers duplicate existing primitives: breadthDriver=fanout, depthDriver=loopUntil+
createScopeAnalyst, mixDriver=widen, analyze()=createScopeAnalyst (and worse — bypasses the
pool/journal/firewall), runAgentic=runPersonified. "Many personas in parallel" = agent-eval
runEvalCampaign; worker tooling/skills = AgentProfile; the sequential operator ~ agent-eval
runProposeReview. Genuinely new: only the AgenticSurface seam + the operator's "work" move.

agentic-personify.ts is the bridge: surfacePersona() wraps the surface registry (the shot is
already a LeafExecutor) into a Persona, so the keystone fanout/loopUntil/widen + runPersonified
drive it directly. PROVEN: runBreadthPersonified (fanout + runPersonified) reproduces the
hand-rolled breadthDriver EXACTLY on EOPS itsm (50%=50%, 57%=57%) — the combinators drive the
surface, the bespoke drivers are redundant.

Next (each deletes code): depth→loopUntil + wire the dormant analyst live (createScopeAnalyst →
buildSteerContext → the gate; combinators.ts:319 currently passes empty findings); mix→widen;
operator→one LoopShape (the work-move); personas→runEvalCampaign. agentic.ts then shrinks to the
surface + the operator shape.

* refactor(bench): delete the redundant drivers — breadth=fanout, operator subsumes depth/mix

Executes the consolidation. Deleted ~239 lines from agentic.ts: breadthDriver (= the existing
`fanout` combinator, proven identical), depthDriver + mixDriver (the operator subsumes them —
steer-only = depth, branch-when-stalled = mix), analyze() + analystExecutor (the worse, non-
firewalled, non-pool-metered analyst), and the now-orphaned leaf/drainOne/perChild helpers.

What remains is minimal: the AgenticSurface seam + its shot LeafExecutor/registry, the OPERATOR
(the one driver — reads the trace, judges, steers/works/branches/stops over the shared artifact,
run through createSupervisor().run), and the surface persona that lets the keystone combinators
drive the surface. BREADTH now runs through runPersonified + fanout (agentic-personify.ts);
runAgentic is operator-only. agentic-run compares operator vs breadth(fanout); the trace runner is
operator. tsc clean; consolidated path runs end-to-end on EOPS itsm.

Reuse over duplication: the drivers are the keystone combinators + the operator; personas are
Persona/runEvalCampaign; the analyst is createScopeAnalyst (wiring it live in the operator is the
next step). agentic.ts is now the surface + the operator, nothing more.

* feat(bench): the profile directory — roles are data, the driver profile is the lead-steerer's config

Full unification: one primitive (a profiled agent run as Agent.act in an executor, orchestrated by
the Supervisor); worker / analyst / driver are the SAME agent differing only by profile. profiles.ts
is the directory — RoleProfile (id/role/model/systemPrompt/skills/tools/mcp), the worker/analyst/
driver profiles, and OPERATOR_TOOLS (the verbs a driver uses to lead workers: list/define/run_analyst,
observe/spawn/steer_worker, stop — in-process = Scope methods, in-sandbox = the same verbs as MCP).

The driver profile is the sketched operator, formalized best-in-class: review the workers you drive
(in-flight or last-shot traces), investigate (run/define analysts, fan out sub-analysts in parallel),
identify everything, propose reusable findings as skills, take the lead when confident or dispatch
otherwise, parallelize intelligently not frivolously, judge from traces never the score. Topology
(breadth/depth) is the driver's agentic choice; the combinators are the deterministic floor.

The directory is the population the agent-eval RSI loop optimizes (runImprovementLoop: gepaDriver over
prompts, a skillopt ImprovementDriver over skills, Pareto + holdout-gated; analyst findings are the
optimizer's input). Next: the operator-toolbox MCP server (Scope-as-MCP) so a sandbox coding-harness
agent can BE the driver via these tools; then wire createScopeAnalyst as run_analyst's impl.

* feat(loops): scope.send — the missing operator verb (steer/interrupt a running child)

The driver's toolbox was three keystone verbs + one gap: spawn=scope.spawn, observe=scope.view,
analyze=createScopeAnalyst all existed; steering a RUNNING worker did not. Adds it as an additive
primitive: Scope.send(nodeId, msg) delivers an out-of-band message to a live child's executor
inbox (LeafExecutor.deliver?), returning false for an unknown/settled child or a leaf with no
inbox. A streaming executor drains its inbox between turns (steer / interrupt / resume); a one-shot
leaf that can't be steered mid-flight simply omits deliver.

This is the one verb the sandbox operator needs that wasn't there: in-process it's this Scope
method; in a box it's the steer_worker MCP tool (Scope-as-MCP); cross-box it's the Agent Bus —
same verb, three transports. spawn stays budget-bounded/fail-closed, so a maximally-agentic driver
that spawns + steers at will is still equal-k by construction.

Additive (optional deliver, new send method on the experimental Scope) — 683/683 tests pass,
including 2 new: send steers a live child via its inbox / returns false for settled-unknown-noinbox.

* feat(mcp): operator toolbox — Scope-as-MCP, the driver's verbs as tools (the sandbox driver enabler)

The driver's four verbs, now callable from inside a sandbox: createOperatorToolbox({scope, blobs,
makeWorkerAgent, perWorker}) returns MCP tools backed by a live keystone Scope —
  spawn_worker  → scope.spawn   (budget-bounded, fail-closed; a worker may itself be a DRIVER → drivers-of-drivers)
  observe_worker→ scope.view + the result blob
  steer_worker  → scope.send    (deliver next-instruction/interrupt to a RUNNING worker)
  stop          → the driver's terminal move
Same verbs the in-process operator calls directly; this is the binding that lets a coding-harness
agent running the driver/operator profile BE the driver from inside a box. createMcpServer gains an
`extraTools` option (fail-loud on shadowing a built-in) so the toolbox is served over the existing
JSON-RPC transport. Exported from the mcp barrel.

run_analyst is deliberately deferred to the analyst-kind directory (agent-eval createTraceAnalystKind
+ createScopeAnalyst). 688/688 tests (5 new: spawn/observe/steer/stop handlers + server wiring +
fail-closed + shadow-throws). The budget discipline holds for a maximally-agentic driver: spawn fails
closed at the conserved pool, so equal-k survives a driver that spawns + steers at will.

* feat(mcp): analyst-kind directory + run_analyst/list_analysts operator verbs

The operator's review verb, completing the driver's toolbox. An analyst is
not one question: a kind is ONE lens (completeness, correctness, policy,
efficiency, tool-use), each emitting AnalystFindings tagged by its area. The
kinds are composable DATA, the runner is generic.

- analyst-kinds.ts: the lens directory (defaultAnalystKinds) + a generic
  runner (runAnalystLens/makeAnalystRunner). Reuses agent-eval makeFinding for
  the lift and the runtime assertTraceDerivedFindings firewall (selector != judge:
  a finding citing a judge/verdict/score metric is rejected). The raw-row
  validator + schema prompt are inlined (kind-factory internals are not root-
  exported) so a lens emits the same finding shape without an unstable import.
  AnalystKind is a deliberate subset of TraceAnalystKindSpec — upgradeable to
  the full agentic factory without changing the directory surface.
- operator-toolbox.ts: list_analysts (the menu) + run_analyst (apply a kind
  over a settled worker's trace -> findings). Both present only when the
  analyst seam (analystKinds + runAnalyst) is wired; a pure-dispatcher driver
  omits them. run_analyst fails loud if the worker has not settled.

Tests: 12 new (lens directory is plural/distinct, lift + firewall, renderTrace,
injected-chat lens run, unknown-kind error; toolbox list/run over a mock scope
with a settled + a running worker). Full suite 695 pass.

* feat(mcp): live operator driver — LLM tool-loop over the toolbox is the topology

The piece that makes "the root decides any topology" real. createOperatorDriverAgent
returns an Agent whose act() hands the operator toolbox to an injected LLM tool-loop, so
the model's tool calls ARE the run shape: how many workers, how deep, when to branch, when
to stop. It runs inside the keystone Supervisor over the conserved budget pool, so an LLM
that tries to spawn past the pool gets a fail-closed error (equal-k by construction), the
journal records the tree, and a worker may itself carry a driver profile (drivers-of-drivers).

- await_next verb (scope.next) added to the toolbox: THE wake event — block until the next
  spawned worker settles, returning its deployable verdict. Workers run concurrently; spawn
  a batch, then await_next to collect them. The toolbox now keeps a settled() ledger.
- selector != judge holds: the driver selects on the deployable verdict (the ledger); the
  analyst lens reads only the trace. selectBest = best valid by score, ties to highest.
- chat is injected (provider-neutral ToolChat) so the driver is runtime-agnostic: a router
  chat in-process, or the harness's own tool loop pointed at the toolbox over MCP in a box.

Tests: live driver through the REAL Supervisor + conserved pool with stub workers —
spawn -> await_next -> stop selects the best verdict; a spawn past the pool fails closed
(equal-k). Plus a toolbox await_next/settled ledger unit test. Full suite 698 pass.

* fix(mcp): harden the operator driver — survive bad tool calls, force a real attempt

Robustness from the codex audit (FunctionCallError RespondToModel-vs-Fatal pattern) + the
aec smoke (the model "answered in prose" and shipped nothing):

- Tool dispatch no longer crashes the run. Malformed JSON args are fed back as a tool error
  (the handler is NOT called with garbage); a handler throw is caught and returned as a
  tool result so the model self-corrects, instead of throwing out of act() and discarding
  the whole run over one correctable tool call.
- minWorkersBeforeStop: reject stop until >=N workers settled done — an operator cannot ship
  nothing (fixes premature done with zero spawns).
- maxBareTurns: re-nudge firmly when the model returns prose instead of a tool call (some
  providers ignore tool_choice:'required'); bounded so a non-driving model ends as a
  no-winner, never an infinite spin.

Driver tests still pass (spawn->await->stop selects best; spawn past pool fails closed).

* feat(bench): operator-driver gate arm on aec — adaptive ≤K workers vs blind@K

The non-blind topology arm of the open gate, router-only (deterministic verify.py judge,
no sandbox). operator-gate.mts runs the live operator driver per task: it may spawn UP TO K
solver workers (each = one router solve + verify.py grade, reusing benchSolveLeaf), await
their deployable verdicts, and stop when satisfied. The conserved pool caps it at K workers
(equal-k vs random@K); it may use fewer (adaptive), at the cost of its own coordination
tokens, which are accounted separately.

- router-client.ts: routerChatWithTools — a tool-calling completion (OpenAI message shape,
  configurable tool_choice), the operator driver's LLM seam. Same real-usage/fail-loud
  discipline as routerChatWithUsage.
- keystone-gate.ts: export benchSolveLeaf so the operator's workers reuse the deployable
  solve-and-grade leaf (the spawn_worker `task` string becomes the worker's strategy).

Validated on 2 tasks: conduit-fill resolves with 1 worker; catenary spawns 2 then stops
(adaptive). Writes an operator@K corpus that corpus-report pairs against random@K/diverse@K.

* fix(loops): close the reserve→factory budget-leak window — protect the equal-k invariant

From the codex audit. In scope.spawn the conserved-pool reservation is taken (pool.reserve)
BEFORE the executor factory runs (resolved.value(spec, ctx)); a synchronous factory throw
there leaked the reservation, because runChild — which reconciles the ticket — is never
reached. A leaked reservation silently breaks total ≡ free + reserved + committed, the
equal-k invariant the whole instrument rests on.

- scope.spawn: wrap the post-reserve region in try/catch; on a synchronous throw release the
  reservation with zero spend (pool.reconcile(ticket, zeroSpend())) and rethrow. runChild is
  the last statement and never sync-throws, so there is no double-reconcile.
- BudgetPool.assertNoOpenTickets(): the leak detector — fail loud if any reservation remains.
  Called at the supervisor join barrier on the success path (a leak would corrupt spentTotal).
- Tests: a synchronous factory throw releases the reservation (pool fully restored, no open
  ticket); assertNoOpenTickets throws while a ticket is open and passes once reconciled.

Full suite 700 pass.

* feat(bench): make the operator + blind gate runners benchmark-agnostic (BENCH=…)

Both gate runners now resolve their adapter via resolveAdapter(process.env.BENCH ?? 'aec-bench')
instead of hardcoding aec, and stamp the corpus benchmark field from adapter.name. One BENCH=…
flag points operator-gate.mts and aec-gate.mts at any of the 16 registered adapters.

Validated: a FinSearchComp smoke (FINSEARCHCOMP_FIXTURES=1 N=3 K=3) ran clean end-to-end — the
operator resolved the finsearch adapter, drove adaptive workers (2.67/task), and the per-record
LLM judge scored each. (It scored 0% there because finsearch needs WEB SEARCH and the current
benchSolveLeaf worker is a plain router chat with none — a worker-capability gap, not a runner
bug; the search/agentic-worker wiring is the next step.)

* feat(bench): operator driver over EnterpriseOps — the full-vision multi-turn test

The new createOperatorDriverAgent over the EOPS AgenticSurface, where every spawn_worker is a
real multi-turn agentic rollout (seeded gym DB → MCP tool loop → verifier score), not a single
router chat. So the operator's full verb set fires for the first time: spawn_worker (a fresh
rollout with a strategy), await_next (deployable verifier score), run_analyst (trace lenses over
the worker transcript — the analyst directory live), stop. Conserved pool caps it at K rollouts
(equal-K vs the breadth best-of-K baseline, runBreadthPersonified).

operatorSurfaceRegistry wraps shotExecutor (fresh per spawn) so the operator's strategy string
maps to a ShotTask steer over the task; the analyst seam is wired from defaultAnalystKinds +
makeAnalystRunner. Smoke (TASKS=1 K=2 gpt-4.1) ran clean: operator spawned 2 rollouts, 8 turns,
selected best; breadth baseline ran. Scaling next for the operator-vs-breadth signal.

* fix(mcp): hard maxWorkers equal-k cap on the operator driver + paired sign test

The first EOPS operator-vs-breadth run (+9.7pp partial score, n=12) was confounded: short
workers refund their unspent iteration budget, so the conserved pool readmitted and the operator
ran 5-6 rollouts vs breadth's 4 — more compute than control. A win under unequal compute is not a
topology win (repo law).

- createOperatorDriverAgent gains maxWorkers: a HARD cap on successful spawns, independent of the
  pool. spawn_worker past it returns a typed cap error (handler not called). Set maxWorkers=K for an
  exactly-≤K-rollout comparison; the coordination-token overhead is disclosed separately (worker-K
  is equal). Test: spawns past the cap are rejected even with pool budget to spare.
- operator-eops.mts: maxWorkers=k + a two-sided exact binomial sign test over score-discordant
  tasks, so the operator-vs-breadth verdict carries wins/losses/ties + p, not just a mean delta.

* fix(mcp): exhaustUnlessResolved — stop the operator giving up early at equal-K

The probe exposed under-exploration: the operator stopped after one low-scoring worker (minWorkers
met) while blind best-of-K used its full budget and scored higher. Adaptive early-stop is good for
easy wins but loses on hard tasks. exhaustUnlessResolved (with maxWorkers set) rejects stop while NO
worker has fully resolved and spawn budget remains — forcing the operator to use its full K with
analyst-informed strategies. Coverage then ≥ blind best-of-K, so a score comparison isolates
STRATEGY quality, not a give-up-early compute-vs-score tradeoff. Wired into operator-eops.mts.

* fix(bench): bounded retry on EOPS seed — survive transient SQLite 5xx under concurrent seeding

The clean n=20 EOPS run logged "unable to open database file" 500s (the gym's SQLite under
concurrent within-task seeding — up to ~8 simultaneous seeds), silently dropping some worker
rollouts to 0 and adding noise to the operator-vs-breadth comparison. seed() now retries a
transient 5xx up to 4 times with linear backoff (fresh database_id per attempt); non-5xx fails
loud immediately. Makes a scaled EOPS run trustworthy.

* docs(canon): split Gate A (inner GO/NO-GO) from Gate B (flywheel success); fix equal-k + oracle/verifier

The canon let one word — "THE gate" — mean both a narrow inner-loop diagnostic AND the program's
success criterion, contradicting learning-flywheel.md (outer loop is the product; a within-run zero
is fine). That mis-specified measurement contract made a predicted inner-loop null read as a
program verdict. Grounded against the real files + this session's EOPS evidence (stateful multi-turn
workers under a HARD equal-K cap beat blind +6.1pp — so equal-compute and depth coexist).

Naming/scoping fixes only — the engine and the equal-compute rigor are correct and kept:
- learning-flywheel.md: add the ONE success definition (Gate B) — cross-run score-vs-run slope under
  a frozen-controller control, equal per-run compute, deployable-checker graded.
- roadmap-rsi.md + architecture-interpretations.md §5 + HARNESS.md: rename "THE gate" → Gate A (the
  inner GO/NO-GO for building the recursive-driver layer); scope "if it fails, stop/delete" to
  within-run steering only — never the corpus+controller product.
- equal-k → equal-COMPUTE everywhere: it bounds Σ rollouts × turns; k counts ROLLOUTS, each may be a
  full multi-turn/stateful trajectory. Statelessness was the corpus-replay instrument's choice, not
  the principle.
- architecture.md §7: split the overloaded "selector ≠ judge" into ORACLE (banned from selection AND
  steering), VERIFIER (a deployable checker — ALLOWED in both; what depth needs), WRITE-ONLY JUDGE
  (banned from steering only).
- architecture.md §9: demote FinSearchComp from primary to a deployable-selector negative control;
  the primary depth/operator bench is stateful (EnterpriseOps-Gym / commit0 / swe-bench).

* refactor(drivers): the driver is harness+profile+MCP — delete dead pair + anchor the decision

Per the vision-owner: a driver is an agent f(trace, outputs) -> thoughts + MCP tool calls, RUN BY a
coding harness in a sandbox. The harness already owns the loop, tool-calling, subagent spawning, and
native idioms (parallelize/ultrathink/dynamic-workflow). We own only the steering MCP, the profiles,
and the orchestrator. There is no createDriver function in the product; the driver is launched, not
constructed. The in-process LLM tool-loop + the create*Driver factory zoo + the TopologyMove DSL were
a measurement shim that calcified into a fake product — slop to delete.

- docs/architecture-driver.md: the decision, the one Decider seam (code-rule / in-process LLM /
  sandbox-harness all unify), the delete-vs-keep table, and the phased build order. The anchor that
  prevents re-drift (companion to the Gate A/Gate B canon fix).
- delete bench/src/drivers/{llm-meta-driver,progressive-widening}.ts — a mutually-referential dead
  pair, 0 external callers (the real widen lives in personify/wave-types.ts + the WidenGate type in
  supervise/types.ts, both kept).

* refactor(drivers): bench has ZERO drivers — delete the dead blind control; sharpen the anchor

The repo builds ONE driver, in the library (src/). bench is a thin experiment consumer: adapter +
"run the one library driver at a profile" + score. A blind control is not a bench driver — it is the
one library driver with a blind decider; the equal-compute guard is experiment infra.

- delete bench/src/drivers/flat-harness.ts (the gate's blind control — 0 callers, superseded by
  keystone-gate's fanout). bench/src/drivers/ is now empty/gone.
- architecture-driver.md: add the "bench has ZERO drivers" section + flag the mislocated library
  abstractions (AgenticSurface / shotExecutor / agenticRegistry in bench/src/agentic.ts) that must
  move to src/ — bench is squatting on the library because the experiment predated the one driver.

* chore(deep-clean): delete dead example orphans + the dead preset registry

Deep-clean batch 1-2 (caller-verified, 0 importers):
- src/loops/personify/examples/{code,meta-orchestrator,research}.ts — proof-of-concept demos,
  not barrel-exported, 0 importers, 0 tests. The examples/ dir is now gone.
- src/loops/drivers/planners.ts + tests/loops/planners.test.ts — the dead preset registry
  (blind/PROMPT_PLANNERS/resolvePlanner), own-test-only; superseded by the decider model in
  docs/architecture-driver.md. Barrel re-exports dropped from src/loops/index.ts.

* chore(deep-clean): drop planners barrel re-exports + lint-format operator-driver

* chore(deep-clean): delete the dead in-process EOPS experiment cluster (10 files)

A closed dead subgraph — 0 external importers, in no package.json script, imported by nothing kept
(verified by caller-grep). The in-process operator experiment + bench squatting on library
abstractions, both slop per docs/architecture-driver.md (the driver is harness+profile+MCP; bench
holds no drivers/abstractions).

Deleted: agentic.ts (AgenticSurface/shotExecutor/agenticRegistry — library abstractions, to be
rebuilt clean in src/), gym-agent.ts + agentic-eops.ts (EOPS gym surface — wiring preserved in
memory), operator-eops.mts (the in-process EOPS experiment runner), agentic-personify.ts,
agentic-run.mts, agentic-trace.mts, keystone-gate-probe.mts, gym-sweep.mts, gym-depth.mts.

The EOPS operator>blind +6.1pp result + the gym recipe live in persistent memory; the experiment
re-runs through the product (sandbox-harness driver + MCP), not the in-process shim. Kept:
keystone-gate / operator-gate / aec-gate / corpus-* / experiment / gepa-refine.

* docs(canon): rewrite the atom as the recursive agent tree; fold in the driver doc; wire the hook model

Consolidate into the existing spine — no new doc, less is more, eliminate the framing that led to
bad decisions.

- architecture.md §1 ("the atom"): replace the Program/TopologyMove opcode DSL (the slop) with the
  true model — ONE agent = AgentProfile + harness in a Scope; it calls a tool; one tool (spawn over
  MCP) creates a child agent; topology EMERGES from spawn/steer, no DSL. The recursive execution
  tree IS the product; we own only the MCP + profiles + orchestrator. Checks (analyst/judge/verifier)
  are DATA the driver authors on the fly (define_check/run_check); oracle≠verifier≠judge made explicit.
- §1b (new): the lifecycle hook stream from the merged PR (#162/#163) — agent.{run,turn,tool_call,
  spawn,child,plan,decision} × {before,after,error,event}; runLoop/toolLoop/Scope.spawn are producers;
  hooks attach at the execution/spawn boundary, not the profile. This is what the topology viz reads.
  Open gap flagged: Scope.spawn doesn't emit into it yet.
- Header status: product-core (supervise tree + sandbox seam + MCP + corpus + hooks) vs driver-as-code
  slop being deleted; runLoop is one backend, not the center.
- §6 corollary: bench holds ZERO drivers/abstractions — thin experiment consumer only.
- Delete docs/architecture-driver.md (folded into §1) + drop its README pointers.

* refactor(kill): delete the in-process operator-driver loop + its experiment arm

The in-process LLM tool-loop was the measurement shim that calcified into a fake product (arch
§1: the driver is a sandbox harness, not an in-process loop). Deleted operator-driver.ts (the
loop + Decider/ToolChat types), operator-driver.test.ts, bench/src/operator-gate.mts (the AEC
operator arm that drove the shim — its result is in memory; the blind arms in aec-gate.mts stay),
and the mcp/index barrel exports. Kept operator-toolbox.ts (the steering MCP) + analyst-kinds.ts
(the checks) — the product surface the sandbox-harness driver will mount. typecheck + 338 loop/mcp
tests green.

* refactor(kill): delete createRefineDriver factory; kernel tests use a 15-line local helper

refine.ts + refine.test.ts + the barrel export gone (no production caller — only the barrel + a
README mention). The 3 kernel/composition tests that used it as a generic driver-under-test now
import a tiny tests/loops/refine-driver.ts (test scaffolding, not a product factory). typecheck +
31 tests green.

* refactor(kill): delete createFanoutVoteDriver factory; inline a 4-line fanout driver where used

fanout-vote.ts + fanout-vote.test.ts + the barrel exports gone. The fanout BEHAVIOR (N copies
round 0 → kernel picks best-valid) is a 4-line inline Driver where production needs it
(profiles/coder.ts:multiHarnessCoderFanout, the delegate_code N>1 path) and in the two examples;
composition.test uses the shared tests/loops/refine-driver.ts helper. No public create*Driver
factory + Decision type + test-helper exports — just the few lines of behavior at each call site.
typecheck + lint clean; coder + composition tests green.

* refactor(kill): delete createSandboxPlanner — the LLM-emits-TopologyMove envelope bridge

The last factory. createSandboxPlanner wrapped an LLM-in-a-box into a TopologyPlanner that decoded
JSON topology-move envelopes — the DSL the native-language-over-MCP model replaces (arch §1: the
harness steers in natural language, topology emerges from spawn/steer, no envelope DSL). No
production caller (only docstring mentions). Deleted sandbox-planner.ts + the barrel exports + the
`describe('createSandboxPlanner')` block + the two envelope-decoding it-cases + the now-dead
plannerAndWorkerClient helper. The kept dynamic-driver tests use inline code TopologyPlanners.
createDynamicDriver stays as a kernel backend. Full suite green.

* refactor(names): delete drivers/ dir — move dynamic.ts up (it's a loop backend, not a driver)

A drivers/ directory with one file actively lies after 'the driver is not a code factory'. dynamic.ts
is the planner-driven runLoop backend — it lives next to run-loop.ts now. Import paths updated; typecheck
+ dynamic tests green.

* refactor(names): operator-toolbox→agent-bus + analyst-kinds→checks; gitignore data dumps

Names now match the model (arch §1):
- operator-toolbox → agent-bus: the MCP an agent uses to spawn/observe/steer child agents (the Agent
  Bus). 'operator' was the deleted in-process driver. createOperatorToolbox→createAgentBus etc.
- analyst-kinds → checks: checks are DATA with kind analyst|judge|verifier; AnalystKind→Check,
  defaultAnalystKinds→defaultChecks, makeAnalystRunner→makeCheckRunner, runAnalystLens→runCheck.
- .gitignore: bench/data, bench/experiments, __pycache__, .claude — stop accidental data dumps.
typecheck + lint + renamed tests green.

* refactor(names): src/loops → src/runtime — the dir was named after one backend (runLoop)

'loops' named the whole execution substrate after its least-central piece (the run-loop backend),
burying the recursive agent tree (supervise/) under it. Renamed to src/runtime/ — the agent
execution runtime: the recursive tree + the loop backends + the sandbox seam + personify. Internal
relative imports are unaffected (siblings); outside importers repointed to ../runtime.

NON-BREAKING: ./loops stays a back-compat export alias (tsup builds src/runtime/index.ts → both
dist/runtime.js and dist/loops.js; package.json keeps ./loops, adds ./runtime). External consumers
(agent-knowledge imports ./loops) keep working; new code should import ./runtime.

typecheck + build (both entries) + lint + 679 tests green.

* refactor(names): agent-bus toolbox → coordination (free the name for the real protocol)

The MCP verb set (spawn/observe/steer/check/stop a parent uses over its children) is NOT a transport
— and 'agent-bus' is already a real, distinct thing: the cross-org call protocol in
docs/agent-bus-protocol.md (forwarded billing identity, depth-4 ceiling, trace stitching). Renaming
the toolbox to agent-bus collided with it. It's now `coordination` (Scope-as-MCP): createAgentBus→
createCoordinationTools, AgentBus→CoordinationTools. The docstring now states it's the verb API, and
that steer rides a transport (in-process Scope.send / SDK SessionMessage / the agent-bus protocol) —
one verb, several bindings. typecheck + lint + test green.

* docs(canon): Gate A/B split + verifier-grounded deployable selector

Carry the prior-session canon: Gate A (inner refine@k>random@k GO/NO-GO) vs
Gate B (cross-run flywheel slope) split across learning-flywheel/roadmap/HARNESS;
add verifierGroundedSelect (highest deployable-checker pass-count, ties→earliest)
+ its assertion test; fix the coordination-toolbox label in the MCP server comment.

* feat(runtime): emit lifecycle hooks from Scope.spawn/settle — one observable tree

The recursive Scope was a producer only of internal journal events (SpawnEvent),
so the live tree was replay-only — invisible to the same RuntimeHooks stream
runLoop/toolLoop already feed. Thread SupervisorOpts.hooks into the root Scope and
emit on the agent-centric stream: agent.spawn at child creation (childId, label,
runtime, budget, depth) and agent.child at settle (status, score/valid or
reason/infra, spend). Fire-and-forget via notifyRuntimeHookEvent (non-throwing);
the journal stays the durable record, the hook stream is its live projection.

This is the source the topology visualization reads — the recursive agent tree is
now one stream across every backend. Updates architecture.md §1b (gap closed).

Tests: +2 in tests/loops/supervise.test.ts (stream emits with parent/child+status;
stays journal-only + silent when no hooks wired). 681 pass, typecheck+lint clean.

* feat(topology): live recursive-agent-tree view over the lifecycle stream

Fold the one hook stream (agent.spawn/child from Scope, agent.run + turn/tool_call/
plan/decision from the loops) into the recursive agent tree and render it. An agent
node is born from agent.spawn (childId) or the root agent.run (runId); a step advances
the agent it belongs to (matched by runId/parentId); agent.child/agent.run:after settle
it with status + deployable score. Pure projection — no I/O, no backend coupling — so
the same fold drives a CLI render, a TUI, or a web tree, and renderTopologyTree is pure
over a folded tree (journal-replay friendly).

This is the topology visualization architecture.md §1b names — it consumes exactly the
stream the prior commit made Scope.spawn emit. New export: ./topology.

Tests: tests/topology.test.ts (structure + status + step attribution, ASCII render,
compact/maxDepth, unknown-agent drop, render purity). 686 pass; build + verify:package
+ typecheck + lint clean.

* docs(research): land experimental belief-state + program-synthesis research drafts

Forward-looking research-track drafts (advisory, NOT the canonical spine), each carrying
its Status banner: belief-state-learner-spec (BUILD-ON-GREEN, gated on a positive diverse@k
gate), belief-agent-research-agenda (offline-on-committed-corpora tier + gated learner tier),
program-research-plan (fund-or-kill audit), codex-techniques-audit (codex adoption report).
Cross-link the existing research docs; reference src/loops/ (main's layout).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant