fix(runtime): use agent-centric hook target names by drewstone · Pull Request #163 · tangle-network/agent-runtime

drewstone · 2026-06-05T18:25:26Z

Summary

rename runtime hook targets from implementation-shaped names to agent-centric names
use agent.run, agent.turn, agent.tool_call, agent.plan, and agent.decision
keep producer details in event metadata for debugging

Verification

pnpm lint
pnpm typecheck
pnpm test
pnpm build

… tree, observable + deep-cleaned (#165) * feat(bench): wire AppWorld as a router-worker GEPA target + harden the refine loop Adds AppWorld to gepa-refine as a no-sandbox router worker (the worker writes the Python solution via the router; the local venv AppWorld engine executes + scores it — objective passes/num_tests, no LLM judge, never touches the sandbox SSE path). Plus the weak-baseline + headroom setup the empirical-proof pursuit needed, and three robustness fixes surfaced by live runs: - AppWorld adapter wired into ADAPTERS + scoreBased (graded fraction); a router-worker branch in runWithPrompt (refine the solution k rounds, extract the last fenced python block); AppWorld-domain reflectionTarget + reflectionPrimitives so the analyst diagnoses in AppWorld terms. - WEAK_APPWORLD_DIRECTIVE (deliberately bare) + BASELINE_DIRECTIVE env override — the standard GEPA "optimize a weak starting prompt" setup. - Reflect-model default gpt-5 -> gpt-4o: gpt-5 takes 40-60s/call and aborts the client timeout; gpt-4o is ~2s and reliable on a latency-critical inner loop. - analyzeGeneration: retry transient LLM failures (4x exp backoff) instead of silently starving a generation of findings. - Deterministic difficulty-balanced train/holdout split (FNV hash of task id) — benchmark task lists arrive ordered, so a raw first/second-half slice can hand TRAIN only easy tasks (0 failures to learn from) and HOLDOUT only hard ones. Verified: clean end-to-end AppWorld run, 0 stubs/aborts, EYES->HANDS findings fire. (Substrate timeout hardening is agent-eval#212.) * feat(bench): multi-turn REPL AppWorld agent (execution feedback) — the scaffold axis The blind one-shot AppWorld worker can't act on directive guidance ("inspect api_docs, authenticate, paginate, verify") because it never sees API output — so a directive is inert and prompt-optimization measures ~0 lift. This adds the missing scaffold: a multi-turn REPL agent. - appworld_driver.py `react` subcommand: the agent writes ONE python block per turn, the engine EXECUTES it in a PERSISTENT AppWorld world, the output (or error traceback) is fed back, and it iterates until complete_task() or max-turns; scored in-process by AppWorld's own evaluator. Returns score + token usage + a compact transcript. Router calls go through the OpenAI-compatible router with retry on transient/429/5xx. Validated standalone: passes 1/2 (0.5) in 8 turns vs ~0 for the blind one-shot. - gepa-refine: AppWorld worker defaults to the REPL agent (APPWORLD_REACT=0 keeps the blind one-shot as a control arm). The episode is scored in-process so the artifact carries the score (judge passes it through — a multi-turn episode can't be re-executed from one artifact), and the transcript rides along so the failure-analyst reflection diagnoses WHAT went wrong — giving the directive real signal to optimize against. This is the genuinely-informative test: optimize the directive for a worker that can actually use it. Powered run in flight. * feat(bench): agentic EOPS rollout + the score-vs-n best-of-n curve The gate's leaf was a one-shot completion, which floors at 0 on an agentic tool-use domain (the agent must observe a tool result before it can act on it). gym-agent.ts runs a REAL rollout: a tool-calling loop against the gym's MCP tools — seed an isolated DB, loop completion->tool_calls->execute->feed-results->repeat with carried state, then score the final DB with the deterministic SQL verifier. This is the POMDP rollout (partial observations = tool results, actions = tool calls), not a blind plan. gym-sweep.mts plots the missing measurement: best-of-m under the verifier over m=1..N independent rollouts, the exact subset-expectation estimator (one batch of N rollouts yields the whole curve). Bad-data robustness: a verifier whose SQL the server rejects (a malformed query in the task) is excluded from the denominator, never charged to the agent; an all-malformed task is skipped; a failed rollout drops from the sample. First curve (EOPS itsm, gpt-4.1, 6 tasks x 8 rollouts): best-of-1 13.5% -> best-of-8 56.7% graded score (+43.2pp), pass@8 16.7%. The canonical inference-time-scaling result, reproduced through real agentic rollouts + a real deployable verifier. This is the PARALLEL best-of-n baseline (brute-force compute + a verifier) — the baseline the sequential driver-steered loop must beat at equal compute. * feat(bench): general agentic primitive (depth + breadth over a shared artifact) through the keystone Collapses the EOPS spike into the real primitive. The domain now lives behind ONE seam, AgenticSurface (open an artifact, list tools, call a tool, score it, close it); EnterpriseOps is one implementation (agentic-eops.ts: seed a gym DB, MCP tools, SQL verifier). Commit0 / AppWorld / terminal-bench plug in by implementing the same interface — the drivers never change. The drivers are domain-blind Agents run through createSupervisor().run (Agent.act over a conserved-budget Scope, spawning leaf shots via scope.spawn): - DEPTH: ONE persistent artifact carried across shots. Each shot the agent works the tool loop; between shots a trace-analyst (selector != judge: reads the trajectory, never the score) steers the resumed session toward what's unfinished. shot n stands on shot n-1's artifact state + history. Reports the score-per-shot progress curve. - BREADTH: K independent rollouts (each own artifact), the deployable verifier picks the best. The leaf (one shot over a handle) is resolved per-spawn from a surface-closed registry — the open LeafExecutor seam, not bespoke per-benchmark glue. Equal-k by the conserved pool. Verified live through the Supervisor (EOPS itsm, gpt-4.1): depth + breadth both run end-to-end; depth progress curve flat at 50% on a short itsm task (agent capability plateau, not a mechanism break — the shots+analyst loop runs). EOPS is short-horizon; the long-horizon depth curve is Commit0 (next surface impl). Exports the EOPS HTTP primitives from gym-agent for reuse. * feat(bench): the operator atom — driver+analyst+IC fused, leads over the shared artifact Replaces the hardcoded relay with a real operator: ONE agent whose brain reads the trace (firewalled — never the score), judges, and each round picks the single best move over the shared artifact — steer (delegate a worker shot), work (do one decisive turn itself), branch (fresh line, adopt if better), or done. The driver, the analyst, and an independent-contributor are the same self-similar atom now; how it leans is the AgentProfile, not a separate type. Also adds the mix driver (depth main line + branch-when-stalled, adopt-if-better, the MCTS-PW shape the traces motivated), a trace hook (AgenticTraceEvent) surfacing every steer + tool call + score + finding so the loop is readable, and MODE on the trace runner. Verified live + legible (EOPS itsm, gpt-4.1, operator on task 15): the operator sequenced recon -> retire+migrate -> ownership-check -> incidents -> done, authoring each instruction with a rationale, choosing steer/work/done with judgment — readable end to end. Honest calibration gaps the trace exposed: it declared `done` at 86% (one verifier still failing — firewalled stop is over-eager vs the persistent relay's 100%), and the `work` move was a no-op (second-guessed an already-correct value). Both are prompt calibration, not structural. * refactor(bench): drive the surface through the existing personify combinators (delete-and-improve) Audit (two parallel passes over agent-runtime src/loops + agent-eval) found the hand-rolled agentic.ts drivers duplicate existing primitives: breadthDriver=fanout, depthDriver=loopUntil+ createScopeAnalyst, mixDriver=widen, analyze()=createScopeAnalyst (and worse — bypasses the pool/journal/firewall), runAgentic=runPersonified. "Many personas in parallel" = agent-eval runEvalCampaign; worker tooling/skills = AgentProfile; the sequential operator ~ agent-eval runProposeReview. Genuinely new: only the AgenticSurface seam + the operator's "work" move. agentic-personify.ts is the bridge: surfacePersona() wraps the surface registry (the shot is already a LeafExecutor) into a Persona, so the keystone fanout/loopUntil/widen + runPersonified drive it directly. PROVEN: runBreadthPersonified (fanout + runPersonified) reproduces the hand-rolled breadthDriver EXACTLY on EOPS itsm (50%=50%, 57%=57%) — the combinators drive the surface, the bespoke drivers are redundant. Next (each deletes code): depth→loopUntil + wire the dormant analyst live (createScopeAnalyst → buildSteerContext → the gate; combinators.ts:319 currently passes empty findings); mix→widen; operator→one LoopShape (the work-move); personas→runEvalCampaign. agentic.ts then shrinks to the surface + the operator shape. * refactor(bench): delete the redundant drivers — breadth=fanout, operator subsumes depth/mix Executes the consolidation. Deleted ~239 lines from agentic.ts: breadthDriver (= the existing `fanout` combinator, proven identical), depthDriver + mixDriver (the operator subsumes them — steer-only = depth, branch-when-stalled = mix), analyze() + analystExecutor (the worse, non- firewalled, non-pool-metered analyst), and the now-orphaned leaf/drainOne/perChild helpers. What remains is minimal: the AgenticSurface seam + its shot LeafExecutor/registry, the OPERATOR (the one driver — reads the trace, judges, steers/works/branches/stops over the shared artifact, run through createSupervisor().run), and the surface persona that lets the keystone combinators drive the surface. BREADTH now runs through runPersonified + fanout (agentic-personify.ts); runAgentic is operator-only. agentic-run compares operator vs breadth(fanout); the trace runner is operator. tsc clean; consolidated path runs end-to-end on EOPS itsm. Reuse over duplication: the drivers are the keystone combinators + the operator; personas are Persona/runEvalCampaign; the analyst is createScopeAnalyst (wiring it live in the operator is the next step). agentic.ts is now the surface + the operator, nothing more. * feat(bench): the profile directory — roles are data, the driver profile is the lead-steerer's config Full unification: one primitive (a profiled agent run as Agent.act in an executor, orchestrated by the Supervisor); worker / analyst / driver are the SAME agent differing only by profile. profiles.ts is the directory — RoleProfile (id/role/model/systemPrompt/skills/tools/mcp), the worker/analyst/ driver profiles, and OPERATOR_TOOLS (the verbs a driver uses to lead workers: list/define/run_analyst, observe/spawn/steer_worker, stop — in-process = Scope methods, in-sandbox = the same verbs as MCP). The driver profile is the sketched operator, formalized best-in-class: review the workers you drive (in-flight or last-shot traces), investigate (run/define analysts, fan out sub-analysts in parallel), identify everything, propose reusable findings as skills, take the lead when confident or dispatch otherwise, parallelize intelligently not frivolously, judge from traces never the score. Topology (breadth/depth) is the driver's agentic choice; the combinators are the deterministic floor. The directory is the population the agent-eval RSI loop optimizes (runImprovementLoop: gepaDriver over prompts, a skillopt ImprovementDriver over skills, Pareto + holdout-gated; analyst findings are the optimizer's input). Next: the operator-toolbox MCP server (Scope-as-MCP) so a sandbox coding-harness agent can BE the driver via these tools; then wire createScopeAnalyst as run_analyst's impl. * feat(loops): scope.send — the missing operator verb (steer/interrupt a running child) The driver's toolbox was three keystone verbs + one gap: spawn=scope.spawn, observe=scope.view, analyze=createScopeAnalyst all existed; steering a RUNNING worker did not. Adds it as an additive primitive: Scope.send(nodeId, msg) delivers an out-of-band message to a live child's executor inbox (LeafExecutor.deliver?), returning false for an unknown/settled child or a leaf with no inbox. A streaming executor drains its inbox between turns (steer / interrupt / resume); a one-shot leaf that can't be steered mid-flight simply omits deliver. This is the one verb the sandbox operator needs that wasn't there: in-process it's this Scope method; in a box it's the steer_worker MCP tool (Scope-as-MCP); cross-box it's the Agent Bus — same verb, three transports. spawn stays budget-bounded/fail-closed, so a maximally-agentic driver that spawns + steers at will is still equal-k by construction. Additive (optional deliver, new send method on the experimental Scope) — 683/683 tests pass, including 2 new: send steers a live child via its inbox / returns false for settled-unknown-noinbox. * feat(mcp): operator toolbox — Scope-as-MCP, the driver's verbs as tools (the sandbox driver enabler) The driver's four verbs, now callable from inside a sandbox: createOperatorToolbox({scope, blobs, makeWorkerAgent, perWorker}) returns MCP tools backed by a live keystone Scope — spawn_worker → scope.spawn (budget-bounded, fail-closed; a worker may itself be a DRIVER → drivers-of-drivers) observe_worker→ scope.view + the result blob steer_worker → scope.send (deliver next-instruction/interrupt to a RUNNING worker) stop → the driver's terminal move Same verbs the in-process operator calls directly; this is the binding that lets a coding-harness agent running the driver/operator profile BE the driver from inside a box. createMcpServer gains an `extraTools` option (fail-loud on shadowing a built-in) so the toolbox is served over the existing JSON-RPC transport. Exported from the mcp barrel. run_analyst is deliberately deferred to the analyst-kind directory (agent-eval createTraceAnalystKind + createScopeAnalyst). 688/688 tests (5 new: spawn/observe/steer/stop handlers + server wiring + fail-closed + shadow-throws). The budget discipline holds for a maximally-agentic driver: spawn fails closed at the conserved pool, so equal-k survives a driver that spawns + steers at will. * feat(mcp): analyst-kind directory + run_analyst/list_analysts operator verbs The operator's review verb, completing the driver's toolbox. An analyst is not one question: a kind is ONE lens (completeness, correctness, policy, efficiency, tool-use), each emitting AnalystFindings tagged by its area. The kinds are composable DATA, the runner is generic. - analyst-kinds.ts: the lens directory (defaultAnalystKinds) + a generic runner (runAnalystLens/makeAnalystRunner). Reuses agent-eval makeFinding for the lift and the runtime assertTraceDerivedFindings firewall (selector != judge: a finding citing a judge/verdict/score metric is rejected). The raw-row validator + schema prompt are inlined (kind-factory internals are not root- exported) so a lens emits the same finding shape without an unstable import. AnalystKind is a deliberate subset of TraceAnalystKindSpec — upgradeable to the full agentic factory without changing the directory surface. - operator-toolbox.ts: list_analysts (the menu) + run_analyst (apply a kind over a settled worker's trace -> findings). Both present only when the analyst seam (analystKinds + runAnalyst) is wired; a pure-dispatcher driver omits them. run_analyst fails loud if the worker has not settled. Tests: 12 new (lens directory is plural/distinct, lift + firewall, renderTrace, injected-chat lens run, unknown-kind error; toolbox list/run over a mock scope with a settled + a running worker). Full suite 695 pass. * feat(mcp): live operator driver — LLM tool-loop over the toolbox is the topology The piece that makes "the root decides any topology" real. createOperatorDriverAgent returns an Agent whose act() hands the operator toolbox to an injected LLM tool-loop, so the model's tool calls ARE the run shape: how many workers, how deep, when to branch, when to stop. It runs inside the keystone Supervisor over the conserved budget pool, so an LLM that tries to spawn past the pool gets a fail-closed error (equal-k by construction), the journal records the tree, and a worker may itself carry a driver profile (drivers-of-drivers). - await_next verb (scope.next) added to the toolbox: THE wake event — block until the next spawned worker settles, returning its deployable verdict. Workers run concurrently; spawn a batch, then await_next to collect them. The toolbox now keeps a settled() ledger. - selector != judge holds: the driver selects on the deployable verdict (the ledger); the analyst lens reads only the trace. selectBest = best valid by score, ties to highest. - chat is injected (provider-neutral ToolChat) so the driver is runtime-agnostic: a router chat in-process, or the harness's own tool loop pointed at the toolbox over MCP in a box. Tests: live driver through the REAL Supervisor + conserved pool with stub workers — spawn -> await_next -> stop selects the best verdict; a spawn past the pool fails closed (equal-k). Plus a toolbox await_next/settled ledger unit test. Full suite 698 pass. * fix(mcp): harden the operator driver — survive bad tool calls, force a real attempt Robustness from the codex audit (FunctionCallError RespondToModel-vs-Fatal pattern) + the aec smoke (the model "answered in prose" and shipped nothing): - Tool dispatch no longer crashes the run. Malformed JSON args are fed back as a tool error (the handler is NOT called with garbage); a handler throw is caught and returned as a tool result so the model self-corrects, instead of throwing out of act() and discarding the whole run over one correctable tool call. - minWorkersBeforeStop: reject stop until >=N workers settled done — an operator cannot ship nothing (fixes premature done with zero spawns). - maxBareTurns: re-nudge firmly when the model returns prose instead of a tool call (some providers ignore tool_choice:'required'); bounded so a non-driving model ends as a no-winner, never an infinite spin. Driver tests still pass (spawn->await->stop selects best; spawn past pool fails closed). * feat(bench): operator-driver gate arm on aec — adaptive ≤K workers vs blind@K The non-blind topology arm of the open gate, router-only (deterministic verify.py judge, no sandbox). operator-gate.mts runs the live operator driver per task: it may spawn UP TO K solver workers (each = one router solve + verify.py grade, reusing benchSolveLeaf), await their deployable verdicts, and stop when satisfied. The conserved pool caps it at K workers (equal-k vs random@K); it may use fewer (adaptive), at the cost of its own coordination tokens, which are accounted separately. - router-client.ts: routerChatWithTools — a tool-calling completion (OpenAI message shape, configurable tool_choice), the operator driver's LLM seam. Same real-usage/fail-loud discipline as routerChatWithUsage. - keystone-gate.ts: export benchSolveLeaf so the operator's workers reuse the deployable solve-and-grade leaf (the spawn_worker `task` string becomes the worker's strategy). Validated on 2 tasks: conduit-fill resolves with 1 worker; catenary spawns 2 then stops (adaptive). Writes an operator@K corpus that corpus-report pairs against random@K/diverse@K. * fix(loops): close the reserve→factory budget-leak window — protect the equal-k invariant From the codex audit. In scope.spawn the conserved-pool reservation is taken (pool.reserve) BEFORE the executor factory runs (resolved.value(spec, ctx)); a synchronous factory throw there leaked the reservation, because runChild — which reconciles the ticket — is never reached. A leaked reservation silently breaks total ≡ free + reserved + committed, the equal-k invariant the whole instrument rests on. - scope.spawn: wrap the post-reserve region in try/catch; on a synchronous throw release the reservation with zero spend (pool.reconcile(ticket, zeroSpend())) and rethrow. runChild is the last statement and never sync-throws, so there is no double-reconcile. - BudgetPool.assertNoOpenTickets(): the leak detector — fail loud if any reservation remains. Called at the supervisor join barrier on the success path (a leak would corrupt spentTotal). - Tests: a synchronous factory throw releases the reservation (pool fully restored, no open ticket); assertNoOpenTickets throws while a ticket is open and passes once reconciled. Full suite 700 pass. * feat(bench): make the operator + blind gate runners benchmark-agnostic (BENCH=…) Both gate runners now resolve their adapter via resolveAdapter(process.env.BENCH ?? 'aec-bench') instead of hardcoding aec, and stamp the corpus benchmark field from adapter.name. One BENCH=… flag points operator-gate.mts and aec-gate.mts at any of the 16 registered adapters. Validated: a FinSearchComp smoke (FINSEARCHCOMP_FIXTURES=1 N=3 K=3) ran clean end-to-end — the operator resolved the finsearch adapter, drove adaptive workers (2.67/task), and the per-record LLM judge scored each. (It scored 0% there because finsearch needs WEB SEARCH and the current benchSolveLeaf worker is a plain router chat with none — a worker-capability gap, not a runner bug; the search/agentic-worker wiring is the next step.) * feat(bench): operator driver over EnterpriseOps — the full-vision multi-turn test The new createOperatorDriverAgent over the EOPS AgenticSurface, where every spawn_worker is a real multi-turn agentic rollout (seeded gym DB → MCP tool loop → verifier score), not a single router chat. So the operator's full verb set fires for the first time: spawn_worker (a fresh rollout with a strategy), await_next (deployable verifier score), run_analyst (trace lenses over the worker transcript — the analyst directory live), stop. Conserved pool caps it at K rollouts (equal-K vs the breadth best-of-K baseline, runBreadthPersonified). operatorSurfaceRegistry wraps shotExecutor (fresh per spawn) so the operator's strategy string maps to a ShotTask steer over the task; the analyst seam is wired from defaultAnalystKinds + makeAnalystRunner. Smoke (TASKS=1 K=2 gpt-4.1) ran clean: operator spawned 2 rollouts, 8 turns, selected best; breadth baseline ran. Scaling next for the operator-vs-breadth signal. * fix(mcp): hard maxWorkers equal-k cap on the operator driver + paired sign test The first EOPS operator-vs-breadth run (+9.7pp partial score, n=12) was confounded: short workers refund their unspent iteration budget, so the conserved pool readmitted and the operator ran 5-6 rollouts vs breadth's 4 — more compute than control. A win under unequal compute is not a topology win (repo law). - createOperatorDriverAgent gains maxWorkers: a HARD cap on successful spawns, independent of the pool. spawn_worker past it returns a typed cap error (handler not called). Set maxWorkers=K for an exactly-≤K-rollout comparison; the coordination-token overhead is disclosed separately (worker-K is equal). Test: spawns past the cap are rejected even with pool budget to spare. - operator-eops.mts: maxWorkers=k + a two-sided exact binomial sign test over score-discordant tasks, so the operator-vs-breadth verdict carries wins/losses/ties + p, not just a mean delta. * fix(mcp): exhaustUnlessResolved — stop the operator giving up early at equal-K The probe exposed under-exploration: the operator stopped after one low-scoring worker (minWorkers met) while blind best-of-K used its full budget and scored higher. Adaptive early-stop is good for easy wins but loses on hard tasks. exhaustUnlessResolved (with maxWorkers set) rejects stop while NO worker has fully resolved and spawn budget remains — forcing the operator to use its full K with analyst-informed strategies. Coverage then ≥ blind best-of-K, so a score comparison isolates STRATEGY quality, not a give-up-early compute-vs-score tradeoff. Wired into operator-eops.mts. * fix(bench): bounded retry on EOPS seed — survive transient SQLite 5xx under concurrent seeding The clean n=20 EOPS run logged "unable to open database file" 500s (the gym's SQLite under concurrent within-task seeding — up to ~8 simultaneous seeds), silently dropping some worker rollouts to 0 and adding noise to the operator-vs-breadth comparison. seed() now retries a transient 5xx up to 4 times with linear backoff (fresh database_id per attempt); non-5xx fails loud immediately. Makes a scaled EOPS run trustworthy. * docs(canon): split Gate A (inner GO/NO-GO) from Gate B (flywheel success); fix equal-k + oracle/verifier The canon let one word — "THE gate" — mean both a narrow inner-loop diagnostic AND the program's success criterion, contradicting learning-flywheel.md (outer loop is the product; a within-run zero is fine). That mis-specified measurement contract made a predicted inner-loop null read as a program verdict. Grounded against the real files + this session's EOPS evidence (stateful multi-turn workers under a HARD equal-K cap beat blind +6.1pp — so equal-compute and depth coexist). Naming/scoping fixes only — the engine and the equal-compute rigor are correct and kept: - learning-flywheel.md: add the ONE success definition (Gate B) — cross-run score-vs-run slope under a frozen-controller control, equal per-run compute, deployable-checker graded. - roadmap-rsi.md + architecture-interpretations.md §5 + HARNESS.md: rename "THE gate" → Gate A (the inner GO/NO-GO for building the recursive-driver layer); scope "if it fails, stop/delete" to within-run steering only — never the corpus+controller product. - equal-k → equal-COMPUTE everywhere: it bounds Σ rollouts × turns; k counts ROLLOUTS, each may be a full multi-turn/stateful trajectory. Statelessness was the corpus-replay instrument's choice, not the principle. - architecture.md §7: split the overloaded "selector ≠ judge" into ORACLE (banned from selection AND steering), VERIFIER (a deployable checker — ALLOWED in both; what depth needs), WRITE-ONLY JUDGE (banned from steering only). - architecture.md §9: demote FinSearchComp from primary to a deployable-selector negative control; the primary depth/operator bench is stateful (EnterpriseOps-Gym / commit0 / swe-bench). * refactor(drivers): the driver is harness+profile+MCP — delete dead pair + anchor the decision Per the vision-owner: a driver is an agent f(trace, outputs) -> thoughts + MCP tool calls, RUN BY a coding harness in a sandbox. The harness already owns the loop, tool-calling, subagent spawning, and native idioms (parallelize/ultrathink/dynamic-workflow). We own only the steering MCP, the profiles, and the orchestrator. There is no createDriver function in the product; the driver is launched, not constructed. The in-process LLM tool-loop + the create*Driver factory zoo + the TopologyMove DSL were a measurement shim that calcified into a fake product — slop to delete. - docs/architecture-driver.md: the decision, the one Decider seam (code-rule / in-process LLM / sandbox-harness all unify), the delete-vs-keep table, and the phased build order. The anchor that prevents re-drift (companion to the Gate A/Gate B canon fix). - delete bench/src/drivers/{llm-meta-driver,progressive-widening}.ts — a mutually-referential dead pair, 0 external callers (the real widen lives in personify/wave-types.ts + the WidenGate type in supervise/types.ts, both kept). * refactor(drivers): bench has ZERO drivers — delete the dead blind control; sharpen the anchor The repo builds ONE driver, in the library (src/). bench is a thin experiment consumer: adapter + "run the one library driver at a profile" + score. A blind control is not a bench driver — it is the one library driver with a blind decider; the equal-compute guard is experiment infra. - delete bench/src/drivers/flat-harness.ts (the gate's blind control — 0 callers, superseded by keystone-gate's fanout). bench/src/drivers/ is now empty/gone. - architecture-driver.md: add the "bench has ZERO drivers" section + flag the mislocated library abstractions (AgenticSurface / shotExecutor / agenticRegistry in bench/src/agentic.ts) that must move to src/ — bench is squatting on the library because the experiment predated the one driver. * chore(deep-clean): delete dead example orphans + the dead preset registry Deep-clean batch 1-2 (caller-verified, 0 importers): - src/loops/personify/examples/{code,meta-orchestrator,research}.ts — proof-of-concept demos, not barrel-exported, 0 importers, 0 tests. The examples/ dir is now gone. - src/loops/drivers/planners.ts + tests/loops/planners.test.ts — the dead preset registry (blind/PROMPT_PLANNERS/resolvePlanner), own-test-only; superseded by the decider model in docs/architecture-driver.md. Barrel re-exports dropped from src/loops/index.ts. * chore(deep-clean): drop planners barrel re-exports + lint-format operator-driver * chore(deep-clean): delete the dead in-process EOPS experiment cluster (10 files) A closed dead subgraph — 0 external importers, in no package.json script, imported by nothing kept (verified by caller-grep). The in-process operator experiment + bench squatting on library abstractions, both slop per docs/architecture-driver.md (the driver is harness+profile+MCP; bench holds no drivers/abstractions). Deleted: agentic.ts (AgenticSurface/shotExecutor/agenticRegistry — library abstractions, to be rebuilt clean in src/), gym-agent.ts + agentic-eops.ts (EOPS gym surface — wiring preserved in memory), operator-eops.mts (the in-process EOPS experiment runner), agentic-personify.ts, agentic-run.mts, agentic-trace.mts, keystone-gate-probe.mts, gym-sweep.mts, gym-depth.mts. The EOPS operator>blind +6.1pp result + the gym recipe live in persistent memory; the experiment re-runs through the product (sandbox-harness driver + MCP), not the in-process shim. Kept: keystone-gate / operator-gate / aec-gate / corpus-* / experiment / gepa-refine. * docs(canon): rewrite the atom as the recursive agent tree; fold in the driver doc; wire the hook model Consolidate into the existing spine — no new doc, less is more, eliminate the framing that led to bad decisions. - architecture.md §1 ("the atom"): replace the Program/TopologyMove opcode DSL (the slop) with the true model — ONE agent = AgentProfile + harness in a Scope; it calls a tool; one tool (spawn over MCP) creates a child agent; topology EMERGES from spawn/steer, no DSL. The recursive execution tree IS the product; we own only the MCP + profiles + orchestrator. Checks (analyst/judge/verifier) are DATA the driver authors on the fly (define_check/run_check); oracle≠verifier≠judge made explicit. - §1b (new): the lifecycle hook stream from the merged PR (#162/#163) — agent.{run,turn,tool_call, spawn,child,plan,decision} × {before,after,error,event}; runLoop/toolLoop/Scope.spawn are producers; hooks attach at the execution/spawn boundary, not the profile. This is what the topology viz reads. Open gap flagged: Scope.spawn doesn't emit into it yet. - Header status: product-core (supervise tree + sandbox seam + MCP + corpus + hooks) vs driver-as-code slop being deleted; runLoop is one backend, not the center. - §6 corollary: bench holds ZERO drivers/abstractions — thin experiment consumer only. - Delete docs/architecture-driver.md (folded into §1) + drop its README pointers. * refactor(kill): delete the in-process operator-driver loop + its experiment arm The in-process LLM tool-loop was the measurement shim that calcified into a fake product (arch §1: the driver is a sandbox harness, not an in-process loop). Deleted operator-driver.ts (the loop + Decider/ToolChat types), operator-driver.test.ts, bench/src/operator-gate.mts (the AEC operator arm that drove the shim — its result is in memory; the blind arms in aec-gate.mts stay), and the mcp/index barrel exports. Kept operator-toolbox.ts (the steering MCP) + analyst-kinds.ts (the checks) — the product surface the sandbox-harness driver will mount. typecheck + 338 loop/mcp tests green. * refactor(kill): delete createRefineDriver factory; kernel tests use a 15-line local helper refine.ts + refine.test.ts + the barrel export gone (no production caller — only the barrel + a README mention). The 3 kernel/composition tests that used it as a generic driver-under-test now import a tiny tests/loops/refine-driver.ts (test scaffolding, not a product factory). typecheck + 31 tests green. * refactor(kill): delete createFanoutVoteDriver factory; inline a 4-line fanout driver where used fanout-vote.ts + fanout-vote.test.ts + the barrel exports gone. The fanout BEHAVIOR (N copies round 0 → kernel picks best-valid) is a 4-line inline Driver where production needs it (profiles/coder.ts:multiHarnessCoderFanout, the delegate_code N>1 path) and in the two examples; composition.test uses the shared tests/loops/refine-driver.ts helper. No public create*Driver factory + Decision type + test-helper exports — just the few lines of behavior at each call site. typecheck + lint clean; coder + composition tests green. * refactor(kill): delete createSandboxPlanner — the LLM-emits-TopologyMove envelope bridge The last factory. createSandboxPlanner wrapped an LLM-in-a-box into a TopologyPlanner that decoded JSON topology-move envelopes — the DSL the native-language-over-MCP model replaces (arch §1: the harness steers in natural language, topology emerges from spawn/steer, no envelope DSL). No production caller (only docstring mentions). Deleted sandbox-planner.ts + the barrel exports + the `describe('createSandboxPlanner')` block + the two envelope-decoding it-cases + the now-dead plannerAndWorkerClient helper. The kept dynamic-driver tests use inline code TopologyPlanners. createDynamicDriver stays as a kernel backend. Full suite green. * refactor(names): delete drivers/ dir — move dynamic.ts up (it's a loop backend, not a driver) A drivers/ directory with one file actively lies after 'the driver is not a code factory'. dynamic.ts is the planner-driven runLoop backend — it lives next to run-loop.ts now. Import paths updated; typecheck + dynamic tests green. * refactor(names): operator-toolbox→agent-bus + analyst-kinds→checks; gitignore data dumps Names now match the model (arch §1): - operator-toolbox → agent-bus: the MCP an agent uses to spawn/observe/steer child agents (the Agent Bus). 'operator' was the deleted in-process driver. createOperatorToolbox→createAgentBus etc. - analyst-kinds → checks: checks are DATA with kind analyst|judge|verifier; AnalystKind→Check, defaultAnalystKinds→defaultChecks, makeAnalystRunner→makeCheckRunner, runAnalystLens→runCheck. - .gitignore: bench/data, bench/experiments, __pycache__, .claude — stop accidental data dumps. typecheck + lint + renamed tests green. * refactor(names): src/loops → src/runtime — the dir was named after one backend (runLoop) 'loops' named the whole execution substrate after its least-central piece (the run-loop backend), burying the recursive agent tree (supervise/) under it. Renamed to src/runtime/ — the agent execution runtime: the recursive tree + the loop backends + the sandbox seam + personify. Internal relative imports are unaffected (siblings); outside importers repointed to ../runtime. NON-BREAKING: ./loops stays a back-compat export alias (tsup builds src/runtime/index.ts → both dist/runtime.js and dist/loops.js; package.json keeps ./loops, adds ./runtime). External consumers (agent-knowledge imports ./loops) keep working; new code should import ./runtime. typecheck + build (both entries) + lint + 679 tests green. * refactor(names): agent-bus toolbox → coordination (free the name for the real protocol) The MCP verb set (spawn/observe/steer/check/stop a parent uses over its children) is NOT a transport — and 'agent-bus' is already a real, distinct thing: the cross-org call protocol in docs/agent-bus-protocol.md (forwarded billing identity, depth-4 ceiling, trace stitching). Renaming the toolbox to agent-bus collided with it. It's now `coordination` (Scope-as-MCP): createAgentBus→ createCoordinationTools, AgentBus→CoordinationTools. The docstring now states it's the verb API, and that steer rides a transport (in-process Scope.send / SDK SessionMessage / the agent-bus protocol) — one verb, several bindings. typecheck + lint + test green. * docs(canon): Gate A/B split + verifier-grounded deployable selector Carry the prior-session canon: Gate A (inner refine@k>random@k GO/NO-GO) vs Gate B (cross-run flywheel slope) split across learning-flywheel/roadmap/HARNESS; add verifierGroundedSelect (highest deployable-checker pass-count, ties→earliest) + its assertion test; fix the coordination-toolbox label in the MCP server comment. * feat(runtime): emit lifecycle hooks from Scope.spawn/settle — one observable tree The recursive Scope was a producer only of internal journal events (SpawnEvent), so the live tree was replay-only — invisible to the same RuntimeHooks stream runLoop/toolLoop already feed. Thread SupervisorOpts.hooks into the root Scope and emit on the agent-centric stream: agent.spawn at child creation (childId, label, runtime, budget, depth) and agent.child at settle (status, score/valid or reason/infra, spend). Fire-and-forget via notifyRuntimeHookEvent (non-throwing); the journal stays the durable record, the hook stream is its live projection. This is the source the topology visualization reads — the recursive agent tree is now one stream across every backend. Updates architecture.md §1b (gap closed). Tests: +2 in tests/loops/supervise.test.ts (stream emits with parent/child+status; stays journal-only + silent when no hooks wired). 681 pass, typecheck+lint clean. * feat(topology): live recursive-agent-tree view over the lifecycle stream Fold the one hook stream (agent.spawn/child from Scope, agent.run + turn/tool_call/ plan/decision from the loops) into the recursive agent tree and render it. An agent node is born from agent.spawn (childId) or the root agent.run (runId); a step advances the agent it belongs to (matched by runId/parentId); agent.child/agent.run:after settle it with status + deployable score. Pure projection — no I/O, no backend coupling — so the same fold drives a CLI render, a TUI, or a web tree, and renderTopologyTree is pure over a folded tree (journal-replay friendly). This is the topology visualization architecture.md §1b names — it consumes exactly the stream the prior commit made Scope.spawn emit. New export: ./topology. Tests: tests/topology.test.ts (structure + status + step attribution, ASCII render, compact/maxDepth, unknown-agent drop, render purity). 686 pass; build + verify:package + typecheck + lint clean. * docs(research): land experimental belief-state + program-synthesis research drafts Forward-looking research-track drafts (advisory, NOT the canonical spine), each carrying its Status banner: belief-state-learner-spec (BUILD-ON-GREEN, gated on a positive diverse@k gate), belief-agent-research-agenda (offline-on-committed-corpora tier + gated learner tier), program-research-plan (fund-or-kill audit), codex-techniques-audit (codex adoption report). Cross-link the existing research docs; reference src/loops/ (main's layout).

fix(runtime): use agent-centric hook target names

4b25194

drewstone merged commit 7c2507b into main Jun 5, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(runtime): use agent-centric hook target names#163

fix(runtime): use agent-centric hook target names#163
drewstone merged 1 commit into
mainfrom
feat/agent-hook-target-names

drewstone commented Jun 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

drewstone commented Jun 5, 2026 •

edited

Loading