feat: durable-run supervisor — cross-worker / cross-DO durability#29
Merged
Conversation
runReconnectableTurn (#23) recovered an interrupted turn only on a retry re-invocation, left an unattended window between worker death and that retry, depended on the sandbox runtime buffering events, and made the correctness-critical reconnect a per-product callback. It checkpointed the run handle as "a completed step at index 0" — an admitted migration-dodge. This relocates the durability boundary off the ephemeral worker onto an always-attached supervisor that owns the run. Substrate (platform-agnostic, tested in Node): - DurableRunStore gains an ordered, replayable stream-event log — appendStreamEvent / readStreamEvents, idempotent on eventId so a reconnecting adapter that re-yields a boundary event cannot double-log. RunHandle is real run-row state via setRunHandle, not a step hack. Schema v2 (durable_stream_events table + durable_runs.handle_json), implemented across the in-memory / file-system / D1 stores. - runSupervisedTurn — drains a run's events into the stream log as they flow, persists the reconnect pointer the instant the substrate yields it, heartbeats the lease. A fresh supervisor reads the log for its cursor and resumes via the adapter — fresh / resumed / replayed. - SandboxReconnectAdapter — one typed, conformance-tested contract. The dangerous reconnect glue lives once per substrate, never per product. Cloudflare host (thin): - SessionSupervisorDO — a Durable Object that hosts runSupervisedTurn; alarm() re-attaches a run a dropped response stream abandoned. CF types are structural (no @cloudflare/workers-types dep). runReconnectableTurn / run-handle.ts are removed — superseded; no product consumed them. RunHandle + the four-mode resolution carry forward. 15 new tests incl. the cross-worker chaos keystone (kill mid-stream, resume, no gap, no duplicate); suite 251 green; typecheck + biome + build clean.
tangletools
approved these changes
May 22, 2026
Contributor
tangletools
left a comment
There was a problem hiding this comment.
Verified. Relocates the durability boundary off the ephemeral worker onto an always-attached supervisor: DurableRunStore gains an idempotent ordered stream-event log (3 stores + schema v2), runSupervisedTurn drains into it + heartbeats + resumes from cursor via a typed conformance-tested SandboxReconnectAdapter, SessionSupervisorDO is a thin DO host with alarm()-driven orphan re-attach. Diff-audit caught + fixed the completeStep/handle ordering bug. Cross-worker chaos keystone test green (kill mid-stream → resume → no gap, no dup); suite 251 green; typecheck + biome + build clean.
drewstone
added a commit
that referenced
this pull request
May 22, 2026
…ples Three days of substrate changes (#28 model-resolution, #29 durable-run supervisor + DO) shipped with no docs and no real-workerd test. This catches the surface up. - Real-workerd integration test for SessionSupervisorDO via @cloudflare/vitest-pool-workers — runs the DO under actual workerd with real DurableObjectState + real ReadableStream + real Response. Pinned to a vitest-3-compatible pool version; separate vitest.workers.config.ts so the main Node suite is unaffected. - README rewrite: entry-point table covers every current primitive — durable turn, supervisor + DO, model resolution, defineAgent, durable chat-turn engine, analyst loop, platform clients. - docs/concepts.md: real mental-model doc — the five layers, the three durability levels, the reconnect-adapter contract, model resolution, reading order for new consumers. Replaces the dead src/index.ts JSDoc link. - examples/model-resolution — resolveChatModel + validateChatModelId (fail-closed) + withConfiguredModels. Runs offline. - examples/durable-supervisor — cross-worker resume keystone: w1 drains 2 of 5, lease lapses, w2 resumes from cursor, full sequence exactly once. Runs offline. - examples/agent-into-reviewer — pipe one runtime's stream into a reviewer agent (the "two-runtime" pattern). Runs offline. Verified: typecheck 0, Node suite 251, workerd suite 2 (real DO), biome clean, build green. All three new examples runnable end-to-end.
6 tasks
drewstone
added a commit
that referenced
this pull request
May 22, 2026
…ples (#30) Three days of substrate changes (#28 model-resolution, #29 durable-run supervisor + DO) shipped with no docs and no real-workerd test. This catches the surface up. - Real-workerd integration test for SessionSupervisorDO via @cloudflare/vitest-pool-workers — runs the DO under actual workerd with real DurableObjectState + real ReadableStream + real Response. Pinned to a vitest-3-compatible pool version; separate vitest.workers.config.ts so the main Node suite is unaffected. - README rewrite: entry-point table covers every current primitive — durable turn, supervisor + DO, model resolution, defineAgent, durable chat-turn engine, analyst loop, platform clients. - docs/concepts.md: real mental-model doc — the five layers, the three durability levels, the reconnect-adapter contract, model resolution, reading order for new consumers. Replaces the dead src/index.ts JSDoc link. - examples/model-resolution — resolveChatModel + validateChatModelId (fail-closed) + withConfiguredModels. Runs offline. - examples/durable-supervisor — cross-worker resume keystone: w1 drains 2 of 5, lease lapses, w2 resumes from cursor, full sequence exactly once. Runs offline. - examples/agent-into-reviewer — pipe one runtime's stream into a reviewer agent (the "two-runtime" pattern). Runs offline. Verified: typecheck 0, Node suite 251, workerd suite 2 (real DO), biome clean, build green. All three new examples runnable end-to-end.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
runReconnectableTurn(#23) recovered an interrupted turn only on a retry re-invocation — leaving an unattended window between worker death and that retry, depending on the sandbox runtime to buffer events, and making the correctness-critical reconnect a per-product callback. It stored the run handle as "a completed step at index 0" — an admitted migration-dodge.This evolves #23 (the
RunHandleshape + four-mode resolution carry forward) by relocating the durability boundary off the ephemeral worker onto an always-attached supervisor.Substrate — platform-agnostic, tested in Node
DurableRunStore—appendStreamEvent/readStreamEvents, idempotent oneventIdso a reconnecting adapter that re-yields a boundary event cannot double-log. Replay is now guaranteed by our store, not hoped-for from the sandbox.RunHandleis real run-row state (setRunHandle+durable_runs.handle_json) — the step-0 hack is gone.durable_stream_eventstable + the handle column, implemented across the in-memory / file-system / D1 stores.runSupervisedTurn— drains events into the log as they flow, persists the reconnect pointer the instant the substrate yields it, heartbeats the lease. A fresh supervisor reads the log for its cursor and resumes via the adapter:fresh/resumed/replayed.SandboxReconnectAdapter— one typed, conformance-tested contract. The dangerous reconnect glue lives once per substrate, never per product.Cloudflare host — thin
SessionSupervisorDO— a Durable Object hostingrunSupervisedTurn;alarm()re-attaches a run a dropped response stream abandoned. CF types are structural (no@cloudflare/workers-typesdep — same discipline asD1DatabaseLike).runReconnectableTurn/run-handle.tsare removed — superseded; no product consumed them yet.Diff-audit (self-review, concurrency-focused)
Caught + fixed one ordering bug: the handle was flipped to
completedbeforecompleteStep, so a crash in that window made a finished run re-run instead of replay.completeStepnow lands first — the replay path's "turn finished" signal.Out of scope (deliberate)
Client↔edge SSE resumability (the event log is its enabler — a clean follow-on); WebSocket Hibernation (cost optimization, not correctness); a generic workflow engine (the sandbox runtime is the durable engine).
Test plan
pnpm typecheck— cleanpnpm test— 251 passed (15 new: the cross-worker chaos keystone — kill mid-stream, resume on a fresh worker, no gap / no duplicate; seam dedup; two successive deaths; lease/heartbeat; lease-lost abort; DO fetch + alarm re-attach; adapter conformance)pnpm exec biome check src— cleanpnpm build— greenPursuit spec:
.evolve/pursuits/2026-05-22-durable-run-supervisor.md.