Skip to content

feat: durable-run supervisor — cross-worker / cross-DO durability#29

Merged
drewstone merged 1 commit into
mainfrom
feat/durable-run-supervisor
May 22, 2026
Merged

feat: durable-run supervisor — cross-worker / cross-DO durability#29
drewstone merged 1 commit into
mainfrom
feat/durable-run-supervisor

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Summary

runReconnectableTurn (#23) recovered an interrupted turn only on a retry re-invocation — leaving an unattended window between worker death and that retry, depending on the sandbox runtime to buffer events, and making the correctness-critical reconnect a per-product callback. It stored the run handle as "a completed step at index 0" — an admitted migration-dodge.

This evolves #23 (the RunHandle shape + four-mode resolution carry forward) by relocating the durability boundary off the ephemeral worker onto an always-attached supervisor.

Substrate — platform-agnostic, tested in Node

  • Stream-event log on DurableRunStoreappendStreamEvent / readStreamEvents, idempotent on eventId so a reconnecting adapter that re-yields a boundary event cannot double-log. Replay is now guaranteed by our store, not hoped-for from the sandbox.
  • RunHandle is real run-row state (setRunHandle + durable_runs.handle_json) — the step-0 hack is gone.
  • Schema v2durable_stream_events table + the handle column, implemented across the in-memory / file-system / D1 stores.
  • runSupervisedTurn — drains events into the log as they flow, persists the reconnect pointer the instant the substrate yields it, heartbeats the lease. A fresh supervisor reads the log for its cursor and resumes via the adapter: fresh / resumed / replayed.
  • SandboxReconnectAdapter — one typed, conformance-tested contract. The dangerous reconnect glue lives once per substrate, never per product.

Cloudflare host — thin

  • SessionSupervisorDO — a Durable Object hosting runSupervisedTurn; alarm() re-attaches a run a dropped response stream abandoned. CF types are structural (no @cloudflare/workers-types dep — same discipline as D1DatabaseLike).

runReconnectableTurn / run-handle.ts are removed — superseded; no product consumed them yet.

Diff-audit (self-review, concurrency-focused)

Caught + fixed one ordering bug: the handle was flipped to completed before completeStep, so a crash in that window made a finished run re-run instead of replay. completeStep now lands first — the replay path's "turn finished" signal.

Out of scope (deliberate)

Client↔edge SSE resumability (the event log is its enabler — a clean follow-on); WebSocket Hibernation (cost optimization, not correctness); a generic workflow engine (the sandbox runtime is the durable engine).

Test plan

  • pnpm typecheck — clean
  • pnpm test251 passed (15 new: the cross-worker chaos keystone — kill mid-stream, resume on a fresh worker, no gap / no duplicate; seam dedup; two successive deaths; lease/heartbeat; lease-lost abort; DO fetch + alarm re-attach; adapter conformance)
  • pnpm exec biome check src — clean
  • pnpm build — green

Pursuit spec: .evolve/pursuits/2026-05-22-durable-run-supervisor.md.

runReconnectableTurn (#23) recovered an interrupted turn only on a retry
re-invocation, left an unattended window between worker death and that
retry, depended on the sandbox runtime buffering events, and made the
correctness-critical reconnect a per-product callback. It checkpointed
the run handle as "a completed step at index 0" — an admitted
migration-dodge.

This relocates the durability boundary off the ephemeral worker onto an
always-attached supervisor that owns the run.

Substrate (platform-agnostic, tested in Node):
- DurableRunStore gains an ordered, replayable stream-event log —
  appendStreamEvent / readStreamEvents, idempotent on eventId so a
  reconnecting adapter that re-yields a boundary event cannot double-log.
  RunHandle is real run-row state via setRunHandle, not a step hack.
  Schema v2 (durable_stream_events table + durable_runs.handle_json),
  implemented across the in-memory / file-system / D1 stores.
- runSupervisedTurn — drains a run's events into the stream log as they
  flow, persists the reconnect pointer the instant the substrate yields
  it, heartbeats the lease. A fresh supervisor reads the log for its
  cursor and resumes via the adapter — fresh / resumed / replayed.
- SandboxReconnectAdapter — one typed, conformance-tested contract. The
  dangerous reconnect glue lives once per substrate, never per product.

Cloudflare host (thin):
- SessionSupervisorDO — a Durable Object that hosts runSupervisedTurn;
  alarm() re-attaches a run a dropped response stream abandoned. CF
  types are structural (no @cloudflare/workers-types dep).

runReconnectableTurn / run-handle.ts are removed — superseded; no
product consumed them. RunHandle + the four-mode resolution carry
forward. 15 new tests incl. the cross-worker chaos keystone (kill
mid-stream, resume, no gap, no duplicate); suite 251 green; typecheck +
biome + build clean.
Copy link
Copy Markdown
Contributor

@tangletools tangletools left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified. Relocates the durability boundary off the ephemeral worker onto an always-attached supervisor: DurableRunStore gains an idempotent ordered stream-event log (3 stores + schema v2), runSupervisedTurn drains into it + heartbeats + resumes from cursor via a typed conformance-tested SandboxReconnectAdapter, SessionSupervisorDO is a thin DO host with alarm()-driven orphan re-attach. Diff-audit caught + fixed the completeStep/handle ordering bug. Cross-worker chaos keystone test green (kill mid-stream → resume → no gap, no dup); suite 251 green; typecheck + biome + build clean.

@drewstone drewstone merged commit 9325ea6 into main May 22, 2026
1 check passed
@drewstone drewstone deleted the feat/durable-run-supervisor branch May 22, 2026 21:13
drewstone added a commit that referenced this pull request May 22, 2026
…ples

Three days of substrate changes (#28 model-resolution, #29 durable-run
supervisor + DO) shipped with no docs and no real-workerd test. This
catches the surface up.

- Real-workerd integration test for SessionSupervisorDO via
  @cloudflare/vitest-pool-workers — runs the DO under actual workerd
  with real DurableObjectState + real ReadableStream + real Response.
  Pinned to a vitest-3-compatible pool version; separate
  vitest.workers.config.ts so the main Node suite is unaffected.
- README rewrite: entry-point table covers every current primitive —
  durable turn, supervisor + DO, model resolution, defineAgent,
  durable chat-turn engine, analyst loop, platform clients.
- docs/concepts.md: real mental-model doc — the five layers, the three
  durability levels, the reconnect-adapter contract, model resolution,
  reading order for new consumers. Replaces the dead src/index.ts
  JSDoc link.
- examples/model-resolution — resolveChatModel + validateChatModelId
  (fail-closed) + withConfiguredModels. Runs offline.
- examples/durable-supervisor — cross-worker resume keystone: w1
  drains 2 of 5, lease lapses, w2 resumes from cursor, full sequence
  exactly once. Runs offline.
- examples/agent-into-reviewer — pipe one runtime's stream into a
  reviewer agent (the "two-runtime" pattern). Runs offline.

Verified: typecheck 0, Node suite 251, workerd suite 2 (real DO),
biome clean, build green. All three new examples runnable end-to-end.
drewstone added a commit that referenced this pull request May 22, 2026
…ples (#30)

Three days of substrate changes (#28 model-resolution, #29 durable-run
supervisor + DO) shipped with no docs and no real-workerd test. This
catches the surface up.

- Real-workerd integration test for SessionSupervisorDO via
  @cloudflare/vitest-pool-workers — runs the DO under actual workerd
  with real DurableObjectState + real ReadableStream + real Response.
  Pinned to a vitest-3-compatible pool version; separate
  vitest.workers.config.ts so the main Node suite is unaffected.
- README rewrite: entry-point table covers every current primitive —
  durable turn, supervisor + DO, model resolution, defineAgent,
  durable chat-turn engine, analyst loop, platform clients.
- docs/concepts.md: real mental-model doc — the five layers, the three
  durability levels, the reconnect-adapter contract, model resolution,
  reading order for new consumers. Replaces the dead src/index.ts
  JSDoc link.
- examples/model-resolution — resolveChatModel + validateChatModelId
  (fail-closed) + withConfiguredModels. Runs offline.
- examples/durable-supervisor — cross-worker resume keystone: w1
  drains 2 of 5, lease lapses, w2 resumes from cursor, full sequence
  exactly once. Runs offline.
- examples/agent-into-reviewer — pipe one runtime's stream into a
  reviewer agent (the "two-runtime" pattern). Runs offline.

Verified: typecheck 0, Node suite 251, workerd suite 2 (real DO),
biome clean, build green. All three new examples runnable end-to-end.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants