feat(0.15.0): cross-worker sandbox-reconnect durability#23
Merged
Conversation
A 15-minute agentic sandbox turn must survive the Cloudflare worker
isolate dying mid-turn. `runDurableTurn` already replays a *completed*
turn, but an *interrupted* one re-runs from the top — the producer's
`streamPrompt` generator died with the isolate.
The sandbox container is orchestrator-managed and outlives the worker.
`runReconnectableTurn` checkpoints a `RunHandle` — `{ kind, sandboxId,
sessionId, runId, status, cursor }` — at turn start. On a retry that
finds a `running` handle, a fresh worker calls a product-supplied
`reconnect(handle)` callback (which wires the sandbox SDK's event-replay
endpoint) instead of re-prompting. tcloud products omit `reconnect` and
fall through to a clean re-run.
The handle is checkpointed as a completed step at index 0; the turn runs
at index 1. This reuses the existing `completeStep` JSON-result path
with zero schema change — a completed step is the only shape
`startOrResume` returns to a retry, and the handle must be readable
while the turn step is still `running`.
Tests cover fresh / reconnected / replayed / rerun / reconnect-failure
across the InMemory / FileSystem / D1 store matrix.
biome flagged three errors — a `let` that should be `const` (run-handle.test.ts), an unsorted export block (durable/index.ts), and a formatting nit (run-handle.ts). All mechanical (biome --write), no behaviour change. A pre-existing unused-import *warning* in tests/agent.test.ts is left untouched — warnings do not fail CI and it is outside this PR's scope.
af9fa2e to
252bcb0
Compare
tangletools
approved these changes
May 22, 2026
Contributor
tangletools
left a comment
There was a problem hiding this comment.
Verified after bringing the branch current with main (merged 9 commits incl. the model-resolution primitive, no conflicts). runReconnectableTurn + RunHandle: checkpoints a run pointer at turn index 0, a fresh worker re-attaches to the in-flight sandbox run (replayed/reconnected/rerun/fresh) instead of re-prompting — reuses the existing completeStep path, zero schema change. Full gate on the merged tree: typecheck 0, 254 tests green (18 sandbox-reconnect across the InMemory/FileSystem/D1 store matrix), build success, biome clean. CI green.
4 tasks
drewstone
added a commit
that referenced
this pull request
May 22, 2026
runReconnectableTurn (#23) recovered an interrupted turn only on a retry re-invocation, left an unattended window between worker death and that retry, depended on the sandbox runtime buffering events, and made the correctness-critical reconnect a per-product callback. It checkpointed the run handle as "a completed step at index 0" — an admitted migration-dodge. This relocates the durability boundary off the ephemeral worker onto an always-attached supervisor that owns the run. Substrate (platform-agnostic, tested in Node): - DurableRunStore gains an ordered, replayable stream-event log — appendStreamEvent / readStreamEvents, idempotent on eventId so a reconnecting adapter that re-yields a boundary event cannot double-log. RunHandle is real run-row state via setRunHandle, not a step hack. Schema v2 (durable_stream_events table + durable_runs.handle_json), implemented across the in-memory / file-system / D1 stores. - runSupervisedTurn — drains a run's events into the stream log as they flow, persists the reconnect pointer the instant the substrate yields it, heartbeats the lease. A fresh supervisor reads the log for its cursor and resumes via the adapter — fresh / resumed / replayed. - SandboxReconnectAdapter — one typed, conformance-tested contract. The dangerous reconnect glue lives once per substrate, never per product. Cloudflare host (thin): - SessionSupervisorDO — a Durable Object that hosts runSupervisedTurn; alarm() re-attaches a run a dropped response stream abandoned. CF types are structural (no @cloudflare/workers-types dep). runReconnectableTurn / run-handle.ts are removed — superseded; no product consumed them. RunHandle + the four-mode resolution carry forward. 15 new tests incl. the cross-worker chaos keystone (kill mid-stream, resume, no gap, no duplicate); suite 251 green; typecheck + biome + build clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A 15-minute agentic sandbox turn must survive the Cloudflare worker isolate dying mid-turn (deploy roll, CPU limit, OOM).
runDurableTurnalready replays a completed turn, but an interrupted turn re-runs from the top — the producer'sstreamPromptgenerator died with the isolate.The Tangle sandbox container is orchestrator-managed and outlives the worker. This PR adds
runReconnectableTurn: it checkpoints aRunHandleat turn start so a fresh worker re-attaches to the in-flight sandbox run instead of re-prompting.RunHandle—{ kind: 'sandbox' | 'tcloud', sandboxId?, sessionId?, runId?, status, cursor? }. A pointer to a substrate run that outlives the isolate.runReconnectableTurn— three resolution paths on a retry:replayed(turn already finished — cached text replays),reconnected(arunninghandle survived — calls the product'sreconnect(handle)callback),rerun/fresh(no reconnectable handle — produces live).reconnect(handle)is product-supplied substrate glue. Sandbox products wire the SDK's event-replay endpoint (GET {runtimeUrl}/agents/run/{runId}/events?lastEventId={cursor}); tcloud products omit it and fall through to a clean re-run.completeStepJSON-result path with zero schema change — a completed step is the only shapestartOrResumereturns to a retry, and the handle must be readable while the turn step is stillrunning. A newdurable_stepscolumn would force a migration across all three stores plus a new store method.This is a thin handle registry, not a second durable-execution framework — the sandbox runtime is the durable engine; agent-runtime just remembers the pointer.
Spike findings (
@tangle-network/sandbox@0.1.2)Cross-worker attach is feasible.
streamPrompt's reconnect usesexecutionId(run id, carried on theexecution.startedSSE frame'sdata) +lastEventId(the SSEid:cursor). The runtime exposesGET {runtimeUrl}/agents/run/{executionId}/events?lastEventId={cursor}&format=sse, reachable from any process via the publicSandboxConnection.runtimeUrl+authToken. The SDK does not expose a one-callresumeRun(executionId)— its reconnect loop is closure-local — so the raw replay fetch is product-owned, which is exactly whyreconnectis a product-supplied callback.Test plan
pnpm typecheckpassespnpm test— 231/231 pass (18 new inrun-handle.test.ts)runninghandle callsreconnectnotproduce;completedhandle replays;runninghandle with noreconnectfalls through to re-run; reconnect-stream failure fails the run (error not swallowed);registeradvances the persisted cursor