Snapshot Runtime: QuickJS WASM VM with snapshot/restore for workflow execution#1300
Snapshot Runtime: QuickJS WASM VM with snapshot/restore for workflow execution#1300TooTallNate wants to merge 134 commits intomainfrom
Conversation
…refix Start of the serialization refactor (separate from snapshot-runtime). New files: - serialization/types.ts — SerializationFormat enum, SerializableSpecial interface, Reducers/Revivers types - serialization/codec.ts — Codec interface with formatPrefix, serialize, deserialize, and optional deserializeLegacy - serialization/format.ts — Format prefix encode/decode/peek, moved from the monolithic serialization.ts The Codec interface enables future alternative formats (CBOR, JSON) while keeping the devalue implementation as the current default.
Serialization refactor Phase 1: create the new module structure alongside the existing monolithic serialization.ts (which continues to work). New files: - serialization/reducers/common.ts — Date, Error, Map, Set, URL, BigInt, typed arrays, Headers, Request, Response, RegExp, URLSearchParams - serialization/reducers/class.ts — Class/Instance with WORKFLOW_SERIALIZE/ DESERIALIZE support - serialization/reducers/step-function.ts — StepFunction with closure vars - serialization/codec-devalue.ts — devalue Codec implementation - serialization/encryption.ts — composable encrypt/decrypt layer - serialization/workflow.ts — synchronous, no encryption, for VM use - serialization/step.ts — async with encryption, for step handler - serialization/client.ts — async with encryption, for start() API - serialization/index.ts — re-exports all public API - serialization/serialization.test.ts — 25 focused tests All modes compose their reducer/reviver sets from the shared building blocks. Cross-mode compatibility verified: data serialized in any mode can be deserialized in any other mode (for common types). Existing 108 serialization tests continue to pass unchanged.
- Add ./serialization/workflow export to @workflow/core package.json
- Add ./internal/serialization re-export to workflow meta-package
- The workflow bundle can now import serialize/deserialize via:
import { serialize, deserialize } from 'workflow/internal/serialization'
Full test suite passes: 493 tests across 22 files (including 25 new
serialization module tests).
1. Fix reducer composition order: Class/Instance reducers now come BEFORE common reducers in all three modes (workflow, step, client). This ensures custom Error subclasses with WORKFLOW_SERIALIZE are handled by the Instance reducer before the generic Error reducer (devalue uses first-match-wins semantics). 2. Fix encryption decrypt() to fail fast when encrypted data is encountered without a decryption key, instead of silently returning encrypted bytes that would fail later with an unhelpful format error. 3. Remove Request/Response from common reducers — they don't have matching common revivers, so including them caused asymmetric behavior (serialize as Request, deserialize as plain object). Request/Response handling belongs in mode-specific modules that can provide proper revivers. 4. Document Node.js dependency in the workflow serialization re-export. The current implementation uses node:util and Buffer. For the QuickJS VM (snapshot runtime), these will need polyfills — tracked separately.
The Codec interface now takes a SerializationMode ('workflow', 'step',
'client') instead of raw reducers/revivers. The reducer/reviver
composition is internal to the devalue codec implementation.
This is the right abstraction because reducers/revivers are devalue-
specific concepts. A future CBOR codec would handle Date, typed arrays,
Map, Set natively via the CBOR type system — it wouldn't use reducers
at all. A JSON codec would only support standard JSON types.
The mode-specific modules (workflow.ts, step.ts, client.ts) are now
simpler — they just pass the mode string to the codec.
The format prefix is now a branded string type validated by
isFormatPrefix() — any 4-character [a-z0-9] string is valid.
This removes the hard-coded enum of known formats, making the system
truly open for extension:
type FormatPrefix = string & { __brand: 'FormatPrefix' };
function isFormatPrefix(value: string): value is FormatPrefix;
The SerializationFormat object still provides well-known constants
('devl', 'encr') but they're now just typed constants, not an
exhaustive enum.
peekFormatPrefix() and decodeFormatPrefix() use isFormatPrefix() for
validation instead of checking against a known list. Unknown but valid
prefixes (e.g. 'cbor', 'json', 'v2b1') are accepted — the caller
decides whether they can handle the format.
6 new isFormatPrefix tests covering: valid strings, too short, too long,
uppercase, special characters. 1 new test for unknown-but-valid prefixes.
Proves that data serialized by the new modules can be deserialized by the old serialization.ts functions, and vice versa. This validates that the new modules are wire-format compatible and safe for incremental migration: - new workflow.serialize → old hydrateStepReturnValue (primitives, Date, Map, nested) - old dehydrateStepReturnValue → new workflow.deserialize (primitives, Date, nested) - old dehydrateWorkflowArguments → new workflow.deserialize - new client.serialize → old hydrateWorkflowArguments - new step.serialize + encryption → old hydrateStepArguments + decryption - old dehydrateStepArguments + encryption → new step.deserialize + decryption All 11 tests pass, confirming the new and old modules produce identical wire formats and can coexist during the migration.
Phase 1 of the VM snapshot runtime (RFC #1298). World interface changes (packages/world): - Add SnapshotMetadata type (lastEventId, createdAt) with zod schema - Add snapshots sub-interface to Storage: save(), load(), delete() - Export new types and schema from @workflow/world world-local implementation (packages/world-local): - Filesystem-based snapshot storage in {dataDir}/snapshots/ - {runId}.bin for serialized VM snapshot data - {runId}.json for metadata (lastEventId, createdAt) - save() overwrites existing snapshots (atomic via ensureDir + write) - load() returns null if no snapshot exists - delete() removes both files - Wired into createStorage() with tracing instrumentation
Phase 2 of the VM snapshot runtime (RFC #1298). - Add quickjs-wasi dependency to @workflow/core - Create snapshot-runtime.ts with the basic structure: - runSnapshotWorkflow() entry point - Fresh VM creation with deterministic WASI clock and seeded Math.random - Snapshot restore path (TODO: event processing) - Host function stubs for useStep, sleep, createHook via Symbol.for() - Interrupt handler (30s timeout) - Memory limit (64MB) - Snapshot serialization on suspension The useStep, sleep, and createHook host functions are stubs with TODO markers — the basic VM lifecycle and snapshot/restore flow is in place.
Demonstrates the core snapshot/restore mechanism with a compiled workflow pattern: - useStep implemented inside QuickJS as JS code (not host functions) - Pending step resolve/reject functions stored on globalThis.__resolvers - Step metadata (stepId, args) preserved across snapshot/restore - Multi-step workflow: snapshot at each suspension, restore and resolve, workflow continues from exact suspension point - Both tests pass: simple workflow + metadata preservation
The snapshot runtime (runSnapshotWorkflow) now handles the complete workflow lifecycle: - First run: bootstrap VM with workflow primitives, evaluate compiled workflow bundle, start workflow function, process any existing events - Snapshot: capture VM state when workflow suspends on step/sleep - Restore: deserialize snapshot, process delta events to resolve/reject pending promises, execute pending jobs - Completion: detect workflow result or error Workflow primitives (useStep, sleep) are implemented as JavaScript code inside the QuickJS VM, not as host function callbacks. This keeps the implementation simple — the host communicates by evaluating small JS snippets to resolve/reject promises. 7 tests covering: simple completion, step suspension, snapshot/restore with step completion, multi-step across 3 snapshots, sleep suspension and wake, step failure with try/catch.
…napshot flag - Add snapshot-entrypoint.ts that handles the full lifecycle: snapshot load → event fetching → runSnapshotWorkflow → result handling (create events, queue steps, save/delete snapshots) - Add feature flag: set WORKFLOW_RUNTIME=snapshot to use the new runtime - When enabled, the snapshot path runs before the event-replay path - Step queuing matches the existing step handler's expected payload format - Wait handling includes timeout calculation for delayed re-queuing - Extract workflow ID from SWC-compiled bundle's manifest comment
The snapshot runtime now successfully: 1. Evaluates the compiled workflow bundle in QuickJS 2. Suspends on the first step call 3. Snapshots the VM state 4. Creates step_created events and queues step execution Web API stubs added for TransformStream, ReadableStream, WritableStream, TextEncoder, TextDecoder, Headers, URL, console — these are referenced by the compiled bundle but not needed for basic step/sleep workflows. Remaining issue: step_created events use raw JSON for step input args, but the step handler expects devalue-serialized data. This is the data serialization boundary that needs to be resolved (RFC #1298 discusses moving devalue inside the QuickJS VM).
…untime The step_created events now contain properly devalue-serialized input data (Uint8Array with 'devl' format prefix) instead of raw JSON. This makes the step handler's hydrateStepArguments() work correctly. When processing step_completed events, the output is deserialized via workflow.deserialize() on the host side before passing to the QuickJS VM as JSON. This handles the devalue format prefix correctly. Also properly serializes the run_completed output.
Step arguments are now wrapped in { args: [...], closureVars?: {...} }
before being serialized with workflow.serialize(), matching the format
expected by the step handler's hydrateStepArguments().
The step handler successfully:
- Receives the step message
- Deserializes the step arguments
- Executes the step function (add(10, 7))
- Handles retry on retryable errors
- Completes the step and re-queues the workflow
New files: - serialization/base64.ts — pure-JS base64 encode/decode (no Buffer) - serialization/reducers/common-vm.ts — VM-compatible reducers using instanceof Error instead of types.isNativeError(), pure-JS base64 instead of Buffer - serialization/codec-devalue-vm.ts — devalue codec using VM reducers - serialization/workflow-vm.ts — VM workflow serialize/deserialize The VM serializer produces the EXACT same wire format as the Node.js serializer (devl-prefixed devalue data). Verified by 14 tests including critical cross-compatibility: - VM serialize → Node.js hydrateStepArguments (step handler path) - Node.js dehydrateStepReturnValue → VM deserialize (step result path) - Pure-JS base64 matches Node.js Buffer base64 Sub-path export: @workflow/core/serialization/workflow-vm Re-export: workflow/internal/serialization now points to workflow-vm
Data now flows as format-prefixed devalue bytes (devl + devalue.stringify)
across the VM boundary, with no JSON conversion in the middle:
Step args: VM __wdk_serialize({args}) → Uint8Array → event input
Step results: event output Uint8Array → VM __wdk_deserialize → value
Workflow result: VM __wdk_serialize(result) → Uint8Array → event output
Host functions __wdk_serialize/__wdk_deserialize are installed on
globalThis and use the VM-compatible workflow serializer (pure JS,
no Node.js deps). They are re-installed after snapshot restore since
host callbacks don't survive the snapshot.
VM-compatible serializer (workflow-vm.ts) produces the EXACT same
wire format as the Node.js serializer — verified by cross-compatibility
tests.
The serializer (devalue + reducers + TextEncoder/TextDecoder polyfills) is now bundled as a 16.6KB IIFE that's evaluated inside the QuickJS VM during bootstrap. The serialize/deserialize functions are real JS functions running inside the VM, operating on QuickJS-native values (Date, Map, Set, etc.) that can't cross the VM boundary via dump(). Architecture: - vm-bundle-entry.ts is bundled by esbuild into a self-contained IIFE - esbuild inject option ensures TextEncoder/TextDecoder polyfills run before any module-level code - The host only passes opaque Uint8Array blobs (devl-prefixed devalue) across the VM boundary - On snapshot restore, the serde functions survive in the QuickJS heap (no re-registration needed) New files: - polyfills/text-encoder.ts — pure JS TextEncoder (from nx.js) - polyfills/text-decoder.ts — pure JS TextDecoder (from nx.js) - polyfills/install-text-coding.ts — installs polyfills on globalThis - serialization/vm-bundle-entry.ts — esbuild entry for VM serde bundle - runtime/vm-serde-bundle.generated.ts — auto-generated bundle string - scripts/build-vm-serde-bundle.js — build script (runs during pnpm build) Removed: installSerdeHostFunctions (no longer needed — serde is in-VM)
…ecution The snapshot metadata now stores eventsCursor (the pagination cursor from events.list()) instead of lastEventId (the raw event ID). The world-local pagination expects cursors in 'timestamp|id' format, not raw event IDs. This fix enables the full workflow lifecycle: 1. First invocation: QuickJS VM evaluates workflow, suspends on step_0 2. Step handler executes add(10, 7) = 17 3. Second invocation: snapshot restored, step_0 resolved, suspends on step_1 4. Step handler executes add(17, 8) = 25 5. Third invocation: snapshot restored, both steps resolved, workflow completes 6. run_completed event created, snapshot cleaned up Verified end-to-end with the nextjs-turbopack workbench: - All events created correctly (run_created → run_completed) - Step retries work (the add function throws on first attempt) - Snapshots are saved/restored/deleted at correct lifecycle points - Run status transitions to 'completed'
🦋 Changeset detectedLatest commit: 8349c88 The changes in this PR will be included in the next version bump. This PR includes changesets to release 20 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results✅ All tests passed Summary
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
✅ 📋 Other
|
- Extract workflow arguments from run_created event and pass to the workflow function via __wdk_deserialize() - Call executePendingJobs() after each step_completed/step_failed/ wait_completed event to allow async function await resumptions to unwind one step at a time - Add debug logging for workflow result bytes The addTenWorkflow e2e test is still failing: the workflow result bytes are 'devl-1' (devalue for undefined) even though all steps complete successfully. The issue appears to be that the async function return value is not propagating through the SWC-compiled workflow bundle's promise chain. This needs investigation — the unit tests with simple inline workflow code work correctly.
Adds snapshot.* semantic conventions and threads the parent
`WORKFLOW {workflowName}` span into the snapshot entrypoint and VM
runner so operators can see snapshot-restore latency, snapshot size,
encrypt/decrypt overhead, and event-fetch behavior in their traces.
Attributes attached to the parent span:
- snapshot.runtime ('snapshot' | 'replay')
- snapshot.invocation_kind ('first' | 'restore')
- snapshot.outcome ('completed' | 'suspended' | 'failed')
- snapshot.events.preloaded, .fetched_count, .fetched_pages
- snapshot.pending_ops_count, .events_cursor
- snapshot.{load,save,delete,decrypt,encrypt,deserialize,serialize}.duration_ms
- snapshot.{load,save}.bytes, snapshot.save.plaintext_bytes
Two child spans:
- snapshot.load — wraps world.snapshots.load + decrypt (deserialize
duration is recorded as an attribute since it occurs inside the
VM runner where the load span is no longer in scope).
- snapshot.save — wraps QuickJS.serializeSnapshot + encrypt +
world.snapshots.save.
No metrics histograms — the codebase has no metric pipeline yet, so
this matches the existing attributes-on-spans convention used by the
replay runtime.
Previously the seedrandom seed for each VM invocation was `runId:workflowName:startedAt` — constant across all resumptions of a run. Each restore re-initialized the RNG from that same seed and replayed the first-N draws, so the VM's `__generateUlid` and `__generateNanoid` produced identical IDs on every resumption. That collapsed the hasCreatedEvent dedup guard and caused step / hook correlation IDs to drift between invocations. Mix `existingSnapshot.metadata.eventsCursor` into the seed when restoring. The cursor is stable for retries of the same resumption (idempotent within a single resume) but advances across resumes, which is exactly the determinism boundary we want.
…invocations Two queue messages for the same workflow run can be processed concurrently by separate workflow handler instances. The replay runtime is naturally idempotent (full event-log replay produces deterministic correlationIds via the seeded PRNG), but the snapshot runtime previously used `ulid(Date.now())` for correlationIds — concurrent VMs hit it at slightly different ms and produced different ULIDs even though the seeded PRNG portion was identical. The world had no way to dedup these as duplicates, so a single logical step became two step_created events with two independent step handlers. For workflows like fibonacciWorkflow that do `Promise.all([runA.returnValue, runB.returnValue])`, this manifested as 4 step_created events for 2 logical operations, with 2 of the 4 `Run#returnValue` proxies hanging because nothing wrote their step_completed. Inject a deterministic timestamp (`workflowRun.startedAt`, constant per-run) into the VM as `__ulidTimestamp`. The bundle's `__generateUlid` reads it instead of `Date.now()` when present, so concurrent VMs produce identical ULIDs. Distinctness across resumptions still comes from the cursor mixed into the seedrandom seed, which advances the PRNG sequence between resumes.
Three unit tests covering: - Same fresh start (no snapshot) → identical correlationIds across two concurrent invocations. - Same restore (snapshot + same events) → identical correlationIds across two concurrent invocations. - Different resume (cursor advanced) → distinct correlationIds across resumes (so EntityConflictError doesn't falsely dedup unrelated steps). The first two tests fail against the pre-fix runtime (different ULID timestamp portions across concurrent invocations); the third test was already passing pre-fix because the cursor-mixed seedrandom seed already produced distinct random portions across resumes.
…-local Concurrent invocations producing identical correlationIds (as the snapshot runtime does by design across replays) previously both succeeded and persisted duplicate events. step_created had no guard at all; wait_created used a TOCTOU read-then-check that allowed both writers through under concurrency. Both now claim a per-(runId, correlationId) constraint file with O_CREAT|O_EXCL before writing, so the loser surfaces as EntityConflictError — which the runtime's dedup catch path already handles.
…in world-postgres Adds a unique partial index on workflow_events(run_id, correlation_id, type) filtered to step_created/hook_created/wait_created, and translates the resulting unique-violation (pg code 23505, surfaced via DrizzleQueryError.cause) into EntityConflictError. The steps table already deduped via onConflictDoNothing, but the event row still inserted, leaving duplicate events in the log. Now both rows are kept consistent and the runtime's existing dedup catch path handles concurrent writers cleanly.
Three coupled changes in the snapshot entrypoint's suspension handler: 1. Build per-pending-op promises and await them with Promise.all instead of running them in a sequential for-loop. Mirrors the replay runtime's suspension-handler.ts pattern. 2. Run snapshot.save concurrently with the op dispatch via the same Promise.all. The snapshot is an optimization — if save lags or fails, the next workflow invocation simply replays from events. Previously blocked step queueing on a full storage round-trip. 3. Drop the redundant hooks.list pre-check from the hook_created branch. With deterministic correlationIds (snapshot runtime PRNG fix) and per-(runId, correlationId) uniqueness in worlds (world-local + world-postgres dedup fixes), EntityConflictError on events.create is the correct dedup signal and the pre-check is an unnecessary round-trip per pending hook. CI run 25095263499 measured snapshot ~2.37x slower than replay per-test on Vercel (sum: 2418s vs 1021s); these changes should narrow that gap considerably on cloud worlds where each storage call is a network round-trip.
Hook-related e2e tests (hookWorkflow, hookCleanupTestWorkflow,
hookDisposeTestWorkflow, hookWithSleepWorkflow, distributedAbortController)
previously slept a fixed 5 seconds before calling getHookByToken to wait
for the hook to be registered. On slower runtimes — notably the snapshot
runtime on Vercel where each workflow round-trip is several seconds longer
than replay — that fixed budget is too tight and the test fails with
HookNotFoundError. On faster runtimes it's unnecessarily slow.
Adds a waitForHook(token, { timeoutMs, intervalMs, runId }) helper that
polls until the hook resolves or the timeout (default 30s) expires, with
an optional runId filter for token-reuse tests where eventually-consistent
backends may briefly still report a stale hook. Each hook-wait site now
uses this helper. Non-hook fixed sleeps (workflow-progress polling for
sleepingWorkflow cancel tests, payload-processing waits in
hookWithSleepWorkflow) are unchanged.
The recursion-hazard fixes that motivated the blast-radius cap have all
landed:
1. Snapshot runtime correlationIds are now deterministic across
concurrent VM invocations (commit 83bcec — `__ulidTimestamp`
injection so same-resumption invocations produce identical ULIDs).
2. The seeded PRNG state is preserved by the VM heap snapshot itself
(commit a71503 — events cursor mixed into seed; ULID
monotonicFactory closure persists in the QuickJS heap).
3. Per-(runId, correlationId) uniqueness is enforced atomically in
world-local (commit ca0078) and via unique partial index in
world-postgres (commit 009a00) for step_created / hook_created /
wait_created.
With those guarantees the duplicate `start()` invocation that previously
fanned out hundreds of thousands of child runs on the fastify deployment
is no longer possible. Restore the full Vercel project matrix
(11 frameworks) and unskip fibonacciWorkflow on Vercel.
…aces Pipelining world.snapshots.save with the per-pending-op events.create + queueMessage dispatch (introduced in 22ab779) opened a window where a fast-completing step could re-invoke the workflow handler before the new snapshot was persisted. The handler then loads a stale (or missing) snapshot whose coroutine state doesn't match the latest events, leaving the workflow stuck. CI run 25098135190 caught this: fetchWorkflow on Vercel snapshot mode regressed from ~16s passing to a 60s timeout. Diagnostic showed both step_completed events landed at +5.5s but no run_completed ever fired. Restore the original ordering: await snapshot.save fully before any step is queued. Per-pending-op dispatch within a single suspension still runs in parallel via Promise.all, which retains the bulk of the wall-clock reduction (run 25098135190 measured ~568s saved on Vercel snapshot vs. the pre-parallelize baseline). Only the cross-invocation pipelining of save with queue is rolled back.
Wedges on Vercel snapshot runtime under concurrent matrix load are opaque from CI logs alone — the workflow handler runs inside a function on Vercel and its console output isn't surfaced in the CI job. This commit adds two pieces of diagnostic plumbing: 1. Always-on checkpoint logs at every major step of the snapshot suspension/restore lifecycle (`SNAPSHOT_DIAG`), plus matching entry/exit logs in the workflow and step queue handlers (`WORKFLOW_HANDLER_DIAG`, `STEP_HANDLER_DIAG`). Each record carries a per-invocation id, runId, elapsed time, and structured fields (snapshot bytes, events fetched + counts by type, pending op summary, outcome, exit action). Emitted at `warn` level so they show up in Vercel function logs without DEBUG=1. 2. e2e diagnostic harness extension that fetches matching function logs from `/v3/deployments/:id/events` for the wedged runId after a test failure and appends them to the existing run-diagnostic block. Only runs when `WORKFLOW_VERCEL_AUTH_TOKEN` / `WORKFLOW_VERCEL_TEAM` / `VERCEL_DEPLOYMENT_ID` are set (i.e. the Vercel-prod CI matrix); silently no-ops elsewhere. Together these let a failed test surface the function-side activity for its wedged run \u2014 e.g. whether the snapshot runtime even reached its post-VM checkpoint, what its last successful save / queue operation was, whether the next handler invocation ever started, etc. That visibility is what we need to actually find the wedge cause.
…reserve Buffer body across retries
Wedge root cause for snapshot runtime on Vercel under concurrent matrix
load. The old save() in world-vercel/src/snapshots.ts used:
fetch(url, { method: 'PUT', body: compressed, dispatcher: getDispatcher() })
where getDispatcher() returns a RetryAgent. fetch() wraps Buffer/Uint8Array
bodies in a one-shot ReadableStream (web fetch spec), so when the
RetryAgent retries on a transient 5xx or network error, the second
attempt has nothing left to read — the iterable yields 0 bytes, undici
detects the mismatch with Content-Length, and throws
UND_ERR_REQ_CONTENT_LENGTH_MISMATCH. With 5–15 MB snapshot bodies the
bug fires under any meaningful network turbulence.
The downstream impact is a permanent wedge:
1. Save throws -> workflow handler returns 500.
2. Queue retries the handler with backoff.
3. Each retry repeats the same save -> same throw -> same 500.
4. Production logs showed attempt: 19 (≈1.5 hours of retries)
before the test framework gave up at the 60s test timeout.
Switch to undici.request() (the lower-level API), which hands the Buffer
to the connection layer directly without stream wrapping, so retries
can replay the same body. Verified locally with a vitest regression
test that reproduces the exact production stack trace
(AsyncWriter.end -> writeIterable -> UND_ERR_REQ_CONTENT_LENGTH_MISMATCH)
without the fix and passes with it.
Other world-vercel endpoints (events, hooks, runs, …) hit the same
underlying undici limitation but in practice rarely fail this way: their
bodies are tiny (KB CBOR-encoded payloads), so the chance of network
turbulence mid-stream is much lower. They remain on fetch() for now.
Avoid a guaranteed-404 round-trip to the snapshot storage backend on
the very first workflow handler invocation. The suspension handler in
this file always saves the snapshot BEFORE creating any
step_created / hook_created / wait_created events, so if the events
preloaded by events.create('run_started') contain only run_created /
run_started, no save cycle has run yet and no snapshot can exist.
Detected by the new exported `canSkipSnapshotLoad(preloadedEvents)`
helper, with 8 unit tests covering each event-type combination
(undefined / empty / run_created+run_started / run_started only /
step_* / hook_received / wait_completed). When the helper returns true,
`existingSnapshot` is set to null without calling
`world.snapshots.load()` and the entrypoint falls through to the
first-run path with the preloaded events.
The wfdiag('snapshot_loaded') checkpoint now also reports
`skippedLoad: true` when the fast path was taken so we can confirm
the optimization is firing in production logs.
Reduces 404 noise on workflow-server's `/v2/runs/:runId/snapshot`
endpoint and saves a network round-trip on every initial workflow
invocation. Falls back to the normal load path whenever
`preloadedEvents` is missing or contains any non-initial event.
…ming breakdown
Two changes that go together:
1. New `stripInlineSourceMap()` helper in `source-map.ts` (with 4 unit
tests). The runtime entrypoint now strips the trailing
`//# sourceMappingURL=data:…` comment from the workflow bundle
before passing it to `vm.evalCode()`. The original (unstripped)
string is kept in the host-side scope so `remapErrorStack` can
still resolve original source positions on workflow failures.
The map is purely host-side metadata for stack-trace remapping —
the VM never reads it. But QuickJS retains source text for
stack-trace line lookups, so the multi-MB base64 comment was being
carried into the VM heap and showing up in every snapshot save+load
round-trip. Empirically, on the example workbench's bundle:
- Bundle string drops 5.16 MB → 1.20 MB (-77%)
- QuickJS heap snapshot drops 11.75 MB → 8.00 MB (-32%)
That maps to ~1s saved per per-step round-trip on Vercel.
2. Extend the `SNAPSHOT_DIAG snapshot_loaded` and
`SNAPSHOT_DIAG snapshot_saved` checkpoint logs with per-stage byte
counts and timings:
- load: returnedBytes (post-decompress, pre-decrypt),
loadDurationMs (HTTP round-trip), decryptDurationMs
- save: plaintextBytes (raw QuickJS output),
handedToWorldBytes (after host-side encrypt),
encryptDurationMs, storeDurationMs
So the savings show up in CI-fetched function logs alongside the
existing OTel attributes. Naming clarified: 'returnedBytes' /
'handedToWorldBytes' instead of misleading 'wireBytes', because
the world (e.g. world-vercel) applies its own gzip layer below
this — true on-the-wire bytes are emitted by world-vercel's own
diagnostic (separate commit).
Adds `WORLD_SNAPSHOT_DIAG` checkpoint logs to the snapshot save and load paths. Save reports inputBytes (what the core handed in) → wireBytes (after gzipSync) → compressionRatio, plus separate gzipDurationMs and putDurationMs. Load reports the equivalents: wireBytes (raw HTTP body) → decompressedBytes (after gunzipSync), plus getDurationMs and gunzipDurationMs. Pairs with the core `SNAPSHOT_DIAG` checkpoints from the previous commit so the entire snapshot lifecycle for any wedged run is grep-able by runId in Vercel function logs. Also covers the 404 (no-snapshot) case so a core `skippedLoad: true` checkpoint can be cross-referenced against the world's view: when both line up, the optimization is firing as intended; when only one side fires, something's off. All emitted at `console.warn` level — no DEBUG required, matching the format/style of the core wfdiag helper.
…able
The snapshot save path was doing the wrong thing: each world (vercel,
postgres, local) gzipped the bytes BEFORE handing them to its
transport, but core's encryption wrapped them AFTER. Net result was
`gzip(encrypt(plain))` on the wire — encryption produces ciphertext
that doesn't compress, so the gzip step was largely wasted CPU.
Flip the order so compression goes BEFORE encryption (the standard
compress-then-encrypt pattern used for at-rest blob encryption — no
CRIME/BREACH applicability here since the snapshot is opaque, no
attacker injection, no per-request size leakage). Move compression
into core so it happens once, in the right place, and so the world
layers can be simplified to opaque-bytes transport.
Codec choice: zstd when available (Node 22.15+), gzip otherwise.
Benchmarked against an 8 MB QuickJS heap snapshot (representative
production payload):
| codec | ratio | compress | decompress |
|--------|-------|----------|------------|
| zstd-3 | 4.29x | 18 ms | 6 ms |
| gzip-6 | 4.02x | 127 ms | 11 ms |
zstd is faster AND smaller. The format prefix on each blob (`zstd`
or `gzip`) marks the codec, so deployments running different Node
versions remain interoperable.
Pipeline now:
- SAVE: serialize → compress → encrypt → world.snapshots.save
- LOAD: world.snapshots.load → decrypt → decompress → deserialize
`@workflow/core`:
* New `serialization/compression.ts` with `compress` /
`decompress` / `isCompressed` / `PREFERRED_CODEC`. 11 unit
tests covering codec selection, idempotency, format-prefix
dispatch, legacy-blob passthrough.
* New SerializationFormat constants `GZIP` / `ZSTD`.
* `runtime/snapshot-entrypoint.ts` save path: compress → encrypt
→ store. Load path: decrypt → decompress. New byte-count and
timing fields on `SNAPSHOT_DIAG snapshot_saved` /
`snapshot_loaded` (compressedBytes, compressionRatio,
compressionCodec, compressDurationMs, decompressDurationMs).
* 7 new tests in `runtime/snapshot-encryption.test.ts` covering
the full pipeline round-trip with and without encryption, plus
legacy-blob backward compatibility.
`@workflow/world-vercel`:
* Drop `gzipSync` from save. Body is sent verbatim (already
compressed+encrypted by core upstream).
* Drop the `X-Snapshot-Content-Encoding: gzip` header on save.
* Load still gunzips when the response carries that header — for
backward compatibility with blobs written by older deployments.
`@workflow/world-postgres`:
* Drop `gzipSync` / `gunzipSync`. Stores opaque bytes.
Snapshots table is created per CI run; no migration concern.
`@workflow/world-local`:
* Save as `{runId}.bin` (was `.bin.gz`). Load still gunzips
legacy `.bin.gz` files via the `dataFile` metadata so a
developer's stale `.workflow-data/` directory keeps working.
The compress-then-encrypt pipeline that landed in 519bb1d added backward-compatibility code to read older snapshot blobs that were written under the previous SDK-side gzip scheme. The snapshot runtime is still on the snapshot-runtime feature branch and has no production deploy, so no such blob has ever been written under the old scheme that needs to outlive a feature-branch deploy. world-vercel: - Remove the X-Snapshot-Content-Encoding: gzip header round-trip on save and load. - Drop the gunzipSync import. - File header comment no longer mentions back-compat. world-local: - Drop the .bin.gz / dataFile metadata mechanism. Snapshots are now always stored as {runId}.bin alongside {runId}.json. - Drop the gunzipSync import and the LocalSnapshotMetadataSchema extension; metadata is just SnapshotMetadataSchema (eventsCursor + createdAt). - File-naming helpers extracted as dataPath() / metadataPath(). core: remove the now-irrelevant 'legacy snapshots saved before compression was added' test from snapshot-encryption.test.ts. The remaining 'plaintext bytes pass through unchanged' test still exercises the contract that decryptSerializedData() does not require prefixed input — that's a real pre-existing API contract used by non-snapshot callers, not snapshot back-compat.
Replaces 14 incremental per-commit changesets with 4 terse, package-scoped ones (one each for @workflow/core, world-vercel, world-postgres, world-local). The detailed per-change context is preserved in git history; CHANGELOG entries from changesets should describe what consumers need to know, not the implementation history.
This changeset is part of the serialization-refactor base branch (introduced in 6add40c) and was incorrectly deleted in the previous consolidation pass. Only changesets local to the snapshot-runtime branch should have been consolidated.
The file is regenerated on every build (`scripts/build-vm-serde-bundle.js`) and is already listed under turbo.json's outputs for caching. Tracking it just produced noisy diffs whenever someone built the package with a slightly different esbuild version.
…isites
Standardize on `Symbol.for('workflow-serialize')` /
`Symbol.for('workflow-deserialize')` everywhere — the parallel
`globalThis.__wdk_serialize` / `__wdk_deserialize` aliases have been
removed from `vm-bundle-entry.ts` and the snapshot runtime's inline
JS strings now use the symbol form directly. Single canonical name,
no duplication.
Drop the `?? Math.random` and `?? Date.now()` fallbacks from the
ULID generator setup. Both prerequisites
(`globalThis.__ulidTimestamp` and the host-replaced seeded
`Math.random`) are always set by `snapshot-runtime.ts` before the
serde bundle is evaluated; silently falling back to unseeded
`Math.random` or live `Date.now()` would re-introduce the
non-determinism we deliberately fixed (concurrent VM invocations of
the same resumption must produce identical correlationIds for the
world's EntityConflictError dedup to work). Now throws if
`__ulidTimestamp` isn't a number, and passes the seeded
`Math.random` reference explicitly to `monotonicFactory` so
upstream's `detectPRNG` never runs (it'd throw in QuickJS anyway,
since `crypto` is unavailable).
Drop the `URL` / `URLSearchParams` / `DOMException` availability
guards in `common-vm.ts`. quickjs-wasi's URL extension is always
loaded (`url.so`) and DOMException is always constructible — the
guards were dead code carried over from when those weren't reliably
available. The reducer/reviver code is now straightforward
`instanceof URL` / `new URL(...)` / `new DOMException(...)`.
Remove `packages/core/src/serialization/base64.ts` and its
sub-path exports (`./serialization/workflow`,
`./serialization/workflow-vm`). The pure-JS base64 helpers were
leftover from before `base64.so` shipped `btoa`/`atob` natively;
the VM-side reducers in `common-vm.ts` now build base64 strings via
the native ones. The sub-path exports had zero consumers in this
repo (the same cleanup landed on the `serialization-refactor`
branch in 05e0fee but never made it onto `snapshot-runtime`
because the branches diverged earlier).
Remove `packages/workflow/src/internal/serialization.ts` and its
`./internal/serialization` package.json export. Same story — zero
consumers, previously removed in #1082, then accidentally
reintroduced via `f04fd8e91`.
The `/v3/deployments/:id/events` endpoint mostly returned empty results in our wedge-debugging usage and the runId-substring filter made it slow when it did return data. The function-log fetch belongs in a dedicated diagnostic CLI command rather than baked into the test diagnostic block. Dropping for now; can be revived in a follow-up PR if needed.
Updates the per-package changesets to match AGENTS.md guidance and the
current state of the PR:
- Bump from `patch` to `minor` (snapshot runtime is a new feature, not
a bug fix; correctness matters when the changesets land on `stable`)
- Correct snapshot-runtime-core.md: snapshot is now the default, with
replay available via `WORKFLOW_RUNTIME=replay` (was incorrectly
describing snapshot as opt-in)
- Drop the misleading 'enforces uniqueness' line from
snapshot-runtime-world-vercel.md (no uniqueness work happens in this
package; that lives in workflow-server)
- Tighten language across all four changesets per AGENTS.md
('Keep the changesets terse')
…stack regression Per CI history (runs 25100278265 vs 25130930859), the regression boundary for the 'basic step error preserves message and stack trace' / 'cross-file step error preserves message and function names in stack' e2e tests on astro local-dev is commit 770c433 ('Add CI-visible runtime diagnostics for snapshot wedges'), NOT the later 9168353 source-map-strip commit. The astro-dev failure reproduces on both replay and snapshot runtimes with identical symptoms (function name shows up as `__getOwnPropDesc` instead of the actual step function name in the source-mapped stack), which rules out any snapshot-runtime specific cause. The STEP_HANDLER_DIAG entries were always-on `runtimeLogger.warn` calls inside the step queue handler. They didn't add real diagnostic value beyond what the existing OTel spans already cover; their main purpose was to grep-correlate step activity with SNAPSHOT_DIAG checkpoints in Vercel function logs during the wedge-debugging session that's now resolved. SNAPSHOT_DIAG and WORKFLOW_HANDLER_DIAG are kept; only the STEP_HANDLER_DIAG pair is removed. The exact mechanism by which the diagnostic warns affect the `stepFn.apply()` stack frame's source-mapped function name is still unclear (the most plausible explanation is that the line-shift in step-handler.ts perturbed Vite's dev-mode module graph in a way that changes which export getter wraps the step function reference at the `__copyProps` site shared with the namespace import in `_workflows.ts`). Reverting the diagnostic is sufficient to restore the test, and the diagnostic itself is not load-bearing.
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
There was a problem hiding this comment.
Pull request overview
Implements the new default snapshot-based workflow runtime (QuickJS WASM VM with snapshot/restore) and wires snapshot persistence into world backends, while keeping the existing event-replay runtime as an opt-out via WORKFLOW_RUNTIME=replay.
Changes:
- Add snapshot runtime execution path in
@workflow/core(VM bootstrap, snapshot save/load pipeline with compression + optional encryption, runtime-mode dispatch, and new telemetry attributes). - Introduce
snapshots.save/load/deleteto the@workflow/worldstorage interface and implement it forworld-vercel,world-postgres, andworld-local. - Expand CI/E2E coverage to run tests against both runtimes and reduce E2E flakiness by polling for hook registration instead of fixed sleeps.
Reviewed changes
Copilot reviewed 53 out of 54 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/create-test-matrix.mjs | Duplicates app matrix across snapshot and replay runtime axes. |
| pnpm-lock.yaml | Adds quickjs-wasi@2.0.0 lock entries. |
| packages/world/src/snapshots.ts | Adds SnapshotMetadataSchema (eventsCursor, createdAt). |
| packages/world/src/interfaces.ts | Extends Storage with snapshots.save/load/delete. |
| packages/world/src/index.ts | Exposes snapshot types/schema from @workflow/world. |
| packages/world-vercel/src/storage.ts | Wires snapshots into Vercel storage and instrumentation. |
| packages/world-vercel/src/snapshots.ts | Implements snapshot storage via workflow-server snapshot endpoints. |
| packages/world-vercel/src/snapshots.test.ts | Adds tests for PUT body correctness and retry behavior. |
| packages/world-postgres/test/storage.test.ts | Adds tests asserting dedup behavior for entity-creation races. |
| packages/world-postgres/src/storage.ts | Maps pg unique-violation for entity-creating events to EntityConflictError. |
| packages/world-postgres/src/snapshots.ts | Implements Postgres snapshot upsert/load/delete storage. |
| packages/world-postgres/src/index.ts | Wires snapshots storage into Postgres createStorage. |
| packages/world-postgres/src/drizzle/schema.ts | Adds snapshots table + entity-creation partial unique index. |
| packages/world-postgres/src/drizzle/migrations/meta/_journal.json | Registers new migrations in drizzle journal. |
| packages/world-postgres/src/drizzle/migrations/0010_add_snapshots_table.sql | Creates workflow.workflow_snapshots table. |
| packages/world-postgres/src/drizzle/migrations/0011_add_events_entity_creation_unique_index.sql | Adds partial unique index for step/hook/wait creation events. |
| packages/world-local/src/storage/snapshots-storage.ts | Adds filesystem-backed snapshot storage (bytes + metadata files). |
| packages/world-local/src/storage/index.ts | Wires snapshots storage into local storage and instrumentation. |
| packages/world-local/src/storage/events-storage.ts | Adds atomic lock-file dedup for step_created and wait_created. |
| packages/world-local/src/storage.test.ts | Adds race tests for local step/wait creation dedup behavior. |
| packages/world-local/src/queue.ts | Logs queue handler errors with stack for debugging. |
| packages/core/turbo.json | Adds generated VM bundle/assets files to build outputs. |
| packages/core/src/telemetry/semantic-conventions.ts | Adds snapshot runtime semantic convention attributes. |
| packages/core/src/source-map.ts | Adds stripInlineSourceMap() to reduce VM heap/snapshot size. |
| packages/core/src/source-map.test.ts | Tests stripInlineSourceMap() behavior. |
| packages/core/src/serialization/workflow-vm.ts | Adds VM-safe workflow-mode serializer/deserializer. |
| packages/core/src/serialization/workflow-vm.test.ts | Tests VM serializer and VM↔Node compatibility. |
| packages/core/src/serialization/vm-bundle-entry.ts | VM bundle entry: installs serde + deterministic ULID generator. |
| packages/core/src/serialization/types.ts | Adds compression format prefixes (gzip, zstd). |
| packages/core/src/serialization/reducers/common-vm.ts | Adds VM-safe reducers/revivers (base64 via btoa/atob). |
| packages/core/src/serialization/compression.ts | Adds compress/decompress layer with gzip/zstd feature detection. |
| packages/core/src/serialization/compression.test.ts | Tests compression layer behavior and codec selection. |
| packages/core/src/serialization/compat.test.ts | Adds compatibility tests between modular and legacy serialization APIs. |
| packages/core/src/serialization/codec-devalue.ts | Adds clarifying notes about modular modules vs legacy runtime path. |
| packages/core/src/serialization/codec-devalue-vm.ts | Adds VM-compatible devalue codec using VM reducers/revivers. |
| packages/core/src/runtime/start.ts | Propagates WORKFLOW_RUNTIME choice into executionContext. |
| packages/core/src/runtime/snapshot-runtime.ts | Implements QuickJS snapshot/restore runtime engine. |
| packages/core/src/runtime/snapshot-runtime.test.ts | Unit tests for snapshot runtime behavior and determinism. |
| packages/core/src/runtime/snapshot-entrypoint.ts | Integrates snapshot runtime into devkit entrypoint + storage pipeline. |
| packages/core/src/runtime/snapshot-entrypoint.test.ts | Tests snapshot-load skip heuristic. |
| packages/core/src/runtime/snapshot-encryption.test.ts | Tests compress→encrypt→decrypt→decompress contract. |
| packages/core/src/runtime/runtime-mode.ts | Adds WORKFLOW_RUNTIME parsing/validation. |
| packages/core/src/runtime/runtime-mode.test.ts | Tests runtime-mode env parsing. |
| packages/core/src/runtime.ts | Switches default runtime to snapshot with replay fallback. |
| packages/core/scripts/build-vm-serde-bundle.js | Generates VM serde bundle source used by snapshot runtime. |
| packages/core/scripts/build-quickjs-assets.js | Generates embedded quickjs-wasi wasm/extension assets. |
| packages/core/package.json | Adds quickjs-wasi dependency and generators to build script. |
| packages/core/e2e/e2e.test.ts | Replaces fixed hook sleeps with polling helper to reduce flakiness. |
| packages/core/.gitignore | Ignores generated VM bundle/assets files. |
| .github/workflows/tests.yml | Expands CI matrix across runtimes and avoids ARG_MAX in sticky comment. |
| .changeset/snapshot-runtime-world-vercel.md | Changeset for world-vercel snapshot storage + undici.request rationale. |
| .changeset/snapshot-runtime-world-postgres.md | Changeset for world-postgres snapshots + event uniqueness fix. |
| .changeset/snapshot-runtime-world-local.md | Changeset for world-local snapshots + event dedup fix. |
| .changeset/snapshot-runtime-core.md | Changeset for core snapshot runtime default + replay opt-out. |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "scripts": { | ||
| "build": "genversion --es6 src/version.ts && tsc", | ||
| "build": "genversion --es6 src/version.ts && node scripts/build-vm-serde-bundle.js && node scripts/build-quickjs-assets.js && tsc", | ||
| "dev": "genversion --es6 src/version.ts && tsc --watch", | ||
| "clean": "tsc --build --clean && rm -rf dist src/version.ts docs ||:", |
| * The binary data is stored gzip-compressed in the `data` column. | ||
| * Metadata (`eventsCursor`, `createdAt`) lives alongside for cheap loads. | ||
| */ |
| const escapedCid = cid.replace(/"/g, '\\"'); | ||
| const eventData = |
| function arrayBufferToBase64( | ||
| value: ArrayBufferLike, | ||
| offset: number, | ||
| length: number | ||
| ): string { | ||
| if (length === 0) return '.'; | ||
| // btoa requires a binary string. Build it from the byte view. | ||
| const uint8 = new Uint8Array(value, offset, length); | ||
| let binary = ''; | ||
| for (let i = 0; i < uint8.length; i++) { | ||
| binary += String.fromCharCode(uint8[i]!); | ||
| } | ||
| return btoa(binary); |
Summary
Implements the snapshot-based workflow runtime described in RFC #1298. Instead of replaying the full event log on every workflow handler invocation, workflows run inside a QuickJS WASM VM that is snapshotted at suspension points and restored on resumption — so each invocation only fetches and processes events that arrived since the last save.
The snapshot runtime is the default in this PR. The previous event-replay runtime remains available as an opt-out via
WORKFLOW_RUNTIME=replayorexecutionContext.workflowRuntime: 'replay'.How it works
compress → encryptpipeline (zstd on Node 22.15+, gzip fallback; AES-256-GCM when an encryption key is configured) and are persisted viaworld.snapshots.save.world.snapshots.loadreturns the bytes, the inversedecrypt → decompresspipeline restores them, andvm.restore()resumes the VM at the exact suspension point.eventsCursor, processes them, and either resolves to a result, suspends on a new pending op, or fails.Most of the snapshot-runtime work lives in
@workflow/core(runtime/snapshot-runtime.ts,runtime/snapshot-entrypoint.ts,serialization/compression.ts,serialization/vm-bundle-entry.ts); each world implementssnapshots.save/load/deletefor its storage backend.Scope of this PR
@workflow/core: snapshot runtime, VM bootstrap, event-cursor-driven resume, deterministic correlationIds (seeded ULIDs across concurrent VM invocations of the same resumption), encryption and compression pipeline,WORKFLOW_RUNTIMEenv-var dispatch with replay-runtime fallback, OTel spans/attributes for the snapshot lifecycle, CI-visible diagnostic checkpoints (SNAPSHOT_DIAG).@workflow/world: newSnapshotsinterface (save/load/delete) and metadata schema.@workflow/world-vercel: workflow-server snapshot endpoints (PUT/GET/DELETE /v2/runs/:runId/snapshot), opaque-bytes transport, switch toundici.request()for retry-with-Buffer-body correctness, atomic per-(run, correlation) uniqueness for entity-creating events.@workflow/world-postgres: newworkflow_snapshotstable, unique partial index onworkflow_events(run_id, correlation_id, type)for entity-creating events.@workflow/world-local: filesystem-backed snapshot storage ({runId}.bin+{runId}.json), atomic correlationId uniqueness forstep_created/wait_created.[snapshot, replay], full Vercel-prod E2E coverage of the snapshot runtime across 11 frameworks.Custom serializers (
Symbol.for('workflow-serialize')/Symbol.for('workflow-deserialize')) and workflow-sideDOMException/WorkflowFunctionround-trip through the VM serde bundle alongside the standard reducers.Out of scope / future work
runId(getVercelFunctionLogswas removed from the e2e diagnostic harness — belongs in its own PR).@opentelemetry/api,zod,ai-sdk, etc. — tree-shaking those out is a builder-side change worth pursuing later).snapshot.save+ storage RTT; further work could batch saves or skip them entirely for ops the runtime can recompute).Based on
serialization-refactor(PR #1299).