Serialize run_failed/step_failed errors through serialization pipeline#1851
Serialize run_failed/step_failed errors through serialization pipeline#1851TooTallNate wants to merge 15 commits intomainfrom
run_failed/step_failed errors through serialization pipeline#1851Conversation
🦋 Changeset detectedLatest commit: 4963546 The changes in this PR will be included in the next version bump. This PR includes changesets to release 22 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (6 failed)example (1 failed):
express (1 failed):
fastify (1 failed):
hono (1 failed):
nextjs-webpack (1 failed):
nuxt (1 failed):
Details by Category❌ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
Switch run_failed, step_failed, and step_retrying events to persist
the full thrown value via the workflow serialization pipeline (as
SerializedData / Uint8Array) instead of a lossy { message, stack, code }
StructuredError shape. Consumers hydrate via hydrateRunError /
hydrateStepError to reconstruct the original thrown value, preserving
Error subclass identity, cause chains, and custom properties.
- WorkflowRun.error and Step.error are now SerializedData
- WorkflowRun gains a top-level errorCode plaintext field
- WorkflowRunFailedError.cause is now the hydrated thrown value
- Adds world-postgres migration 0010_add_error_code.sql
- Legacy pre-pipeline errorJson records surface as undefined on read
cause is now `unknown` (the hydrated thrown value) rather than
`Error & { code }`. Defensively extract Error-shaped fields when the
hydrated value is an Error, otherwise round-trip the raw value, and
expose the new `errorCode` classification field.
The hydrated `cause` is now `unknown` (the original thrown value through the serialization pipeline) and the error classification has moved to the top-level `errorCode` property. Update the two affected docs pages and the `TSDoc` interface to reflect the new shape, and narrow `cause` with `instanceof Error` before accessing fields.
Unit tests:
- 19 new dehydrate/hydrate{Step,Run}Error round-trip tests covering
FatalError, plain Error, built-in Error subclasses, non-Error thrown
values (string, plain object), cause chains, encryption round-trip,
the binary format prefix contract, and the unserializable / unknown-
format error paths.
- 5 new tests for Run.returnValue when the run is failed: hydrated
FatalError + cause as cause, plain Error preservation, non-Error
thrown values surfaced verbatim, cross-class cause chains, and the
hydration-failure fallback that still surfaces errorCode.
E2E tests (new, in 99_e2e.ts + e2e.test.ts):
- Step throw → workflow catch round-trips a FatalError with a TypeError
cause chain, asserting class identity, fatal marker, and cause name +
message all survive the step_failed event pipeline.
- Workflow throw → run_failed reaches status with the new
top-level errorCode metadata exposed (cause-shape coverage lives at
the unit level, since the SWC plugin's class registration is not
invoked in the plain-Node e2e runner).
- Workflow throw of a non-Error value round-trips that value verbatim
as WorkflowRunFailedError.cause.
Adjustments to existing assertions:
- error.cause is now ; tests narrow with
and use the new top-level field instead of .
- step.error / run.error from CLI --withData are now hydrated payloads:
unregistered class instances surface as Instance refs whose
carries the original message + stack.
Observability hydration:
- hydrateStepIO / hydrateWorkflowIO in serialization-format.ts now
hydrate the field via hydrateData, so the CLI and web UI
continue to surface readable run/step error messages and stacks.
When a workflow runs in a Node `vm` context, its bundled
`@workflow/errors` is a different module instance than the host's
import (separate prototype chains, separate class identity). Calling
`new FatalError(...)` from the host-side reviver produces a
host-realm instance that fails `err instanceof FatalError` checks
in the workflow code — even when the serialized payload was correctly
tagged via the dedicated `FatalError` reducer.
Surfaced by the local-prod e2e "step throw round-trips FatalError"
test on Next.js Turbopack: each route gets its own bundled chunk, so
the flow handler's `@workflow/errors` and the workflow VM bundle's
`@workflow/errors` are two distinct copies of the same module.
Fix:
- Each bundled copy of `@workflow/errors` self-registers its
`FatalError` and `RetryableError` classes on `globalThis` via
`Symbol.for("@workflow/errors//FatalError")` /
`Symbol.for("@workflow/errors//RetryableError")`. First load wins
per realm; the descriptor is non-writable / non-configurable to make
accidental clobbering loud.
- The revivers in `@workflow/core`'s common reducers module read the
consumer's `globalThis` (passed in as `global`) to pick up the
realm-local class, falling back to the host-imported class when no
registration is present (e.g. in the CLI / test runner).
The runtime's run-failure path computes a source-map-remapped stack and then assigns it back onto the thrown value via `if (err instanceof Error) err.stack = errorStack`. Workflows run inside a Node `vm` context, so a workflow-thrown error is an instance of the VM realm's `Error` — `instanceof` against the host realm's `Error` returns `false`, the assignment is skipped, and the serialized `run_failed` event carries the un-remapped (bundled-line- number) stack instead of the source-mapped one. Switch the gate to `types.isNativeError`, which uses V8's internal type tag and works across realms — same approach already in place for the serialization reducers. Caught by the local-prod e2e "nested function calls preserve message and stack trace" and "cross-file imports preserve message and stack trace" tests, which assert that the persisted run-error stack contains `99_e2e.ts` / `helpers.ts`.
Two issues with the CLI's hand-rolled reviver list: 1. It hadn't been updated for the new first-class Error subclass reducers (`TypeError`, `RangeError`, `FatalError`, `RetryableError`, etc.). devalue throws "Unknown type X" when it encounters a reduced value with no matching reviver, and `hydrateResourceIO` swallows that error and surfaces the raw `Uint8Array` payload — so `step.error` / `run.error` showed up as raw byte dumps in `workflow inspect` output. 2. Even with all the right revivers, `Error.prototype`'s `message` / `stack` / `cause` are non-enumerable, so `JSON.stringify` (used by `workflow inspect --json`) drops them — leaving the subclass-specific enumerable fields (e.g. `FatalError.fatal`) visible but the actual error data missing. Fix: - Build the CLI reviver set on top of `getCommonRevivers()` from `@workflow/core` so the CLI stays in sync with the runtime's reducer set automatically. New core reducers/revivers will Just Work without any CLI-side change. - Wrap each Error reviver from the common set with a thin shim that attaches a non-enumerable `toJSON` method to the produced `Error` instance. `JSON.stringify` calls `toJSON` and gets a full object (`name` + `message` + `stack` + `cause` + any enumerable subclass fields like `fatal` / `retryAfter` / `errors`); `util.inspect` ignores `toJSON` and renders the canonical `Error: msg\\n at ...` format. Best of both worlds for CLI output without compromising the runtime hydration path. Caught by the local-prod e2e "basic step error preserves" and "cross-file step error preserves" tests, which read `failedStep.error.message` / `.stack` from the CLI's JSON output.
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
|
There was a problem hiding this comment.
Pull request overview
This PR updates the workflow failure event model so run_failed, step_failed, and step_retrying persist the full thrown value through the existing serialization pipeline (SerializedData / Uint8Array) rather than a lossy { message, stack, code } shape, enabling consumers to hydrate back to the original thrown value (including Error subclass identity, cause chains, and custom properties).
Changes:
- Persist run/step failure payloads as opaque
SerializedDataand adderrorCodeas plaintext metadata for workflow runs. - Add dedicated
dehydrate{Step,Run}Error/hydrate{Step,Run}Errorhelpers and wire them into step/workflow handlers and consumer APIs (Run.returnValue, step promise rejection). - Update storage backends (local + Postgres + Vercel world), CLI/web UI hydration, docs, and tests; add Postgres migration for
workflow_runs.error_code.
Reviewed changes
Copilot reviewed 36 out of 36 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| workbench/nextjs-webpack/pages/api/trigger-pages.ts | Adjusts workbench API response to handle WorkflowRunFailedError.cause as unknown and surfaces errorCode. |
| workbench/nextjs-turbopack/pages/api/trigger-pages.ts | Same as above for turbopack workbench app. |
| workbench/example/workflows/99_e2e.ts | Adds e2e workflows that validate round-tripping thrown values (FatalError + cause chains; non-Error throws). |
| packages/world/src/steps.ts | Switches Step.error to SerializedData and updates schema/docs accordingly. |
| packages/world/src/runs.ts | Switches WorkflowRun.error to SerializedData and adds top-level errorCode. |
| packages/world/src/events.ts | Updates step_failed/step_retrying/run_failed event payload schemas to carry serialized error (+ errorCode for runs). |
| packages/world-vercel/src/utils.ts | Makes error (de)serialization helpers pass-through for SerializedData wire format. |
| packages/world-vercel/src/steps.ts | Updates step wire schema + deserializer to pass through serialized error payloads. |
| packages/world-vercel/src/runs.ts | Updates run wire schema to accept serialized error payloads and separate errorCode. |
| packages/world-vercel/src/events.ts | Updates event result transformation docs/behavior for serialized error fields. |
| packages/world-postgres/test/storage.test.ts | Updates Postgres storage tests to assert opaque Uint8Array persistence and legacy error handling. |
| packages/world-postgres/src/storage.ts | Updates Postgres storage read/write paths for serialized errors and run errorCode. |
| packages/world-postgres/src/drizzle/schema.ts | Changes CBOR error column typing to SerializedData and adds error_code column. |
| packages/world-postgres/src/drizzle/migrations/meta/_journal.json | Registers the new Postgres migration. |
| packages/world-postgres/src/drizzle/migrations/0010_add_error_code.sql | Adds error_code column to workflow.workflow_runs. |
| packages/world-local/src/storage/events-storage.ts | Updates local world event application to store serialized errors verbatim (+ errorCode). |
| packages/world-local/src/storage.test.ts | Updates local world storage tests for new error payload shape and stripping behavior. |
| packages/web-shared/src/components/sidebar/attribute-panel.tsx | Adds UI display handling for the new errorCode attribute. |
| packages/errors/src/index.ts | Updates WorkflowRunFailedError to use cause: unknown and add errorCode; registers FatalError/RetryableError on globalThis for cross-realm identity. |
| packages/core/src/step.ts | Hydrates step failure payloads via hydrateStepError and rejects with the original thrown value. |
| packages/core/src/step.test.ts | Updates tests to use error de/rehydration pipeline and validate subclass preservation. |
| packages/core/src/serialization/reducers/common.ts | Updates FatalError/RetryableError revivers to resolve constructors via globalThis Symbol registry when available. |
| packages/core/src/serialization.ts | Adds dehydrate{Step,Run}Error / hydrate{Step,Run}Error helpers and integrates with format-prefix + optional encryption. |
| packages/core/src/serialization.test.ts | Adds unit tests covering round-trips for step/run error helpers, subclasses, causes, and encryption. |
| packages/core/src/serialization-format.ts | Extends observability hydration to hydrate error fields on step/run resources. |
| packages/core/src/runtime/step-handler.ts | Writes step_failed/step_retrying errors via dehydrateStepError (encrypted when configured). |
| packages/core/src/runtime/step-handler.test.ts | Updates mocked serialization + assertions to account for binary error payload. |
| packages/core/src/runtime/runs.test.ts | Adds tests ensuring Run.returnValue throws WorkflowRunFailedError with hydrated cause + errorCode. |
| packages/core/src/runtime/run.ts | Hydrates run errors via hydrateRunError before throwing WorkflowRunFailedError. |
| packages/core/src/runtime.ts | Writes run_failed errors via dehydrateRunError and preserves remapped stacks on the serialized thrown value. |
| packages/core/src/async-deserialization-ordering.test.ts | Updates ordering tests to use serialized step errors. |
| packages/core/e2e/e2e.test.ts | Updates/extends e2e coverage for new errorCode + hydrated causes across step/run failures. |
| packages/cli/src/lib/inspect/hydration.ts | Uses core common revivers and adds Error.toJSON shims so --json output includes name/message/stack/cause. |
| docs/content/docs/foundations/errors-and-retries.mdx | Updates docs to reference errorCode and cause: unknown narrowing. |
| docs/content/docs/api-reference/workflow-errors/workflow-run-failed-error.mdx | Updates API reference for WorkflowRunFailedError’s new cause and errorCode. |
| .changeset/run-step-error-hydration.md | Declares major-version breaking changes across affected packages. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The previous JSDoc described preserving legacy values "for best-effort hydration" which contradicted the implementation, where legacy errors are intentionally surfaced as absent (the pre-pipeline shapes can't be hydrated by the new error revivers). Rewrite the comment so the contract matches behavior. Also rename the now-unused parameter to `_errorJson` to reflect that the function ignores it. Caught by a code review on #1851.
karthikscale3
left a comment
There was a problem hiding this comment.
AI review: Five flagged items from my analysis of this PR, posted as inline comments below.
Three review-driven adjustments that all touch the queue handlers and their interaction with the error serialization pipeline: 1. Memoize the per-run encryption key fetch. The step handler used to eagerly fetch + import the key at the top of every step delivery so the value would be in scope for every potential dehydrateStepError path. That pessimized step-started early-return cases (the fetch happens unconditionally even when the step never reaches user code) and required duplicating the same boilerplate at four call sites in runtime.ts. Introduce `memoizeEncryptionKey(world, run)` in runtime/helpers.ts that returns a lazy, single-fetch accessor; step-handler / runtime call sites use `await getEncryptionKey()` instead. The first caller pays the fetch cost, subsequent callers await the cached promise, and steps that fail before any encryption-aware work happens skip the fetch entirely. 2. Preserve the prior attempt's serialized error as the cause on the defensive max-retries-exceeded `step_failed` re-invocation guard. The existing comment explicitly opted out of cause attachment, but the symmetric post-failure path below already does this and the reviewer is right that consumers shouldn't have to walk the step_retrying event history to recover the underlying error. Best- effort: if hydration of the prior `step.error` throws, fall back to a FatalError without cause rather than letting the event write itself fail. 3. Document the intentional `unflatten` throw in `hydrateStepError` / `hydrateRunError` for non-Uint8Array input. SDK version is pinned per workflow run via skew protection so the non-binary branch is dead in production; if a misshapen value reaches it, surfacing the throw via the surrounding o11y try/catch is more debuggable than masking it. Add a comment so future reviewers don't reach for a defensive fallback. A standalone `falls back to plaintext` suggestion on the run_failed key fetch was rejected: when encryption is configured we should fail loudly rather than silently emit plaintext error data. The queue's redelivery semantics will retry the key fetch; persistent KMS outages get logged with the existing "persistent error preventing the run from being terminated" message rather than a security regression.
`hydrateEventData` enumerated the per-event fields that need
hydration (`result`, `input`, `output`, `metadata`, `payload`)
but omitted the new `error` field on `step_failed`,
`step_retrying`, and `run_failed` events. Without this branch,
o11y tools that list events (e.g. `workflow inspect events`) surface
the raw `Uint8Array` payload instead of a hydrated
`{ name, message, stack, … }` object even though the entity-level
`Run.error` / `Step.error` paths already hydrate.
Mirrors the existing per-field branches; the `try/catch` leaves the
field un-hydrated on parse failure rather than failing the whole
event view. Adds a unit test.
Workflows execute inside a separate `vm` realm: the `WorkflowRuntimeError` class bundled into the workflow code and the host-imported one are distinct constructors, so an `err instanceof WorkflowRuntimeError` check on a VM-thrown error returns `false` and we'd misclassify genuine runtime errors (corrupted event log, missing timestamps, workflow/step not registered) as user errors. Switch to each subclass's `.is()` static (a name-based duck check that works across realms). Since `WorkflowRuntimeError.is` only matches its own concrete name, enumerate every concrete subclass we want to recognize (`StepNotRegisteredError`, `WorkflowNotRegisteredError`) in a `RUNTIME_ERROR_CHECKS` table; keep that table in sync with the class hierarchy in `@workflow/errors`. Existing `classify-error.test.ts` already covers `WorkflowRuntimeError` and `WorkflowNotRegisteredError` cases — both still pass.
We had `errorWorkflowThrowNonErrorValue` (workflow body throws a plain object — round-trips verbatim as `WorkflowRunFailedError.cause`) but no symmetric coverage for the step-throw side. Step-throw goes through a different code path: non-Error values aren't recognized as `FatalError` (no `name === 'FatalError'`) nor `RetryableError`, so they take the transient retry path. After max retries the runtime wraps the original thrown value as `cause` on a fresh `FatalError` which the workflow's catch block then sees. Add a workflow that throws a recognizable plain object from a step with `maxRetries = 0` (so we exhaust on first attempt and avoid a long test wait) and a workflow that asserts the wrapped FatalError shape: `isFatal`, `instanceof FatalError`, message includes the original object's serialized form, `cause` is the original non-Error object verbatim with structure preserved. Documents the current retry-then-wrap behavior so any future change to "non-Error throws skip retries" semantics has to update the test.
Pre-upgrade failed runs that wrote into world-postgres's deprecated `error` text column can't be hydrated through the new pipeline (the shape is incompatible with the new revivers). The new runtime intentionally surfaces them as `error: undefined` on read; the original payload is still readable directly from the `errorJson` column for manual inspection. Add a one-sentence note to the changeset's migration text so consumers upgrading don't get blindsided by suddenly-empty error fields on historical runs.
Summary
Switch
run_failed,step_failed, andstep_retryingevents to persist the full thrown value through the workflow serialization pipeline (asSerializedData/Uint8Array) instead of a lossy{ message, stack, code }StructuredErrorshape. Consumers hydrate viahydrateRunError/hydrateStepErrorto reconstruct the original thrown value — preservingErrorsubclass identity,causechains, and custom properties (e.g.FatalError.fatal,RetryableError.retryAfter).Breaking changes
WorkflowRun.errorandStep.errorare nowSerializedData(Uint8Array) instead of{ message, stack?, code? }. Consumers must hydrate viahydrateRunError/hydrateStepError.WorkflowRungains a top-levelerrorCodefield carrying the previouserror.codevalue as plaintext metadata.WorkflowRunFailedError.causeis nowunknown(the hydrated thrown value with its original type identity preserved) instead of a synthesizedError. A newerrorCodeproperty exposes the error classification.step_failed,step_retrying, andrun_failednow containserror: SerializedData.@workflow/world-postgresmigration0010_add_error_code.sql(newerror_codecolumn onworkflow.workflow_runs).errortext column are surfaced asundefinedon read (they cannot be hydrated into the original thrown value).Stack
FatalErrorandRetryableError#1513 —FatalError/RetryableErrorfirst-class serde (base of this PR)run_failed/step_failederrors through the serialization pipelineNotes
FatalErrorandRetryableError#1513, which adds dedicated reducers/revivers forFatalError/RetryableError. Together with the built-inErrorsubclass support from Add first-class serialization for built-in Error subclasses #1511, this means thrown values round-trip with full type identity preserved.