Skip to content

[core] Move stream reconnect logic to getReadable level#1847

Open
VaguelySerious wants to merge 9 commits intostablefrom
peter/stream-control-at-getreadable-level
Open

[core] Move stream reconnect logic to getReadable level#1847
VaguelySerious wants to merge 9 commits intostablefrom
peter/stream-control-at-getreadable-level

Conversation

@VaguelySerious
Copy link
Copy Markdown
Member

@VaguelySerious VaguelySerious commented Apr 23, 2026

Reverts #1790 and instead moves stream reconnects to the getReadable level, where we already to chunk framing.

Closes #1801
Closes #1802

After shipping this

Needs to later be forward-ported to main

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Apr 23, 2026

🦋 Changeset detected

Latest commit: e58a449

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 18 packages
Name Type
@workflow/world-vercel Patch
@workflow/core Patch
@workflow/cli Patch
@workflow/web Patch
@workflow/builders Patch
@workflow/next Patch
@workflow/nitro Patch
@workflow/vitest Patch
@workflow/web-shared Patch
workflow Patch
@workflow/world-testing Patch
@workflow/astro Patch
@workflow/nest Patch
@workflow/rollup Patch
@workflow/sveltekit Patch
@workflow/vite Patch
@workflow/nuxt Patch
@workflow/ai Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented Apr 23, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
example-nextjs-workflow-turbopack Ready Ready Preview, Comment Apr 29, 2026 0:49am
example-nextjs-workflow-webpack Ready Ready Preview, Comment Apr 29, 2026 0:49am
example-workflow Ready Ready Preview, Comment Apr 29, 2026 0:49am
workbench-astro-workflow Ready Ready Preview, Comment Apr 29, 2026 0:49am
workbench-express-workflow Ready Ready Preview, Comment Apr 29, 2026 0:49am
workbench-fastify-workflow Ready Ready Preview, Comment Apr 29, 2026 0:49am
workbench-hono-workflow Ready Ready Preview, Comment Apr 29, 2026 0:49am
workbench-nitro-workflow Ready Ready Preview, Comment Apr 29, 2026 0:49am
workbench-nuxt-workflow Ready Ready Preview, Comment Apr 29, 2026 0:49am
workbench-sveltekit-workflow Ready Ready Preview, Comment Apr 29, 2026 0:49am
workbench-vite-workflow Ready Ready Preview, Comment Apr 29, 2026 0:49am
workflow-docs Ready Ready Preview, Comment, Open in v0 Apr 29, 2026 0:49am
workflow-swc-playground Ready Ready Preview, Comment Apr 29, 2026 0:49am
workflow-web Ready Ready Preview, Comment Apr 29, 2026 0:49am

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 23, 2026

🧪 E2E Test Results

Some tests failed

Summary

Passed Failed Skipped Total
❌ ▲ Vercel Production 900 1 67 968
✅ 🪟 Windows 88 0 0 88
Total 988 1 67 1056

❌ Failed Tests

▲ Vercel Production (1 failed)

nextjs-turbopack (1 failed):

  • pathsAliasWorkflow - TypeScript path aliases resolve correctly | wrun_01KQCMWDP78VRTE0FQ5D0HBVME | 🔍 observability

Details by Category

❌ ▲ Vercel Production
App Passed Failed Skipped
✅ astro 81 0 7
✅ example 81 0 7
✅ express 81 0 7
✅ fastify 81 0 7
✅ hono 81 0 7
❌ nextjs-turbopack 85 1 2
✅ nextjs-webpack 86 0 2
✅ nitro 81 0 7
✅ nuxt 81 0 7
✅ sveltekit 81 0 7
✅ vite 81 0 7
✅ 🪟 Windows
App Passed Failed Skipped
✅ nextjs-turbopack 88 0 0

📋 View full workflow run


Some E2E test jobs failed:

  • Vercel Prod: failure
  • Local Dev: failure
  • Local Prod: failure
  • Local Postgres: failure
  • Windows: success

Check the workflow run for details.

Comment thread packages/core/src/serialization.ts Outdated
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
Copy link
Copy Markdown
Member

@TooTallNate TooTallNate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

The architectural shift makes sense: client-side frame counting is a cleaner abstraction than wire-level control frames, and moving it to core means it works for any world that returns a ReadableStream from readFromStream, not just world-vercel. The reconnect math, frame-counting, and partial-frame discard are all correct.

But there are two significant concerns I think need addressing before merge.

1. Byte streams lose auto-reconnect entirely

The PR explicitly opts byte streams out of reconnect:

if (value.type === 'bytes') {
  // No auto-reconnect here: raw byte streams have no wire framing
  const readable = new WorkflowServerReadableStream(value.name, value.startIndex);
  // ...
} else {
  const readable = createReconnectingFramedStream(value.name, value.startIndex);
  // ...
}

The reason given is technically correct (no wire framing → no chunk boundary detection client-side), but this is a regression vs. the reverted #1790, which handled byte streams just fine because the server sent the resume hint via control frame.

The use cases that lose auto-reconnect:

  • AI streaming responses (text/SSE) piped from getWritable()
  • Any HTTP route doing return new Response(run.getReadable()) for raw bytes
  • Any streaming workflow output that goes more than 2 minutes (the prior server-side timeout window) and uses byte type

The docs callout added by this PR points users to WorkflowChatTransport and supportsCancellation, but those address a different problem (cancellation, not reconnect). Pushing reconnect to the application layer — where every consumer has to reimplement it — is a step backward in usability.

Possible directions:

  1. Frame byte streams on the writable side too (4 bytes per chunk overhead) so createReconnectingFramedStream works for them. The user-facing surface stays raw bytes; only the wire format changes.
  2. Keep the control-frame approach for byte streams only as a hybrid — frame counting for non-byte streams, server-side hint for byte streams.
  3. Document this as an explicit limitation and update the docs callout to specifically warn about byte streams losing reconnect, not just talk about supportsCancellation (separate issue).

(1) seems best to me — it removes the asymmetry entirely and keeps the cleaner architecture.

2. The "clean EOF means done" assumption needs verification

if (result.done || !result.value) {
  // Clean EOF — stream is truly complete...
  controller.close();
  return;
}

This assumes the workflow-server signals "done" and "timeout/aborted" differently at the network level — clean done = FIN, timeout = error/reset. The deleted control-frame logic disambiguated these because both manifested as clean closes from a TCP perspective; the magic-footer frame was the disambiguator.

Without the control frame, the new code can't tell them apart. If the workflow-server's 2-minute timeout sends a clean FIN (rather than a TCP reset or stream error), this PR will appear to "complete" any stream that hits 2 minutes.

Is that assumption verified against the actual server behavior? The new test simulates max-duration as controller.error(...), which is fine for the unit test, but I'd want to see either:

  • An e2e test confirming a real long-lived stream against workflow-server triggers reconnect (not premature close)
  • A statement in the PR description / commit explaining why the server-side timeout is now an error not a clean close (was the workflow-server changed? was the timeout removed?)

The supportsCancellation callout suggests the architecture has shifted such that streams now run for the full function maxDuration rather than the old 2-minute server timeout — but if so, that's a precondition for this PR and worth calling out explicitly.

Minor

See inline comments.

What looks good

  • Frame-counting math is correct: currentStartIndex += consumedFrames resumes at the right place, partial-frame buffer is correctly discarded, the math is symmetric for non-zero initial startIndex.
  • Negative startIndex correctly bypasses reconnect with a clear reason (can't compute absolute resume index without a tail-index lookup) — and there's a test for it.
  • AbortController plumbing in world-vercel readFromStream is the right primitive. Cancel propagation through cancel(reason) { abortController.abort(reason) } correctly tears down the fetch.
  • Test coverage for createReconnectingFramedStream is good — frames split across reads, partial frame at error, clean EOF, non-zero initial startIndex, negative startIndex bypass, cancel propagation. Six tests, all targeted.
  • Two changesets correctly scoped: @workflow/core for the new wrapper, @workflow/world-vercel for the cancel propagation.

value.name,
value.startIndex
);
if (value.type === 'bytes') {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Byte streams are intentionally opted out of auto-reconnect here. This is a behavioral regression vs. the reverted #1790, which handled byte streams via server-sent control frames.

The comment correctly identifies why this is hard (no wire framing → no chunk boundary detection client-side), but pushing reconnect to the application layer means:

  1. Every consumer of run.getReadable() for byte streams (AI text streaming, raw HTTP responses, etc.) has to implement its own reconnect logic.
  2. The docs callout added by this PR (about supportsCancellation) doesn't actually help — that's a cancellation fix, not a reconnect fix.

I think the right move is to frame byte streams on the writable side too (4 bytes per chunk overhead), so createReconnectingFramedStream can be used uniformly. The user-facing API stays raw bytes; only the wire format gets the length prefix. That removes the asymmetry and keeps the cleaner architecture this PR is trying to achieve.

* the writable buffers one frame per chunk when multi-writing). The wrapper
* counts completed frames and, on upstream error, reopens the connection
* with `startIndex = resolvedStartIndex + consumedFrames`. Partial-frame
* bytes buffered before the cut are discarded — the server will resend the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment says On serverfull backends, reconnects should only happen during transient errors. For serverless backends, we set this constant so that we cover at least 10 minutes even if the server would be limited to e.g. 1 minute per session.

10 reconnects \u00d7 1-minute-per-session = 10 minutes covered. That's tighter than the deleted constant in world-vercel (MAX_RECONNECTS = 50, ~100 minutes coverage at 2-min server timeouts). If the underlying assumption is that streams now run for full function maxDuration (which on Pro/Enterprise can exceed 10 minutes), this cap may be too low.

Worth either:

  1. Bumping the constant to match the longest realistic maxDuration (~15 min Pro), so something like 30, or
  2. Making it configurable per-call (or via the world)

console.warn("Error closing ReadableStream reader:", err)
});
reader = undefined;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: cancel() here only cancels the active reader. There's a small race window: if cancel fires while connect() is in flight (between reader = undefined after a reconnect-triggering error and the new reader being assigned), there's nothing to cancel — the new connection completes and the loop continues reading.

A cancelled flag checked at the top of the pull loop and inside connect() would close this. Same race existed in the deleted world-vercel cancel handler, so it's not a regression — just worth tightening if you're touching this code.

let cancelled = false;
// ... in pull loop, top of for(;;):
if (cancelled) { controller.close(); return; }
// ... in cancel:
cancelled = true;

const { world } = makeWorldWithScriptedStreams({
0: () =>
scriptedStream([
// Split frame into 3 byte-level reads to prove boundary-aware
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test simulates max-duration abort as controller.error(...) — which is correct for what the wrapper sees on a network reset, but doesn't verify the actual workflow-server behavior matches.

If workflow-server's stream timeout sends a clean FIN (i.e., calls controller.close() on its end) instead of an error, this code path will treat it as EOF and not reconnect. The control-frame logic that this PR removes was specifically designed to disambiguate these two cases.

Could you confirm in the PR description whether:

  1. workflow-server's stream timeout has been removed entirely (streams now run for full function maxDuration), OR
  2. the timeout still exists but now manifests as a network error / TCP reset rather than a clean FIN?

This is the load-bearing assumption of the whole design.

@TooTallNate
Copy link
Copy Markdown
Member

Following up after the discussion thread — consolidating the recommended direction so it's all in one place.

Recommended direction

Move byte-stream framing into core, gated on a per-run feature flag, with the resolved choice baked into the serialized stream ref.

The PR's instinct (move reconnect to core) is right. The concrete change to make it work uniformly for byte streams:

1. Frame byte streams on the writer side

In serialization.ts, the byte-stream branch of the ReadableStream reducer currently does:

ops.push(value.pipeTo(writable));

It would become:

ops.push(
  value
    .pipeThrough(getByteFramingStream())  // wrap each chunk in [4-byte len][bytes]
    .pipeTo(writable)
);

Cost: 4 bytes per server-side chunk. For typical streaming workloads (AI text chunks of dozens of bytes, structured byte payloads in the KB+ range) this is well under 5% overhead.

2. Use createReconnectingFramedStream for both branches on the reader side

The non-byte branch already does. The byte branch additionally pipes through an unframing transform that strips the 4-byte length and emits raw bytes to a type: 'bytes' WHATWG stream — preserving the user-facing API exactly as it is today.

3. WHATWG type: 'bytes' semantics are unaffected

To clarify a point that came up in the discussion: WHATWG's type: 'bytes' is purely about the reader-side API (BYOB readers, Uint8Array chunks, optional autoAllocateChunkSize). The spec says nothing about wire format or chunk-boundary semantics. Whether the bytes are framed on the wire is a transport choice the SDK gets to make — it doesn't change what the user sees from getReader().

So the framing change is purely internal to serialization. User-facing API is identical.

Backwards compatibility

This is the load-bearing concern, since byte-stream wire format becomes a versioning surface.

Cross-version exposures (post-version-skew-protection)

Within a single run: no exposure. Workflow runs are pinned to one deployment, so all chunks of any stream within a run are written and read by the same SDK version.

The only real exposures are streams that cross the run boundary via hook payloads, where the producer and consumer can be different SDK versions:

  1. Newer caller → older run (resumeHook(token, { stream: writable }) where the older run writes to it): older writer can't frame, newer reader must accept raw.
  2. Newer caller → older run (resumeHook(token, { stream: readable }) where the older run reads from it): older reader can't unframe, newer writer must produce raw.
  3. Older caller → newer run: mirror cases — newer side must defer to the older side's format.

In all cases, the framing decision must be made at the producer side based on the consumer side's capability.

Proposed mechanism

  • Per-run feature flag in run.features, e.g. 'byte-stream-framing'. Set at run-creation time based on the SDK version of the run's pinned deployment.
  • NOT specVersion: that's reserved for World-protocol changes (queue transport, event schemas). Byte-stream framing is purely a core/serialization concern that worlds don't need to know about. Features are the right granularity.
  • Reducer resolves at serialization time: the ReadableStream / WritableStream reducer looks up the target run's features and decides framing. For hook payloads the target is the hook's owning run (already looked up by the resumeHook code path); for same-run streams the target is the current run.
  • Bake the resolved choice into the stream ref:
    ReadableStream:
      | { name: string; type?: 'bytes'; startIndex?: number; framing?: 'raw' | 'framed-v1' }
      | { bodyInit: any };
  • Reader dispatches on the ref field: framing === 'framed-v1' → use createReconnectingFramedStream + unframing transform; framing === undefined | 'raw' → use existing WorkflowServerReadableStream (no reconnect).
  • Default is raw: absence of the field means raw, so existing serialized refs from older SDKs still work.
  • Auto-reconnect for byte streams becomes opt-in for new runs only. Old runs keep current no-reconnect behavior. Consistent with how feature flags work elsewhere in the codebase.

One implementation note

For start(workflow, args, { deploymentId }) with cross-deployment args, the args are serialized before the target run exists. The reducer needs a path to predict features for the target deployment without an actual run object — probably reading the deployment manifest's SDK version. Worth confirming this lookup is feasible at reducer-call time before committing to the design.

What's still open in the current PR

The architectural shift (frame counting in core, simpler world-vercel transport, AbortController plumbing) is good and should land. The two outstanding points from my prior review:

  1. Byte streams are opted out of reconnect — addressed by the above.
  2. "Clean EOF means done" assumption — still worth verifying explicitly. Either confirm that workflow-server's stream timeout now manifests as a network error (not a clean FIN), or document that this design only works if the server signals timeout-via-error.

The framing change for byte streams could be a follow-up PR if you want to keep the scope of this one tight, but the docs callout should at minimum be updated to clarify that byte streams currently lose auto-reconnect, distinct from the supportsCancellation issue (which is about cancellation, not reconnect).

Copy link
Copy Markdown
Member

@TooTallNate TooTallNate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving — withdrawing my prior request-for-changes.

Context: my earlier blocker was that this PR opts byte streams out of auto-reconnect, which I called a regression vs. the now-reverted #1790. Since then we discussed it and settled on a different plan: this PR lands on stable as-is (object-stream reconnect only), and byte-stream support gets added on main/v5 via wire-level framing in a follow-up. The framing work is now in PRs #1854 (workflowCoreVersion on HealthCheckResult) and #1853 (the framing itself), which together let createReconnectingFramedStream be applied uniformly to byte streams on main once they land.

So for stable, this PR is the right scope:

  • Object-stream reconnect via createReconnectingFramedStream is correct.
  • Byte streams legitimately can't be auto-reconnected with the legacy unframed wire format that stable ships, so opting them out is the right call there.
  • Frame-counting math, AbortController plumbing, world-vercel simplification all look good.

The earlier non-blocking concerns I raised still apply — would be nice to address them but I'm not gating on them:

  1. The "clean EOF means done" assumption. Worth a sentence in the commit/PR description confirming whether workflow-server's stream timeout now manifests as a network error rather than a clean FIN, since the deleted control-frame logic was specifically there to disambiguate them.
  2. FRAMED_STREAM_MAX_RECONNECTS = 10 is tighter than the deleted MAX_RECONNECTS = 50. Probably fine, but worth a sanity check against the longest realistic Pro/Enterprise maxDuration.
  3. Cancel race during reconnect — pre-existing, not a regression here.

Signed-off-by: Peter Wielander <mittgfu@gmail.com>
Comment thread docs/content/docs/deploying/world/vercel-world.mdx Outdated
Comment thread docs/content/docs/foundations/streaming.mdx Outdated
Co-authored-by: Peter Wielander <mittgfu@gmail.com>
Signed-off-by: Peter Wielander <mittgfu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants