fix(core): harden wait_completed.resumeAt validation (defensive; root cause fixed by #2171) by TooTallNate · Pull Request #2177 · vercel/workflow

TooTallNate · 2026-05-30T15:30:57Z

Summary

Fixes a second, independent source of CORRUPTED_EVENT_LOG on replay: a wait_completed whose resumeAt is validated against a non-deterministic, wall-clock-derived value, producing a false corruption error on a perfectly consistent event log. This is the residual wait_completed.resumeAt shape that survived the hook-vs-sleep fix (#2171) in stress testing.

Root cause (confirmed via production instrumentation)

sleep(<ms|string>) computes resumeAt = Date.now() + duration (parseDurationToDate). The original run records that absolute timestamp into both wait_created and wait_completed. During replay the VM clock advances to each event's createdAt, so a freshly-created sleep recomputes a different absolute resumeAt.

Normally harmless: the wait_created consumer overwrites the queue item's resumeAt with the recorded (authoritative) value before wait_completed is validated. The bug fires when a wait_completed is consumed by a sleep consumer that never applied a wait_created (hasCreatedEvent=false) — the queue item still holds the freshly-recomputed value, and the comparison fails even though the log is internally consistent.

I instrumented the SDK and captured this in production stress runs. Every failing sample showed hasCreatedEvent=false, with ~18–42s deltas between the recomputed and recorded resumeAt, e.g.:

hasCreatedEvent=false queueItemResumeAt=1780153339381 (recomputed)
eventMs=1780153320646 (recorded)  delta=-18735ms

The recorded resumeAt is the source of truth; the consumer's recomputed value is not a valid basis for a corruption assertion.

Fix

Only validate resumeAt when an authoritative recorded value is available (hasCreatedEvent=true). When it is not, the correlationId match already establishes the wait's identity, so skip the check rather than fail a consistent log. Validation is extracted into detectResumeAtMismatch, which also lowers the consumer callback's pre-existing cognitive-complexity warning (33 → 21).

Tests

New regression test in sleep.test.ts advances the replay clock (updateTimestamp) and asserts a consistent wait_completed with hasCreatedEvent=false no longer raises CorruptedEventLogError. Fails before the fix (reproduces the exact production error), passes after.
Existing resumeAt-mismatch / invalid-resumeAt tests (which have hasCreatedEvent=true) still correctly fire.
Full @workflow/core suite: 635/635, typecheck clean.

Scope

Pre-existing bug on stable. Independent of the hook-vs-sleep race fix (#2171) and of the server-side wait_created atomicity work (workflow-server #462) — those address different failure shapes. Stress data showed #2171 removes the step-consumer-mismatch shape; this removes the wait_completed.resumeAt shape.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com

…rded value A reused/duration sleep races a `wait_completed` replay against a non-deterministic, wall-clock-derived expected value, producing a false CorruptedEventLogError on a perfectly consistent event log. `sleep(<ms|string>)` computes its resumeAt as `Date.now() + duration` (see parseDurationToDate). The original run records that absolute timestamp into both wait_created and wait_completed. During replay the VM clock advances to each event's createdAt, so a freshly-created sleep recomputes a *different* absolute resumeAt. Normally harmless: the wait_created consumer overwrites the queue item's resumeAt with the recorded (authoritative) value before wait_completed is validated. The bug: when a wait_completed is consumed by a sleep consumer that never applied a wait_created (hasCreatedEvent=false), the queue item still holds the freshly-recomputed value, and the resumeAt comparison fails — even though the event log is internally consistent and the recorded resumeAt is the source of truth. Captured in production stress runs: hasCreatedEvent=false with ~18-42s deltas between the recomputed and recorded resumeAt. Fix: only validate resumeAt when an authoritative recorded value is available (hasCreatedEvent=true). When it is not, the correlationId match already establishes the wait's identity, so skip the check rather than fail. Extracted the validation into `detectResumeAtMismatch`, which also lowers the consumer callback's cognitive-complexity warning (33 -> 21). Adds a regression test that advances the replay clock (via updateTimestamp) and asserts a consistent wait_completed with hasCreatedEvent=false no longer raises CorruptedEventLogError. Pre-existing stable bug; independent of the hook-vs-sleep race fix (#2171) and of the server-side work. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

changeset-bot · 2026-05-30T15:31:01Z

🦋 Changeset detected

Latest commit: 671a3ea

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 16 packages

Name	Type
@workflow/core	Patch
@workflow/builders	Patch
@workflow/cli	Patch
@workflow/next	Patch
@workflow/nitro	Patch
@workflow/vitest	Patch
@workflow/web-shared	Patch
@workflow/web	Patch
workflow	Patch
@workflow/world-testing	Patch
@workflow/astro	Patch
@workflow/nest	Patch
@workflow/rollup	Patch
@workflow/sveltekit	Patch
@workflow/vite	Patch
@workflow/nuxt	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

vercel · 2026-05-30T15:31:04Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
example-nextjs-workflow-turbopack	Ready	Preview, Comment	Jun 1, 2026 10:05pm
example-nextjs-workflow-webpack	Ready	Preview, Comment	Jun 1, 2026 10:05pm
example-workflow	Ready	Preview, Comment	Jun 1, 2026 10:05pm
workbench-astro-workflow	Ready	Preview, Comment	Jun 1, 2026 10:05pm
workbench-express-workflow	Ready	Preview, Comment	Jun 1, 2026 10:05pm
workbench-fastify-workflow	Ready	Preview, Comment	Jun 1, 2026 10:05pm
workbench-hono-workflow	Ready	Preview, Comment	Jun 1, 2026 10:05pm
workbench-nitro-workflow	Ready	Preview, Comment	Jun 1, 2026 10:05pm
workbench-nuxt-workflow	Ready	Preview, Comment	Jun 1, 2026 10:05pm
workbench-sveltekit-workflow	Ready	Preview, Comment	Jun 1, 2026 10:05pm
workbench-tanstack-start-workflow	Ready	Preview, Comment	Jun 1, 2026 10:05pm
workbench-vite-workflow	Ready	Preview, Comment	Jun 1, 2026 10:05pm
workflow-docs	Ready	Preview, Comment, Open in v0	Jun 1, 2026 10:05pm
workflow-swc-playground	Ready	Preview, Comment	Jun 1, 2026 10:05pm
workflow-tarballs	Ready	Preview, Comment	Jun 1, 2026 10:05pm
workflow-web	Ready	Preview, Comment	Jun 1, 2026 10:05pm

github-actions · 2026-05-30T15:31:07Z

🧪 E2E Test Results

❌ Some tests failed

Summary

	Passed	Failed	Skipped	Total
❌ ▲ Vercel Production	922	1	67	990
✅ 💻 Local Development	994	0	86	1080
✅ 📦 Local Production	994	0	86	1080
✅ 🐘 Local Postgres	994	0	86	1080
✅ 🪟 Windows	90	0	0	90
❌ 🌍 Community Worlds	136	92	0	228
✅ 📋 Other	504	0	36	540
Total	4634	93	361	5088

❌ Failed Tests

▲ Vercel Production (1 failed)

vite (1 failed):

webhookWorkflow | wrun_01KT2KE1245NRQWPBV1AT1Z8CW | 🔍 observability

🌍 Community Worlds (92 failed)

mongodb (14 failed):

hookWorkflow is not resumable via public webhook endpoint | wrun_01KT2KDWR4CJPCV4KF93JHGCG6
webhookWorkflow | wrun_01KT2KE1245NRQWPBV1AT1Z8CW
sleepingWorkflow | wrun_01KT2KFHF3TAK9BCQGVVBMH2KN
outputStreamWorkflow no startIndex (reads all chunks)
outputStreamWorkflow negative startIndex (reads from end)
outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns correct index after stream completes
outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns -1 before any chunks are written
outputStreamWorkflow - getTailIndex and getStreamChunks getStreamChunks returns same content as reading the stream
outputStreamInsideStepWorkflow - getWritable() called inside step functions | wrun_01KT2KJFXQ9AF2Y076M0EYNW7X
writableForwardedFromWorkflowWorkflow | wrun_01KT2KJX8NFAHFABXH2TTK02DD
writableForwardedFromStepWorkflow | wrun_01KT2KK2QDGCPFCD1BHXXJ0GYR
concurrent hook token conflict - two workflows cannot use the same hook token simultaneously | wrun_01KT2KPQYFBNDX1GHX631MPXPB
pages router sleepingWorkflow via pages router
resilient start: addTenWorkflow completes when run_created returns 500 | wrun_01KT2KWARMSYH1VWQ6TF28GWAV

redis (9 failed):

hookWorkflow is not resumable via public webhook endpoint | wrun_01KT2KDWR4CJPCV4KF93JHGCG6
sleepingWorkflow | wrun_01KT2KFHF3TAK9BCQGVVBMH2KN
outputStreamWorkflow negative startIndex (reads from end)
outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns correct index after stream completes
outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns -1 before any chunks are written
outputStreamWorkflow - getTailIndex and getStreamChunks getStreamChunks returns same content as reading the stream
concurrent hook token conflict - two workflows cannot use the same hook token simultaneously | wrun_01KT2KPQYFBNDX1GHX631MPXPB
pages router sleepingWorkflow via pages router
resilient start: addTenWorkflow completes when run_created returns 500 | wrun_01KT2KWARMSYH1VWQ6TF28GWAV

turso-dev (1 failed):

dev e2e should rebuild on imported step dependency change

turso (68 failed):

addTenWorkflow | wrun_01KT2KCS4M5XK193AB6VN0F3BN
addTenWorkflow | wrun_01KT2KCS4M5XK193AB6VN0F3BN
wellKnownAgentWorkflow (.well-known/agent) | wrun_01KT2KE4MF3Y78S5QNQTZ8Y20M
should work with react rendering in step
promiseAllWorkflow | wrun_01KT2KD0G60VB2N7TTTEYSYT1N
promiseRaceWorkflow | wrun_01KT2KD6W8F158K74NPC5HBSQK
promiseAnyWorkflow | wrun_01KT2KD8THC6MBHSVRK63XVNKY
importedStepOnlyWorkflow | wrun_01KT2KEQPT5ZE0ZEGFE7DA1GSE
readableStreamWorkflow | wrun_01KT2KDAPVZEARDZ5CC857R7PX
hookWorkflow | wrun_01KT2KDPYJ51TANZ87FXBEYR8C
hookWorkflow is not resumable via public webhook endpoint | wrun_01KT2KDWR4CJPCV4KF93JHGCG6
webhookWorkflow | wrun_01KT2KE1245NRQWPBV1AT1Z8CW
sleepingWorkflow | wrun_01KT2KFHF3TAK9BCQGVVBMH2KN
parallelSleepWorkflow | wrun_01KT2KG0BQX1KX9F7D8AXQ43QK
nullByteWorkflow | wrun_01KT2KG4EWHYFE7QR71HQAJQHZ
workflowAndStepMetadataWorkflow | wrun_01KT2KG6E4Q8KYGAW7EYF1Z5PB
outputStreamWorkflow no startIndex (reads all chunks)
outputStreamWorkflow positive startIndex (skips first chunk)
outputStreamWorkflow negative startIndex (reads from end)
outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns correct index after stream completes
outputStreamWorkflow - getTailIndex and getStreamChunks getTailIndex returns -1 before any chunks are written
outputStreamWorkflow - getTailIndex and getStreamChunks getStreamChunks returns same content as reading the stream
outputStreamInsideStepWorkflow - getWritable() called inside step functions | wrun_01KT2KJFXQ9AF2Y076M0EYNW7X
writableForwardedFromWorkflowWorkflow | wrun_01KT2KJX8NFAHFABXH2TTK02DD
writableForwardedFromStepWorkflow | wrun_01KT2KK2QDGCPFCD1BHXXJ0GYR
fetchWorkflow | wrun_01KT2KK61105R59RS7J6C4TMCC
promiseRaceStressTestWorkflow | wrun_01KT2KK94CT1PQEDHVN9EG6XFT
error handling error propagation workflow errors nested function calls preserve message and stack trace
error handling error propagation workflow errors cross-file imports preserve message and stack trace
error handling error propagation step errors basic step error preserves message and stack trace
error handling error propagation step errors cross-file step error preserves message and function names in stack
error handling retry behavior regular Error retries until success
error handling retry behavior FatalError fails immediately without retries
error handling retry behavior RetryableError respects custom retryAfter delay
error handling retry behavior maxRetries=0 disables retries
error handling catchability FatalError can be caught and detected with FatalError.is()
error handling not registered WorkflowNotRegisteredError fails the run when workflow does not exist
error handling not registered StepNotRegisteredError fails the step but workflow can catch it
error handling not registered StepNotRegisteredError fails the run when not caught in workflow
hookCleanupTestWorkflow - hook token reuse after workflow completion | wrun_01KT2KPD9A3J85BPKK93SW6028
concurrent hook token conflict - two workflows cannot use the same hook token simultaneously | wrun_01KT2KPQYFBNDX1GHX631MPXPB
hookDisposeTestWorkflow - hook token reuse after explicit disposal while workflow still running | wrun_01KT2KQ4ZTYXAB0FZ3G5V4FJSE
stepFunctionPassingWorkflow - step function references can be passed as arguments (without closure vars) | wrun_01KT2KQM39T6NE24VSJSPGJ1BV
stepFunctionWithClosureWorkflow - step function with closure variables passed as argument | wrun_01KT2KQXFKFKC86FXH71BSEB3N
closureVariableWorkflow - nested step functions with closure variables | wrun_01KT2KR23MQ0MCVPWZRHYV9X1V
spawnWorkflowFromStepWorkflow - spawning a child workflow using start() inside a step | wrun_01KT2KR40C0XSFZCKNRA9BES1A
health check (queue-based) - workflow and step endpoints respond to health check messages
health check (CLI) - workflow health command reports healthy endpoints
pathsAliasWorkflow - TypeScript path aliases resolve correctly | wrun_01KT2KRHXX954WKGQEFSBXY854
Calculator.calculate - static workflow method using static step methods from another class | wrun_01KT2KRPTHBM5853GKVXWA1272
AllInOneService.processNumber - static workflow method using sibling static step methods | wrun_01KT2KRWX43JHCP82FAZ5NCSW4
ChainableService.processWithThis - static step methods using this to reference the class | wrun_01KT2KS2VZS54D3XX5FYNCF8HN
thisSerializationWorkflow - step function invoked with .call() and .apply() | wrun_01KT2KS919QGAZAPVN894TPD5Z
customSerializationWorkflow - custom class serialization with WORKFLOW_SERIALIZE/WORKFLOW_DESERIALIZE | wrun_01KT2KSG0J449SDCQN4TRRTVJQ
instanceMethodStepWorkflow - instance methods with "use step" directive | wrun_01KT2KSPZ1574P6861D4GVX6JB
crossContextSerdeWorkflow - classes defined in step code are deserializable in workflow context | wrun_01KT2KT39HDD75T7ZE6RPVZKC5
stepFunctionAsStartArgWorkflow - step function reference passed as start() argument | wrun_01KT2KTBH88BN1RE0A512BMMNQ
cancelRun - cancelling a running workflow | wrun_01KT2KTJFC29XD67PD4GHQ3VQX
cancelRun via CLI - cancelling a running workflow | wrun_01KT2KTV2HKFKK36N528CYNQ2X
pages router addTenWorkflow via pages router
pages router promiseAllWorkflow via pages router
pages router sleepingWorkflow via pages router
hookWithSleepWorkflow - hook payloads delivered correctly with concurrent sleep | wrun_01KT2KV5Z6FWM79SHB3Y3PQEBM
sleepInLoopWorkflow - sleep inside loop with steps actually delays each iteration | wrun_01KT2KVMF6RM60YJGZBAFY1HNG
sleepWithSequentialStepsWorkflow - sequential steps work with concurrent sleep (control) | wrun_01KT2KW12WZ52V1G011E6AH59C
importMetaUrlWorkflow - import.meta.url is available in step bundles | wrun_01KT2KW70B6VRBXH9M4Q65JD72
metadataFromHelperWorkflow - getWorkflowMetadata/getStepMetadata work from module-level helper (#1577) | wrun_01KT2KW8XPTJY23Z0BGBD2ATX6
resilient start: addTenWorkflow completes when run_created returns 500 | wrun_01KT2KWARMSYH1VWQ6TF28GWAV

Details by Category

❌ ▲ Vercel Production

App	Passed	Failed	Skipped
✅ astro	83	0	7
✅ example	83	0	7
✅ express	83	0	7
✅ fastify	83	0	7
✅ hono	83	0	7
✅ nextjs-turbopack	88	0	2
✅ nextjs-webpack	88	0	2
✅ nitro	83	0	7
✅ nuxt	83	0	7
✅ sveltekit	83	0	7
❌ vite	82	1	7

✅ 💻 Local Development

App	Passed	Skipped
✅ astro-stable	84	6
✅ express-stable	84	6
✅ fastify-stable	84	6
✅ hono-stable	84	6
✅ nextjs-turbopack-canary	71	19
✅ nextjs-turbopack-stable	90	0
✅ nextjs-webpack-canary	71	19
✅ nextjs-webpack-stable	90	0
✅ nitro-stable	84	6
✅ nuxt-stable	84	6
✅ sveltekit-stable	84	6
✅ vite-stable	84	6

✅ 📦 Local Production

App	Passed	Skipped
✅ astro-stable	84	6
✅ express-stable	84	6
✅ fastify-stable	84	6
✅ hono-stable	84	6
✅ nextjs-turbopack-canary	71	19
✅ nextjs-turbopack-stable	90	0
✅ nextjs-webpack-canary	71	19
✅ nextjs-webpack-stable	90	0
✅ nitro-stable	84	6
✅ nuxt-stable	84	6
✅ sveltekit-stable	84	6
✅ vite-stable	84	6

✅ 🐘 Local Postgres

App	Passed	Skipped
✅ astro-stable	84	6
✅ express-stable	84	6
✅ fastify-stable	84	6
✅ hono-stable	84	6
✅ nextjs-turbopack-canary	71	19
✅ nextjs-turbopack-stable	90	0
✅ nextjs-webpack-canary	71	19
✅ nextjs-webpack-stable	90	0
✅ nitro-stable	84	6
✅ nuxt-stable	84	6
✅ sveltekit-stable	84	6
✅ vite-stable	84	6

✅ 🪟 Windows

App	Passed	Failed	Skipped
✅ nextjs-turbopack	90	0	0

❌ 🌍 Community Worlds

App	Passed	Failed
✅ mongodb-dev	5	0
❌ mongodb	57	14
✅ redis-dev	5	0
❌ redis	62	9
❌ turso-dev	4	1
❌ turso	3	68

✅ 📋 Other

App	Passed	Skipped
✅ e2e-local-dev-nest-stable	84	6
✅ e2e-local-dev-tanstack-start-stable	84	6
✅ e2e-local-postgres-nest-stable	84	6
✅ e2e-local-postgres-tanstack-start-stable	84	6
✅ e2e-local-prod-nest-stable	84	6
✅ e2e-local-prod-tanstack-start-stable	84	6

📋 View full workflow run

❌ Some E2E test jobs failed:

Vercel Prod: failure
Local Dev: success
Local Prod: success
Local Postgres: success
Windows: success

Check the workflow run for details.

⚠️ Community world tests failed (non-blocking):

Community Worlds: failure

Check the workflow run for details.

Copilot

Pull request overview

Fixes a false-positive CorruptedEventLogError raised on replay when a sleep consumer processes a wait_completed event without having first applied the corresponding wait_created. In that case the queue item's resumeAt still reflects a freshly recomputed (wall-clock-dependent) value from parseDurationToDate, so comparing it against the recorded resumeAt produces a spurious mismatch on a consistent event log.

Changes:

Extracted resumeAt validation into detectResumeAtMismatch() helper.
Skip the resumeAt comparison unless queueItem.hasCreatedEvent is true (authoritative recorded value present).
Added regression test simulating replay clock drift with wait_completed and no prior wait_created.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
packages/core/src/workflow/sleep.ts	Extract resumeAt mismatch detection; only validate when authoritative recorded value (`hasCreatedEvent`) is available.
packages/core/src/workflow/sleep.test.ts	Add regression test for replay-clock-advanced `wait_completed` without `wait_created`; expose `updateTimestamp` from setup helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pranaygp

Please add a patch changeset for @workflow/core before merging. This PR changes a published package on stable; without a changeset this bug fix will not produce a package release.

VaguelySerious

AI review: blocking issues found

VaguelySerious · 2026-05-31T08:53:00Z

AI Review: Blocking

Missing changeset. This PR fixes a real bug in a published package (@workflow/core) but ships no changeset, so it won't produce a version bump or changelog entry — the fix wouldn't actually reach users on release. Per the repo's contribution rule every workflow PR needs one (pnpm changeset, a patch for @workflow/core here). There's no automated changeset gate on this PR, which is why it slipped past CI.

This is the only blocker — the code change itself is regression-free and low-risk (see the inline note on sleep.ts): it strictly weakens the resumeAt validation, adding no new error path, and legitimate corruption detection (hasCreatedEvent=true) is preserved.

AI Review: Note — on the red CI checks, none of which are caused by this change:

Unit Tests (windows-latest) — failed at @workflow/swc-plugin#build (the Rust/wasm crate build), not at any test. This job is itself flaky on stable (mixed success/failure in recent runs); unrelated to a pure-TypeScript @workflow/core change.
E2E Community World (Redis / MongoDB) — the failing cases are error-handling workflows (errorWorkflowNested, errorWorkflowCrossFile, errorStepBasic) and the jobs ended in cancellation/timeout. These adapters are cancelled/incomplete on stable too. A change that only removes a wait_completed.resumeAt error path cannot cause unrelated error-workflow tests to start failing.
E2E Required Check — red only because it aggregates the Windows (UNIT_STATUS/WINDOWS_STATUS=failure) and community (COMMUNITY_STATUS=cancelled) jobs above.

Local validation on 80545bb: full @workflow/core unit suite 635/635 (stable across 5 repeats), typecheck clean. The new regression test reproduces the exact production error on stable (resumeAt "2025-07-25..." but expects "2026-05-31...") and passes on this branch. I also confirmed, via an ad-hoc test, that a consistent log with wait_created does not false-positive across a 30s replay-clock advance, and that a genuine resumeAt mismatch still fires regardless of clock state.

Recommend adding the changeset; once that's in, this is good to merge.

Add the missing patch changeset for the `@workflow/core` wait_completed resumeAt replay fix so it gets a version bump and changelog entry. Also remove the fixed 250ms grace timer from the new regression test: it now races the error-vs-resolve outcomes directly, so a regression surfaces deterministically (error branch, or a hang caught by the test timeout) rather than via a flaky race against a wall-clock guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

VaguelySerious

LGTM, applied the AI nit about the test case directly

(cherry picked from commit ae3c833) (cherry picked from commit c1d7bab)

…ist + infra breakdown Rework the PR-comment renderer so a human can immediately see what gates the job and inspect every failing run: - 🚨 Event-Log Regressions table lists *every* gating run in full (never truncated), each with its duration, a synthesised detail line, and a direct dashboard link. Stuck runs render "no terminal state after <ms>". - Infra (non-gating) section groups harness noise by error code with a plain-language explanation and example run links, instead of flooding one table with thousands of rows. - Headline names the regression count and digests the infra noise (e.g. "904 HOOK_RESUME_FAILED, 61 NO_WAKE_BRANCH"). Adds unit coverage for the breakdown, message synthesis and the never-truncate-regressions guarantee. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

On run-poll timeout, fetch the run's event log and record the latest event (type, step name, elapsed) as the stuck run's errorMessage. The summary's regression table then shows "stuck after N events; latest step_started (foo) at +12.3s" with a dashboard link, instead of only a duration — so a human can see where the run wedged without opening every link. Best-effort; falls back to the duration-only note if the event fetch fails. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ep-resumeat-replay # Conflicts: # .github/scripts/render-event-log-race-repro-results.js # .github/scripts/render-event-log-race-repro-results.test.js

A run flagged at the 150s poll budget can simply be slow on a loaded preview deployment — observed wrun_…EFDZ9 completed shortly after the harness gave up and was wrongly gated as `stuck`. Add a generous post-budget grace window: a run that reaches a terminal state during grace is classified by its real outcome (completed → non-gating `SLOW_COMPLETION` infra, surfaced for visibility; failed → its error class). Only a run still non-terminal after budget + grace is a genuine wedge (gating `stuck`). Renderer gains notes for SLOW_COMPLETION/CANCELLED and singular/plural agreement fixes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…e they occur Investigating HARNESS_ERRORs on a repro run: a `fetch failed` and a `Hook not found`. Both came from harness-side network calls to the deployment, not the SDK. A single dropped connection should never abort tracking an otherwise healthy run. - Add `withRetry` (linear backoff, transient-network detection) and apply it to the harness network calls: getWorkflowMetadata, start, resumeHook, and the run-status poll. On final failure the error is prefixed with the call site (e.g. "start: fetch failed", "poll runs.get: fetch failed"), so the infra breakdown says *where* it happened. - pollTerminalRun no longer aborts on a flaky GET: a transient error just retries/continues until the deadline. - waitForHook labels its surfaced error ("waitForHook: Hook not found") so the hook-propagation timeout is identifiable in the summary. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…issing field A failed WorkflowRun exposes its reason as `error: { code, message }` and has no top-level `errorCode`, so the poller's `classifyFailure(runData.errorCode)` was always passing `undefined` — collapsing every polled failure to an uncategorised, detail-less `other`. Read `runData.error.code`/`.message` so USER_ERROR/RUNTIME_ERROR/CORRUPTED_EVENT_LOG are classified correctly and the regression row shows why the run failed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit d2a59d4)

…leep(Date) Addresses review feedback on the resumeAt-skip guard: the no-`wait_created` skip was too broad. It correctly avoids a false `CorruptedEventLogError` for duration-based `sleep(<ms|string>)` (whose resumeAt is `Date.now() + duration` and therefore varies across replays), but it also skipped validation for an absolute `sleep(Date)`, whose resumeAt is recomputed identically every replay and so remains an authoritative value worth checking even without a recorded `wait_created`. Track `resumeAtIsDeterministic` on the wait queue item (true when the sleep was given a Date / date-like), and only skip the equality check when resumeAt is non-deterministic AND no `wait_created` was applied. A genuine absolute-Date mismatch now still raises. Adds a regression test (mismatched `sleep(Date)` without `wait_created` → CorruptedEventLogError). The malformed/Invalid-Date case was already handled unconditionally before the gate and is already covered. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ep-resumeat-replay

An A/B stress reproduction (instrumented build, 300 runs each) confirmed the no-`wait_created` state this guard handles is a downstream symptom of the hook-vs-sleep race fixed in #2171: every captured case was a `wait_completed` whose correlationId has NO matching `wait_created` in the log (a divergent replay shifted the deterministic ULID sequence). It reproduced readily with #2171 reverted (5 hits / 300) and never with #2171 present (0 / 300). Reword the changeset to one sentence describing the validation hardening, and document the confirmed root cause inline. No behavior change.

TooTallNate · 2026-06-01T22:01:44Z

Sit-rep: root cause confirmed — this is defensive hardening, not an independent bug fix

We dug into why a wait_completed would ever be validated without a recorded wait_created (hasCreatedEvent=false). An instrumented A/B stress reproduction settled it.

Instrumentation

Built a debug SDK that force-fails on any hasCreatedEvent=false at wait_completed, capturing the consumer's subscribe-time event index, completed-at index, and — critically — the wait_created log index for that correlationId.

A/B result (same harness, 300 runs each, identical params)

Build	CORRUPTED_EVENT_LOG	`hasCreatedEvent=false` captured
stable + this PR (incl. #2171)	0	0 / 300
#2171 reverted (+ this PR)	241	5 / 300

Every captured hasCreatedEvent=false case showed waitCreatedLogIndex = -1 — i.e. the wait_completed has no matching wait_created anywhere in the event log for its correlationId.

Root cause

That state is a divergent-replay artifact of the hook-vs-sleep race fixed in #2171: when the race resolved non-deterministically, the workflow's branch decisions diverged on replay, shifting the deterministic ULID sequence — so a sleep got a correlationId whose wait_created isn't in the committed log. The consumer then validated a wait_completed for a wait it never saw created, comparing against a freshly-recomputed wall-clock resumeAt → spurious CorruptedEventLogError.

With #2171 (now merged) the race is deterministic and this no longer occurs — confirmed by the 0/300 above.

What this PR is now

Defensive hardening of the resumeAt validation path, reframed accordingly:

Skip the equality check only when there's no authoritative recorded value (no wait_created AND a non-deterministic duration-based sleep).
Still validate absolute sleep(Date) (deterministic resumeAt) even without wait_created.
Still reject malformed/non-finite resumeAt unconditionally.

Changeset reworded to one sentence; root cause documented inline. No behavior change since the last review pass. Given the root cause is fixed upstream by #2171, this is belt-and-suspenders — happy to keep it as hardening or close it if folks would rather not carry it. cc @pranaygp @VaguelySerious

TooTallNate requested a review from a team as a code owner May 30, 2026 15:30

Copilot AI review requested due to automatic review settings May 30, 2026 15:30

Copilot started reviewing on behalf of TooTallNate May 30, 2026 15:31 View session

vercel Bot deployed to Preview – workflow-web May 30, 2026 15:31 View deployment

vercel Bot deployed to Preview – workflow-tarballs May 30, 2026 15:32 View deployment

vercel Bot deployed to Preview – workbench-express-workflow May 30, 2026 15:32 View deployment

vercel Bot deployed to Preview – workbench-vite-workflow May 30, 2026 15:32 View deployment

vercel Bot deployed to Preview – workbench-sveltekit-workflow May 30, 2026 15:32 View deployment

vercel Bot deployed to Preview – workbench-hono-workflow May 30, 2026 15:32 View deployment

vercel Bot deployed to Preview – workbench-astro-workflow May 30, 2026 15:32 View deployment

vercel Bot deployed to Preview – workbench-nitro-workflow May 30, 2026 15:32 View deployment

vercel Bot deployed to Preview – workbench-fastify-workflow May 30, 2026 15:32 View deployment

vercel Bot deployed to Preview – example-workflow May 30, 2026 15:32 View deployment

vercel Bot deployed to Preview – workbench-tanstack-start-workflow May 30, 2026 15:32 View deployment

vercel Bot deployed to Preview – workbench-nuxt-workflow May 30, 2026 15:32 View deployment

vercel Bot deployed to Preview – example-nextjs-workflow-turbopack May 30, 2026 15:32 View deployment

vercel Bot deployed to Preview – example-nextjs-workflow-webpack May 30, 2026 15:33 View deployment

Copilot AI reviewed May 30, 2026

View reviewed changes

vercel Bot deployed to Preview – workflow-swc-playground May 30, 2026 15:34 View deployment

pranaygp reviewed May 30, 2026

View reviewed changes

Comment thread packages/core/src/workflow/sleep.ts Outdated

VaguelySerious reviewed May 31, 2026

View reviewed changes

Comment thread packages/core/src/workflow/sleep.ts Outdated

Comment thread packages/core/src/workflow/sleep.test.ts Outdated

VaguelySerious and others added 2 commits May 31, 2026 10:56

Merge branch 'stable' into nate/fix-reused-sleep-resumeat-replay

4cc9b43

vercel Bot deployed to Preview – workflow-docs May 31, 2026 08:58 View deployment

VaguelySerious approved these changes May 31, 2026

View reviewed changes

vercel Bot deployed to Preview – workbench-hono-workflow May 31, 2026 10:22 View deployment

vercel Bot deployed to Preview – example-workflow May 31, 2026 10:22 View deployment

vercel Bot deployed to Preview – workbench-fastify-workflow May 31, 2026 10:22 View deployment

vercel Bot deployed to Preview – workbench-sveltekit-workflow May 31, 2026 10:22 View deployment

vercel Bot deployed to Preview – workbench-astro-workflow May 31, 2026 10:22 View deployment

vercel Bot deployed to Preview – workbench-tanstack-start-workflow May 31, 2026 10:22 View deployment

vercel Bot deployed to Preview – workbench-nuxt-workflow May 31, 2026 10:22 View deployment

vercel Bot deployed to Preview – workbench-express-workflow May 31, 2026 10:22 View deployment

vercel Bot deployed to Preview – example-nextjs-workflow-turbopack May 31, 2026 10:22 View deployment

vercel Bot deployed to Preview – example-nextjs-workflow-webpack May 31, 2026 10:23 View deployment

vercel Bot deployed to Preview – workflow-swc-playground May 31, 2026 10:24 View deployment

VaguelySerious mentioned this pull request May 31, 2026

[e2e] Separate event-log-race-repro harness noise from real regressions #2190

Merged

Merge branch 'stable' into nate/fix-reused-sleep-resumeat-replay

ed8f9d7

vercel Bot deployed to Preview – workflow-docs June 1, 2026 08:20 View deployment

vercel Bot deployed to Preview – workflow-web June 1, 2026 08:20 View deployment

vercel Bot deployed to Preview – example-nextjs-workflow-turbopack June 1, 2026 08:21 View deployment

VaguelySerious and others added 6 commits June 1, 2026 11:17

[e2e] Improve error labeling in event-log-race-repro CI job (#2190)

d392c7e

(cherry picked from commit ae3c833) (cherry picked from commit c1d7bab)

Merge remote-tracking branch 'origin/stable' into nate/fix-reused-sle…

2864608

…ep-resumeat-replay # Conflicts: # .github/scripts/render-event-log-race-repro-results.js # .github/scripts/render-event-log-race-repro-results.test.js

VaguelySerious mentioned this pull request Jun 1, 2026

[world-vercel] [builders] Add ENFORCE_STRICT_CONCURRENCY option to limit flow route concurrency to one #2193

Draft

VaguelySerious and others added 2 commits June 1, 2026 03:29

Merge branch 'stable' into nate/fix-reused-sleep-resumeat-replay

49b3057

VaguelySerious mentioned this pull request Jun 1, 2026

[TESTING] Combined #2177 + #2193 for event-log-race-repro #2197

Draft

TooTallNate and others added 3 commits June 1, 2026 11:29

Merge remote-tracking branch 'origin/stable' into nate/fix-reused-sle…

f0594a3

…ep-resumeat-replay

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): harden wait_completed.resumeAt validation (defensive; root cause fixed by #2171)#2177

fix(core): harden wait_completed.resumeAt validation (defensive; root cause fixed by #2171)#2177
TooTallNate wants to merge 17 commits into
stablefrom
nate/fix-reused-sleep-resumeat-replay

TooTallNate commented May 30, 2026

Uh oh!

changeset-bot Bot commented May 30, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 30, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

pranaygp left a comment

Uh oh!

Uh oh!

VaguelySerious left a comment

Uh oh!

Uh oh!

Uh oh!

VaguelySerious commented May 31, 2026

Uh oh!

VaguelySerious left a comment

Uh oh!

TooTallNate commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

TooTallNate commented May 30, 2026

Summary

Root cause (confirmed via production instrumentation)

Fix

Tests

Scope

Uh oh!

changeset-bot Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

vercel Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧪 E2E Test Results

Summary

❌ Failed Tests

Details by Category

Check the workflow run for details.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

pranaygp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

VaguelySerious left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

VaguelySerious commented May 31, 2026

AI Review: Blocking

Uh oh!

VaguelySerious left a comment

Choose a reason for hiding this comment

Uh oh!

TooTallNate commented Jun 1, 2026

Sit-rep: root cause confirmed — this is defensive hardening, not an independent bug fix

Instrumentation

A/B result (same harness, 300 runs each, identical params)

Root cause

What this PR is now

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

changeset-bot Bot commented May 30, 2026 •

edited

Loading

vercel Bot commented May 30, 2026 •

edited

Loading

github-actions Bot commented May 30, 2026 •

edited

Loading