fix(core): harden wait_completed.resumeAt validation (defensive; root cause fixed by #2171)#2177
fix(core): harden wait_completed.resumeAt validation (defensive; root cause fixed by #2171)#2177TooTallNate wants to merge 17 commits into
Conversation
…rded value A reused/duration sleep races a `wait_completed` replay against a non-deterministic, wall-clock-derived expected value, producing a false CorruptedEventLogError on a perfectly consistent event log. `sleep(<ms|string>)` computes its resumeAt as `Date.now() + duration` (see parseDurationToDate). The original run records that absolute timestamp into both wait_created and wait_completed. During replay the VM clock advances to each event's createdAt, so a freshly-created sleep recomputes a *different* absolute resumeAt. Normally harmless: the wait_created consumer overwrites the queue item's resumeAt with the recorded (authoritative) value before wait_completed is validated. The bug: when a wait_completed is consumed by a sleep consumer that never applied a wait_created (hasCreatedEvent=false), the queue item still holds the freshly-recomputed value, and the resumeAt comparison fails — even though the event log is internally consistent and the recorded resumeAt is the source of truth. Captured in production stress runs: hasCreatedEvent=false with ~18-42s deltas between the recomputed and recorded resumeAt. Fix: only validate resumeAt when an authoritative recorded value is available (hasCreatedEvent=true). When it is not, the correlationId match already establishes the wait's identity, so skip the check rather than fail. Extracted the validation into `detectResumeAtMismatch`, which also lowers the consumer callback's cognitive-complexity warning (33 -> 21). Adds a regression test that advances the replay clock (via updateTimestamp) and asserts a consistent wait_completed with hasCreatedEvent=false no longer raises CorruptedEventLogError. Pre-existing stable bug; independent of the hook-vs-sleep race fix (#2171) and of the server-side work. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
🦋 Changeset detectedLatest commit: 671a3ea The changes in this PR will be included in the next version bump. This PR includes changesets to release 16 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (1 failed)vite (1 failed):
🌍 Community Worlds (92 failed)mongodb (14 failed):
redis (9 failed):
turso-dev (1 failed):
turso (68 failed):
Details by Category❌ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details.
Check the workflow run for details. |
There was a problem hiding this comment.
Pull request overview
Fixes a false-positive CorruptedEventLogError raised on replay when a sleep consumer processes a wait_completed event without having first applied the corresponding wait_created. In that case the queue item's resumeAt still reflects a freshly recomputed (wall-clock-dependent) value from parseDurationToDate, so comparing it against the recorded resumeAt produces a spurious mismatch on a consistent event log.
Changes:
- Extracted
resumeAtvalidation intodetectResumeAtMismatch()helper. - Skip the
resumeAtcomparison unlessqueueItem.hasCreatedEventis true (authoritative recorded value present). - Added regression test simulating replay clock drift with
wait_completedand no priorwait_created.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| packages/core/src/workflow/sleep.ts | Extract resumeAt mismatch detection; only validate when authoritative recorded value (hasCreatedEvent) is available. |
| packages/core/src/workflow/sleep.test.ts | Add regression test for replay-clock-advanced wait_completed without wait_created; expose updateTimestamp from setup helper. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pranaygp
left a comment
There was a problem hiding this comment.
Please add a patch changeset for @workflow/core before merging. This PR changes a published package on stable; without a changeset this bug fix will not produce a package release.
VaguelySerious
left a comment
There was a problem hiding this comment.
AI review: blocking issues found
AI Review: BlockingMissing changeset. This PR fixes a real bug in a published package ( This is the only blocker — the code change itself is regression-free and low-risk (see the inline note on AI Review: Note — on the red CI checks, none of which are caused by this change:
Local validation on Recommend adding the changeset; once that's in, this is good to merge. |
Add the missing patch changeset for the `@workflow/core` wait_completed resumeAt replay fix so it gets a version bump and changelog entry. Also remove the fixed 250ms grace timer from the new regression test: it now races the error-vs-resolve outcomes directly, so a regression surfaces deterministically (error branch, or a hang caught by the test timeout) rather than via a flaky race against a wall-clock guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
VaguelySerious
left a comment
There was a problem hiding this comment.
LGTM, applied the AI nit about the test case directly
…ist + infra breakdown Rework the PR-comment renderer so a human can immediately see what gates the job and inspect every failing run: - 🚨 Event-Log Regressions table lists *every* gating run in full (never truncated), each with its duration, a synthesised detail line, and a direct dashboard link. Stuck runs render "no terminal state after <ms>". - Infra (non-gating) section groups harness noise by error code with a plain-language explanation and example run links, instead of flooding one table with thousands of rows. - Headline names the regression count and digests the infra noise (e.g. "904 HOOK_RESUME_FAILED, 61 NO_WAKE_BRANCH"). Adds unit coverage for the breakdown, message synthesis and the never-truncate-regressions guarantee. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
On run-poll timeout, fetch the run's event log and record the latest event (type, step name, elapsed) as the stuck run's errorMessage. The summary's regression table then shows "stuck after N events; latest step_started (foo) at +12.3s" with a dashboard link, instead of only a duration — so a human can see where the run wedged without opening every link. Best-effort; falls back to the duration-only note if the event fetch fails. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ep-resumeat-replay # Conflicts: # .github/scripts/render-event-log-race-repro-results.js # .github/scripts/render-event-log-race-repro-results.test.js
A run flagged at the 150s poll budget can simply be slow on a loaded preview deployment — observed wrun_…EFDZ9 completed shortly after the harness gave up and was wrongly gated as `stuck`. Add a generous post-budget grace window: a run that reaches a terminal state during grace is classified by its real outcome (completed → non-gating `SLOW_COMPLETION` infra, surfaced for visibility; failed → its error class). Only a run still non-terminal after budget + grace is a genuine wedge (gating `stuck`). Renderer gains notes for SLOW_COMPLETION/CANCELLED and singular/plural agreement fixes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e they occur
Investigating HARNESS_ERRORs on a repro run: a `fetch failed` and a `Hook not
found`. Both came from harness-side network calls to the deployment, not the
SDK. A single dropped connection should never abort tracking an otherwise
healthy run.
- Add `withRetry` (linear backoff, transient-network detection) and apply it to
the harness network calls: getWorkflowMetadata, start, resumeHook, and the
run-status poll. On final failure the error is prefixed with the call site
(e.g. "start: fetch failed", "poll runs.get: fetch failed"), so the infra
breakdown says *where* it happened.
- pollTerminalRun no longer aborts on a flaky GET: a transient error just
retries/continues until the deadline.
- waitForHook labels its surfaced error ("waitForHook: Hook not found") so the
hook-propagation timeout is identifiable in the summary.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…issing field
A failed WorkflowRun exposes its reason as `error: { code, message }` and has no
top-level `errorCode`, so the poller's `classifyFailure(runData.errorCode)` was
always passing `undefined` — collapsing every polled failure to an
uncategorised, detail-less `other`. Read `runData.error.code`/`.message` so
USER_ERROR/RUNTIME_ERROR/CORRUPTED_EVENT_LOG are classified correctly and the
regression row shows why the run failed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit d2a59d4)
…leep(Date) Addresses review feedback on the resumeAt-skip guard: the no-`wait_created` skip was too broad. It correctly avoids a false `CorruptedEventLogError` for duration-based `sleep(<ms|string>)` (whose resumeAt is `Date.now() + duration` and therefore varies across replays), but it also skipped validation for an absolute `sleep(Date)`, whose resumeAt is recomputed identically every replay and so remains an authoritative value worth checking even without a recorded `wait_created`. Track `resumeAtIsDeterministic` on the wait queue item (true when the sleep was given a Date / date-like), and only skip the equality check when resumeAt is non-deterministic AND no `wait_created` was applied. A genuine absolute-Date mismatch now still raises. Adds a regression test (mismatched `sleep(Date)` without `wait_created` → CorruptedEventLogError). The malformed/Invalid-Date case was already handled unconditionally before the gate and is already covered. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ep-resumeat-replay
An A/B stress reproduction (instrumented build, 300 runs each) confirmed the no-`wait_created` state this guard handles is a downstream symptom of the hook-vs-sleep race fixed in #2171: every captured case was a `wait_completed` whose correlationId has NO matching `wait_created` in the log (a divergent replay shifted the deterministic ULID sequence). It reproduced readily with #2171 reverted (5 hits / 300) and never with #2171 present (0 / 300). Reword the changeset to one sentence describing the validation hardening, and document the confirmed root cause inline. No behavior change.
Sit-rep: root cause confirmed — this is defensive hardening, not an independent bug fixWe dug into why a InstrumentationBuilt a debug SDK that force-fails on any A/B result (same harness, 300 runs each, identical params)
Every captured Root causeThat state is a divergent-replay artifact of the hook-vs-sleep race fixed in #2171: when the race resolved non-deterministically, the workflow's branch decisions diverged on replay, shifting the deterministic ULID sequence — so a sleep got a correlationId whose With #2171 (now merged) the race is deterministic and this no longer occurs — confirmed by the 0/300 above. What this PR is nowDefensive hardening of the
Changeset reworded to one sentence; root cause documented inline. No behavior change since the last review pass. Given the root cause is fixed upstream by #2171, this is belt-and-suspenders — happy to keep it as hardening or close it if folks would rather not carry it. cc @pranaygp @VaguelySerious |
Summary
Fixes a second, independent source of
CORRUPTED_EVENT_LOGon replay: await_completedwhoseresumeAtis validated against a non-deterministic, wall-clock-derived value, producing a false corruption error on a perfectly consistent event log. This is the residualwait_completed.resumeAtshape that survived the hook-vs-sleep fix (#2171) in stress testing.Root cause (confirmed via production instrumentation)
sleep(<ms|string>)computesresumeAt = Date.now() + duration(parseDurationToDate). The original run records that absolute timestamp into bothwait_createdandwait_completed. During replay the VM clock advances to each event'screatedAt, so a freshly-created sleep recomputes a different absoluteresumeAt.Normally harmless: the
wait_createdconsumer overwrites the queue item'sresumeAtwith the recorded (authoritative) value beforewait_completedis validated. The bug fires when await_completedis consumed by a sleep consumer that never applied await_created(hasCreatedEvent=false) — the queue item still holds the freshly-recomputed value, and the comparison fails even though the log is internally consistent.I instrumented the SDK and captured this in production stress runs. Every failing sample showed
hasCreatedEvent=false, with ~18–42s deltas between the recomputed and recordedresumeAt, e.g.:The recorded
resumeAtis the source of truth; the consumer's recomputed value is not a valid basis for a corruption assertion.Fix
Only validate
resumeAtwhen an authoritative recorded value is available (hasCreatedEvent=true). When it is not, the correlationId match already establishes the wait's identity, so skip the check rather than fail a consistent log. Validation is extracted intodetectResumeAtMismatch, which also lowers the consumer callback's pre-existing cognitive-complexity warning (33 → 21).Tests
sleep.test.tsadvances the replay clock (updateTimestamp) and asserts a consistentwait_completedwithhasCreatedEvent=falseno longer raisesCorruptedEventLogError. Fails before the fix (reproduces the exact production error), passes after.hasCreatedEvent=true) still correctly fire.@workflow/coresuite: 635/635, typecheck clean.Scope
Pre-existing bug on
stable. Independent of the hook-vs-sleep race fix (#2171) and of the server-sidewait_createdatomicity work (workflow-server #462) — those address different failure shapes. Stress data showed #2171 removes the step-consumer-mismatch shape; this removes thewait_completed.resumeAtshape.Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com