Fix concurrent wait_completed race condition in world-local#1388
Fix concurrent wait_completed race condition in world-local#1388TooTallNate wants to merge 2 commits intomainfrom
Conversation
Use writeExclusive (O_CREAT|O_EXCL) to atomically prevent concurrent invocations from both completing the same wait. Previously, two concurrent runtime invocations could both read the wait as 'waiting' and both create wait_completed events, causing duplicate events in the event log. On replay, the sleep callback consumed the first and the second was reported as an unconsumed event. Add wait test helpers (createWait, completeWait) and tests for: - Basic wait creation and completion - Duplicate wait_created rejection - Sequential duplicate wait_completed rejection (409) - Concurrent wait_completed race (Promise.allSettled, exactly one wins)
🦋 Changeset detectedLatest commit: 8aa0385 The changes in this PR will be included in the next version bump. This PR includes changesets to release 18 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro | Express workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro | Express Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express | Nitro Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express | Nitro Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express | Nitro Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express | Nitro SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
|
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (14 failed)nextjs-webpack (14 failed):
🌍 Community Worlds (54 failed)mongodb (2 failed):
redis (2 failed):
turso (50 failed):
Details by Category❌ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
There was a problem hiding this comment.
Pull request overview
Fixes a concurrency bug in @workflow/world-local event storage where concurrent wait_completed writes could create duplicate events and break replay, by adding an atomic completion-claim mechanism and regression tests.
Changes:
- Add
createWait/completeWaithelpers for storage lifecycle tests. - Update
wait_completedhandling to atomically claim a completion lock file viawriteExclusive. - Add wait lifecycle + concurrent completion regression tests and a patch changeset.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| packages/world-local/src/test-helpers.ts | Adds wait-focused helper functions used by storage tests. |
| packages/world-local/src/storage/events-storage.ts | Introduces exclusive-create lock file to prevent concurrent double-completion. |
| packages/world-local/src/storage.test.ts | Adds wait lifecycle tests, including a concurrent completion regression test. |
| .changeset/fix-concurrent-wait-completed.md | Declares a patch release note for the race-condition fix. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| @@ -732,12 +747,6 @@ export function createEventsStorage( | |||
| status: 404, | |||
| }); | |||
| } | |||
| const lockPath = taggedPath( | ||
| basedir, | ||
| 'waits', | ||
| `${waitCompositeKey}.completed`, | ||
| tag |
When a concurrent invocation already created a wait_completed event, the local events array was missing it (the 409 handler just continued). This could cause the workflow to not see the event during replay. Re-fetch the full event log from the source of truth when any 409 is encountered during wait completion, ensuring the events array has the correct ordering with all events from concurrent invocations.
| events.push(result.event!); | ||
| } catch (err) { | ||
| if (WorkflowAPIError.is(err) && err.status === 409) { | ||
| runtimeLogger.info('Wait already completed, skipping', { |
There was a problem hiding this comment.
If we see a 409 here, wouldn't it mean we can exit this flow early since there's a different flow that created the wait_completed event? 🤔 Unsure. Otherwise LGTM
There was a problem hiding this comment.
Ya I had that same optimization in mind. I kinda worry about if that earlier replay might not have the full event log though (i.e. if two steps completed ~simultaneously after the wait had elapsed, first replay only has the first step_completed event but the second one has both?)
There was a problem hiding this comment.
Mmh fair. I think this is likely fine to merge and maybe only leads to addition unnecessary replays. Can we add a TODO in here to maybe log or monitor how often this happen? Might also need to re-examine with all the changes in #1338
VaguelySerious
left a comment
There was a problem hiding this comment.
LGTM let's add a comment or TODO though about re-examining this since it feels like a hack and it also applies to vercel/postgres world despite being a local world fix
Summary
world-localwhere two concurrent runtime invocations could both createwait_completedevents for the same wait, causing duplicate events in the event logwait_completedcausedWorkflowRuntimeError: Unconsumed eventduring replay, since the sleep callback consumed the first event and was removed, leaving the second with no consumerwriteExclusive(O_CREAT|O_EXCL) to atomically claim a.completedlock file before transitioning a wait tocompleted— if a concurrent invocation already claimed it, the second gets a 409 which the runtime's existing conflict handler gracefully skipscreateWaitandcompleteWaittest helpersPromise.allSettledwith two simultaneous completions — exactly one succeeds, the other gets 409)