Skip to content

sleep() causes 'Unconsumed event in event log' when called concurrently via Promise.all #1266

@desenmeng

Description

@desenmeng

Bug Report

Description

When multiple tool calls execute in parallel via Promise.all (as DurableAgent does internally), concurrent sleep() calls cause event log corruption during replay, resulting in:

Unconsumed event in event log: eventType=wait_completed, correlationId=wait_xxx, eventId=evnt_xxx.
This indicates a corrupted or invalid event log.

Reproduction

  1. Create a DurableAgent with tools that use sleep() for polling (e.g., polling an external API queue)
  2. Ask the AI to call 5+ tools simultaneously
  3. DurableAgent executes tools via Promise.all (line 189 of durable-agent.js)
  4. Each tool calls sleep() concurrently
  5. On replay, the event log has unconsumed wait_completed events

Minimal Example

export async function myWorkflow() {
  "use workflow";

  const agent = new DurableAgent({
    model: "anthropic/claude-sonnet-4.6",
    tools: {
      generateImage: tool({
        description: "Generate an image",
        inputSchema: z.object({ prompt: z.string() }),
        execute: async (params) => {
          // Submit to external queue
          const requestId = await submitStep(params.prompt);

          // Poll with sleep — breaks when multiple tools run concurrently
          for (let i = 0; i < 100; i++) {
            await sleep("5s");
            const status = await checkStatusStep(requestId, i);
            if (status === "COMPLETED") break;
          }

          return await getResultStep(requestId);
        },
      }),
    },
  });

  await agent.stream({ messages, writable: getWritable() });
}

When the LLM returns 5 generateImage tool calls, DurableAgent runs them all via Promise.all, creating 5 concurrent sleep() calls. On replay, the execution order of these concurrent sleeps may differ, causing event correlation mismatches.

Root Cause Analysis

sleep() uses order-based correlation IDs internally. With Promise.all, the order in which concurrent sleep() calls execute is non-deterministic. During replay, the runtime may match a wait_completed event to the wrong sleep() call because the execution order differs from the original run.

This contrasts with "use step" functions, which are identified by file + function + args (content-based), making them safe for concurrent use.

Suggested Fix

Add an optional token or name parameter to sleep() for content-based correlation:

// Current — order-based, breaks with Promise.all
await sleep("5s");

// Proposed — content-based, safe for concurrent use
await sleep("5s", { token: `poll-${requestId}-${iteration}` });

This would align with how createWebhook({ token }) and createHook({ token }) already support deterministic correlation via explicit tokens.

Environment

  • workflow@4.1.0-beta.63
  • @workflow/ai@4.0.1-beta.54
  • @workflow/core@4.1.0-beta.63
  • Platform: Vercel (Next.js)

Workaround

Currently working around this by avoiding sleep() in tools that may execute concurrently — using fal.subscribe() (blocking poll) for fast operations and planning to use createWebhook() for long-running ones.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions