fix(webapp): treat Phase 2 batch-stream retries as idempotent (TRI-9944) by matt-aitken · Pull Request #3766 · triggerdotdev/trigger.dev

matt-aitken · 2026-05-27T21:32:31Z

Summary

Customers were seeing BatchTriggerError logs (status 422, Batch ... is not in PENDING status (current: COMPLETED)) even though their runs completed successfully. The SDK was hitting Phase 2 of the 2-phase batch API a second time on a network retry; by the time the retry arrived for a fast-completing single-item batch, the batch had already gone PENDING → PROCESSING → COMPLETED, and the server rejected the retry instead of treating it as a success.

StreamBatchItemsService.call now returns the standard sealed: true success response when the batch is already sealed or has moved past PENDING into PROCESSING/COMPLETED. This mirrors the idempotency already applied at the two post-loop race branches in the same file (lines 226 and 306). ABORTED and any other unexpected state still throw.

Safe because:

The original successful request already enqueued all items (otherwise the batch wouldn't be sealed/COMPLETED).
The SDK only checks sealed: true to decide success — a no-op success response is exactly what stops the retry loop.
Engine-side idempotency by item index would prevent duplicates even if items were reprocessed; we just skip the work entirely.

Changes

apps/webapp/app/runEngine/services/streamBatchItems.server.ts — replace the pre-loop sealed/non-PENDING throws with an idempotent success return for sealed || PROCESSING || COMPLETED.
apps/webapp/test/engine/streamBatchItems.test.ts — flip the existing "already sealed" race-condition test from expecting a throw to expecting sealed: true; add a COMPLETED-pre-loop test (the exact customer scenario: single-item batch, status=COMPLETED, sealed=false since tryCompleteBatch sets status without sealing); add an ABORTED negative test.
.server-changes/batch-stream-phase2-retry-idempotency.md.

Follow-up commits — scope expanded across all three branches

Review surfaced two adjacent classes of bug in the same file. The original fix only covered the pre-loop check; the same flawed idempotency reasoning also lived in the post-loop count-mismatch and seal-failed handlers. Subsequent commits unified all three branches behind a single helper and added two more "work is done" states the original PR missed:

sealed=true + ABORTED must throw, not succeed. The first revision admitted any sealed batch as success. But the V2 batchCompletionCallback (runEngineHandlers.server.ts:946) can set status=ABORTED (every run failed) on a batch this endpoint already sealed, leaving sealed=true next to a terminally-failed batch. Surfacing this as success masks the failure and prevents the customer's own retry from creating a fresh batch. CodeRabbit caught this; commit e8caa1d tightens the check to exclude sealed + ABORTED everywhere.
PARTIAL_FAILED is also an idempotent-success state. Same V2 callback can set status=PARTIAL_FAILED when some run-creation attempts failed but at least one succeeded. The original PR did not list this state, so the pre-loop and the two race branches all rejected it. Devin flagged this; commit e8caa1d adds it to the success list.
Cleanup-race (sealed=false + PENDING + processingCompletedAt set). Reported by a customer on 4-item batchTriggerAndWait. BatchQueue rushes through every item before the service finishes its loop, the V2 callback fires (writing processingCompletedAt and leaving status=PENDING because all runs created cleanly), cleanup() deletes the Redis metadata, then getBatchEnqueuedCount returns 0 ≠ runCount. The count-mismatch branch returned sealed:false because it couldn't distinguish "callback fired, work is done" from "client should stream more items". The SDK then retried the stream against the cleaned-up batch, the engine threw Batch not found or not initialized, retries exhausted, and the customer saw BatchTriggerError despite every child run completing successfully. Commit f28f53f adds processingCompletedAt as the discriminator: it is set exclusively by the V2 completion callback (runEngineHandlers.server.ts:968), so (status=PENDING) && (sealed || processingCompletedAt != null) cleanly separates the two cases.

Commit 5a25abf extracts the four-line condition into isIdempotentRetrySuccess(status, sealed, processingCompletedAt) and routes the pre-loop check, the count-mismatch handler, and the seal-failed handler through it, so any future "work is done" state only needs to be added in one place. ABORTED is explicitly excluded in all three branches: it means zero TaskRun records exist for the customer to monitor (every per-item attempt failed AND the pre-failed-TaskRun fallback also failed), so the trigger call must throw to give their batchTrigger() retry the chance to create a fresh batch.

Test coverage grew alongside: each of the three branches has positive cases for every admitted state (PROCESSING, COMPLETED, PARTIAL_FAILED, PENDING+sealed, PENDING+processingCompletedAt) and a negative case that pins down the ABORTED throw.

Test plan

pnpm run test ./test/engine/streamBatchItems.test.ts --run — 31/31 pass, including the 3 new TRI-9944 cases.
Confirmed red phase first: the two new behavior tests fail at the expected throw sites against the unchanged service before applying the fix.
pnpm run typecheck --filter webapp — passes.

🤖 Generated with Claude Code

When the SDK created a batch and then streamed its items (Phase 2 of the 2-phase batch API), a lost response would trigger the SDK's network-retry path. For small, fast-completing batches the original request had already enqueued every item, sealed the batch, and the runs flipped the batch to PROCESSING or COMPLETED by the time the retry arrived. The retry then failed the pre-loop check at streamBatchItems.server.ts:109 with a 422 — surfacing a customer-visible BatchTriggerError for a batch whose runs had actually succeeded. StreamBatchItemsService.call now returns the standard sealed:true success response (itemsAccepted: 0, itemsDeduplicated: 0, runCount: batch.runCount) when the batch is already sealed or in PROCESSING/COMPLETED, matching the idempotency already applied at the two post-loop race-condition branches in the same file (lines 226 and 306). ABORTED and other unexpected non-PENDING states still throw. Tests: - Rewrote the existing "already sealed" race test from expecting a throw to expecting sealed:true (Phase 2 retry idempotency). - Added a COMPLETED-pre-loop test mirroring the exact customer scenario (single-item batch, status=COMPLETED, sealed=false — tryCompleteBatch sets status without setting sealed). - Added a negative ABORTED test to lock in that terminal-failure states still surface as errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

changeset-bot · 2026-05-27T21:32:35Z

⚠️ No Changeset found

Latest commit: d4e8af9

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2026-05-27T21:32:46Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 68ece743-1c8e-435c-bd4d-6482972130e2

📥 Commits

Reviewing files that changed from the base of the PR and between f28f53f and d4e8af9.

📒 Files selected for processing (1)

.server-changes/batch-stream-phase2-retry-idempotency.md

✅ Files skipped from review due to trivial changes (1)

.server-changes/batch-stream-phase2-retry-idempotency.md

📜 Recent review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)

GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 8)
GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 8)
GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp
GitHub Check: typecheck / typecheck
GitHub Check: audit
GitHub Check: audit
GitHub Check: Analyze (python)
GitHub Check: Analyze (javascript-typescript)
GitHub Check: Analyze (actions)

Walkthrough

StreamBatchItemsService.call() now uses an exported isIdempotentRetrySuccess(status, sealed, processingCompletedAt) helper to detect Phase 2 idempotent-success states. When a retry finds a batch already sealed or advanced to PROCESSING, COMPLETED, PARTIAL_FAILED, or has processingCompletedAt set while PENDING, the service returns { sealed: true, itemsAccepted: 0, itemsDeduplicated: 0, runCount } instead of throwing. Fast-path, post-enqueue re-check, and concurrent-seal fallback use this helper; ABORTED still causes a ServiceValidationError. Tests and changelog updated accordingly.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: treating Phase 2 batch-stream retries as idempotent, which directly addresses the core issue (TRI-9944) described in the changeset.
Description check	✅ Passed	The description comprehensively addresses all required template sections: issue closure, checklist items, detailed testing approach, changelog content, and technical follow-up explanation. All critical information is present and well-documented.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/tri-9944-phase-2-batch-stream-retry-idempotency

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…race Addresses two PR review findings: CodeRabbit: sealed=true + ABORTED would silently succeed under the previous `if (batch.sealed || ...)` check. V2's batch completion callback can set status=ABORTED (failedRunCount > 0 && successfulRunCount === 0) on a batch that streamBatchItems already sealed — leaving sealed=true alongside a terminally-failed batch. A Phase 2 retry of such a batch must surface the error, not mask it. Devin: PARTIAL_FAILED (failedRunCount > 0 with at least one success) is a real V2 completion-callback status, but neither the pre-loop check nor the post-loop race handlers (lines 226 and 306) accepted it as success. A retry whose original stream succeeded would either 422 at the pre-loop or hit "unexpected state" at the post-loop seal- failed branch. Changes: - Pre-loop: replace the broad `sealed || PROCESSING || COMPLETED` check with an `isIdempotentRetrySuccess` boolean that admits PROCESSING, COMPLETED, PARTIAL_FAILED, or (sealed && PENDING) — ABORTED falls through to the throw. - Post-loop count-mismatch (line 226 region): add PARTIAL_FAILED to the success short-circuit alongside sealed and COMPLETED. - Post-loop seal-failed (line 306 region): add PARTIAL_FAILED to the success short-circuit alongside (sealed && PROCESSING) and COMPLETED. Tests (TDD red-then-green): - New: pre-loop sealed=true + ABORTED → throws (CodeRabbit's case). - New: pre-loop PARTIAL_FAILED → returns sealed:true. - New: post-loop seal-failed race with PARTIAL_FAILED → returns sealed:true (uses the same racingPrisma pattern as the existing COMPLETED race test). All 34 tests in streamBatchItems.test.ts pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…branches After settling the operational contract — ABORTED throws because zero TaskRun records exist for the customer to monitor; every other terminal state returns sealed:true because TaskRun records exist (some may be in failed state, but per-run signals reach the customer through run monitoring) — three inconsistencies remained between the pre-loop check and the two post-loop race handlers: 1. Seal-failed branch threw "unexpected state" on sealed=true + PENDING, which is the legitimate post-callback "all runs created" state (V2 batchCompletionCallback resets PROCESSING → PENDING and leaves sealed=true). Pre-loop and count-mismatch both accept this state. 2. Count-mismatch branch admitted sealed=true + ABORTED via the bare `currentBatch?.sealed` clause, returning sealed:true. Pre-loop throws on this state. The count-mismatch outcome would silently hide a batch where zero TaskRuns were created. 3. Count-mismatch branch's fall-through return (sealed:false) implies "retry with missing items", which is wrong for ABORTED — a fresh batch is needed. Extracted the per-status policy into an exported helper: isIdempotentRetrySuccess(status, sealed) returns true for PROCESSING, COMPLETED, PARTIAL_FAILED, or (sealed && PENDING). ABORTED is excluded so the customer's batchTrigger() retry fires. All three branches now call the same helper. The count-mismatch branch additionally throws explicitly on ABORTED before falling through to the sealed:false return. Tests (TDD red-then-green): - New: seal-failed race with sealed=true + PENDING returns sealed:true (was throwing "unexpected state"). Uses racingPrisma to set the exact post-callback shape during the seal updateMany. - New: count-mismatch race with sealed=true + ABORTED throws ServiceValidationError (was returning sealed:false). Uses a call-counter on findFirst to flip the batch state between the pre-loop read and the re-query. All 36 tests in streamBatchItems.test.ts pass; webapp typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…erAndWait reports) Another customer hit a related mode of the same Phase 2 stream class of bug: parent batchTriggerAndWait throwing BatchTriggerError despite every child run completing successfully. Pattern: 10 occurrences over 2 days, all 4-item batches, parents fail ~45s after the 200 — exactly five SDK stream-retry attempts exhausting. Trace of the failure mode against the existing code: 1. SDK POST /items sends 4 items. Server enqueues all 4. 2. BatchQueue rushes through them (independent items, fast). All 4 TaskRuns created. 3. batchCompletionCallback fires — sets processingCompletedAt = now(), successfulRunCount = 4, runIds. Status stays PENDING (the callback's "all created" happy path). sealed stays false (callback never touches it). 4. cleanup() runs, deletes the Redis batch metadata. 5. Our service's getBatchEnqueuedCount returns 0. Count-mismatch branch: 0 != 4. 6. Re-query: status=PENDING, sealed=false. Neither (sealed && PENDING) nor any of PROCESSING/COMPLETED/PARTIAL_FAILED matched → fell through to the sealed:false "client should stream more items" return. Server: 200 + sealed:false (matches the customer's "first POST returned 200, 8.1s"). 7. SDK retries the stream. engine.enqueueBatchItem at batch-queue/ index.ts:346 throws `Batch not found or not initialized` because cleanup deleted the metadata. Five retries exhaust → SDK throws BatchTriggerError (~45s after the 200). The discriminator that distinguishes "callback fired, work is done" from "client should stream more items" is processingCompletedAt: it's written exclusively by the V2 batchCompletionCallback (verified by grep across the run-engine and webapp). Nothing else touches it. Extended isIdempotentRetrySuccess to take processingCompletedAt as a third argument: (status === PENDING) && (sealed === true || processingCompletedAt != null) now means "callback fired, every item has a TaskRun, return sealed:true". The same helper is used by all three branches (pre-loop, count-mismatch, seal-failed) so the contract stays uniform. All three findFirst selects add `processingCompletedAt`. ABORTED still excluded everywhere. Test helper createBatch now accepts PARTIAL_FAILED (per CodeRabbit's adjacent nit on the previous commit) and processingCompletedAt. Tests (TDD red-then-green): - New: pre-loop with sealed=false + PENDING + processingCompletedAt set → returns sealed:true. Exercises the path a Phase 2 retry would hit if it arrived after the original count-mismatch returned sealed:false. - New: count-mismatch race with the customer's exact shape (sealed=false + PENDING + processingCompletedAt flipped between pre-loop read and re-query) → returns sealed:true. Uses the findFirst-counter racing pattern to reproduce the production timing. All 38 tests in streamBatchItems.test.ts pass; webapp typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

devin-ai-integration

Devin Review found 0 new potential issues.

View 7 additional findings in Devin Review.

This comment was marked as resolved.

Sign in to view

matt-aitken and others added 2 commits May 27, 2026 15:18

ericallam approved these changes May 28, 2026

View reviewed changes

chore: trim .server-changes file to release-note length

d4e8af9

devin-ai-integration Bot reviewed May 28, 2026

View reviewed changes

nicktrn enabled auto-merge (squash) May 28, 2026 09:25

nicktrn merged commit 816986d into main May 28, 2026
27 checks passed

nicktrn deleted the feature/tri-9944-phase-2-batch-stream-retry-idempotency branch May 28, 2026 09:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(webapp): treat Phase 2 batch-stream retries as idempotent (TRI-9944)#3766

fix(webapp): treat Phase 2 batch-stream retries as idempotent (TRI-9944)#3766
nicktrn merged 5 commits into
mainfrom
feature/tri-9944-phase-2-batch-stream-retry-idempotency

matt-aitken commented May 27, 2026 •

edited by nicktrn

Loading

Uh oh!

changeset-bot Bot commented May 27, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Reviews paused

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

matt-aitken commented May 27, 2026 • edited by nicktrn Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Follow-up commits — scope expanded across all three branches

Test plan

Uh oh!

changeset-bot Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

matt-aitken commented May 27, 2026 •

edited by nicktrn

Loading

changeset-bot Bot commented May 27, 2026 •

edited

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading