Skip to content

fix(webapp): treat Phase 2 batch-stream retries as idempotent (TRI-9944)#3766

Merged
nicktrn merged 5 commits into
mainfrom
feature/tri-9944-phase-2-batch-stream-retry-idempotency
May 28, 2026
Merged

fix(webapp): treat Phase 2 batch-stream retries as idempotent (TRI-9944)#3766
nicktrn merged 5 commits into
mainfrom
feature/tri-9944-phase-2-batch-stream-retry-idempotency

Conversation

@matt-aitken
Copy link
Copy Markdown
Member

@matt-aitken matt-aitken commented May 27, 2026

Summary

Closes TRI-9944.

Customers were seeing BatchTriggerError logs (status 422, Batch ... is not in PENDING status (current: COMPLETED)) even though their runs completed successfully. The SDK was hitting Phase 2 of the 2-phase batch API a second time on a network retry; by the time the retry arrived for a fast-completing single-item batch, the batch had already gone PENDINGPROCESSINGCOMPLETED, and the server rejected the retry instead of treating it as a success.

StreamBatchItemsService.call now returns the standard sealed: true success response when the batch is already sealed or has moved past PENDING into PROCESSING/COMPLETED. This mirrors the idempotency already applied at the two post-loop race branches in the same file (lines 226 and 306). ABORTED and any other unexpected state still throw.

Safe because:

  • The original successful request already enqueued all items (otherwise the batch wouldn't be sealed/COMPLETED).
  • The SDK only checks sealed: true to decide success — a no-op success response is exactly what stops the retry loop.
  • Engine-side idempotency by item index would prevent duplicates even if items were reprocessed; we just skip the work entirely.

Changes

  • apps/webapp/app/runEngine/services/streamBatchItems.server.ts — replace the pre-loop sealed/non-PENDING throws with an idempotent success return for sealed || PROCESSING || COMPLETED.
  • apps/webapp/test/engine/streamBatchItems.test.ts — flip the existing "already sealed" race-condition test from expecting a throw to expecting sealed: true; add a COMPLETED-pre-loop test (the exact customer scenario: single-item batch, status=COMPLETED, sealed=false since tryCompleteBatch sets status without sealing); add an ABORTED negative test.
  • .server-changes/batch-stream-phase2-retry-idempotency.md.

Follow-up commits — scope expanded across all three branches

Review surfaced two adjacent classes of bug in the same file. The original fix only covered the pre-loop check; the same flawed idempotency reasoning also lived in the post-loop count-mismatch and seal-failed handlers. Subsequent commits unified all three branches behind a single helper and added two more "work is done" states the original PR missed:

  • sealed=true + ABORTED must throw, not succeed. The first revision admitted any sealed batch as success. But the V2 batchCompletionCallback (runEngineHandlers.server.ts:946) can set status=ABORTED (every run failed) on a batch this endpoint already sealed, leaving sealed=true next to a terminally-failed batch. Surfacing this as success masks the failure and prevents the customer's own retry from creating a fresh batch. CodeRabbit caught this; commit e8caa1d tightens the check to exclude sealed + ABORTED everywhere.
  • PARTIAL_FAILED is also an idempotent-success state. Same V2 callback can set status=PARTIAL_FAILED when some run-creation attempts failed but at least one succeeded. The original PR did not list this state, so the pre-loop and the two race branches all rejected it. Devin flagged this; commit e8caa1d adds it to the success list.
  • Cleanup-race (sealed=false + PENDING + processingCompletedAt set). Reported by a customer on 4-item batchTriggerAndWait. BatchQueue rushes through every item before the service finishes its loop, the V2 callback fires (writing processingCompletedAt and leaving status=PENDING because all runs created cleanly), cleanup() deletes the Redis metadata, then getBatchEnqueuedCount returns 0 ≠ runCount. The count-mismatch branch returned sealed:false because it couldn't distinguish "callback fired, work is done" from "client should stream more items". The SDK then retried the stream against the cleaned-up batch, the engine threw Batch not found or not initialized, retries exhausted, and the customer saw BatchTriggerError despite every child run completing successfully. Commit f28f53f adds processingCompletedAt as the discriminator: it is set exclusively by the V2 completion callback (runEngineHandlers.server.ts:968), so (status=PENDING) && (sealed || processingCompletedAt != null) cleanly separates the two cases.

Commit 5a25abf extracts the four-line condition into isIdempotentRetrySuccess(status, sealed, processingCompletedAt) and routes the pre-loop check, the count-mismatch handler, and the seal-failed handler through it, so any future "work is done" state only needs to be added in one place. ABORTED is explicitly excluded in all three branches: it means zero TaskRun records exist for the customer to monitor (every per-item attempt failed AND the pre-failed-TaskRun fallback also failed), so the trigger call must throw to give their batchTrigger() retry the chance to create a fresh batch.

Test coverage grew alongside: each of the three branches has positive cases for every admitted state (PROCESSING, COMPLETED, PARTIAL_FAILED, PENDING+sealed, PENDING+processingCompletedAt) and a negative case that pins down the ABORTED throw.

Test plan

  • pnpm run test ./test/engine/streamBatchItems.test.ts --run — 31/31 pass, including the 3 new TRI-9944 cases.
  • Confirmed red phase first: the two new behavior tests fail at the expected throw sites against the unchanged service before applying the fix.
  • pnpm run typecheck --filter webapp — passes.

🤖 Generated with Claude Code

When the SDK created a batch and then streamed its items (Phase 2 of
the 2-phase batch API), a lost response would trigger the SDK's
network-retry path. For small, fast-completing batches the original
request had already enqueued every item, sealed the batch, and the
runs flipped the batch to PROCESSING or COMPLETED by the time the
retry arrived. The retry then failed the pre-loop check at
streamBatchItems.server.ts:109 with a 422 — surfacing a
customer-visible BatchTriggerError for a batch whose runs had
actually succeeded.

StreamBatchItemsService.call now returns the standard sealed:true
success response (itemsAccepted: 0, itemsDeduplicated: 0,
runCount: batch.runCount) when the batch is already sealed or in
PROCESSING/COMPLETED, matching the idempotency already applied at
the two post-loop race-condition branches in the same file (lines
226 and 306). ABORTED and other unexpected non-PENDING states still
throw.

Tests:
- Rewrote the existing "already sealed" race test from expecting a
  throw to expecting sealed:true (Phase 2 retry idempotency).
- Added a COMPLETED-pre-loop test mirroring the exact customer
  scenario (single-item batch, status=COMPLETED, sealed=false —
  tryCompleteBatch sets status without setting sealed).
- Added a negative ABORTED test to lock in that terminal-failure
  states still surface as errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 27, 2026

⚠️ No Changeset found

Latest commit: d4e8af9

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 68ece743-1c8e-435c-bd4d-6482972130e2

📥 Commits

Reviewing files that changed from the base of the PR and between f28f53f and d4e8af9.

📒 Files selected for processing (1)
  • .server-changes/batch-stream-phase2-retry-idempotency.md
✅ Files skipped from review due to trivial changes (1)
  • .server-changes/batch-stream-phase2-retry-idempotency.md
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp
  • GitHub Check: typecheck / typecheck
  • GitHub Check: audit
  • GitHub Check: audit
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (actions)

Walkthrough

StreamBatchItemsService.call() now uses an exported isIdempotentRetrySuccess(status, sealed, processingCompletedAt) helper to detect Phase 2 idempotent-success states. When a retry finds a batch already sealed or advanced to PROCESSING, COMPLETED, PARTIAL_FAILED, or has processingCompletedAt set while PENDING, the service returns { sealed: true, itemsAccepted: 0, itemsDeduplicated: 0, runCount } instead of throwing. Fast-path, post-enqueue re-check, and concurrent-seal fallback use this helper; ABORTED still causes a ServiceValidationError. Tests and changelog updated accordingly.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: treating Phase 2 batch-stream retries as idempotent, which directly addresses the core issue (TRI-9944) described in the changeset.
Description check ✅ Passed The description comprehensively addresses all required template sections: issue closure, checklist items, detailed testing approach, changelog content, and technical follow-up explanation. All critical information is present and well-documented.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/tri-9944-phase-2-batch-stream-retry-idempotency

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

…race

Addresses two PR review findings:

CodeRabbit: sealed=true + ABORTED would silently succeed under the
previous `if (batch.sealed || ...)` check. V2's batch completion
callback can set status=ABORTED (failedRunCount > 0 &&
successfulRunCount === 0) on a batch that streamBatchItems already
sealed — leaving sealed=true alongside a terminally-failed batch.
A Phase 2 retry of such a batch must surface the error, not mask it.

Devin: PARTIAL_FAILED (failedRunCount > 0 with at least one success)
is a real V2 completion-callback status, but neither the pre-loop
check nor the post-loop race handlers (lines 226 and 306) accepted
it as success. A retry whose original stream succeeded would either
422 at the pre-loop or hit "unexpected state" at the post-loop seal-
failed branch.

Changes:
- Pre-loop: replace the broad `sealed || PROCESSING || COMPLETED`
  check with an `isIdempotentRetrySuccess` boolean that admits
  PROCESSING, COMPLETED, PARTIAL_FAILED, or (sealed && PENDING) —
  ABORTED falls through to the throw.
- Post-loop count-mismatch (line 226 region): add PARTIAL_FAILED to
  the success short-circuit alongside sealed and COMPLETED.
- Post-loop seal-failed (line 306 region): add PARTIAL_FAILED to the
  success short-circuit alongside (sealed && PROCESSING) and
  COMPLETED.

Tests (TDD red-then-green):
- New: pre-loop sealed=true + ABORTED → throws (CodeRabbit's case).
- New: pre-loop PARTIAL_FAILED → returns sealed:true.
- New: post-loop seal-failed race with PARTIAL_FAILED → returns
  sealed:true (uses the same racingPrisma pattern as the existing
  COMPLETED race test).
All 34 tests in streamBatchItems.test.ts pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

matt-aitken and others added 2 commits May 27, 2026 15:18
…branches

After settling the operational contract — ABORTED throws because zero
TaskRun records exist for the customer to monitor; every other terminal
state returns sealed:true because TaskRun records exist (some may be
in failed state, but per-run signals reach the customer through run
monitoring) — three inconsistencies remained between the pre-loop
check and the two post-loop race handlers:

1. Seal-failed branch threw "unexpected state" on sealed=true + PENDING,
   which is the legitimate post-callback "all runs created" state
   (V2 batchCompletionCallback resets PROCESSING → PENDING and leaves
   sealed=true). Pre-loop and count-mismatch both accept this state.
2. Count-mismatch branch admitted sealed=true + ABORTED via the bare
   `currentBatch?.sealed` clause, returning sealed:true. Pre-loop
   throws on this state. The count-mismatch outcome would silently
   hide a batch where zero TaskRuns were created.
3. Count-mismatch branch's fall-through return (sealed:false) implies
   "retry with missing items", which is wrong for ABORTED — a fresh
   batch is needed.

Extracted the per-status policy into an exported helper:

  isIdempotentRetrySuccess(status, sealed) returns true for
  PROCESSING, COMPLETED, PARTIAL_FAILED, or (sealed && PENDING).
  ABORTED is excluded so the customer's batchTrigger() retry fires.

All three branches now call the same helper. The count-mismatch
branch additionally throws explicitly on ABORTED before falling
through to the sealed:false return.

Tests (TDD red-then-green):
- New: seal-failed race with sealed=true + PENDING returns sealed:true
  (was throwing "unexpected state"). Uses racingPrisma to set the
  exact post-callback shape during the seal updateMany.
- New: count-mismatch race with sealed=true + ABORTED throws
  ServiceValidationError (was returning sealed:false). Uses a
  call-counter on findFirst to flip the batch state between the
  pre-loop read and the re-query.
All 36 tests in streamBatchItems.test.ts pass; webapp typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erAndWait reports)

Another customer hit a related mode of the same Phase 2 stream class of
bug: parent batchTriggerAndWait throwing BatchTriggerError despite
every child run completing successfully. Pattern: 10 occurrences over
2 days, all 4-item batches, parents fail ~45s after the 200 — exactly
five SDK stream-retry attempts exhausting.

Trace of the failure mode against the existing code:

1. SDK POST /items sends 4 items. Server enqueues all 4.
2. BatchQueue rushes through them (independent items, fast). All 4
   TaskRuns created.
3. batchCompletionCallback fires — sets processingCompletedAt = now(),
   successfulRunCount = 4, runIds. Status stays PENDING (the callback's
   "all created" happy path). sealed stays false (callback never
   touches it).
4. cleanup() runs, deletes the Redis batch metadata.
5. Our service's getBatchEnqueuedCount returns 0. Count-mismatch
   branch: 0 != 4.
6. Re-query: status=PENDING, sealed=false. Neither (sealed && PENDING)
   nor any of PROCESSING/COMPLETED/PARTIAL_FAILED matched → fell
   through to the sealed:false "client should stream more items"
   return. Server: 200 + sealed:false (matches the customer's "first
   POST returned 200, 8.1s").
7. SDK retries the stream. engine.enqueueBatchItem at batch-queue/
   index.ts:346 throws `Batch not found or not initialized` because
   cleanup deleted the metadata. Five retries exhaust → SDK throws
   BatchTriggerError (~45s after the 200).

The discriminator that distinguishes "callback fired, work is done"
from "client should stream more items" is processingCompletedAt: it's
written exclusively by the V2 batchCompletionCallback (verified by
grep across the run-engine and webapp). Nothing else touches it.

Extended isIdempotentRetrySuccess to take processingCompletedAt as a
third argument:

  (status === PENDING) && (sealed === true || processingCompletedAt != null)

now means "callback fired, every item has a TaskRun, return sealed:true".
The same helper is used by all three branches (pre-loop, count-mismatch,
seal-failed) so the contract stays uniform. All three findFirst selects
add `processingCompletedAt`. ABORTED still excluded everywhere.

Test helper createBatch now accepts PARTIAL_FAILED (per CodeRabbit's
adjacent nit on the previous commit) and processingCompletedAt.

Tests (TDD red-then-green):
- New: pre-loop with sealed=false + PENDING + processingCompletedAt set
  → returns sealed:true. Exercises the path a Phase 2 retry would hit
  if it arrived after the original count-mismatch returned sealed:false.
- New: count-mismatch race with the customer's exact shape (sealed=false
  + PENDING + processingCompletedAt flipped between pre-loop read and
  re-query) → returns sealed:true. Uses the findFirst-counter racing
  pattern to reproduce the production timing.
All 38 tests in streamBatchItems.test.ts pass; webapp typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 0 new potential issues.

View 7 additional findings in Devin Review.

Open in Devin Review

@nicktrn nicktrn enabled auto-merge (squash) May 28, 2026 09:25
@nicktrn nicktrn merged commit 816986d into main May 28, 2026
27 checks passed
@nicktrn nicktrn deleted the feature/tri-9944-phase-2-batch-stream-retry-idempotency branch May 28, 2026 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants