fix(a2a): resubscribe on graceful SSE close with in-flight task#2406
fix(a2a): resubscribe on graceful SSE close with in-flight task#2406malinskibeniamin merged 4 commits intomasterfrom
Conversation
Load balancers with idle timeouts (commonly ~5 minutes) close idle TCP connections gracefully with a FIN rather than an error. The A2A streaming iterator surfaces this as end-of-stream, so the catch-block reconnect never runs and the message is finalized with a non-terminal taskState, orphaning the task server-side. Route clean closes through the same resubscribeLoop used in the error path when the task is still resubscribable. The loop already handles backoff, progress detection, and give-up; it just needed a second entry point. Covered by three regression tests: clean-close with in-flight task, clean-close with terminal task (no resubscribe), and clean-close where resubscribe exhausts retries (finalizes with gave-up status). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@claude review |
|
Claude finished @malinskibeniamin's task in 1m 18s —— View job Code Review
The fix is correct, well-motivated, and minimal. The reuse of the existing Implementation (
|
| Bug fix correctness | ✅ Correct — routes clean close through the existing resubscribe loop |
| Minimal change | ✅ 12 lines of production code |
| Test coverage | ✅ Three new targeted regression scenarios |
| Minor cleanups | See redundant close call and test numbering above |
…ailure After the clean-close path enters resubscribeLoop and exhausts its 5 attempts without reaching terminal state, the captured task state is still 'working' and isResubscribable() remains true. If finalizeMessage then throws (e.g., the DB write fails), the outer catch would re-enter resubscribeLoop for another full 31s backoff round. Track entry with a resubscribeAttempted flag and skip the catch-path reconnect when the clean-close path already ran one. Mirrors the defensive pattern in the existing catch block around finalizeMessage. Add a regression test asserting exactly one round of 5 retries even when both resubscribe and finalize fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@claude please review this PR. Specifically interested in your read on:
|
|
Claude finished @malinskibeniamin's task in 2m 33s —— View job Code Review — Focused Q&A
|
| Path | resubscribeLoop returns false (gave-up) |
|---|---|
| Clean-close | Falls through to finalizeMessage — success=true, gave-up block visible in UI |
| Error catch | Skips finalizeMessage, falls to the a2a-error block — success=false |
A "shared resubscribe + finalize" helper would need to express this same fork internally — either via its own parameter flag, or by returning a discriminated result. That's the same complexity in a different location, and it fragments the paths that are currently readable in one place.
The one structural alternative worth noting: move the clean-close finalizeMessage call inside the if (isResubscribable) block, then add a second finalizeMessage call for the non-resubscribable branch. That eliminates the flag but introduces a duplicate call (or a helper that calls it in both arms). The flag wins.
One thing the current implementation does paper over: after resubscribeLoop gives up, state.capturedTaskState is still non-terminal (the server never sent a terminal event). So isResubscribable(state) is still true when the outer catch fires. The flag is what prevents re-entry. A deeper fix would be to force-transition capturedTaskState to a pseudo-terminal value inside resubscribeLoop before returning false — then isResubscribable itself would return false and the flag becomes unnecessary. That said, mutating task state to something that was never sent by the server has its own risks. The flag is safer.
Q2 — Does the clean-close entry need a minimum delay or other guardrails to avoid flashing the reconnect UI?
Short answer: no extra guardrails needed. The flash concern mostly doesn't apply.
The scenario you're worried about — "quick task completes + clean TCP close → UI briefly shows disconnected" — is actually handled at the isResubscribable check:
-
If the task completed normally, the server sent a terminal
status-updateevent before closing the connection. That event updatedstate.capturedTaskStateto'completed'.isResubscribablereturnsfalse. No reconnect UI, no flash. (Scenario 16c.) -
If both the FIN and the final event are racing in the TCP buffer, the iterator still drains them before exiting. By the time the
for awaitexits,capturedTaskStateis already terminal. Same result.
The flash only occurs when the LB closed the connection before the terminal event arrived — meaning the task is genuinely still in-flight. In that case the "disconnected" status block is accurate information, not noise.
The one real edge case: the task completes server-side in the 1-second window between the clean close and the first resubscribeTask attempt. The user sees disconnected → reconnecting → reconnected (completed) for ~1s. That's an accurate summary of what happened, and it's a narrow window. A pre-check getTask call would eliminate it, but it adds a network round-trip and complexity that isn't worth it here.
Q3 — Missing test scenarios
The four new tests (16b–16e) are solid. The one gap I'd flag:
Missing: clean-close + resubscribe succeeds + finalizeMessage throws
Scenario 16e tests: clean-close → gave-up → finalizeMessage throws → no second resubscribe loop.
Scenario 16 tests: error path → resubscribe succeeds → finalizeMessage throws → falls to error path.
Nobody tests: clean-close → resubscribe succeeds (task reaches terminal) → finalizeMessage throws → catch fires with resubscribeAttempted = true → no second loop → error path.
This matters because resubscribeAttempted = true is set before resubscribeLoop is awaited (line 361), so it guards both gave-up and success outcomes. But there's no test that a successful clean-close resubscribe followed by a DB failure doesn't trigger a second loop. The production code path is:
// line 360-365 (clean-close branch)
if (isResubscribable(state)) {
resubscribeAttempted = true; // ← set here
...
await resubscribeLoop(...); // ← task now terminal, returns true
}
return await finalizeMessage(...); // ← DB write fails → throws
// catch fires: !resubscribeAttempted is false → guard triggers → correctThe guard is correct but untested for this specific combination. Worth adding a test analogous to Scenario 16 but with clean-close as the entry.
Lower priority: clean-close when taskId comes from response metadata fallback
Scenario 10 confirms that capturedTaskId can be populated from streamResult.response metadata. The clean-close check at line 360 runs after the metadata fallback at lines 344–352, so the ordering is correct. A test combining clean-close + metadata-fallback taskId would confirm this ordering holds. Low priority since the logic path is sequential and obvious, but it's a coverage gap.
Production code (use-message-streaming.ts): - Remove redundant closeActiveTextBlock call in the clean-close branch; the block at the top of that stretch already closed it and nothing between could have opened a new one. - Report success=false when the clean-close path's resubscribe gives up, mirroring the error-path's gave-up semantics. An orphaned task is a failure regardless of whether the original disconnect was graceful. Tests (use-message-streaming.test.ts): - Renumber scenarios into a clean 1-25 sequence (was 1-13, 13b, 13c, 14-17 with an added 16b/c/d/e block mid-file). - Rename the TypeError test: the code breaks out of the retry loop rather than rethrowing, so "stops retrying immediately on TypeError" is accurate. - Update gave-up clean-close test to assert success=false. - Add scenario 23: clean-close resubscribe succeeds then finalizeMessage throws — guards the success-path arm of resubscribeAttempted against future regressions that change terminal-state tracking. - Add scenario 24: clean-close with taskId captured only from response metadata fallback — confirms the metadata block runs before the clean-close isResubscribable check and does not trigger spurious reconnects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@claude I've addressed the review feedback. Summary of changes in 2d9abcf: Production code:
Tests: 33/33 tests pass. Type-check and lint clean. Please take another pass. |
|
Claude finished @malinskibeniamin's task in 1m 30s —— View job Code Review — Follow-up Pass
All seven items from the trigger comment are confirmed in the code. Here's a point-by-point check plus one minor observation. Verified changes1. Redundant 2. Clean-close gave-up returns 3. Scenarios renumbered 1–25 ✅ 4. Scenario 17 renamed ✅ 5. Scenario 21 asserts 6. Scenario 23 added (clean-close succeeds → 7. Scenario 24 added (metadata-fallback taskId + clean close) ✅ One minor observation (non-blocking)
The Not worth a change request — just something to be aware of when reading the test output. Summary
This is in good shape to merge. |
|
@claude review |
|
Claude finished @malinskibeniamin's task in 1m 48s —— View job Code Review — Final Pass
This is a clean final state. All items from the previous three review rounds are addressed, and the two new scenarios (23, 24) are well-constructed. One minor observation, non-blocking. Production code (
|
| Bug fix correctness | ✅ |
resubscribeAttempted guard |
✅ Correct — prevents double-loop in all four paths (gave-up×2, success×2) |
success: false symmetry |
✅ Clean-close gave-up now mirrors error-path gave-up |
| Scenario renumbering (1–25) | ✅ |
| Scenario 22: gave-up + finalize fail | ✅ |
| Scenario 23: success + finalize fail | ✅ |
| Scenario 24: metadata-fallback ordering | ✅ |
| Logging gap on clean-close finalize fail |
LGTM. The one observation above is minor enough that I wouldn't hold the merge for it.
Mirror the catch-block recovery branch's inner try/catch so a DB write failure after a clean-close reconnect is observable in production telemetry instead of silently producing an a2a-error block. Closes the last review observation from the final Claude pass: the clean-close path previously let finalizeMessage errors propagate to the outer catch, which runs parseA2AError on the DB error but emits no log. Scenarios 22 and 23 now also assert the log fires exactly once from the clean-close branch, making the previously-inaccurate console.error spy comments accurate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@claude review |
|
Claude finished @malinskibeniamin's task in 1m 27s —— View job Code Review — Final Pass
The last remaining observation from the previous review pass — "a Production code (
|
| Bug fix correctness | ✅ |
resubscribeAttempted guard |
✅ |
success: false symmetry for gave-up |
✅ |
| Logging on clean-close finalize fail | ✅ Addressed in 3cd57b2 |
| Scenarios 22/23 spy now verifies log | ✅ Accurate and assertive |
| Scenario 24 metadata-fallback ordering | ✅ |
| Scenario numbering (1–25) | ✅ |
No outstanding issues. LGTM.

Summary
Load balancers with idle timeouts (commonly ~5 min — AWS ALB, GCP LB, nginx defaults) close idle TCP connections gracefully with a FIN rather than an error. The A2A streaming iterator surfaces this as clean end-of-stream, so
streamMessage's existing catch-block reconnect never fires and the message is finalized with a non-terminaltaskState, orphaning the task server-side and showing the user a truncated reply.Fix: after the
for awaitoverstreamResult.fullStreamexits cleanly, ifisResubscribable(state)is true, route through the sameresubscribeLoopalready used in the error path. The loop already handles backoff, progress detection, and give-up — it just needed a second entry point.Guard: if the clean-close path already ran
resubscribeLoopand a subsequentfinalizeMessagethrows (e.g., DB write fails), the outer catch must not invokeresubscribeLoopa second time. Tracked with aresubscribeAttemptedflag.Context — not caused by recent SDK bumps
This bug predates the A2A 0.3.10 → 0.3.13, AI SDK v6, and streamdown v2 upgrades (#2389). The
try { for-await } catch { resubscribeLoop }shape is unchanged from the original SSE reconnection commit (4fefd7b73). Clean-close never triggered the catch path in any version.Which 5-min LB failure modes does this cover
resubscribeTaskfires, same backoff loop runs ✅resubscribeLoopresubscribeLoopIf the existing error path is "still buggy" for case 2 or 3, the root cause is different from what this PR addresses.
Test plan
New regression tests in
use-message-streaming.test.ts:resubscribes when stream ends cleanly but task is still in-flight (LB idle-timeout)— assertsresubscribeTaskis called and task completes after clean-close reconnectdoes not resubscribe when stream ends cleanly and task is already terminal— negative guardfinalizes with gave-up status when clean-close triggers resubscribe but it exhausts retries— content preserved, singlegave-upblock showndoes not re-enter resubscribeLoop when finalizeMessage fails after a gave-up clean-close— asserts exactly 5 retries (one round), not 10Verification (run locally):
bun run test:unit -- use-message-streaming— 31/31 passbun run type:check— cleanbun run linton touched files — cleandisconnected → reconnecting → reconnectedstatus blocks instead of silent truncation. DevTools Network panel will show200 OK/completed on the streaming request (graceful close) vs(failed)/net::ERR_*(error close — different code path).@claude please review.
🤖 Generated with Claude Code