Retry ForwardPoll on ResourceExhausted instead of disabling forwarding#10019
Merged
Retry ForwardPoll on ResourceExhausted instead of disabling forwarding#10019
Conversation
80078cd to
f99e11d
Compare
…arding When CancelOutstandingWorkerPolls RPCs exhaust the matching service rate limiter during a rolling deployment, ForwardPoll from child partitions to root gets ResourceExhausted. Previously this permanently disabled forwarding for that poller (setting forwardCtx = nil), causing it to wait the full 60s poll timeout. Now we re-enqueue with forwarding still enabled so the poll retries and succeeds once the rate limiter recovers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
f99e11d to
2e2ba8c
Compare
dnr
approved these changes
Apr 24, 2026
Replaces reuse of BacklogTaskForwardTimeout with a dedicated config (default 10s) for the ForwardPoll retry max interval. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
On child partitions, when
ForwardPollgets aResourceExhaustederror, re-enqueue the poller with forwarding still enabled so it retries.Why
When
frontend.enableCancelWorkerPollsOnShutdownis enabled, a wave of cancelled polls followed by re-polls can trigger rate limiting on the root partition. TheforwardPollsgoroutine was treatingResourceExhaustedthe same as other errors — permanently disabling forwarding by settingforwardCtx = nil. This caused the poll to fall back to waiting for a local task match, which on a child partition with no backlog means waiting the full 60s long-poll timeout before the poller retries.The
forwardTasksgoroutine already had properResourceExhaustedretry logic with exponential backoff;forwardPollswas missing it.How did you test it?
Unit test (
TestForwardPollRetriesOnResourceExhausted) that sets up a child partition with a mock matching client returningResourceExhaustedon the firstForwardPollcall and a valid task on the second, then verifies the poll succeeds via retry.🤖 Generated with Claude Code