Skip to content

Retry ForwardPoll on ResourceExhausted instead of disabling forwarding#10019

Merged
rkannan82 merged 3 commits intomainfrom
kannan/debug-slow-poll
Apr 24, 2026
Merged

Retry ForwardPoll on ResourceExhausted instead of disabling forwarding#10019
rkannan82 merged 3 commits intomainfrom
kannan/debug-slow-poll

Conversation

@rkannan82
Copy link
Copy Markdown
Contributor

What

On child partitions, when ForwardPoll gets a ResourceExhausted error, re-enqueue the poller with forwarding still enabled so it retries.

Why

When frontend.enableCancelWorkerPollsOnShutdown is enabled, a wave of cancelled polls followed by re-polls can trigger rate limiting on the root partition. The forwardPolls goroutine was treating ResourceExhausted the same as other errors — permanently disabling forwarding by setting forwardCtx = nil. This caused the poll to fall back to waiting for a local task match, which on a child partition with no backlog means waiting the full 60s long-poll timeout before the poller retries.

The forwardTasks goroutine already had proper ResourceExhausted retry logic with exponential backoff; forwardPolls was missing it.

How did you test it?

Unit test (TestForwardPollRetriesOnResourceExhausted) that sets up a child partition with a mock matching client returning ResourceExhausted on the first ForwardPoll call and a valid task on the second, then verifies the poll succeeds via retry.

🤖 Generated with Claude Code

@rkannan82 rkannan82 force-pushed the kannan/debug-slow-poll branch 2 times, most recently from 80078cd to f99e11d Compare April 22, 2026 06:10
@rkannan82 rkannan82 requested a review from dnr April 23, 2026 01:23
@rkannan82 rkannan82 marked this pull request as ready for review April 23, 2026 01:24
@rkannan82 rkannan82 requested a review from a team as a code owner April 23, 2026 01:24
…arding

When CancelOutstandingWorkerPolls RPCs exhaust the matching service rate
limiter during a rolling deployment, ForwardPoll from child partitions
to root gets ResourceExhausted. Previously this permanently disabled
forwarding for that poller (setting forwardCtx = nil), causing it to
wait the full 60s poll timeout. Now we re-enqueue with forwarding still
enabled so the poll retries and succeeds once the rate limiter recovers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rkannan82 rkannan82 force-pushed the kannan/debug-slow-poll branch from f99e11d to 2e2ba8c Compare April 23, 2026 21:04
Comment thread service/matching/pri_matcher.go Outdated
Comment thread service/matching/pri_matcher_test.go
Comment thread service/matching/pri_matcher_test.go
Replaces reuse of BacklogTaskForwardTimeout with a dedicated config
(default 10s) for the ForwardPoll retry max interval.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rkannan82 rkannan82 requested review from a team as code owners April 24, 2026 03:04
@rkannan82 rkannan82 merged commit dd32c8f into main Apr 24, 2026
46 checks passed
@rkannan82 rkannan82 deleted the kannan/debug-slow-poll branch April 24, 2026 05:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants