-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tx_migration: avoid ping pong of requests between brokers #15953
Merged
piyushredpanda
merged 2 commits into
redpanda-data:dev
from
bharathv:tx_migration_fix_alloc
Jan 6, 2024
Merged
tx_migration: avoid ping pong of requests between brokers #15953
piyushredpanda
merged 2 commits into
redpanda-data:dev
from
bharathv:tx_migration_fix_alloc
Jan 6, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
When leader table is stale (say at startup or during failures), the current code can result in a ping pong of requests between two brokers in a tight loop. Example - tx_migration_replicate dispatched from node 1 to node 2 (because node 1 thinks node 2 is the leader) - node 2 dispatches the request back to node 1 because it thinks node 1 is the leader. Until leadership stabilizes this results in a huge pile up of requests which manifested as an oversized allocation. Don't think a router is the right choice in the handler as the handler is supposed to process the request locally. If it returns an error, it is propagated to the source router which dispatches to the correct leader. This also breaks the ping pong loop as the source router has sleeps induced for retry backoff.
new failures in https://buildkite.com/redpanda/redpanda/builds/43450#018cd6dd-f8b9-42e9-8821-73b33cbb5da4:
|
/ci-repeat 1 |
Failure is a known issue #15944 |
rockwotj
approved these changes
Jan 5, 2024
/backport v23.3.x |
This was referenced Jan 6, 2024
nvartolomei
added a commit
to nvartolomei/redpanda
that referenced
this pull request
Apr 16, 2024
A tight forward-to-leader loop has been discovered in a test where metadata about leader is out of date: redpanda-data#17873. Instead, we remove the forwarding from the request handler and do it only once on the original invoker. In `id_allocator_frontend::allocate_id` we call `allocate_router::process_or_dispatch` which will do the redirect and retry if the target node returns an error/does not respond. It also has backoff built in. This is a fix very similar to one described in redpanda-data#15953. Fixes redpanda-data#17873
6 tasks
vbotbuildovich
pushed a commit
to vbotbuildovich/redpanda
that referenced
this pull request
Apr 16, 2024
A tight forward-to-leader loop has been discovered in a test where metadata about leader is out of date: redpanda-data#17873. Instead, we remove the forwarding from the request handler and do it only once on the original invoker. In `id_allocator_frontend::allocate_id` we call `allocate_router::process_or_dispatch` which will do the redirect and retry if the target node returns an error/does not respond. It also has backoff built in. This is a fix very similar to one described in redpanda-data#15953. Fixes redpanda-data#17873 (cherry picked from commit f388566)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When leader table is stale (say at startup or during failures), the
current code can result in a ping pong of requests between two brokers
in a tight loop.
Example
thinks node 2 is the leader)
is the leader.
Until leadership stabilizes this results in a huge pile up of requests
which manifested as an oversized allocation.
Don't think a router is the right choice in the handler as the handler
is supposed to process the request locally. If it returns an error, it
is propagated to the source router which dispatches to the correct
leader. This also breaks the ping pong loop as the source router has
sleeps induced for retry backoff.
Fixes: #15901
Backports Required
Release Notes