-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return retryable error when method not found during bootstrap #8482
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks right to me.
There will be a window on first startup where nodes will still have the issue of returning un-retriable errors, which should be closed by #8282
src/v/redpanda/application.cc
Outdated
} catch (ss::abort_requested_exception&) { | ||
// Shutting down | ||
} catch (...) { | ||
// Should never happen, abort is the only exception that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is giving me deja-vu: is this copy-paste? Maybe a higher level await_feature method like await_feature_then
could wrap this error handling pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, this seems like it'd be generally useful to activate usage of features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using the feature manager, this looks reasonable. per yesterday's conversation, though, does our transport-level versioning support a solution that avoids the use of the feature manager?
We chatted about this on Slack and it seems like there are certainly ways to skirt around the use of feature manager for this new error code. That said, the solution would likely look like a new transport version, with a similar handshake to what exists today between v0 and v2 (i.e. new transports send requests with some new version, new servers respond based on what it knows both can serialize), and I think ultimately we'd end up relying on the feature manager to use this new transport version by default. Moving to a new transport version for this new error code also heavy-weight compared to what it seems like it was designed for (new compression formats, etc.), so I'm leaning towards just using the feature manager for now. |
rpc::transport previously hit undefined behavior if a server came across a status code it didn't know about. This commit adds handling for default/unknown codes, and adds a feature for this so new servers know when it is safe to send new codes.
This commit adds a new server status code that, when received will be a signal for clients to retry. This will be useful to tell clients to retry if a method is missing if we're in the middle of bootstrapping.
fb7d757
to
fa25041
Compare
This hit a SEGV in test_cloud_storage_rpunit https://buildkite.com/redpanda/redpanda/builds/22245#01860c50-8ce1-4c06-8879-fd4040217faf -- I think it's an issue in this PR because the crash is immediately after "Feature rpc_transport_unknown_errc already active", suggesting that the code after that feature goes active is segfaulting. |
It isn't uncommon to want to wait for the activation of a feature, and then do something once activated (e.g. start using the feature). I intend on using this exact pattern to introduce a new error code that isn't compatible with older versions. This commit encapsulates the wait logic in redpanda/application.cc to be a first-class citizen of the feature_table.
We previously saw issues where during bootstrap, a server would return `method_not_found` upon receiving RPCs from other nodes, and that code could result in an unexpected error (e.g. in the case of the admin server, it resulted in a 500 code being sent back to the client). This commit adds an explicit period where method_not_found errors don't get returned if a method isn't found. Instead, the new service_unavailable error is returned, indicating a request to retry. Once the RPC subsystem has had all services registered, the behavior reverts to always returning method_not_found.
fa25041
to
af98a25
Compare
Thanks for the look -- the issue was a lifetime issue with the input std::function |
CI failure was #8595 |
@@ -360,6 +362,25 @@ ss::future<> feature_table::await_feature(feature f, ss::abort_source& as) { | |||
} | |||
} | |||
|
|||
ss::future<> | |||
feature_table::await_feature_then(feature f, std::function<void(void)> fn) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refactor is nice, thanks for taking care of it
/backport v22.3.x |
Failed to run cherry-pick command. I executed the below command:
|
@andrwng , could you look into backporting this when you have a chance? |
We previously saw issues where during bootstrap, a server would return
method_not_found
upon receiving RPCs from other nodes, and that codecould result in an unexpected error (e.g. in the case of the admin
server, it resulted in a 500 code being sent back to the client).
This commit adds an explicit period where method_not_found errors don't
get returned if a method isn't found. Instead, the new
service_unavailable error is returned, indicating a request to retry.
Once the RPC subsystem has had all services registered, the behavior
reverts to always returning
method_not_found
.This PR also addresses potential UB when adding a new status --
rpc::transport
handles error codes it doesn't know about, making it difficult to extendrpc::status
.Related #8406
Backports Required
UX Changes
Release Notes
Improvements