Return retryable error when method not found during bootstrap #8482

andrwng · 2023-01-28T07:55:36Z

We previously saw issues where during bootstrap, a server would return
method_not_found upon receiving RPCs from other nodes, and that code
could result in an unexpected error (e.g. in the case of the admin
server, it resulted in a 500 code being sent back to the client).

This commit adds an explicit period where method_not_found errors don't
get returned if a method isn't found. Instead, the new
service_unavailable error is returned, indicating a request to retry.

Once the RPC subsystem has had all services registered, the behavior
reverts to always returning method_not_found.

This PR also addresses potential UB when adding a new status -- rpc::transport handles error codes it doesn't know about, making it difficult to extend rpc::status.

Related #8406

Backports Required

UX Changes

Release Notes

Improvements

Redpanda will now return a retriable status when its internal RPC subsystem is bootstrapping.

src/v/rpc/types.h

jcsp

This looks right to me.

There will be a window on first startup where nodes will still have the issue of returning un-retriable errors, which should be closed by #8282

src/v/rpc/rpc_server.cc

jcsp · 2023-01-30T10:43:24Z

src/v/redpanda/application.cc

+                } catch (ss::abort_requested_exception&) {
+                    // Shutting down
+                } catch (...) {
+                    // Should never happen, abort is the only exception that


This comment is giving me deja-vu: is this copy-paste? Maybe a higher level await_feature method like await_feature_then could wrap this error handling pattern.

Good idea, this seems like it'd be generally useful to activate usage of features.

src/v/redpanda/application.cc

dotnwat

using the feature manager, this looks reasonable. per yesterday's conversation, though, does our transport-level versioning support a solution that avoids the use of the feature manager?

andrwng · 2023-02-01T07:17:41Z

using the feature manager, this looks reasonable. per yesterday's conversation, though, does our transport-level versioning support a solution that avoids the use of the feature manager?

We chatted about this on Slack and it seems like there are certainly ways to skirt around the use of feature manager for this new error code.

That said, the solution would likely look like a new transport version, with a similar handshake to what exists today between v0 and v2 (i.e. new transports send requests with some new version, new servers respond based on what it knows both can serialize), and I think ultimately we'd end up relying on the feature manager to use this new transport version by default.

Moving to a new transport version for this new error code also heavy-weight compared to what it seems like it was designed for (new compression formats, etc.), so I'm leaning towards just using the feature manager for now.

rpc::transport previously hit undefined behavior if a server came across a status code it didn't know about. This commit adds handling for default/unknown codes, and adds a feature for this so new servers know when it is safe to send new codes.

This commit adds a new server status code that, when received will be a signal for clients to retry. This will be useful to tell clients to retry if a method is missing if we're in the middle of bootstrapping.

andrwng · 2023-02-01T09:34:18Z

There will be a window on first startup where nodes will still have the issue of returning un-retriable errors, which should be closed by #8282

Good point, the I'll remove the "Fixed" label since I think the flakiness mentioned in #8406 may still exist.

jcsp · 2023-02-02T21:47:40Z

This hit a SEGV in test_cloud_storage_rpunit https://buildkite.com/redpanda/redpanda/builds/22245#01860c50-8ce1-4c06-8879-fd4040217faf -- I think it's an issue in this PR because the crash is immediately after "Feature rpc_transport_unknown_errc already active", suggesting that the code after that feature goes active is segfaulting.

It isn't uncommon to want to wait for the activation of a feature, and then do something once activated (e.g. start using the feature). I intend on using this exact pattern to introduce a new error code that isn't compatible with older versions. This commit encapsulates the wait logic in redpanda/application.cc to be a first-class citizen of the feature_table.

We previously saw issues where during bootstrap, a server would return `method_not_found` upon receiving RPCs from other nodes, and that code could result in an unexpected error (e.g. in the case of the admin server, it resulted in a 500 code being sent back to the client). This commit adds an explicit period where method_not_found errors don't get returned if a method isn't found. Instead, the new service_unavailable error is returned, indicating a request to retry. Once the RPC subsystem has had all services registered, the behavior reverts to always returning method_not_found.

andrwng · 2023-02-03T07:49:28Z

This hit a SEGV in test_cloud_storage_rpunit https://buildkite.com/redpanda/redpanda/builds/22245#01860c50-8ce1-4c06-8879-fd4040217faf -- I think it's an issue in this PR because the crash is immediately after "Feature rpc_transport_unknown_errc already active", suggesting that the code after that feature goes active is segfaulting.

Thanks for the look -- the issue was a lifetime issue with the input std::function

andrwng · 2023-02-03T19:58:53Z

CI failure was #8595

jcsp · 2023-02-06T15:04:50Z

src/v/features/feature_table.cc

@@ -360,6 +362,25 @@ ss::future<> feature_table::await_feature(feature f, ss::abort_source& as) {
    }
 }

+ss::future<>
+feature_table::await_feature_then(feature f, std::function<void(void)> fn) {


This refactor is nice, thanks for taking care of it

vshtokman · 2023-02-09T17:34:02Z

/backport v22.3.x

vbotbuildovich · 2023-02-09T17:35:02Z

Failed to run cherry-pick command. I executed the below command:

git cherry-pick -x 798661a7afc827084a2a6d176f164b570230db09 e666aabda9db690b11d90c2fd867cff4580b1cfc a82ae9f5be883e971222db8b55a964ab5b0d8e44 af98a25e597a77bf50e19b5166679e780f20a4cb

Workflow run logs.

vshtokman · 2023-02-28T15:13:11Z

@andrwng , could you look into backporting this when you have a chance?

andrwng · 2023-02-28T21:00:27Z

@andrwng , could you look into backporting this when you have a chance?

As it turns out the implemented fix can't be backported since it depends on the feature flag used for 23.1.1 features. See here for a brief discussion.

I opened #9175 for the bad log line in CI on v22.3.x.

andrwng requested review from dotnwat and jcsp January 28, 2023 07:55

github-actions bot added the area/redpanda label Jan 28, 2023

jcsp reviewed Jan 30, 2023

View reviewed changes

src/v/rpc/types.h Outdated Show resolved Hide resolved

jcsp reviewed Jan 30, 2023

View reviewed changes

This was referenced Jan 30, 2023

BadLogLines (500 on decommission) in NodeOperationFuzzyTest.test_node_operations #7982

Closed

CI Failure (BadLogLines admin_api_server unexpected rpc::errc::unknown) in RandomNodeOperationsTest.test_node_operations #8406

Closed

dotnwat reviewed Jan 31, 2023

View reviewed changes

andrwng added 2 commits February 1, 2023 01:13

rpc: add new service_unavailable

e666aab

This commit adds a new server status code that, when received will be a signal for clients to retry. This will be useful to tell clients to retry if a method is missing if we're in the middle of bootstrapping.

andrwng force-pushed the rpc-bootstrap-retry branch from fb7d757 to fa25041 Compare February 1, 2023 09:30

andrwng marked this pull request as ready for review February 1, 2023 09:34

andrwng added 2 commits February 2, 2023 23:46

andrwng force-pushed the rpc-bootstrap-retry branch from fa25041 to af98a25 Compare February 3, 2023 07:47

jcsp reviewed Feb 6, 2023

View reviewed changes

jcsp approved these changes Feb 6, 2023

View reviewed changes

jcsp merged commit 0e5756f into redpanda-data:dev Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return retryable error when method not found during bootstrap #8482

Return retryable error when method not found during bootstrap #8482

andrwng commented Jan 28, 2023 •

edited

jcsp left a comment

jcsp Jan 30, 2023

andrwng Feb 1, 2023

andrwng Feb 1, 2023

dotnwat left a comment

andrwng commented Feb 1, 2023

andrwng commented Feb 1, 2023

jcsp commented Feb 2, 2023

andrwng commented Feb 3, 2023

andrwng commented Feb 3, 2023

jcsp Feb 6, 2023

vshtokman commented Feb 9, 2023

vbotbuildovich commented Feb 9, 2023

vshtokman commented Feb 28, 2023

andrwng commented Feb 28, 2023

Return retryable error when method not found during bootstrap #8482

Return retryable error when method not found during bootstrap #8482

Conversation

andrwng commented Jan 28, 2023 • edited

Backports Required

UX Changes

Release Notes

Improvements

jcsp left a comment

Choose a reason for hiding this comment

jcsp Jan 30, 2023

Choose a reason for hiding this comment

andrwng Feb 1, 2023

Choose a reason for hiding this comment

andrwng Feb 1, 2023

Choose a reason for hiding this comment

dotnwat left a comment

Choose a reason for hiding this comment

andrwng commented Feb 1, 2023

andrwng commented Feb 1, 2023

jcsp commented Feb 2, 2023

andrwng commented Feb 3, 2023

andrwng commented Feb 3, 2023

jcsp Feb 6, 2023

Choose a reason for hiding this comment

vshtokman commented Feb 9, 2023

vbotbuildovich commented Feb 9, 2023

vshtokman commented Feb 28, 2023

andrwng commented Feb 28, 2023

andrwng commented Jan 28, 2023 •

edited