Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return retryable error when method not found during bootstrap #8482

Merged
merged 4 commits into from
Feb 6, 2023

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Jan 28, 2023

We previously saw issues where during bootstrap, a server would return
method_not_found upon receiving RPCs from other nodes, and that code
could result in an unexpected error (e.g. in the case of the admin
server, it resulted in a 500 code being sent back to the client).

This commit adds an explicit period where method_not_found errors don't
get returned if a method isn't found. Instead, the new
service_unavailable error is returned, indicating a request to retry.

Once the RPC subsystem has had all services registered, the behavior
reverts to always returning method_not_found.

This PR also addresses potential UB when adding a new status -- rpc::transport handles error codes it doesn't know about, making it difficult to extend rpc::status.

Related #8406

Backports Required

  • none - not a bug fix
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v22.3.x
  • v22.2.x
  • v22.1.x

UX Changes

Release Notes

Improvements

  • Redpanda will now return a retriable status when its internal RPC subsystem is bootstrapping.

src/v/rpc/types.h Outdated Show resolved Hide resolved
Copy link
Contributor

@jcsp jcsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks right to me.

There will be a window on first startup where nodes will still have the issue of returning un-retriable errors, which should be closed by #8282

src/v/rpc/rpc_server.cc Show resolved Hide resolved
} catch (ss::abort_requested_exception&) {
// Shutting down
} catch (...) {
// Should never happen, abort is the only exception that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is giving me deja-vu: is this copy-paste? Maybe a higher level await_feature method like await_feature_then could wrap this error handling pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, this seems like it'd be generally useful to activate usage of features.

src/v/redpanda/application.cc Outdated Show resolved Hide resolved
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using the feature manager, this looks reasonable. per yesterday's conversation, though, does our transport-level versioning support a solution that avoids the use of the feature manager?

@andrwng
Copy link
Contributor Author

andrwng commented Feb 1, 2023

using the feature manager, this looks reasonable. per yesterday's conversation, though, does our transport-level versioning support a solution that avoids the use of the feature manager?

We chatted about this on Slack and it seems like there are certainly ways to skirt around the use of feature manager for this new error code.

That said, the solution would likely look like a new transport version, with a similar handshake to what exists today between v0 and v2 (i.e. new transports send requests with some new version, new servers respond based on what it knows both can serialize), and I think ultimately we'd end up relying on the feature manager to use this new transport version by default.

Moving to a new transport version for this new error code also heavy-weight compared to what it seems like it was designed for (new compression formats, etc.), so I'm leaning towards just using the feature manager for now.

rpc::transport previously hit undefined behavior if a server came across
a status code it didn't know about. This commit adds handling for
default/unknown codes, and adds a feature for this so new servers know
when it is safe to send new codes.
This commit adds a new server status code that, when received will be a
signal for clients to retry. This will be useful to tell clients to
retry if a method is missing if we're in the middle of bootstrapping.
@andrwng
Copy link
Contributor Author

andrwng commented Feb 1, 2023

There will be a window on first startup where nodes will still have the issue of returning un-retriable errors, which should be closed by #8282

Good point, the I'll remove the "Fixed" label since I think the flakiness mentioned in #8406 may still exist.

@andrwng andrwng marked this pull request as ready for review February 1, 2023 09:34
@jcsp
Copy link
Contributor

jcsp commented Feb 2, 2023

This hit a SEGV in test_cloud_storage_rpunit https://buildkite.com/redpanda/redpanda/builds/22245#01860c50-8ce1-4c06-8879-fd4040217faf -- I think it's an issue in this PR because the crash is immediately after "Feature rpc_transport_unknown_errc already active", suggesting that the code after that feature goes active is segfaulting.

It isn't uncommon to want to wait for the activation of a feature, and
then do something once activated (e.g. start using the feature).
I intend on using this exact pattern to introduce a new error code that
isn't compatible with older versions.

This commit encapsulates the wait logic in redpanda/application.cc to be
a first-class citizen of the feature_table.
We previously saw issues where during bootstrap, a server would return
`method_not_found` upon receiving RPCs from other nodes, and that code
could result in an unexpected error (e.g. in the case of the admin
server, it resulted in a 500 code being sent back to the client).

This commit adds an explicit period where method_not_found errors don't
get returned if a method isn't found. Instead, the new
service_unavailable error is returned, indicating a request to retry.

Once the RPC subsystem has had all services registered, the behavior
reverts to always returning method_not_found.
@andrwng
Copy link
Contributor Author

andrwng commented Feb 3, 2023

This hit a SEGV in test_cloud_storage_rpunit https://buildkite.com/redpanda/redpanda/builds/22245#01860c50-8ce1-4c06-8879-fd4040217faf -- I think it's an issue in this PR because the crash is immediately after "Feature rpc_transport_unknown_errc already active", suggesting that the code after that feature goes active is segfaulting.

Thanks for the look -- the issue was a lifetime issue with the input std::function

@andrwng
Copy link
Contributor Author

andrwng commented Feb 3, 2023

CI failure was #8595

@@ -360,6 +362,25 @@ ss::future<> feature_table::await_feature(feature f, ss::abort_source& as) {
}
}

ss::future<>
feature_table::await_feature_then(feature f, std::function<void(void)> fn) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refactor is nice, thanks for taking care of it

@jcsp jcsp merged commit 0e5756f into redpanda-data:dev Feb 6, 2023
@vshtokman
Copy link
Contributor

/backport v22.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to run cherry-pick command. I executed the below command:

git cherry-pick -x 798661a7afc827084a2a6d176f164b570230db09 e666aabda9db690b11d90c2fd867cff4580b1cfc a82ae9f5be883e971222db8b55a964ab5b0d8e44 af98a25e597a77bf50e19b5166679e780f20a4cb

Workflow run logs.

@vshtokman
Copy link
Contributor

@andrwng , could you look into backporting this when you have a chance?

@andrwng
Copy link
Contributor Author

andrwng commented Feb 28, 2023

@andrwng , could you look into backporting this when you have a chance?

As it turns out the implemented fix can't be backported since it depends on the feature flag used for 23.1.1 features. See here for a brief discussion.

I opened #9175 for the bad log line in CI on v22.3.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants