-
Notifications
You must be signed in to change notification settings - Fork 553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: fix correlation_id overflow #16156
Conversation
If we happen to end up in this exception handler, chances are that we never inserted an entry into _requests_queue. This means that there is a hole in the sequence numbers sequence and the connection is not usable anymore. Better to reset it.
To check that netbuf header is initialized, we checked correlation_id, treating 0 as special invalid value. But it turns out, we can get 0 if correlation_id overflows uint32_t (this can happen on a reasonably busy connection). The overflow itself is benign as correlation_id just wraps around to 0 (and it is unlikely that there still remain entries with these values in the _correlations map) but the check throws. Since it is just a precautionary check that doesn't add much value, remove it.
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/43922#018d1fa9-d4e6-4018-ac77-41ef6505cc54 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/43922#018d1fa9-d4df-4426-a257-a1ef5d476a7a |
/backport v23.3.x |
/backport v23.2.x |
/backport v23.1.x |
Failed to create a backport PR to v23.1.x branch. I tried:
|
if (hdr.correlation_id == 0 || hdr.meta == 0) { | ||
throw std::runtime_error( | ||
"cannot compose scattered view with incomplete header. missing " | ||
"correlation_id or remote method id"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the commit message:
(and it is unlikely that there still remain entries with
these values in the _correlations map)
in the unlikely event this occurred, wouldn't this pair a response with the incorrect request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq @ztlpn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a check for this here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes of course i forgot. thanks!
It is possible that in an active RPC client connection,
correlation_id
(which is uint32_t) will overflow - at, say, 5k req/s (high, but not unreasonable value) overflow happens after 10 days. This isn't particularly bad, because it will just wrap around to 0, but we treatcorrelation_id=0
as an invalid value and throw an exception. Because this exception gets thrown from an unexpected place, the connection send loop stalls and the connection becomes unusable afterwards.Remove this check as it doesn't add much value, and also ensure that connections get shut down after these unexpected exceptions.
Backports Required
Release Notes
Bug Fixes