Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: fix correlation_id overflow #16156

Merged
merged 2 commits into from
Jan 19, 2024

Conversation

ztlpn
Copy link
Contributor

@ztlpn ztlpn commented Jan 19, 2024

It is possible that in an active RPC client connection, correlation_id (which is uint32_t) will overflow - at, say, 5k req/s (high, but not unreasonable value) overflow happens after 10 days. This isn't particularly bad, because it will just wrap around to 0, but we treat correlation_id=0 as an invalid value and throw an exception. Because this exception gets thrown from an unexpected place, the connection send loop stalls and the connection becomes unusable afterwards.

Remove this check as it doesn't add much value, and also ensure that connections get shut down after these unexpected exceptions.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x
  • v23.1.x

Release Notes

Bug Fixes

  • Fix internal RPC client connection stall after more than 2^32 requests are sent.

If we happen to end up in this exception handler, chances are that we
never inserted an entry into _requests_queue. This means that there is a
hole in the sequence numbers sequence and the connection is not usable
anymore. Better to reset it.
To check that netbuf header is initialized, we checked correlation_id,
treating 0 as special invalid value. But it turns out, we can get 0 if
correlation_id overflows uint32_t (this can happen on a reasonably busy
connection). The overflow itself is benign as correlation_id just wraps
around to 0 (and it is unlikely that there still remain entries with
these values in the _correlations map) but the check throws. Since it is
just a precautionary check that doesn't add much value, remove it.
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jan 19, 2024

@piyushredpanda piyushredpanda merged commit 7d8d220 into redpanda-data:dev Jan 19, 2024
18 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.1.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-16156-v23.1.x-674 remotes/upstream/v23.1.x
git cherry-pick -x c517c9b0de68959c9098ee547aeb678b2ab853b8 58f4dbf9e3e9ba353c16d573a4ed24e2b151c4a2

Workflow run logs.

Comment on lines -35 to -39
if (hdr.correlation_id == 0 || hdr.meta == 0) {
throw std::runtime_error(
"cannot compose scattered view with incomplete header. missing "
"correlation_id or remote method id");
}
Copy link
Member

@dotnwat dotnwat Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the commit message:

(and it is unlikely that there still remain entries with
these values in the _correlations map)

in the unlikely event this occurred, wouldn't this pair a response with the incorrect request?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq @ztlpn

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a check for this here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes of course i forgot. thanks!

@ztlpn ztlpn deleted the fix-rpc-overflow branch April 19, 2024 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants