Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_metadata: upload consumer groups in batches #15748

Merged
merged 2 commits into from
Dec 19, 2023

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Dec 19, 2023

We currently serialize the entirety of the consmer offsets partition
into a single iobuf. This imposes a fundamental limit on the scale of
the number of groups that can be snapshotted.

The cluster manifest previously accounted for this by allowing multiple
snapshots per offsets partition in the manifest, even though existing
callers would only ever pass a single snapshot for the entire partition.

This commit begins batching the group snapshots with a configured value
for the number of groups (default 1000).

Fixes #15749

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x
  • v23.1.x

Release Notes

  • none

This introduces a safeguard config, protecting against uploads from
consumer offsets partitions that may countain a huge number of groups.
As is, the offsets uploader will upload all groups as a single iobuf,
which imposes an evental limit on the scalability of offsets uplaods.

This commit introduces a property to cap the number of groups per
snapshot that will be used in a subsequent commit.
We currently serialize the entirety of the consmer offsets partition
into a single iobuf. This imposes a fundamental limit on the scale of
the number of groups that can be snapshotted.

The cluster manifest previously accounted for this by allowing multiple
snapshots per offsets partition in the manifest, even though existing
callers would only ever pass a single snapshot for the entire partition.

This commit begins batching the group snapshots with a configured value
for the number of groups.
@andrwng
Copy link
Contributor Author

andrwng commented Dec 19, 2023

It'd be nice to add a ducktape test for this, but for now, this only tweaks C++ fixture tests.

@andrwng andrwng added this to the v23.3.1-rc5 milestone Dec 19, 2023
@vbotbuildovich
Copy link
Collaborator

Comment on lines +72 to +85
auto upload_res = co_await _remote.local().upload_object(
_bucket, remote_key, std::move(buf), retry_node);
if (upload_res == cloud_storage::upload_result::success) {
paths.paths.emplace_back(remote_key().c_str());
}
} catch (...) {
auto eptr = std::current_exception();
if (ssx::is_shutdown_exception(eptr)) {
vlog(clusterlog.debug, "Shutdown while uploading offsets");
} else {
vlog(
clusterlog.debug, "Error while uploading offsets: {}", eptr);
}
co_return error_outcome::upload_failed;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any difference in behavior to take into consideration now that we can have a partial upload failure? here, or in restore to identify partial upload?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really -- if there is a failure to upload, an error is returned and the paths aren't passed back to the manifest. In that case, we do not update the manifest

if (reply.ec != cluster::errc::success) {

This does mean that the "snapshot" may not reflect the same round of metadata uploads, but at least it should still represent the partition completely at a given (stale) point in time

@andrwng andrwng merged commit 9f7f123 into redpanda-data:dev Dec 19, 2023
21 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve scalability of consumer offsets uploads
3 participants