cloud_metadata: upload consumer groups in batches #15748

andrwng · 2023-12-19T01:41:05Z

We currently serialize the entirety of the consmer offsets partition
into a single iobuf. This imposes a fundamental limit on the scale of
the number of groups that can be snapshotted.

The cluster manifest previously accounted for this by allowing multiple
snapshots per offsets partition in the manifest, even though existing
callers would only ever pass a single snapshot for the entire partition.

This commit begins batching the group snapshots with a configured value
for the number of groups (default 1000).

Fixes #15749

Backports Required

Release Notes

none

This introduces a safeguard config, protecting against uploads from consumer offsets partitions that may countain a huge number of groups. As is, the offsets uploader will upload all groups as a single iobuf, which imposes an evental limit on the scalability of offsets uplaods. This commit introduces a property to cap the number of groups per snapshot that will be used in a subsequent commit.

We currently serialize the entirety of the consmer offsets partition into a single iobuf. This imposes a fundamental limit on the scale of the number of groups that can be snapshotted. The cluster manifest previously accounted for this by allowing multiple snapshots per offsets partition in the manifest, even though existing callers would only ever pass a single snapshot for the entire partition. This commit begins batching the group snapshots with a configured value for the number of groups.

andrwng · 2023-12-19T01:42:41Z

It'd be nice to add a ducktape test for this, but for now, this only tweaks C++ fixture tests.

vbotbuildovich · 2023-12-19T04:11:36Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/43006#018c7ff9-7b1a-4b4c-b93c-6ef318e41ff0

dotnwat · 2023-12-19T07:18:05Z

src/v/cluster/cloud_metadata/offsets_uploader.cc

+            auto upload_res = co_await _remote.local().upload_object(
+              _bucket, remote_key, std::move(buf), retry_node);
+            if (upload_res == cloud_storage::upload_result::success) {
+                paths.paths.emplace_back(remote_key().c_str());
+            }
+        } catch (...) {
+            auto eptr = std::current_exception();
+            if (ssx::is_shutdown_exception(eptr)) {
+                vlog(clusterlog.debug, "Shutdown while uploading offsets");
+            } else {
+                vlog(
+                  clusterlog.debug, "Error while uploading offsets: {}", eptr);
+            }
+            co_return error_outcome::upload_failed;


is there any difference in behavior to take into consideration now that we can have a partial upload failure? here, or in restore to identify partial upload?

Not really -- if there is a failure to upload, an error is returned and the paths aren't passed back to the manifest. In that case, we do not update the manifest

redpanda/src/v/cluster/cloud_metadata/uploader.cc

Line 173 in 8132430

if (reply.ec != cluster::errc::success) {

This does mean that the "snapshot" may not reflect the same round of metadata uploads, but at least it should still represent the partition completely at a given (stale) point in time

vbotbuildovich · 2023-12-19T07:51:19Z

/backport v23.3.x

andrwng added 2 commits December 18, 2023 16:39

github-actions bot added the area/redpanda label Dec 19, 2023

andrwng requested review from dotnwat, Lazin, andijcr, abhijat and nvartolomei December 19, 2023 01:41

andrwng added this to the v23.3.1-rc5 milestone Dec 19, 2023

dotnwat approved these changes Dec 19, 2023

View reviewed changes

andrwng merged commit 9f7f123 into redpanda-data:dev Dec 19, 2023
21 checks passed

This was referenced Dec 19, 2023

[v23.3.x] Improve scalability of consumer offsets uploads #15762

Closed

[v23.3.x] cloud_metadata: upload consumer groups in batches #15763

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud_metadata: upload consumer groups in batches #15748

cloud_metadata: upload consumer groups in batches #15748

andrwng commented Dec 19, 2023 •

edited

andrwng commented Dec 19, 2023

vbotbuildovich commented Dec 19, 2023

dotnwat Dec 19, 2023

andrwng Dec 19, 2023

vbotbuildovich commented Dec 19, 2023

cloud_metadata: upload consumer groups in batches #15748

cloud_metadata: upload consumer groups in batches #15748

Conversation

andrwng commented Dec 19, 2023 • edited

Backports Required

Release Notes

andrwng commented Dec 19, 2023

vbotbuildovich commented Dec 19, 2023

dotnwat Dec 19, 2023

Choose a reason for hiding this comment

andrwng Dec 19, 2023

Choose a reason for hiding this comment

vbotbuildovich commented Dec 19, 2023

andrwng commented Dec 19, 2023 •

edited