cloud_storage: metric for memory consumed by manifest segments metadata #8304

andrwng · 2023-01-18T21:45:35Z

This commit plumbs a util::mem_tracker into the partition_manifest
so it may allocate its segment_map with a tracking_allocator. This
will be useful to expose a metric for memory consumed by segments.

This adds some unit tests to exemplify how different manifest APIs
affect tracked memory. While the results are strongly coupled with the
underlying implementation, they're still worth having around to better
understand how our memory footprint respond to different events, until
otherwise changed.

This will also be useful as a metric with which to gauge further improvements to the manifest's in-memory format (see #7898)

Backports Required

UX Changes

Release Notes

Improvements

A new partition metric redpanda_cloud_storage_segments_metadata_bytes is added that tracks the amount of memory consumed by segment metadata managed by the remote partition.

jcsp · 2023-01-19T11:49:46Z

src/v/archival/probe.cc

+       sm::make_current_bytes(
+         "segments_metadata_bytes",
+         [this] { return _segments_metadata_bytes; },
+         sm::description("Current number of bytes consumed by remote segments "


Description says "by the partition" but I think this is actually aggregated by topic in the output? (via aggregate_labels). I'm good with having it aggregated by topic (per-partition metrics are expensive), just needs the description to line up.

I ended up moving this into the partition probe, which is aggregated by partition. I am also wary of metric cardinality problems, so I can find it another home if there's concern that it doesn't belong.

jcsp · 2023-01-19T11:53:29Z

src/v/cloud_storage/partition_manifest.cc

@@ -1008,6 +1028,8 @@ struct partition_manifest_handler
    std::optional<model::offset_delta> _delta_offset_end;
    std::optional<segment_name_format> _meta_sname_format;

+    ss::shared_ptr<util::mem_tracker> _manifest_mem_tracker;


I haven't gone into the impl of mem_tracker: are you confident it doesn't have any significant overhead?

Fairly confident, under the hood it's just a list of child trackers (of which there are none here), a counter, and a label.

It'll be interesting to see how mem-tracking evolves in the codebase, but this seems like a great place to start considering the ongoing memory-focused efforts.

src/v/cloud_storage/partition_manifest.h

jcsp · 2023-01-19T11:59:08Z

src/v/archival/ntp_archiver_service.cc

@@ -502,6 +502,7 @@ void ntp_archiver::update_probe() {
    const auto& man = manifest();

    _probe->segments_in_manifest(man.size());
+    _probe->segments_metadata_bytes(man.segments_metadata_bytes());


Looks like update_probe() is only called from the segment upload loop: this will give us zero stats on followers, and stale stats on nodes that used to be leaders, right?

I believe metrics can be constructed with a lambda for fetching gauges on demand, which might be more robust here than trying to remember to call update_probe in all the right places.

(Clearly this issue already exists for segments_in_manifest too)

Nice catch. I moved the metric to accommodate. Will put up a separate PR to address segments_in_manifest and co.

This commit plumbs a `util::mem_tracker` into the `partition_manifest` so it may allocate its `segment_map` with a `tracking_allocator`. This will be useful to expose a metric for memory consumed by segments. This adds some unit tests to exemplify how different manifest APIs affect tracked memory. While the results are strongly coupled with the underlying implementation, they're still worth having around to better understand how our memory footprint respond to different events, until otherwise changed.

This adds the newly introduced tracked memory from the segments metadata in the manifest as a metric.

jcsp · 2023-01-20T10:46:27Z

Having this as an internal metric makes sense for near term testing, although we're probably going to need to revisit, as the internal metrics endpoint becomes unmanagably large when there are more than a few thousand partitions.

Ultimately would be good to reinstate this as a public metric, suitably reduced in cardinality. If we could augment the mem tracker to accumulate some per-shard totals as well, those would be great for public metrics, so that the user could basically get a per-shard breakdown of which subsystem is eating all the memory.

andrwng requested review from Lazin, jcsp and VladLazar January 18, 2023 21:45

github-actions bot added the area/redpanda label Jan 18, 2023

jcsp reviewed Jan 19, 2023

View reviewed changes

src/v/cloud_storage/partition_manifest.h Show resolved Hide resolved

jcsp reviewed Jan 19, 2023

View reviewed changes

andrwng added 2 commits January 19, 2023 22:43

cloud_storage: metric for memory consumed by segment metadata

9030384

This adds the newly introduced tracked memory from the segments metadata in the manifest as a metric.

andrwng force-pushed the manifest-memory branch from 36b4d52 to 9030384 Compare January 20, 2023 06:43

jcsp approved these changes Jan 20, 2023

View reviewed changes

jcsp merged commit e030ea8 into redpanda-data:dev Jan 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud_storage: metric for memory consumed by manifest segments metadata #8304

cloud_storage: metric for memory consumed by manifest segments metadata #8304

andrwng commented Jan 18, 2023

jcsp Jan 19, 2023

andrwng Jan 20, 2023

jcsp Jan 19, 2023

andrwng Jan 20, 2023

jcsp Jan 19, 2023

andrwng Jan 20, 2023

jcsp commented Jan 20, 2023

cloud_storage: metric for memory consumed by manifest segments metadata #8304

cloud_storage: metric for memory consumed by manifest segments metadata #8304

Conversation

andrwng commented Jan 18, 2023

Backports Required

UX Changes

Release Notes

Improvements

jcsp Jan 19, 2023

Choose a reason for hiding this comment

andrwng Jan 20, 2023

Choose a reason for hiding this comment

jcsp Jan 19, 2023

Choose a reason for hiding this comment

andrwng Jan 20, 2023

Choose a reason for hiding this comment

jcsp Jan 19, 2023

Choose a reason for hiding this comment

andrwng Jan 20, 2023

Choose a reason for hiding this comment

jcsp commented Jan 20, 2023