-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud_storage: metric for memory consumed by manifest segments metadata #8304
Conversation
src/v/archival/probe.cc
Outdated
sm::make_current_bytes( | ||
"segments_metadata_bytes", | ||
[this] { return _segments_metadata_bytes; }, | ||
sm::description("Current number of bytes consumed by remote segments " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description says "by the partition" but I think this is actually aggregated by topic in the output? (via aggregate_labels). I'm good with having it aggregated by topic (per-partition metrics are expensive), just needs the description to line up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up moving this into the partition probe, which is aggregated by partition. I am also wary of metric cardinality problems, so I can find it another home if there's concern that it doesn't belong.
@@ -1008,6 +1028,8 @@ struct partition_manifest_handler | |||
std::optional<model::offset_delta> _delta_offset_end; | |||
std::optional<segment_name_format> _meta_sname_format; | |||
|
|||
ss::shared_ptr<util::mem_tracker> _manifest_mem_tracker; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't gone into the impl of mem_tracker: are you confident it doesn't have any significant overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fairly confident, under the hood it's just a list of child trackers (of which there are none here), a counter, and a label.
It'll be interesting to see how mem-tracking evolves in the codebase, but this seems like a great place to start considering the ongoing memory-focused efforts.
@@ -502,6 +502,7 @@ void ntp_archiver::update_probe() { | |||
const auto& man = manifest(); | |||
|
|||
_probe->segments_in_manifest(man.size()); | |||
_probe->segments_metadata_bytes(man.segments_metadata_bytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like update_probe() is only called from the segment upload loop: this will give us zero stats on followers, and stale stats on nodes that used to be leaders, right?
I believe metrics can be constructed with a lambda for fetching gauges on demand, which might be more robust here than trying to remember to call update_probe in all the right places.
(Clearly this issue already exists for segments_in_manifest too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. I moved the metric to accommodate. Will put up a separate PR to address segments_in_manifest and co.
This commit plumbs a `util::mem_tracker` into the `partition_manifest` so it may allocate its `segment_map` with a `tracking_allocator`. This will be useful to expose a metric for memory consumed by segments. This adds some unit tests to exemplify how different manifest APIs affect tracked memory. While the results are strongly coupled with the underlying implementation, they're still worth having around to better understand how our memory footprint respond to different events, until otherwise changed.
This adds the newly introduced tracked memory from the segments metadata in the manifest as a metric.
36b4d52
to
9030384
Compare
Having this as an internal metric makes sense for near term testing, although we're probably going to need to revisit, as the internal metrics endpoint becomes unmanagably large when there are more than a few thousand partitions. Ultimately would be good to reinstate this as a public metric, suitably reduced in cardinality. If we could augment the mem tracker to accumulate some per-shard totals as well, those would be great for public metrics, so that the user could basically get a per-shard breakdown of which subsystem is eating all the memory. |
This commit plumbs a
util::mem_tracker
into thepartition_manifest
so it may allocate its
segment_map
with atracking_allocator
. Thiswill be useful to expose a metric for memory consumed by segments.
This adds some unit tests to exemplify how different manifest APIs
affect tracked memory. While the results are strongly coupled with the
underlying implementation, they're still worth having around to better
understand how our memory footprint respond to different events, until
otherwise changed.
This will also be useful as a metric with which to gauge further improvements to the manifest's in-memory format (see #7898)
Backports Required
UX Changes
Release Notes
Improvements
redpanda_cloud_storage_segments_metadata_bytes
is added that tracks the amount of memory consumed by segment metadata managed by the remote partition.