-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add partition aggregation to some metrics #15966
Add partition aggregation to some metrics #15966
Conversation
Add a comment explaining that compaction_ratio metric cannot be aggregated as it is a ratio, for which a sum is meaningless.
When metrics aggregation is turned on, we want to aggregate away the partition labels on most metrics: this wasn't occurring in the consensus object for two metrics: leader_for and reconfiguration changes in progress. This change enable aggregation on the partition label for these metrics if aggregation is turned on. Issue redpanda-data#15811. Issue redpanda-data/core-internal#677.
I think I'll tap @StephanDollberg as the primary reviewer but it would be nice if @mmaslankaprv could take a look (or nominate somone) as it touches a fair number of consensus metrics as well as @dotnwat for the "partition probe" metrics which I think are sort of a mix of enterprise/storage things. The main thing to check is if any of the metrics are absolutely critical for incidents and diagnosis with their "partition" label intact. If so, we can try to keep them. Most of the numbers & analysis can be found over here: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but will defer to Michal/Noah for which ones they will want to keep.
src/v/cluster/partition_probe.cc
Outdated
// aggregate any labels since aggregation does not make sense for "leader | ||
// ID" values. | ||
_metrics.add_group( | ||
prometheus_sanitize::metrics_name("cluster:partition"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make "cluster:partition" a constant/variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made it a namespace-scope const global.
20a34ab
to
563bd3e
Compare
When metrics aggregation is turned on, we want to aggregate away the partition labels on most metrics: this wasn't occurring in the partition probe object. This change enable aggregation on the partition label for almost all metrics in the partition problem, with only two excluded: vectorized_cluster_partition_leader_id vectorized_cluster_partition_under_replicated_replicas Issue redpanda-data#15811. Issue redpanda-data/core-internal#677.
When metrics aggregation is turned on, we want to aggregate away the partition labels on most metrics: this wasn't occurring in the tx metrics. This change enable aggregation on the partition label for almost all three metrics in this probe. Issue redpanda-data#15811. Issue redpanda-data/core-internal#677.
563bd3e
to
65c0a89
Compare
new failures in https://buildkite.com/redpanda/redpanda/builds/43500#018cdb81-ad48-42ab-9ed0-eb02abcf835d:
|
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/43500#018cdb93-0234-4c59-a951-a74a978e38e3 |
@@ -153,13 +164,13 @@ void replicated_partition_probe::setup_internal_metrics(const model::ntp& ntp) { | |||
labels), | |||
}, | |||
{}, | |||
{sm::shard_label}); | |||
{sm::shard_label, partition_label}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also aggregate offset
metrics like high_watermark, end_offset, etc. I think in this case the aggregation has no point, we may either remove those metrics or consider not aggregating them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mmaslankaprv - I was thinking that it still makes sense in that offsets are also a "count of records" in the partition, so after aggregation they will be a count of records in the topic. E.g., end_offset - start_offset is roughly the record count (if we ignore that there may be non-Kafka records in there too).
So maybe at least one of these may make sense to keep. For debugging issues, in your experience which are the most valuable offsets? Note that we do keep redpanda_kafka_max_offset
(a true "kafka" offset, i.e., it uses the offset translator) unaggregated on the public metrics side.
CI failure looks like: #15117 |
Let me know if this needs force-merging. |
Will do, it's still under review ATM. |
/backport v23.3.x |
/backport v23.2.x |
Failed to create a backport PR to v23.2.x branch. I tried:
|
Oops! Something went wrong. |
/backport v23.3.x |
Oops! Something went wrong. |
/backport v23.3.x |
Oops! Something went wrong. |
This series adds partition label aggregation to some metrics that were missing it.
When metrics aggregation is turned on, we want to aggregate away the partition labels on most metrics: this wasn't occurring in some cases, perhaps due to oversight (this logic needs to be applied at each metrics registration site), or because it was believed that the metrics were not suitable for aggregation.
This change enable aggregation on the partition label for most metrics, leaving 3+1 unaggregated as detailed in this comment:
https://github.com/redpanda-data/core-internal/issues/677#issuecomment-1879103823
It affects only the internal
/metrics
endpoint.Fixes #15811.
Fixes redpanda-data/core-internal#677.
Backports Required
Release Notes
Bug Fixes