Add partition aggregation to some metrics #15966

travisdowns · 2024-01-05T19:05:10Z

This series adds partition label aggregation to some metrics that were missing it.

When metrics aggregation is turned on, we want to aggregate away the partition labels on most metrics: this wasn't occurring in some cases, perhaps due to oversight (this logic needs to be applied at each metrics registration site), or because it was believed that the metrics were not suitable for aggregation.

This change enable aggregation on the partition label for most metrics, leaving 3+1 unaggregated as detailed in this comment:

https://github.com/redpanda-data/core-internal/issues/677#issuecomment-1879103823

It affects only the internal /metrics endpoint.

Fixes #15811.
Fixes redpanda-data/core-internal#677.

Backports Required

Release Notes

Bug Fixes

Several additional metrics will have their "partition" label aggregated away (i.e., into a single series per remaining label set with no partition label, whose value is the sum of all input series with the same label set and different partition labels). This is already the default behavior for most metrics, but this change extends it to almost all remaining metrics.

Add a comment explaining that compaction_ratio metric cannot be aggregated as it is a ratio, for which a sum is meaningless.

When metrics aggregation is turned on, we want to aggregate away the partition labels on most metrics: this wasn't occurring in the consensus object for two metrics: leader_for and reconfiguration changes in progress. This change enable aggregation on the partition label for these metrics if aggregation is turned on. Issue redpanda-data#15811. Issue redpanda-data/core-internal#677.

travisdowns · 2024-01-05T19:36:47Z

I think I'll tap @StephanDollberg as the primary reviewer but it would be nice if @mmaslankaprv could take a look (or nominate somone) as it touches a fair number of consensus metrics as well as @dotnwat for the "partition probe" metrics which I think are sort of a mix of enterprise/storage things.

The main thing to check is if any of the metrics are absolutely critical for incidents and diagnosis with their "partition" label intact. If so, we can try to keep them.

Most of the numbers & analysis can be found over here:

https://github.com/redpanda-data/core-internal/issues/677

StephanDollberg

Looks good but will defer to Michal/Noah for which ones they will want to keep.

StephanDollberg · 2024-01-05T19:43:41Z

src/v/cluster/partition_probe.cc

+    // aggregate any labels since aggregation does not make sense for "leader
+    // ID" values.
+    _metrics.add_group(
+      prometheus_sanitize::metrics_name("cluster:partition"),


Maybe make "cluster:partition" a constant/variable.

Made it a namespace-scope const global.

src/v/cluster/partition_probe.cc

When metrics aggregation is turned on, we want to aggregate away the partition labels on most metrics: this wasn't occurring in the partition probe object. This change enable aggregation on the partition label for almost all metrics in the partition problem, with only two excluded: vectorized_cluster_partition_leader_id vectorized_cluster_partition_under_replicated_replicas Issue redpanda-data#15811. Issue redpanda-data/core-internal#677.

When metrics aggregation is turned on, we want to aggregate away the partition labels on most metrics: this wasn't occurring in the tx metrics. This change enable aggregation on the partition label for almost all three metrics in this probe. Issue redpanda-data#15811. Issue redpanda-data/core-internal#677.

vbotbuildovich · 2024-01-05T22:28:56Z

new failures in https://buildkite.com/redpanda/redpanda/builds/43500#018cdb81-ad48-42ab-9ed0-eb02abcf835d:

"rptest.tests.cluster_config_test.ClusterConfigAliasTest.test_aliasing.prop_set=PropertyAliasData.primary_name=.cloud_storage_graceful_transfer_timeout_ms.aliased_name=.cloud_storage_graceful_transfer_timeout.redpanda_version=.23.2.test_values=.1234.1235.1236.expect_restart=False"

vbotbuildovich · 2024-01-05T23:05:05Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/43500#018cdb93-0234-4c59-a951-a74a978e38e3

mmaslankaprv · 2024-01-08T08:45:40Z

src/v/cluster/partition_probe.cc

@@ -153,13 +164,13 @@ void replicated_partition_probe::setup_internal_metrics(const model::ntp& ntp) {
          labels),
      },
      {},
-      {sm::shard_label});
+      {sm::shard_label, partition_label});


This will also aggregate offset metrics like high_watermark, end_offset, etc. I think in this case the aggregation has no point, we may either remove those metrics or consider not aggregating them

@mmaslankaprv - I was thinking that it still makes sense in that offsets are also a "count of records" in the partition, so after aggregation they will be a count of records in the topic. E.g., end_offset - start_offset is roughly the record count (if we ignore that there may be non-Kafka records in there too).

So maybe at least one of these may make sense to keep. For debugging issues, in your experience which are the most valuable offsets? Note that we do keep redpanda_kafka_max_offset (a true "kafka" offset, i.e., it uses the offset translator) unaggregated on the public metrics side.

travisdowns · 2024-01-08T13:06:36Z

CI failure looks like: #15117

piyushredpanda · 2024-01-08T13:51:10Z

CI failure looks like: #15117

Let me know if this needs force-merging.

travisdowns · 2024-01-08T13:53:00Z

Let me know if this needs force-merging.

Will do, it's still under review ATM.

vbotbuildovich · 2024-01-12T18:24:59Z

/backport v23.3.x

vbotbuildovich · 2024-01-12T18:24:59Z

/backport v23.2.x

vbotbuildovich · 2024-01-12T18:25:51Z

Failed to create a backport PR to v23.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-15966-v23.2.x-239 remotes/upstream/v23.2.x
git cherry-pick -x 1ab66f971885835d590c46dca1795f3567e81e94 04b15356b894750668f036fd31ca81611f8ea9f5 27e6e2b3f5b78a3a1e2c1094d9e195c99070711e 65c0a896008fec511b224e484201794bf3fb7680

Workflow run logs.

vbotbuildovich · 2024-01-12T18:25:53Z

Oops! Something went wrong.

Workflow run logs.

savex · 2024-01-12T21:59:27Z

/backport v23.3.x

vbotbuildovich · 2024-01-12T22:00:22Z

Oops! Something went wrong.

Workflow run logs.

travisdowns · 2024-01-15T13:30:22Z

/backport v23.3.x

vbotbuildovich · 2024-01-15T13:31:21Z

Oops! Something went wrong.

Workflow run logs.

travisdowns added 2 commits January 5, 2024 15:33

Comment on why compaction_ratio aggregation

1ab66f9

Add a comment explaining that compaction_ratio metric cannot be aggregated as it is a ratio, for which a sum is meaningless.

travisdowns requested review from StephanDollberg and mmaslankaprv January 5, 2024 19:05

github-actions bot added the area/redpanda label Jan 5, 2024

travisdowns requested a review from dotnwat January 5, 2024 19:05

StephanDollberg previously approved these changes Jan 5, 2024

View reviewed changes

travisdowns dismissed StephanDollberg’s stale review via 563bd3e January 5, 2024 20:31

travisdowns force-pushed the td-677-15811-metrics-reduction branch from 20a34ab to 563bd3e Compare January 5, 2024 20:31

travisdowns added 2 commits January 5, 2024 17:33

travisdowns force-pushed the td-677-15811-metrics-reduction branch from 563bd3e to 65c0a89 Compare January 5, 2024 20:33

StephanDollberg approved these changes Jan 5, 2024

View reviewed changes

mmaslankaprv reviewed Jan 8, 2024

View reviewed changes

mmaslankaprv self-requested a review January 12, 2024 16:59

mmaslankaprv approved these changes Jan 12, 2024

View reviewed changes

travisdowns merged commit 750bbdb into redpanda-data:dev Jan 12, 2024
19 checks passed

This was referenced Jan 12, 2024

[v23.2.x] Add partition aggregation to some metrics #16092

Closed

[v23.3.x] Add partition aggregation to some metrics #16093

Closed

savex mentioned this pull request Jan 12, 2024

[v23.3.x] Add partition aggregation to some metrics #16094

Merged

travisdowns mentioned this pull request Jan 15, 2024

[v23.3.x] Add partition aggregation to some metrics #16100

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add partition aggregation to some metrics #15966

Add partition aggregation to some metrics #15966

travisdowns commented Jan 5, 2024

travisdowns commented Jan 5, 2024

StephanDollberg left a comment

StephanDollberg Jan 5, 2024

travisdowns Jan 5, 2024

vbotbuildovich commented Jan 5, 2024

vbotbuildovich commented Jan 5, 2024

mmaslankaprv Jan 8, 2024

travisdowns Jan 8, 2024

travisdowns commented Jan 8, 2024

piyushredpanda commented Jan 8, 2024

travisdowns commented Jan 8, 2024

vbotbuildovich commented Jan 12, 2024

vbotbuildovich commented Jan 12, 2024

vbotbuildovich commented Jan 12, 2024

vbotbuildovich commented Jan 12, 2024

savex commented Jan 12, 2024

vbotbuildovich commented Jan 12, 2024

travisdowns commented Jan 15, 2024

vbotbuildovich commented Jan 15, 2024

Add partition aggregation to some metrics #15966

Add partition aggregation to some metrics #15966

Conversation

travisdowns commented Jan 5, 2024

Backports Required

Release Notes

Bug Fixes

travisdowns commented Jan 5, 2024

StephanDollberg left a comment

Choose a reason for hiding this comment

StephanDollberg Jan 5, 2024

Choose a reason for hiding this comment

travisdowns Jan 5, 2024

Choose a reason for hiding this comment

vbotbuildovich commented Jan 5, 2024

vbotbuildovich commented Jan 5, 2024

mmaslankaprv Jan 8, 2024

Choose a reason for hiding this comment

travisdowns Jan 8, 2024

Choose a reason for hiding this comment

travisdowns commented Jan 8, 2024

piyushredpanda commented Jan 8, 2024

travisdowns commented Jan 8, 2024

vbotbuildovich commented Jan 12, 2024

vbotbuildovich commented Jan 12, 2024

vbotbuildovich commented Jan 12, 2024

vbotbuildovich commented Jan 12, 2024

savex commented Jan 12, 2024

vbotbuildovich commented Jan 12, 2024

travisdowns commented Jan 15, 2024

vbotbuildovich commented Jan 15, 2024