Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add partition aggregation to some metrics #15966

Merged

Conversation

travisdowns
Copy link
Member

This series adds partition label aggregation to some metrics that were missing it.

When metrics aggregation is turned on, we want to aggregate away the partition labels on most metrics: this wasn't occurring in some cases, perhaps due to oversight (this logic needs to be applied at each metrics registration site), or because it was believed that the metrics were not suitable for aggregation.

This change enable aggregation on the partition label for most metrics, leaving 3+1 unaggregated as detailed in this comment:

https://github.com/redpanda-data/core-internal/issues/677#issuecomment-1879103823

It affects only the internal /metrics endpoint.

Fixes #15811.
Fixes redpanda-data/core-internal#677.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x
  • v23.1.x

Release Notes

Bug Fixes

  • Several additional metrics will have their "partition" label aggregated away (i.e., into a single series per remaining label set with no partition label, whose value is the sum of all input series with the same label set and different partition labels). This is already the default behavior for most metrics, but this change extends it to almost all remaining metrics.

Add a comment explaining that compaction_ratio metric cannot be
aggregated as it is a ratio, for which a sum is meaningless.
When metrics aggregation is turned on, we want to aggregate away the
partition labels on most metrics: this wasn't occurring in the
consensus object for two metrics: leader_for and reconfiguration
changes in progress.

This change enable aggregation on the partition label for these metrics
if aggregation is turned on.

Issue redpanda-data#15811.
Issue redpanda-data/core-internal#677.
@travisdowns
Copy link
Member Author

I think I'll tap @StephanDollberg as the primary reviewer but it would be nice if @mmaslankaprv could take a look (or nominate somone) as it touches a fair number of consensus metrics as well as @dotnwat for the "partition probe" metrics which I think are sort of a mix of enterprise/storage things.

The main thing to check is if any of the metrics are absolutely critical for incidents and diagnosis with their "partition" label intact. If so, we can try to keep them.

Most of the numbers & analysis can be found over here:

https://github.com/redpanda-data/core-internal/issues/677

StephanDollberg
StephanDollberg previously approved these changes Jan 5, 2024
Copy link
Member

@StephanDollberg StephanDollberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but will defer to Michal/Noah for which ones they will want to keep.

// aggregate any labels since aggregation does not make sense for "leader
// ID" values.
_metrics.add_group(
prometheus_sanitize::metrics_name("cluster:partition"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make "cluster:partition" a constant/variable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it a namespace-scope const global.

src/v/cluster/partition_probe.cc Outdated Show resolved Hide resolved
When metrics aggregation is turned on, we want to aggregate away the
partition labels on most metrics: this wasn't occurring in the
partition probe object.

This change enable aggregation on the partition label for almost all
metrics in the partition problem, with only two excluded:

vectorized_cluster_partition_leader_id
vectorized_cluster_partition_under_replicated_replicas

Issue redpanda-data#15811.
Issue redpanda-data/core-internal#677.
When metrics aggregation is turned on, we want to aggregate away the
partition labels on most metrics: this wasn't occurring in the
tx metrics.

This change enable aggregation on the partition label for almost all
three metrics in this probe.

Issue redpanda-data#15811.
Issue redpanda-data/core-internal#677.
@travisdowns travisdowns force-pushed the td-677-15811-metrics-reduction branch from 563bd3e to 65c0a89 Compare January 5, 2024 20:33
@vbotbuildovich
Copy link
Collaborator

new failures in https://buildkite.com/redpanda/redpanda/builds/43500#018cdb81-ad48-42ab-9ed0-eb02abcf835d:

"rptest.tests.cluster_config_test.ClusterConfigAliasTest.test_aliasing.prop_set=PropertyAliasData.primary_name=.cloud_storage_graceful_transfer_timeout_ms.aliased_name=.cloud_storage_graceful_transfer_timeout.redpanda_version=.23.2.test_values=.1234.1235.1236.expect_restart=False"

@vbotbuildovich
Copy link
Collaborator

@@ -153,13 +164,13 @@ void replicated_partition_probe::setup_internal_metrics(const model::ntp& ntp) {
labels),
},
{},
{sm::shard_label});
{sm::shard_label, partition_label});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also aggregate offset metrics like high_watermark, end_offset, etc. I think in this case the aggregation has no point, we may either remove those metrics or consider not aggregating them

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmaslankaprv - I was thinking that it still makes sense in that offsets are also a "count of records" in the partition, so after aggregation they will be a count of records in the topic. E.g., end_offset - start_offset is roughly the record count (if we ignore that there may be non-Kafka records in there too).

So maybe at least one of these may make sense to keep. For debugging issues, in your experience which are the most valuable offsets? Note that we do keep redpanda_kafka_max_offset (a true "kafka" offset, i.e., it uses the offset translator) unaggregated on the public metrics side.

@travisdowns
Copy link
Member Author

CI failure looks like: #15117

@piyushredpanda
Copy link
Contributor

CI failure looks like: #15117

Let me know if this needs force-merging.

@travisdowns
Copy link
Member Author

Let me know if this needs force-merging.

Will do, it's still under review ATM.

@mmaslankaprv mmaslankaprv self-requested a review January 12, 2024 16:59
@travisdowns travisdowns merged commit 750bbdb into redpanda-data:dev Jan 12, 2024
19 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.2.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-15966-v23.2.x-239 remotes/upstream/v23.2.x
git cherry-pick -x 1ab66f971885835d590c46dca1795f3567e81e94 04b15356b894750668f036fd31ca81611f8ea9f5 27e6e2b3f5b78a3a1e2c1094d9e195c99070711e 65c0a896008fec511b224e484201794bf3fb7680

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Oops! Something went wrong.

Workflow run logs.

@savex
Copy link
Contributor

savex commented Jan 12, 2024

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

Oops! Something went wrong.

Workflow run logs.

@travisdowns
Copy link
Member Author

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

Oops! Something went wrong.

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Too many per-partition metrics when aggregation is turned on
6 participants