Add metrics aggregation capabilities #26

VladLazar · 2022-06-20T16:57:46Z

This PR contains a set of patches from the seastar mailing list (patch set and repo). They add aggregations
support to the seastar metrics subsystem. This PR used to be based on an older version of the upstream patch set,
but Ben fed back our findings and the new version matches our needs as is.

Mailing List Cover Letter

Motivation:
Histograms are the Prometheus killer. They are big to send over the
network and are big to store in Prometheus.
Histograms' size is always an issue, but with Seastar, it becomes even trickier.
Typically, each shard would collect its histograms, so the overall data
is multiplied by the number of shards.

This series addresses the need to report quantile information like latency
without generating massive metrics reports.

A summary is a Prometheus metric type that holds a quantile summary (i.e.
p95, p99).

The downside of summaries is that they cannot be aggregated, which is
needed for a distributed system (i.e., calculate the p99 latency of a cluster).

The series adds four tools for Prometheus performance:

Add summaries.
Optionally, remove empty metrics. It's common to register metrics for
optional services. It is now possible to mark those metrics as
skip_when_empty and they will not be reported if they were never used.
Allow aggregating metrics. The most common case is reporting per-node
metrics instead of per shard. For example, for multi-nodes quantile calculation,
we need a per-node histogram. It is now possible to mark a registered
metric for aggregation. The metrics layer will aggregate this
metric based on a list of labels. (Typically, this will be by shard,
but it could be any other combination of labels).
Reuse the stringstream instead of recreating an object on each
iteration.

force push:

split into smaller commits
use sm::label strong type instead of std::string for specifying aggregation labels
reuse string stream when converting aggregated metrics to text

force push:

Remove aggregation_labels argument from metric creation functions (make_<metric_type>).

force push:

Use the write_counter function in the commit it's introduced

force push:

Use strong label type sm::label in setter of aggregation labels

force push:

Amnon's latest version of the patch set includes everything we need and is functionally
equivalent to what we had on this branch before (it also allows for skipping reporting of
empty metrics configurably).

BenPope

This would be easier to review if the last commit was split up a bit and the comments improved.

src/core/prometheus.cc

include/seastar/core/metrics.hh

BenPope · 2022-06-21T11:19:12Z

The patchset brought in optimises histograms by not outputting them if they aren't used. I wonder if we always want that. Maybe it should be configurable.

VladLazar · 2022-06-21T11:46:08Z

The patchset brought in optimises histograms by not outputting them if they aren't used. I wonder if we always want that. Maybe it should be configurable.

Didn't you remove that in this commit (the value.is_empty() check)?

BenPope · 2022-06-21T11:49:23Z

The patchset brought in optimises histograms by not outputting them if they aren't used. I wonder if we always want that. Maybe it should be configurable.

Didn't you remove that in this commit (the value.is_empty() check)?

Looks like it. But it was also printing the headers twice. Lets consider this later.

VladLazar · 2022-06-21T14:25:24Z

This would be easier to review if the last commit was split up a bit and the comments improved.

I split it up and expanded on the commit messages in this force push.

include/seastar/core/metrics.hh

src/core/prometheus.cc

This patch add support for the summary type on the metrics layer. A summary is a different kind of histogram, it's buckets are percentile so the reporting layer (i.e. Prometheus for example) would know to report it correctly. Signed-off-by: Amnon Heiman <amnon@scylladb.com>

This patch adds a missing part to how histograms are being aggregated, it needs to aggregate the sum and count as well. Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Aggregate labels are a mechanism for reporting aggregated results. Most commonly it allows to report one histogram per node instead of per shard. This patch adds an option to mark a metric with a vector of labels. That vector will be part of the metric meta-data so the reporting layer would be able to aggregate over it. Skip when empty, means that metrics that are not in used, will not be reported. A common scenario is that user register a metrics, but that metrics is never used. The most common case is histogram and summary but it it can also happen with counters. This patch adds an option to mark a metric with skip_when_empty. When done so, if a metric was never used (true for histogram, counters and summary) it will not be reported. Signed-off-by: Amnon Heiman <amnon@scylladb.com>

This patch adds multiple functionality to Prometheus reporting: 1. Add summary reporting. Summaries are used in seastar to report aggregated percentile information (for example p95 and p99) The main usage is to report per-shard summary of a latency histograms. 2. Support aggregated metrics. With an aggregated metrics, Prometheus would aggregate multiple metrics based on labels and would report the result. Usually this would be for reporting a single latency histogram per node instead of per shard. But it could be used for counters and gauge as well. 3. Skip empty counters, histograms and summaries. It's a common practice to register lots of metrics even if they are not being used. Histograms have a huge effect on performance, so not reporting an empty histogram is a great performance boost both for the application and for the Prometheus server. This is true for Summaries and Counters as well, marking a metrics with skip_when_empty would mean Prometheus will not report those metrics. 4. As an optimization, the stringstream that is used per metric is reused and clear insted of recreated. Signed-off-by: Amnon Heiman <amnon@scylladb.com>

BenPope

I'm happy with this. Sent some minor nits to the mailing list.

VladLazar changed the title ~~Add metrics aggrecation capabilities~~ Add metrics aggregation capabilities Jun 20, 2022

This was referenced Jun 20, 2022

Minimal primary metrics endpoint redpanda-data/redpanda#5165

Merged

Aggregate metrics to reduce cardinality redpanda-data/redpanda#5166

Merged

BenPope requested review from jcsp, BenPope and dotnwat June 20, 2022 18:14

BenPope reviewed Jun 20, 2022

View reviewed changes

src/core/prometheus.cc Outdated Show resolved Hide resolved

BenPope reviewed Jun 20, 2022

View reviewed changes

include/seastar/core/metrics.hh Show resolved Hide resolved

VladLazar force-pushed the aggregation-support branch from 57f55f8 to b45c02b Compare June 21, 2022 11:23

BenPope reviewed Jun 22, 2022

View reviewed changes

include/seastar/core/metrics.hh Show resolved Hide resolved

BenPope requested changes Jun 22, 2022

View reviewed changes

include/seastar/core/metrics.hh Outdated Show resolved Hide resolved

VladLazar force-pushed the aggregation-support branch from b45c02b to 2db661d Compare June 22, 2022 18:01

VladLazar requested a review from BenPope June 23, 2022 09:39

BenPope reviewed Jun 23, 2022

View reviewed changes

src/core/prometheus.cc Outdated Show resolved Hide resolved

VladLazar force-pushed the aggregation-support branch from 2db661d to 1729479 Compare June 23, 2022 12:39

BenPope requested a review from mmaslankaprv June 23, 2022 12:48

VladLazar force-pushed the aggregation-support branch from 1729479 to c8de0ae Compare June 23, 2022 14:18

amnonh added 4 commits June 24, 2022 10:34

metrics.cc: missing count and sum when aggregating histograms

893e2de

This patch adds a missing part to how histograms are being aggregated, it needs to aggregate the sum and count as well. Signed-off-by: Amnon Heiman <amnon@scylladb.com>

VladLazar force-pushed the aggregation-support branch from c8de0ae to 3f35820 Compare June 24, 2022 09:39

BenPope approved these changes Jun 24, 2022

View reviewed changes

BenPope merged commit da64789 into redpanda-data:v22.2.x Jun 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metrics aggregation capabilities #26

Add metrics aggregation capabilities #26

VladLazar commented Jun 20, 2022 •

edited

Loading

BenPope left a comment

BenPope commented Jun 21, 2022

VladLazar commented Jun 21, 2022

BenPope commented Jun 21, 2022

VladLazar commented Jun 21, 2022

BenPope left a comment

Add metrics aggregation capabilities #26

Add metrics aggregation capabilities #26

Conversation

VladLazar commented Jun 20, 2022 • edited Loading

Mailing List Cover Letter

BenPope left a comment

Choose a reason for hiding this comment

BenPope commented Jun 21, 2022

VladLazar commented Jun 21, 2022

BenPope commented Jun 21, 2022

VladLazar commented Jun 21, 2022

BenPope left a comment

Choose a reason for hiding this comment

VladLazar commented Jun 20, 2022 •

edited

Loading