RFC: Ability to aggregate percentiles #476

hexedpackets · 2019-05-07T18:46:20Z

Problem

Aggregating percentiles is a hard problem. They show the point at which a certain percentage of observed values occur. For example, the 99th percentile is the value which is greater than 99% of the values. An aggregation of percentiles that manipulates the data (e.g. average) would be mathematically incorrect since we’d only be handling the percentile and not the complete set of values for the different time series.

You can however, aggregate “group by, min/max” to group the information and simplify graphs and alerts. For instance you could look at p75 latency, using aggregation max grouping by site. This would give you the maximum value for p75 latency in in the given site across all instances, though it doesn’t tell you how broad the situation is, since it could apply to just one or many instances, you got the maximum value across all the percentiles in each group.

Suggestion

Having first-class support for an aggregatable data structures other than doubles would go a long way to solving this problem. Some type of native histogram support would be an ideal choice, assuming we can ingest them in a sane way.

hexedpackets · 2019-05-07T18:47:22Z

@dmichel1 experimented with t-digests and bucketed histograms recently:

The t-digests are really nice since you don't have to worry about setting bucket boundaries yourself. This would require heroic to support storing and aggregations the t-digests.
The bucketed histograms, would be easier to ingest into heroic since the data format is the same as it is today and would only require adding a new aggregation to heroic.

dmichel1 · 2020-04-01T18:30:47Z

The metrics in OpenCensus and OpenTelemetry all use the bucketed approach. From an integration perspective it would be easiest to adopt these in Heroic as opposed to a new format like t-digests where we would have to add the features to clients.

ao2017 · 2020-07-07T14:15:28Z

To compute accurate latency percentiles at a reasonable speed in a distributed environment, we need a data structure that preserve the actual data distribution. Wether we choose bucketing, data clustering or sketch. We will have to send data to heroic in a format that is currently not supported.

We are currently evaluating t-digest and data sketch implementations such as HdrHistogram, Circllhist and DDsketch.

We are not considering Opencensus because it uses bucketing with bucket boundaries. To get accurate percentiles, you need to setup good bucket boundaries.That is possible if you have prior knowledge of the data distribution.
Furthermore, Opencensus is merging with OpenTelemetry and OpenTelemetry and Opencensus don’t have the same distribution interface. So It is not clear what is the long term support model for Opencensus distribution.

hexedpackets added the note:rfc label May 7, 2019

project-bot bot added this to Inbox in Observability Kanban May 7, 2019

hexedpackets removed this from Inbox in Observability Kanban May 7, 2019

lmuhlha added this to the Support Histograms milestone May 21, 2020

lmuhlha added this to In progress in Observability Kanban Jul 6, 2020

lmuhlha removed this from In progress in Observability Kanban Jul 14, 2020

ao2017 mentioned this issue Sep 2, 2020

Add distribution metric to semantic-core and ffwd-reporter spotify/semantic-metrics#79

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Ability to aggregate percentiles #476

RFC: Ability to aggregate percentiles #476

hexedpackets commented May 7, 2019

hexedpackets commented May 7, 2019 •

edited

Loading

dmichel1 commented Apr 1, 2020

ao2017 commented Jul 7, 2020 •

edited

Loading

RFC: Ability to aggregate percentiles #476

RFC: Ability to aggregate percentiles #476

Comments

hexedpackets commented May 7, 2019

Problem

Suggestion

hexedpackets commented May 7, 2019 • edited Loading

dmichel1 commented Apr 1, 2020

ao2017 commented Jul 7, 2020 • edited Loading

hexedpackets commented May 7, 2019 •

edited

Loading

ao2017 commented Jul 7, 2020 •

edited

Loading