Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DataDog Distribution metrics #7393

Open
petoalbert opened this issue Oct 5, 2022 · 1 comment
Open

Support DataDog Distribution metrics #7393

petoalbert opened this issue Oct 5, 2022 · 1 comment
Labels
metrics ZIO Debugging and Monitoring facilities

Comments

@petoalbert
Copy link
Contributor

petoalbert commented Oct 5, 2022

My use-case

As a DataDog user, I want to measure my app's response times across all hosts and see response time percentiles grouped by endpoint. In DataDog, the metric type capable of doing that is called Distribution: "it is used to represent a global statistical distribution of a set of values calculated across a distributed infrastructure." It uses a special internal representation called DDSketch. Users don't have to specify buckets or percentiles manually when reporting the metric, just send raw data to the DataDog server, most commonly through the DataDog agent. As seen from the screenshot, I have to explicitly enable percentile queries (I assume because it increases indexing costs) and I can select arbitrary percentile distributions for that metric.

drawing

drawing

DataDog agents running on host machines support an extension of the StatsD protocol called DogStatsD. The protocol supports sending metric values for distribution metric type. There's no need from metric reporting applications to do any aggregation, it is done by the datadog agent (and DataDog in case of the distribution metric). The application only sends raw values.

Is this supported in zio.metrics? No

Two of the zio.metric metrics capable of measuring statistical distributions of data is Histgoram and Summary, which directly correspond with the Prometheus/OpenMetrics metric Histogram/Summary types. Of these two, Summary calculates percentiles on the client-side, which makes it generally not aggregatable across labels/hosts. Histogram counts observations in buckets, which can flexibly be aggregated on the server-side across various dimensions, however, unlike Prometheus, this representation is not supported by DataDog, even though its internal DDSketch algorithm seems to be working in a similar way. As seen from the screenshot, I cannot see percentiles of my data distribution for a metric which is uploaded as a set of gauge's with labels specifying bucket counts:

drawing

Proposed solution

The solution I envisioned would provide new MetricEvent handler to zio-metrics-connectors which uses the DogStatsD protocol, making it able to report distribution metrics. The changes would be the following:

1. Notify metric listener every time a metric is updated

DogStatD relies on the application to send raw values for each metric type. Instead of collecting them in a MetricState and periodically take a snapshot of the metric states and process them (as it is currently done), we should extend MetricsRegistry to be able to notify listeners as soon as a metric is updated. I also see this being mentioned in the ScalaDoc comments of zio.metrics.MetricClient, however the implementation seems missing.

2. Use zio.metrics.Histogram metric type for datadog distribution

I assume that we don't want any platform-specific metric type in zio.metrics, instead we should have generic types that can work with any metrics reporting platform. Is this assumption true?

zio.metrics.Gauge and zio.metrics.Rate already have a direct correspondance with DataDog metric types so I wouldn't change these.

zio.metrics.Sumamry is actually quite similar to what's called Histogram in DataDog. It is used for calculating a set of percentiles at the client side. (calculating it in the app or the datadog agent both counts as client-side)

Since the gauges produced by zio.metrics.Histogram are currently not interpreted correctly by DataDog, I believe we should use that type for DataDog's distribution metric. Furthermore, the use-case of this metric is the same as Datadog's distribution: it is used to represent a global statistical distribution of a set of values calculated across a distributed infrastructure.

To send Distributions, the dogstatsd client would ignore details of MetricState.Histogram, as it would only send raw values to the datadog agent. This means that DataDog users would be able create a histogram like this and still use it properly:

// We send raw data to the datadog agent so we can ignore internal state
Metric.histogram("response_time", Boundaries(Chunk.empty))

Note that this can cause compatibility issues if this user was trying to switch the metric platform without changing their code, for example when trying to migrate from DataDog to Prometheus. However, as we see there's no one-to-one mapping between Prometheus/DataDog metrics so I can't think of a scenario where a user wouldn't have to change at least small parts of their code when migrating between metric metric reporting platforms.

I am happy to create PRs for this issue if you agree with the direction.

@adamgfraser adamgfraser added the metrics ZIO Debugging and Monitoring facilities label Oct 27, 2022
@adamgfraser
Copy link
Contributor

Yes we need to bring back the MetricListener interface to support this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metrics ZIO Debugging and Monitoring facilities
Projects
None yet
Development

No branches or pull requests

2 participants