You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As a DataDog user, I want to measure my app's response times across all hosts and see response time percentiles grouped by endpoint. In DataDog, the metric type capable of doing that is called Distribution: "it is used to represent a global statistical distribution of a set of values calculated across a distributed infrastructure." It uses a special internal representation called DDSketch. Users don't have to specify buckets or percentiles manually when reporting the metric, just send raw data to the DataDog server, most commonly through the DataDog agent. As seen from the screenshot, I have to explicitly enable percentile queries (I assume because it increases indexing costs) and I can select arbitrary percentile distributions for that metric.
DataDog agents running on host machines support an extension of the StatsD protocol called DogStatsD. The protocol supports sending metric values for distribution metric type. There's no need from metric reporting applications to do any aggregation, it is done by the datadog agent (and DataDog in case of the distribution metric). The application only sends raw values.
Is this supported in zio.metrics? No
Two of the zio.metric metrics capable of measuring statistical distributions of data is Histgoram and Summary, which directly correspond with the Prometheus/OpenMetrics metric Histogram/Summary types. Of these two, Summary calculates percentiles on the client-side, which makes it generally not aggregatable across labels/hosts. Histogram counts observations in buckets, which can flexibly be aggregated on the server-side across various dimensions, however, unlike Prometheus, this representation is not supported by DataDog, even though its internal DDSketch algorithm seems to be working in a similar way. As seen from the screenshot, I cannot see percentiles of my data distribution for a metric which is uploaded as a set of gauge's with labels specifying bucket counts:
Proposed solution
The solution I envisioned would provide new MetricEvent handler to zio-metrics-connectors which uses the DogStatsD protocol, making it able to report distribution metrics. The changes would be the following:
1. Notify metric listener every time a metric is updated
DogStatD relies on the application to send raw values for each metric type. Instead of collecting them in a MetricState and periodically take a snapshot of the metric states and process them (as it is currently done), we should extend MetricsRegistry to be able to notify listeners as soon as a metric is updated. I also see this being mentioned in the ScalaDoc comments of zio.metrics.MetricClient, however the implementation seems missing.
2. Use zio.metrics.Histogram metric type for datadog distribution
I assume that we don't want any platform-specific metric type in zio.metrics, instead we should have generic types that can work with any metrics reporting platform. Is this assumption true?
zio.metrics.Gauge and zio.metrics.Rate already have a direct correspondance with DataDog metric types so I wouldn't change these.
zio.metrics.Sumamry is actually quite similar to what's called Histogram in DataDog. It is used for calculating a set of percentiles at the client side. (calculating it in the app or the datadog agent both counts as client-side)
Since the gauges produced by zio.metrics.Histogram are currently not interpreted correctly by DataDog, I believe we should use that type for DataDog's distribution metric. Furthermore, the use-case of this metric is the same as Datadog's distribution: it is used to represent a global statistical distribution of a set of values calculated across a distributed infrastructure.
To send Distributions, the dogstatsd client would ignore details of MetricState.Histogram, as it would only send raw values to the datadog agent. This means that DataDog users would be able create a histogram like this and still use it properly:
// We send raw data to the datadog agent so we can ignore internal stateMetric.histogram("response_time", Boundaries(Chunk.empty))
Note that this can cause compatibility issues if this user was trying to switch the metric platform without changing their code, for example when trying to migrate from DataDog to Prometheus. However, as we see there's no one-to-one mapping between Prometheus/DataDog metrics so I can't think of a scenario where a user wouldn't have to change at least small parts of their code when migrating between metric metric reporting platforms.
I am happy to create PRs for this issue if you agree with the direction.
The text was updated successfully, but these errors were encountered:
My use-case
As a DataDog user, I want to measure my app's response times across all hosts and see response time percentiles grouped by endpoint. In DataDog, the metric type capable of doing that is called Distribution: "it is used to represent a global statistical distribution of a set of values calculated across a distributed infrastructure." It uses a special internal representation called DDSketch. Users don't have to specify buckets or percentiles manually when reporting the metric, just send raw data to the DataDog server, most commonly through the DataDog agent. As seen from the screenshot, I have to explicitly enable percentile queries (I assume because it increases indexing costs) and I can select arbitrary percentile distributions for that metric.
DataDog agents running on host machines support an extension of the StatsD protocol called DogStatsD. The protocol supports sending metric values for distribution metric type. There's no need from metric reporting applications to do any aggregation, it is done by the datadog agent (and DataDog in case of the distribution metric). The application only sends raw values.
Is this supported in zio.metrics? No
Two of the
zio.metric
metrics capable of measuring statistical distributions of data isHistgoram
andSummary
, which directly correspond with the Prometheus/OpenMetrics metric Histogram/Summary types. Of these two,Summary
calculates percentiles on the client-side, which makes it generally not aggregatable across labels/hosts.Histogram
counts observations in buckets, which can flexibly be aggregated on the server-side across various dimensions, however, unlike Prometheus, this representation is not supported by DataDog, even though its internal DDSketch algorithm seems to be working in a similar way. As seen from the screenshot, I cannot see percentiles of my data distribution for a metric which is uploaded as a set of gauge's with labels specifying bucket counts:Proposed solution
The solution I envisioned would provide new
MetricEvent
handler to zio-metrics-connectors which uses the DogStatsD protocol, making it able to report distribution metrics. The changes would be the following:1. Notify metric listener every time a metric is updated
DogStatD relies on the application to send raw values for each metric type. Instead of collecting them in a
MetricState
and periodically take a snapshot of the metric states and process them (as it is currently done), we should extendMetricsRegistry
to be able to notify listeners as soon as a metric is updated. I also see this being mentioned in the ScalaDoc comments of zio.metrics.MetricClient, however the implementation seems missing.2. Use zio.metrics.Histogram metric type for datadog distribution
I assume that we don't want any platform-specific metric type in
zio.metrics
, instead we should have generic types that can work with any metrics reporting platform. Is this assumption true?zio.metrics.Gauge
andzio.metrics.Rate
already have a direct correspondance with DataDog metric types so I wouldn't change these.zio.metrics.Sumamry
is actually quite similar to what's calledHistogram
in DataDog. It is used for calculating a set of percentiles at the client side. (calculating it in the app or the datadog agent both counts as client-side)Since the gauges produced by
zio.metrics.Histogram
are currently not interpreted correctly by DataDog, I believe we should use that type for DataDog's distribution metric. Furthermore, the use-case of this metric is the same as Datadog's distribution: it is used to represent a global statistical distribution of a set of values calculated across a distributed infrastructure.To send Distributions, the
dogstatsd
client would ignore details ofMetricState.Histogram
, as it would only send raw values to the datadog agent. This means that DataDog users would be able create a histogram like this and still use it properly:Note that this can cause compatibility issues if this user was trying to switch the metric platform without changing their code, for example when trying to migrate from DataDog to Prometheus. However, as we see there's no one-to-one mapping between Prometheus/DataDog metrics so I can't think of a scenario where a user wouldn't have to change at least small parts of their code when migrating between metric metric reporting platforms.
I am happy to create PRs for this issue if you agree with the direction.
The text was updated successfully, but these errors were encountered: