feat(prometheus sink): Support histograms by loony-bean · Pull Request #675 · vectordotdev/vector

loony-bean · 2019-07-23T12:46:03Z

Closes #384

loony-bean · 2019-07-23T13:11:19Z

This PR draft is a playground for discussing the UX of metrics pipelines. Specifically, it touches log_to_metric transform and Prometheus sink, but might be seen in broader context.

The intermediate result is that using this config we can transform such log content

99.16.194.115 - olson6561 [23/07/2019:14:35:37 +0300] "HEAD /extensible" 400 10820
37.180.76.108 - - [23/07/2019:14:35:37 +0300] "POST /redefine/proactive" 301 2346
120.17.2.3 - oreilly2474 [23/07/2019:14:35:37 +0300] "GET /user-centric/relationships/morph/models" 503 7372
32.57.26.2 - cartwright1010 [23/07/2019:14:35:37 +0300] "PUT /grow/innovate/best-of-breed" 201 27801
151.103.54.126 - - [23/07/2019:14:35:37 +0300] "DELETE /revolutionary/networks" 201 12691
244.115.139.219 - - [23/07/2019:14:35:37 +0300] "GET /iterate/granular/recontextualize/iterate" 404 20862
146.197.22.54 - - [23/07/2019:14:35:37 +0300] "DELETE /impactful" 400 28990
254.83.212.8 - - [23/07/2019:14:35:37 +0300] "GET /drive/cross-media/e-enable/integrate" 403 20683
65.180.197.177 - gleichner6855 [23/07/2019:14:35:37 +0300] "PUT /distributed/reinvent" 504 1112
237.200.72.216 - schmeler4652 [23/07/2019:14:35:37 +0300] "HEAD /metrics/markets" 204 11111
136.193.221.152 - - [23/07/2019:14:35:37 +0300] "POST /magnetic/one-to-one/orchestrate" 502 25796
149.70.197.139 - - [23/07/2019:14:35:37 +0300] "DELETE /deploy/empower/scalable" 100 5134

into such Prometheus plain text exposition

# HELP bytes_out bytes_out
# TYPE bytes_out gauge
bytes_out 5134
# HELP bytes_out_histogram bytes_out_histogram
# TYPE bytes_out_histogram histogram
bytes_out_histogram_bucket{le="1"} 0
bytes_out_histogram_bucket{le="10"} 0
bytes_out_histogram_bucket{le="100"} 0
bytes_out_histogram_bucket{le="1000"} 0
bytes_out_histogram_bucket{le="10000"} 4
bytes_out_histogram_bucket{le="100000"} 12
bytes_out_histogram_bucket{le="+Inf"} 12
bytes_out_histogram_sum 174718
bytes_out_histogram_count 12
# HELP bytes_out_total bytes_out_total
# TYPE bytes_out_total counter
bytes_out_total 174718
# HELP message_total message_total
# TYPE message_total counter
message_total 12

binarylogic · 2019-07-23T13:27:53Z

Nice! Are there any specific issues you want us to look at? The config looks great to me, but we can have Luke or Lucio do a review on the direction if it helps.

loony-bean · 2019-07-23T13:35:37Z

Now the issues and questions.

Prometheus HELP sections

Adding meta information about the metric seem like a good thing to allow in configs. The question is whether it should be part of metric source, log_to_metric transform or sink config? We might even want to have some descriptions attached to log fields as well.

Labels

Labels are imperative for Prometheus UX. It's hard to imagine how we could express typical Prom patterns like e.g. status code counter (http_requests_total{method="post",code="200"}) without labels. We have #512 for tracking labels support, and it might be reasonable to bring it into the scope.

Sets

Unlike Statsd, Prometheus doesn't support Set types of metrics. It is still possible to transform Sets into Gauges for cardinality of stuff, like here, but it doesn't seem easy to bring it into our rust-prometheus based code.

In general, the idea of handling metric types coercions, implicit or explicit, sounds a bit controversial to me.

loony-bean · 2019-07-23T13:40:11Z

Default names in log_to_metric

It is a good idea to help users writing smaller configs, but here is an example how it might strike us back.

We need to provide the name (bytes_out_total) which we do not really control because it was generated from this user input automatically.

loony-bean · 2019-07-23T13:46:09Z

Individual buckets configs

Bucket configuration needs to be done in sinks, and it could be done similar to this. However I'm not sure this config is ergonomic, mainly because:

user needs to repeat field names in many places
user can try to put different types for the same thing (and who knows what will happen then)

loony-bean · 2019-07-23T13:48:29Z

@binarylogic Yeah, need some help to spec this out.

lukesteensen · 2019-07-23T22:11:23Z

@loony-bean Great questions! I'm not sure of all the answers, but here are my thoughts:

Prometheus HELP sections

Adding meta information about the metric seem like a good thing to allow in configs. The question is whether it should be part of metric source, log_to_metric transform or sink config? We might even want to have some descriptions attached to log fields as well.

I think the main question here is whether or not this information is specific to prometheus. If it is, any configuration of it should probably be limited to the prometheus sink. But we should also think about the overall user experience here. My suspicion is that this would require a lot of time spent writing configuration before providing much value to the user. And at first glance, it doesn't seem as through tools like https://github.com/prometheus/statsd_exporter have decided it's worthwhile to implement anything like this.

Labels

Labels are imperative for Prometheus UX. It's hard to imagine how we could express typical Prom patterns like e.g. status code counter (http_requests_total{method="post",code="200"}) without labels. We have #512 for tracking labels support, and it might be reasonable to bring it into the scope.

Labels were definitely one of the next things I think we should tackle in the metrics area. I always try to split up tasks into small iterative steps, but if you think it would be better to do labels alongside this work feel free to go ahead. I just don't want any individual change to snowball too much and become hard to review.

Sets

Unlike Statsd, Prometheus doesn't support Set types of metrics. It is still possible to transform Sets into Gauges for cardinality of stuff, like here, but it doesn't seem easy to bring it into our rust-prometheus based code.

Yeah, sets are kinda funky. The simplest thing we could do is simply ignore them, which would be totally valid for this early stage. We've also been talking about trying to drop our dependency on the prometheus crate. It doesn't seem well suited to our use case, and we've experimented with leaving it out of some of our internal metrics implementations. If that would make this easier, feel free to explore that route as well.

Default names in log_to_metric

It is a good idea to help users writing smaller configs, but here is an example how it might strike us back.

We need to provide the name (bytes_out_total) which we do not really control because it was generated from this user input automatically.

I'm not sure I totally understand this example. Do we change the name of the field from bytes_out to bytes_out_total when we turn it into a metric? My initial thought is that we should default to just keeping the name the same. If the user configures something that conflicts we try to give them a warning, but there is probably some degree of that which is impossible to prevent.

Individual buckets configs

Bucket configuration needs to be done in sinks, and it could be done similar to this. However I'm not sure this config is ergonomic, mainly because:

user needs to repeat field names in many places

user can try to put different types for the same thing (and who knows what will happen then)

We could provide the option of giving a default bucketing for histograms. That would cut down on the repetitiveness if the users wants the same behavior for most fields. If we wanted to be fancier, we could let them define named bucketings and then just specify the name for each metric.

What do you mean by different types for the same thing?

loony-bean · 2019-07-24T08:23:37Z

Thanks for input @lukesteensen! Re HELP, Labels, Sets - all clear, will ignore it for now.

I'm not sure I totally understand this example. Do we change the name of the field from bytes_out to bytes_out_total when we turn it into a metric? My initial thought is that we should default to just keeping the name the same. If the user configures something that conflicts we try to give them a warning, but there is probably some degree of that which is impossible to prevent.

Yes, we have a special logic of adding a suffix to counter names. Maybe we can adjust this a bit to simply use the field name as is, like you said? /cc @binarylogic

loony-bean · 2019-07-24T08:54:02Z

What do you mean by different types for the same thing?

Using this config as an example:

[[transforms.log_to_metric.metrics]]
type = "histogram"
field = "bytes_out"
name = "bytes_out_histogram"

[sinks.prometheus]
inputs = ["log_to_metric"]
type = "prometheus"

[[sinks.prometheus.metrics]]
type = "histogram"
field = "bytes_out_histogram"
buckets = {type = "linear", start = 100.0, width = 2.0, count = 5} 

[[sinks.prometheus.metrics]]
type = "summary"
field = "bytes_out_histogram"
name = "bytes_out_histogram_summary"

[[sinks.prometheus.metrics]]
type = "counter"
field = "bytes_out_histogram"
name = "bytes_out_histogram_counter"

In here user can try to connect histogram to histogram (fine), histogram to summary (makes sense), or even histogram to counter (wat, but still possible).

The question in general is if we are going to allow user to convert internal metric type (internal histogram) into different sink types (prometheus histogram/summary), and how user should be guided?

lukesteensen · 2019-07-24T19:36:57Z

I see what you're saying. This seems to boil down to how the sink is configured. I think we should maintain the current prometheus sink behavior of aggregating all the metrics it receives without requiring an explicit configuration for each metric. This makes the sink much easier to use by reducing the burden of writing configs, especially in use cases with many different metrics.

Obviously, we'll still need a way for users to specify how certain metrics should be aggregated. I think our approach should be layers here, to try to minimize the amount of config a user would need to write:

For each metric type, we'll have a default way of aggregating that metric for the prometheus sink. For types like counters, this is pretty obvious and should be all we need. For things like histograms, we can only guess at a bucketing strategy that should be generally useful.
For the next layer, we should allow users to specify things like the default bucketing strategy for timers/histograms (or specify that they should be aggregated as summaries instead of histograms, for example).
Finally, we can provide more specific overrides that operate on metrics of a particular type that match a particular pattern. For example, users could configure that all timers that match foo.* should use a summary, all that match bar.* should use bucketing strategy A, and all that match baz.* should use bucketing strategy B. All the others would fall back to the defaults set in (1), and (2).

Not all of this should be implemented right away, of course. Starting with (1) would be very reasonable, and we could add (2) and (3) as their own followup features.

To get back to your question, I think the answer is that we should try to design our configuration to avoid that problem as much as possible. With the design above, nothing necessarily has to match correctly for the system to work. It will do a reasonable default thing on it's own, and you have the ability to layer patterned overrides on top. If an override doesn't match an incoming metric, it will simply be handled with the default behavior.

loony-bean · 2019-07-27T13:51:04Z

Thanks again @lukesteensen for your comments! In this PR all defaults (1) layer is respected, as well as buckets override from layer (2).

@binarylogic Within this PR the way we construct a default name for Counters has changed, and a new example config is added. I'd like to understand the necessary amount of documentation changes that I need to include into this PR, thus asking for your review.

binarylogic · 2019-07-27T14:34:06Z

Nice work!

the way we construct a default name for Counters has changed

You're talking specifically about Prometheus metrics, correct? If the user provides an explicit name for a metric (bytes_out in your example) then that is the name we should use throughout the pipeline unless explicitly changed again. My previous requirement of default naming in the log_to_metric transform was very heavily inspired by Promtheus' recommendations, and it probably makes sense to move that logic into the prometheus sink at some point in the future. For example, I could see us implementing default rules for metric renaming in the prometheus sink (ex: append all counters with _total). We can address this in a follow up PR though.

I'd like to understand the necessary amount of documentation changes that I need to include into this PR, thus asking for your review.

Good question and thanks for considering that. It appears there are some behavior changes that we can better reflect in the docs. I can take a stab at that and push up a commit.

loony-bean · 2019-07-27T14:42:20Z

You're talking specifically about Prometheus metrics, correct?

Almost. The change happened in the upstream log_to_metric transform. Current logic is that if name is not specified, field string (as is) is used as a default for all metric types. In practice, it means that we are not appending _total to Counter names any more in log_to_metric.

binarylogic · 2019-07-27T14:45:42Z

Sure, I'm ok with that for now. I think any naming changes we need to make for specific storages will become more clear over time.

lukesteensen

Added some comments, but this looks good! I'll defer to @binarylogic for approval on docs and such.

Signed-off-by: Alexey Suslov <alexey.suslov@gmail.com>

binarylogic · 2019-08-05T23:48:39Z

+                }
+            }),
+            Metric::Set { .. } => {
+                trace!("Sets are not supported in Prometheus sink");


@LucioFranco as part of #628, do you think it makes sense to log this at the warning level and rate limit it? I'm curious what you think.

A small remark here. I believe we will eventually add support for this, and this message will disappear.

Luke: I was also wondering if we could support sets by exposing them as a gauge/counter value. We don't need to do it now, but could be interesting for the future.

binarylogic · 2019-08-06T00:11:05Z

Just to update, I'm in the process of updating the docs. I have a few questions though

Is there any reason we don't default to Prometheus' summary type instead of the histogram type when mapping our own internal histograms? I know there are caveats to summaries, but I wanted to check that this was intentional and document the reasons why.
It looks like the default histogram buckets are 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0 but the example above shows buckets of 1, 10, 100, .... How exactly are these buckets maps when exposed via the exposition format?
I couldn't find any tests for this sink. Any reason why?

Signed-off-by: Ben Johnson <bjohnson@binarylogic.com>

binarylogic · 2019-08-06T01:19:00Z

I've updated the docs to reflect everything going on here. I need answers to my above questions to verify some of the content.

Signed-off-by: Ben Johnson <bjohnson@binarylogic.com>

loony-bean · 2019-08-06T08:12:35Z

Is there any reason we don't default to Prometheus' summary type instead of the histogram type when mapping our own internal histograms? I know there are caveats to summaries, but I wanted to check that this was intentional and document the reasons why.

I'm sure you've seen this comparison. My understanding is that histograms provide more features, and require less work on our side (because the work is delegated to Prometheus server).

It looks like the default histogram buckets are 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0 but the example above shows buckets of 1, 10, 100, .... How exactly are these buckets maps when exposed via the exposition format?

Sorry about this. This PR started as a draft, and some changes were not reflected in the upstream comments. I believe I'd need to start a new PR next time, or update the comments retrospectively.

Currently our default buckets are precisely Prometheus' defaults, and user should not see 1, 10, 100... buckets.

I couldn't find any tests for this sink. Any reason why?

This is a good question. Actually we have tests, but they live in Statsd module for historical reasons. We are missing histograms asserts in those tests though (will add some). I'm not sure if we need to copy or move the tests from this module or keep it there.

binarylogic · 2019-08-06T16:02:18Z

Thanks @loony-bean

My understanding is that histograms provide more features, and require less work on our side (because the work is delegated to Prometheus server).

Agree, I was curious to hear your reasoning. That makes sense to me.

Currently our default buckets are precisely Prometheus' defaults, and user should not see 1, 10, 100... buckets.

Perfect. I've updated the docs to reflect this.

I'm not sure if we need to copy or move the tests from this module or keep it there.

Whatever you think is best from an organization standpoint. I just want to make sure we're testing behavior traits to ensure we don't introduce backwards incompatible changes in the future.

New question! 😄

I think we might have an issue with units. Our default buckets are in seconds, but if a metric has a unit that is not seconds it will fall into the wrong bucket.

For example, if we have an internal metric like:

{
  "histogram": {
    "name": "duration_ms",
    "val": 1000.0
  }
}

And our buckets are (in seconds):

[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]

This would fall into the 10.0 bucket, when really it should fall into the 1.0 bucket.

I have experience working with the Elixir prometheus client, and they implemented a unit auto-detection featured based on the suffix in the name. Additionally, you could manually specify the unit so it was normalized. I think we should address this in a follow up PR? Or we can just require the user to redefine the buckets.

Signed-off-by: Ben Johnson <bjohnson@binarylogic.com>

binarylogic · 2019-08-06T16:36:47Z

This PR is getting pretty noisy and I think it's a good first step for supporting histograms. I've created follow up issues to address changes in separate PRs. Going to merge as a result.

binarylogic assigned lukesteensen Jul 23, 2019

binarylogic mentioned this pull request Jul 26, 2019

log_to_metric field data into label #684

Closed

loony-bean force-pushed the metrics-prom-hist branch from a5b5b5d to e2256ed Compare July 27, 2019 13:11

loony-bean marked this pull request as ready for review July 27, 2019 13:26

loony-bean requested a review from lukesteensen July 27, 2019 13:27

loony-bean requested a review from binarylogic July 27, 2019 13:51

binarylogic added the needs: docs Needs documentation updates label Jul 31, 2019

lukesteensen approved these changes Jul 31, 2019

View reviewed changes

Comment thread src/sinks/prometheus.rs

Comment thread src/sinks/prometheus.rs Outdated

Comment thread src/sinks/prometheus.rs Outdated

binarylogic self-assigned this Aug 5, 2019

loony-bean added 4 commits August 5, 2019 11:18

Add config draft and limited histogram support

67e1c72

Signed-off-by: Alexey Suslov <alexey.suslov@gmail.com>

Use field as default name

dd1a867

Signed-off-by: Alexey Suslov <alexey.suslov@gmail.com>

Add default bucket config

28cd3f3

Signed-off-by: Alexey Suslov <alexey.suslov@gmail.com>

Adjust error levels

590c983

binarylogic mentioned this pull request Aug 5, 2019

Provide metric naming and strategy overrides in the prometheus sink #710

Closed

binarylogic reviewed Aug 5, 2019

View reviewed changes

update docs

e14e957

Signed-off-by: Ben Johnson <bjohnson@binarylogic.com>

binarylogic force-pushed the metrics-prom-hist branch from 232eac3 to e14e957 Compare August 6, 2019 01:18

binarylogic removed the needs: docs Needs documentation updates label Aug 6, 2019

fix buckets option type

5b88ec4

Signed-off-by: Ben Johnson <bjohnson@binarylogic.com>

binarylogic changed the title ~~feat(metrics): metrics prom hist~~ feat(prometheus sink): support histograms Aug 6, 2019

binarylogic changed the title ~~feat(prometheus sink): support histograms~~ feat(prometheus sink): Support histograms Aug 6, 2019

binarylogic mentioned this pull request Aug 6, 2019

Normalize metric values and units #711

Closed

binarylogic added 3 commits August 6, 2019 12:33

docs: add more examples and clarify buckets

de46f1d

Signed-off-by: Ben Johnson <bjohnson@binarylogic.com>

Fix merge conflicts

599dddd

docs: regenerate docs

ad0e854

Signed-off-by: Ben Johnson <bjohnson@binarylogic.com>

binarylogic mentioned this pull request Aug 6, 2019

Finalize prometheus testing #712

Closed

binarylogic mentioned this pull request Aug 6, 2019

Support sets in the prometheus sink #713

Closed

binarylogic merged commit 855b007 into master Aug 6, 2019

binarylogic deleted the metrics-prom-hist branch August 6, 2019 16:43

Conversation

loony-bean commented Jul 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

loony-bean commented Jul 23, 2019

Uh oh!

binarylogic commented Jul 23, 2019

Uh oh!

loony-bean commented Jul 23, 2019

Prometheus HELP sections

Labels

Sets

Uh oh!

loony-bean commented Jul 23, 2019

Default names in log_to_metric

Uh oh!

loony-bean commented Jul 23, 2019

Individual buckets configs

Uh oh!

loony-bean commented Jul 23, 2019

Uh oh!

lukesteensen commented Jul 23, 2019

Prometheus HELP sections

Labels

Sets

Default names in log_to_metric

Individual buckets configs

Uh oh!

loony-bean commented Jul 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

loony-bean commented Jul 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lukesteensen commented Jul 24, 2019

Uh oh!

loony-bean commented Jul 27, 2019

Uh oh!

binarylogic commented Jul 27, 2019

Uh oh!

loony-bean commented Jul 27, 2019

Uh oh!

binarylogic commented Jul 27, 2019

Uh oh!

lukesteensen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

binarylogic Aug 5, 2019

Choose a reason for hiding this comment

Uh oh!

loony-bean Aug 6, 2019

Choose a reason for hiding this comment

Uh oh!

binarylogic commented Aug 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

binarylogic commented Aug 6, 2019

Uh oh!

loony-bean commented Aug 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

binarylogic commented Aug 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

binarylogic commented Aug 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

loony-bean commented Jul 23, 2019 •

edited

Loading

loony-bean commented Jul 24, 2019 •

edited

Loading

loony-bean commented Jul 24, 2019 •

edited

Loading

binarylogic commented Aug 6, 2019 •

edited

Loading

loony-bean commented Aug 6, 2019 •

edited

Loading

binarylogic commented Aug 6, 2019 •

edited

Loading