Skip to content

feat(prometheus sink): Support histograms#675

Merged
binarylogic merged 9 commits intomasterfrom
metrics-prom-hist
Aug 6, 2019
Merged

feat(prometheus sink): Support histograms#675
binarylogic merged 9 commits intomasterfrom
metrics-prom-hist

Conversation

@loony-bean
Copy link
Copy Markdown
Contributor

@loony-bean loony-bean commented Jul 23, 2019

Closes #384

@loony-bean
Copy link
Copy Markdown
Contributor Author

This PR draft is a playground for discussing the UX of metrics pipelines. Specifically, it touches log_to_metric transform and Prometheus sink, but might be seen in broader context.

The intermediate result is that using this config we can transform such log content

99.16.194.115 - olson6561 [23/07/2019:14:35:37 +0300] "HEAD /extensible" 400 10820
37.180.76.108 - - [23/07/2019:14:35:37 +0300] "POST /redefine/proactive" 301 2346
120.17.2.3 - oreilly2474 [23/07/2019:14:35:37 +0300] "GET /user-centric/relationships/morph/models" 503 7372
32.57.26.2 - cartwright1010 [23/07/2019:14:35:37 +0300] "PUT /grow/innovate/best-of-breed" 201 27801
151.103.54.126 - - [23/07/2019:14:35:37 +0300] "DELETE /revolutionary/networks" 201 12691
244.115.139.219 - - [23/07/2019:14:35:37 +0300] "GET /iterate/granular/recontextualize/iterate" 404 20862
146.197.22.54 - - [23/07/2019:14:35:37 +0300] "DELETE /impactful" 400 28990
254.83.212.8 - - [23/07/2019:14:35:37 +0300] "GET /drive/cross-media/e-enable/integrate" 403 20683
65.180.197.177 - gleichner6855 [23/07/2019:14:35:37 +0300] "PUT /distributed/reinvent" 504 1112
237.200.72.216 - schmeler4652 [23/07/2019:14:35:37 +0300] "HEAD /metrics/markets" 204 11111
136.193.221.152 - - [23/07/2019:14:35:37 +0300] "POST /magnetic/one-to-one/orchestrate" 502 25796
149.70.197.139 - - [23/07/2019:14:35:37 +0300] "DELETE /deploy/empower/scalable" 100 5134

into such Prometheus plain text exposition

# HELP bytes_out bytes_out
# TYPE bytes_out gauge
bytes_out 5134
# HELP bytes_out_histogram bytes_out_histogram
# TYPE bytes_out_histogram histogram
bytes_out_histogram_bucket{le="1"} 0
bytes_out_histogram_bucket{le="10"} 0
bytes_out_histogram_bucket{le="100"} 0
bytes_out_histogram_bucket{le="1000"} 0
bytes_out_histogram_bucket{le="10000"} 4
bytes_out_histogram_bucket{le="100000"} 12
bytes_out_histogram_bucket{le="+Inf"} 12
bytes_out_histogram_sum 174718
bytes_out_histogram_count 12
# HELP bytes_out_total bytes_out_total
# TYPE bytes_out_total counter
bytes_out_total 174718
# HELP message_total message_total
# TYPE message_total counter
message_total 12

@binarylogic
Copy link
Copy Markdown
Contributor

Nice! Are there any specific issues you want us to look at? The config looks great to me, but we can have Luke or Lucio do a review on the direction if it helps.

@loony-bean
Copy link
Copy Markdown
Contributor Author

Now the issues and questions.

Prometheus HELP sections

Adding meta information about the metric seem like a good thing to allow in configs. The question is whether it should be part of metric source, log_to_metric transform or sink config? We might even want to have some descriptions attached to log fields as well.

Labels

Labels are imperative for Prometheus UX. It's hard to imagine how we could express typical Prom patterns like e.g. status code counter (http_requests_total{method="post",code="200"}) without labels. We have #512 for tracking labels support, and it might be reasonable to bring it into the scope.

Sets

Unlike Statsd, Prometheus doesn't support Set types of metrics. It is still possible to transform Sets into Gauges for cardinality of stuff, like here, but it doesn't seem easy to bring it into our rust-prometheus based code.

In general, the idea of handling metric types coercions, implicit or explicit, sounds a bit controversial to me.

@loony-bean
Copy link
Copy Markdown
Contributor Author

Default names in log_to_metric

It is a good idea to help users writing smaller configs, but here is an example how it might strike us back.

We need to provide the name (bytes_out_total) which we do not really control because it was generated from this user input automatically.

@loony-bean
Copy link
Copy Markdown
Contributor Author

Individual buckets configs

Bucket configuration needs to be done in sinks, and it could be done similar to this. However I'm not sure this config is ergonomic, mainly because:

  • user needs to repeat field names in many places
  • user can try to put different types for the same thing (and who knows what will happen then)

@loony-bean
Copy link
Copy Markdown
Contributor Author

@binarylogic Yeah, need some help to spec this out.

@lukesteensen
Copy link
Copy Markdown
Member

@loony-bean Great questions! I'm not sure of all the answers, but here are my thoughts:

Prometheus HELP sections

Adding meta information about the metric seem like a good thing to allow in configs. The question is whether it should be part of metric source, log_to_metric transform or sink config? We might even want to have some descriptions attached to log fields as well.

I think the main question here is whether or not this information is specific to prometheus. If it is, any configuration of it should probably be limited to the prometheus sink. But we should also think about the overall user experience here. My suspicion is that this would require a lot of time spent writing configuration before providing much value to the user. And at first glance, it doesn't seem as through tools like https://github.com/prometheus/statsd_exporter have decided it's worthwhile to implement anything like this.

Labels

Labels are imperative for Prometheus UX. It's hard to imagine how we could express typical Prom patterns like e.g. status code counter (http_requests_total{method="post",code="200"}) without labels. We have #512 for tracking labels support, and it might be reasonable to bring it into the scope.

Labels were definitely one of the next things I think we should tackle in the metrics area. I always try to split up tasks into small iterative steps, but if you think it would be better to do labels alongside this work feel free to go ahead. I just don't want any individual change to snowball too much and become hard to review.

Sets

Unlike Statsd, Prometheus doesn't support Set types of metrics. It is still possible to transform Sets into Gauges for cardinality of stuff, like here, but it doesn't seem easy to bring it into our rust-prometheus based code.

Yeah, sets are kinda funky. The simplest thing we could do is simply ignore them, which would be totally valid for this early stage. We've also been talking about trying to drop our dependency on the prometheus crate. It doesn't seem well suited to our use case, and we've experimented with leaving it out of some of our internal metrics implementations. If that would make this easier, feel free to explore that route as well.

Default names in log_to_metric

It is a good idea to help users writing smaller configs, but here is an example how it might strike us back.

We need to provide the name (bytes_out_total) which we do not really control because it was generated from this user input automatically.

I'm not sure I totally understand this example. Do we change the name of the field from bytes_out to bytes_out_total when we turn it into a metric? My initial thought is that we should default to just keeping the name the same. If the user configures something that conflicts we try to give them a warning, but there is probably some degree of that which is impossible to prevent.

Individual buckets configs

Bucket configuration needs to be done in sinks, and it could be done similar to this. However I'm not sure this config is ergonomic, mainly because:

  • user needs to repeat field names in many places

  • user can try to put different types for the same thing (and who knows what will happen then)

We could provide the option of giving a default bucketing for histograms. That would cut down on the repetitiveness if the users wants the same behavior for most fields. If we wanted to be fancier, we could let them define named bucketings and then just specify the name for each metric.

What do you mean by different types for the same thing?

@loony-bean
Copy link
Copy Markdown
Contributor Author

loony-bean commented Jul 24, 2019

Thanks for input @lukesteensen! Re HELP, Labels, Sets - all clear, will ignore it for now.

I'm not sure I totally understand this example. Do we change the name of the field from bytes_out to bytes_out_total when we turn it into a metric? My initial thought is that we should default to just keeping the name the same. If the user configures something that conflicts we try to give them a warning, but there is probably some degree of that which is impossible to prevent.

Yes, we have a special logic of adding a suffix to counter names. Maybe we can adjust this a bit to simply use the field name as is, like you said? /cc @binarylogic

@loony-bean
Copy link
Copy Markdown
Contributor Author

loony-bean commented Jul 24, 2019

What do you mean by different types for the same thing?

Using this config as an example:

[[transforms.log_to_metric.metrics]]
type = "histogram"
field = "bytes_out"
name = "bytes_out_histogram"

[sinks.prometheus]
inputs = ["log_to_metric"]
type = "prometheus"

[[sinks.prometheus.metrics]]
type = "histogram"
field = "bytes_out_histogram"
buckets = {type = "linear", start = 100.0, width = 2.0, count = 5}


[[sinks.prometheus.metrics]]
type = "summary"
field = "bytes_out_histogram"
name = "bytes_out_histogram_summary"

[[sinks.prometheus.metrics]]
type = "counter"
field = "bytes_out_histogram"
name = "bytes_out_histogram_counter"

In here user can try to connect histogram to histogram (fine), histogram to summary (makes sense), or even histogram to counter (wat, but still possible).

The question in general is if we are going to allow user to convert internal metric type (internal histogram) into different sink types (prometheus histogram/summary), and how user should be guided?

@lukesteensen
Copy link
Copy Markdown
Member

I see what you're saying. This seems to boil down to how the sink is configured. I think we should maintain the current prometheus sink behavior of aggregating all the metrics it receives without requiring an explicit configuration for each metric. This makes the sink much easier to use by reducing the burden of writing configs, especially in use cases with many different metrics.

Obviously, we'll still need a way for users to specify how certain metrics should be aggregated. I think our approach should be layers here, to try to minimize the amount of config a user would need to write:

  1. For each metric type, we'll have a default way of aggregating that metric for the prometheus sink. For types like counters, this is pretty obvious and should be all we need. For things like histograms, we can only guess at a bucketing strategy that should be generally useful.

  2. For the next layer, we should allow users to specify things like the default bucketing strategy for timers/histograms (or specify that they should be aggregated as summaries instead of histograms, for example).

  3. Finally, we can provide more specific overrides that operate on metrics of a particular type that match a particular pattern. For example, users could configure that all timers that match foo.* should use a summary, all that match bar.* should use bucketing strategy A, and all that match baz.* should use bucketing strategy B. All the others would fall back to the defaults set in (1), and (2).

Not all of this should be implemented right away, of course. Starting with (1) would be very reasonable, and we could add (2) and (3) as their own followup features.

To get back to your question, I think the answer is that we should try to design our configuration to avoid that problem as much as possible. With the design above, nothing necessarily has to match correctly for the system to work. It will do a reasonable default thing on it's own, and you have the ability to layer patterned overrides on top. If an override doesn't match an incoming metric, it will simply be handled with the default behavior.

@loony-bean loony-bean marked this pull request as ready for review July 27, 2019 13:26
@loony-bean loony-bean requested a review from lukesteensen July 27, 2019 13:27
@loony-bean
Copy link
Copy Markdown
Contributor Author

Thanks again @lukesteensen for your comments! In this PR all defaults (1) layer is respected, as well as buckets override from layer (2).

@binarylogic Within this PR the way we construct a default name for Counters has changed, and a new example config is added. I'd like to understand the necessary amount of documentation changes that I need to include into this PR, thus asking for your review.

@loony-bean loony-bean requested a review from binarylogic July 27, 2019 13:51
@binarylogic
Copy link
Copy Markdown
Contributor

Nice work!

the way we construct a default name for Counters has changed

You're talking specifically about Prometheus metrics, correct? If the user provides an explicit name for a metric (bytes_out in your example) then that is the name we should use throughout the pipeline unless explicitly changed again. My previous requirement of default naming in the log_to_metric transform was very heavily inspired by Promtheus' recommendations, and it probably makes sense to move that logic into the prometheus sink at some point in the future. For example, I could see us implementing default rules for metric renaming in the prometheus sink (ex: append all counters with _total). We can address this in a follow up PR though.

I'd like to understand the necessary amount of documentation changes that I need to include into this PR, thus asking for your review.

Good question and thanks for considering that. It appears there are some behavior changes that we can better reflect in the docs. I can take a stab at that and push up a commit.

@loony-bean
Copy link
Copy Markdown
Contributor Author

You're talking specifically about Prometheus metrics, correct?

Almost. The change happened in the upstream log_to_metric transform. Current logic is that if name is not specified, field string (as is) is used as a default for all metric types. In practice, it means that we are not appending _total to Counter names any more in log_to_metric.

@binarylogic
Copy link
Copy Markdown
Contributor

Sure, I'm ok with that for now. I think any naming changes we need to make for specific storages will become more clear over time.

@binarylogic binarylogic added the needs: docs Needs documentation updates label Jul 31, 2019
Copy link
Copy Markdown
Member

@lukesteensen lukesteensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments, but this looks good! I'll defer to @binarylogic for approval on docs and such.

Comment thread src/sinks/prometheus.rs
Comment thread src/sinks/prometheus.rs Outdated
Comment thread src/sinks/prometheus.rs Outdated
@binarylogic binarylogic self-assigned this Aug 5, 2019
Signed-off-by: Alexey Suslov <alexey.suslov@gmail.com>
Signed-off-by: Alexey Suslov <alexey.suslov@gmail.com>
Signed-off-by: Alexey Suslov <alexey.suslov@gmail.com>
Comment thread src/sinks/prometheus.rs
}
}),
Metric::Set { .. } => {
trace!("Sets are not supported in Prometheus sink");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LucioFranco as part of #628, do you think it makes sense to log this at the warning level and rate limit it? I'm curious what you think.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small remark here. I believe we will eventually add support for this, and this message will disappear.

Luke: I was also wondering if we could support sets by exposing them as a gauge/counter value. We don't need to do it now, but could be interesting for the future.

@binarylogic
Copy link
Copy Markdown
Contributor

binarylogic commented Aug 6, 2019

Just to update, I'm in the process of updating the docs. I have a few questions though

  1. Is there any reason we don't default to Prometheus' summary type instead of the histogram type when mapping our own internal histograms? I know there are caveats to summaries, but I wanted to check that this was intentional and document the reasons why.
  2. It looks like the default histogram buckets are 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0 but the example above shows buckets of 1, 10, 100, .... How exactly are these buckets maps when exposed via the exposition format?
  3. I couldn't find any tests for this sink. Any reason why?

Signed-off-by: Ben Johnson <bjohnson@binarylogic.com>
@binarylogic
Copy link
Copy Markdown
Contributor

I've updated the docs to reflect everything going on here. I need answers to my above questions to verify some of the content.

@binarylogic binarylogic removed the needs: docs Needs documentation updates label Aug 6, 2019
Signed-off-by: Ben Johnson <bjohnson@binarylogic.com>
@binarylogic binarylogic changed the title feat(metrics): metrics prom hist feat(prometheus sink): support histograms Aug 6, 2019
@loony-bean
Copy link
Copy Markdown
Contributor Author

loony-bean commented Aug 6, 2019

  1. Is there any reason we don't default to Prometheus' summary type instead of the histogram type when mapping our own internal histograms? I know there are caveats to summaries, but I wanted to check that this was intentional and document the reasons why.

I'm sure you've seen this comparison. My understanding is that histograms provide more features, and require less work on our side (because the work is delegated to Prometheus server).

  1. It looks like the default histogram buckets are 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0 but the example above shows buckets of 1, 10, 100, .... How exactly are these buckets maps when exposed via the exposition format?

Sorry about this. This PR started as a draft, and some changes were not reflected in the upstream comments. I believe I'd need to start a new PR next time, or update the comments retrospectively.

Currently our default buckets are precisely Prometheus' defaults, and user should not see 1, 10, 100... buckets.

  1. I couldn't find any tests for this sink. Any reason why?

This is a good question. Actually we have tests, but they live in Statsd module for historical reasons. We are missing histograms asserts in those tests though (will add some). I'm not sure if we need to copy or move the tests from this module or keep it there.

@binarylogic binarylogic changed the title feat(prometheus sink): support histograms feat(prometheus sink): Support histograms Aug 6, 2019
@binarylogic
Copy link
Copy Markdown
Contributor

binarylogic commented Aug 6, 2019

Thanks @loony-bean

My understanding is that histograms provide more features, and require less work on our side (because the work is delegated to Prometheus server).

Agree, I was curious to hear your reasoning. That makes sense to me.

Currently our default buckets are precisely Prometheus' defaults, and user should not see 1, 10, 100... buckets.

Perfect. I've updated the docs to reflect this.

I'm not sure if we need to copy or move the tests from this module or keep it there.

Whatever you think is best from an organization standpoint. I just want to make sure we're testing behavior traits to ensure we don't introduce backwards incompatible changes in the future.


New question! 😄

I think we might have an issue with units. Our default buckets are in seconds, but if a metric has a unit that is not seconds it will fall into the wrong bucket.

For example, if we have an internal metric like:

{
  "histogram": {
    "name": "duration_ms",
    "val": 1000.0
  }
}

And our buckets are (in seconds):

[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]

This would fall into the 10.0 bucket, when really it should fall into the 1.0 bucket.

I have experience working with the Elixir prometheus client, and they implemented a unit auto-detection featured based on the suffix in the name. Additionally, you could manually specify the unit so it was normalized. I think we should address this in a follow up PR? Or we can just require the user to redefine the buckets.

Signed-off-by: Ben Johnson <bjohnson@binarylogic.com>
Signed-off-by: Ben Johnson <bjohnson@binarylogic.com>
@binarylogic
Copy link
Copy Markdown
Contributor

This PR is getting pretty noisy and I think it's a good first step for supporting histograms. I've created follow up issues to address changes in separate PRs. Going to merge as a result.

@binarylogic binarylogic merged commit 855b007 into master Aug 6, 2019
@binarylogic binarylogic deleted the metrics-prom-hist branch August 6, 2019 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support all metric types for prometheus sink

3 participants