-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementing Prometheus Exporter and documentation #534
Conversation
This is a continuation of PR #526 but decided to create a new PR to cleanup things for a new review. |
@tsenart Could you please give a check on this PR? |
@flaviostutz: I've been swamped with work but saved some time to look at this on Saturday! Sorry. |
No worries... I know how it is... here with me it's the same... thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of review done! Thanks for working on this 🙇
@tsenart could you please take a look at this PR when you have some time? I think I managed to do all the requested changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just a few leftover changes / inconsistencies. Thank you 🙇
lib/prom/prom.go
Outdated
vegeta "github.com/tsenart/vegeta/v12/lib" | ||
) | ||
|
||
//PrometheusMetrics vegeta metrics observer with exposition as Prometheus metrics endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do this.
@tsenart please take a look at the requested changes. |
go.sum
Outdated
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME= | ||
github.com/stretchr/objx v0.1.1/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME= | ||
github.com/stretchr/testify v1.2.2/go.mod h1:a8OnRcib4nhh0OaRAV+Yts87kKdq0PP7pXfy6kDkUVs= | ||
github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@flaviostutz: Why is this still here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed all dependencies to testify in prom lib, but in go.sum there is an existing testify dependency in master already and I don't know where it comes from. go.mod doesn't declare it. probably an indirect dependency somewhere (but it already exists before my PR)
@@ -803,6 +807,62 @@ $ ulimit -u # processes / threads | |||
|
|||
Just pass a new number as the argument to change it. | |||
|
|||
## Prometheus Support | |||
|
|||
Vegeta has a built-in Prometheus Exporter that may be enabled during "attacks" so that you can point any Prometheus instance to Vegeta instances and get some metrics about http requests performance and about the Vegeta process itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I like this for a lot reasons, I think vegeta should be exporting metrics to a Prometheus push gateway rather than being scraped directly by prometheus.
Since vegeta isn't a long lived process you'll have race conditions where Prometheus might not scrape the last bit of attack data before vegeta shuts down.
Prometheus best practices docs say you should use push gateway for "service-level batch jobs" which is what I think vegeta would qualify as:
Usually, the only valid use case for the Pushgateway is for capturing the outcome of a service-level batch job. A "service-level" batch job is one which is not semantically related to a specific machine or job instance (for example, a batch job that deletes a number of users for an entire service). Such a job's metrics should not include a machine or instance label to decouple the lifecycle of specific machines or instances from the pushed metrics. This decreases the burden for managing stale metrics in the Pushgateway. See also the best practices for monitoring batch jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bring this up, bc I think it should instead have, or at least additionally offer, a report type of prom
which will then export metrics from a report to a prometheus push gateway.
I am currently writing that for datadog
rather than Prometheus but the idea is the same. The advantage of the report is that one could take an existing Vegeta test result and push into a metrics store rather than having to run a new test.
And the user could publish the results to a Prometheus and a DataDog and wherever else they needed with the reporting decoupled from the attacking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: When -prometheus-addr
is set and we start an HTTP server for prometheus to scrape, make sure that all metrics are scrapped before exiting from the program, even after the attack finished per se.
Hi guys, Hope you are all well ! Just wanted to know the current status of this PR as it sounds awesome for load testing any webapp. What is missing to merge it ? Cheers, |
As someone who uses Vegeta as a library for testing/benchmarking/loadtesting, this is fantastic! Having to manually implement OpenCensus middleware for HTTP is a bit of a pain. Just a note, is there a way to customise the latency histogram buckets? It looked like they were hardcoded to me, and since I can think of two immediate ways of making it customisable. Exposing the prom metrics types to the user, or enabling the user to set it as a "param" upon creating the |
hola people! any news on this one? seems this functionality is widely needed and a lot of thanks for your nice job cc @flaviostutz @tsenart |
@hfuss @flaviostutz I'm wondering what the state of the PR is and if it's still possible to implement Prometheus support. I agree that normally an ad-hoc job like a Vegeta process could benefit from just delivering results to Pushgateway. I also think creating an exporter could be useful if you're going to run Vegeta for a long time and want a simpler integration. I'm preparing to use Vegeta for some long-running (hours-days) load-testing jobs and wondered if I could help at all with Prometheus integration one way or the other.
|
Depending on use case, I would advise against the PushGateway approach except as a last resort. In my efforts to use it, observed that metrics over the push gateway have no TTL and persist forever, and the metric's value only updates if you send new values over time with the exact same labels/dimensions. If the labels change over time (different instance, host, etc.), you'll just get new metrics instead and both old and new metrics persist in the system over time. So when say viewing the metrics in grafana, you have a never ending set of timeseries, rather than timeseries with filled lines for whenever metrics are sent to gateway, and missing data on chart for intervals where no data is sent to push gateway. I wanted and expected the latter but got the former. For ad-hoc and periodic load testing, I'd assume people would want the latter. The former is more suitable if say you ran vegeta for monitoring of a test instance that never changes, same service you are monitoring over time. For what I want to do, if you go with push gateway, you may need something like this forked version instead that offers TTL to the push gateway metrics: https://github.com/dinumathai/pushgateway So I think the exporter route is good, I think it doesn't have the TTL issue that the push gateway has. Another option to consider for alternative to exporter and push gateway, is the Prometheus remote write protocol/feature, as a way to push metrics to prometheus. We're looking into and testing the remote write out at our organization, I'm not aware of the specific implementation or how well it's performing at the moment. But I wouldn't know for case of vegeta, which would be more optimal from a performance standpoint since vegeta needs to both generate the load (test) and also expose or push the metrics. https://last9.io/blog/what-is-prometheus-remote-write/ |
@daluu: Makes sense. I think the Prometheus exporter should then be implemented as a vegeta sub command, like |
As discussed in #477 we decided to instrument attack because then we don't need to parse stdin and it would be simpler/more performant. If we decide to change this at this point of the PR actually we should cancel it and start a new branch because a lot of things will be different. I would recommend us to finish this PR with the initial requirements because it's almost there and if necessary in the future we discuss better about creating a reporter/push gateway/direct write version of it. What do you think? |
Ah, I had forgotten that discussion. It was a long time ago! Thinking about it now, having the attack command expose a web server handler that a Prometheus instance can scrape is good for performance and reducing moving pieces in distributed attacks. But for interactive use and debugging past attacks I think we should still have a sub command that exporta saved results as Prometheus metrics. So, the answer is I think we need both, and they should share as much code as possible. The only difference between doing it in attack and doing it in another sub-command is that in attack we can observe the metric without decoding the result first. I suggest we introduce a lib/prom package where all of this is encapsulated and modularized. Also, very much up for hacking on this together. So let me know how you'd want to go about it. |
This PR is already creating a lib/prom package. I just reviewed and fixed all comments that were open (after 3y lol!). Please do another review round and mark comments as resolved if they are ok. |
44c66f3
to
7962a87
Compare
Signed-off-by: Flávio Stutz <flaviostutz@gmail.com>
7962a87
to
a998c8e
Compare
I squashed the messy commits (various upstream merges and small changes) from my branch into one and applied the PGP signatures so you can merge it to master 😁 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for moving this forward! Out of all my comments, let me know what you want to work on. I'm happy to address my own feedback on top of your changes. I'll also implement the prom
sub-command that I mentioned.
…es with P90; minor changes Signed-off-by: Flávio Stutz <flaviostutz@gmail.com>
@tsenart Pushed everything I changed here. In summary I tried to resolve all comments, so please do another round of checks. As this PR is too big already, I would advise us to merge it as soon as possible and build the "prom" command in another one. |
Signed-off-by: Tomás Senart <tsenart@gmail.com>
@daluu #534 (comment) As Info I use a PushGateway and delete at the end all metrics based by vegta so i see no problem with pushgateway: import (
"log"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/push"
vegeta "github.com/tsenart/vegeta/v12/lib"
"github.com/tsenart/vegeta/v12/lib/prom"
)
type PushGatewayRegister struct {
Pusher *push.Pusher
}
// MustRegister implements prometheus.Registerer.
func (p *PushGatewayRegister) MustRegister(c ...prometheus.Collector) {
for _, v := range c {
p.Pusher = p.Pusher.Collector(v)
}
}
// Register implements prometheus.Registerer.
func (p *PushGatewayRegister) Register(c prometheus.Collector) error {
p.Pusher = p.Pusher.Collector(c)
return nil
}
// Unregister implements prometheus.Registerer.
func (p *PushGatewayRegister) Unregister(c prometheus.Collector) bool {
return true
}
func LoadTest() {
metrics := prom.NewMetrics()
pusher := push.New("url", "job")
_ := metrics.Register(&PushGatewayRegister{Pusher: pusher})
for res := range attacker.Attack(targeter.Targeter, rate, duration, "Big Bang!") {
metrics.Observe(res)
err = pusher.Push()
if err != nil {
log.Println(fmt.Errorf("push error %w", err))
}
}
_ = pusher.Delete() <- thats the point... all pusher created metrics will be removed from pushgateway
}
but change my mind |
As long as that works out in the end. Then yes should be fine. I would advise to please do test that this works as anticipated, by sending some metrics, then stop sending anymore for some time e.g. over 15 minutes to an hour or more, and verifying that grafana (or equivalent data viewer) properly renders the metric as discrete data points for when metrics were sent rather than an extrapolated line that continues through current time even though no more metrics were sent. Expecting how something works in theory is different from actually validating it with some testing. So as long as this special handling in push gateway has been confirmed to work as expected with testing, then sounds good. Note, maybe this approach works better when you have more access to interface to the push gateway, the library/interface we were using at the time, I'm not aware if it had a way to "delete" metrics on the push gateway, we only sent metrics to it.
I'm not sure what is meant here, it is a little vague for interpretation. Did you mean to say, unless someone can convince you to change your mind otherwise, pushgateway works fine for you? Or did you mean despite what you mentioned, you have decided to change your mind about using push gateway approach? |
@daluu My Grafana graph looks like this: As you see only at the time of attack there are data. So i think to advise against pushgateway (see readme) is incorrect. |
Yes, makes sense, I take back my prior advice, but with the caveat/warning that provided the user has properly utilized the push gateway logic. Because if you omit the delete step at the end, I believe you will run into the concern I previously mentioned, one can try to confirm it. So if we're building a solution and documentation here, need to account for that to ensure proper successful deployment. @fasibio, curious what made you issue delete at the end of pushing metrics? Somehow you were aware of the need for this (or the issue when you don't delete), or you came across it from trial & error, or found it documented somewhere? Because unless I overlooked the documentation/example code, as far as I can recall I don't recall seeing the documentation or example code indicating to user to issue a delete after pushing out metrics. It's not so intuitive to me how the push gateway was designed, as one would think the push (and pull) model is discrete - you send/poll data, you get data. when no push or poll is occurring, then there should be no data - but the push gateway still holds on to it for continuous forwarding to prometheus if you don't clear it out specifically when the metric values don't change. |
@daluu Simple I follow the pushgateway "Use it" (no go specific) and there the CURL delete command is part of. @tsenart see discussion, might make sense to update readme. |
Background
This is a complete initial implementation of Prometheus Exporter support as discussed in #477
Checklist