Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Prometheus Plugin #692

Merged
merged 6 commits into from
Jul 28, 2016
Merged

Added Prometheus Plugin #692

merged 6 commits into from
Jul 28, 2016

Conversation

pradeepchhetri
Copy link
Member

This pull request adds support for Prometheus backend. Since Prometheus uses pull model for collecting metrics, it provides a Pushgateway server which acts as an intermediate proxy for pushing datapoints.

Sample Screenshot:

screen shot 2016-05-22 at 6 22 33 pm

@pradeepchhetri pradeepchhetri changed the title Added Prometheus Plugin (https://prometheus.io) Added Prometheus Plugin May 22, 2016
@pradeepchhetri
Copy link
Member Author

Can someone please review this pull request ?

@jamtur01
Copy link
Member

@pradeepchhetri Might be a little while - we're all very part time maintainers! I'll try to look on the weekend. But please be patient. :)

@pradeepchhetri
Copy link
Member Author

@jamtur01 James, thank you for the quick reply. :)

@mfournier
Copy link

This looks promising @pradeepchhetri ! Thanks for working on this !

One thing that strikes me is that the attributes from the riemann event get dropped. Only tags and the "host" field seem to get submitted to the push-gateway. IMHO it's important to also pass (most of) the attributes along to prometheus. We'll typically want to see "service", "state" and friends on the prometheus side too. What do you think ?

@pradeepchhetri
Copy link
Member Author

pradeepchhetri commented May 28, 2016

@mfournier Thank you for looking into it. We are actually passing the "service" and "metric" fields as the body of the http post request (https://github.com/pradeepchhetri/riemann/blob/prometheus/src/riemann/prometheus.clj#L18-L22). I wasn't able to figure out how to pass "state" field into each datapoint.
screen shot 2016-05-28 at 5 44 31 pm

The reason i am not attaching the timestamp is because prometheus overwrites the timestamp with the time it scrapes pushgateway. For more details: https://github.com/prometheus/pushgateway#about-timestamps

Let me know if you feel we can improve the plugin in some way.

@mfournier
Copy link

Oh, just noticed this problem when submitting an event with an empty metric field:

clojure.lang.ExceptionInfo: clj-http: status 500 {:status 500, :headers {"Content-Type" "text/plain; charset=utf-8", "X-Content-Type-Options" "nosniff", "Date" "Sat, 28 May 2016 12:32:59 GMT", "Content-Length" "84", "Connection" "close"}, :body "text format parsing error in line 1: expected float as value, got \"5940577/1000000\"\n", :request-time 3, :trace-redirects ["http://localhost:9091/metrics/job/riemann/host/lonquimay"], :orig-content-encoding nil}

Here is what the event looked like when dumped to the logs:

INFO [2016-05-28 14:35:22,268] defaultEventExecutorGroup-2-1 - riemann.config - #riemann.codec.Event{:host toto, :service foobar, :state nil, :description nil, :metric nil, :tags nil, :time 1.464438922266E9, :ttl 60, :x-client riemann-c-client}

(fn [event]
(let [url (generate-url opts event)
datapoint (generate-datapoint event)]
(when (and (:metric event) (:service event))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this logic repeated? Don't you already check for :service and :metric in generate-datapoint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. I will fix it.

@jamtur01
Copy link
Member

@pradeepchhetri Can you pass :state et al as a label? Like you did tags?

@pradeepchhetri
Copy link
Member Author

@mfournier I can reproduce the error. Looking into it.

@mfournier
Copy link

Looking at the implementation more in details, I must say I'm not really sold to the tagN, tagN+1, etc, idea. On the prometheus side, building a graph/alert rule based on this implies knowning in advance how many tags there are, and then searching each of them for the element we're looking for. Wouldn't it be better to join them together in a single label, seperated by a comma for example (better: configurable character) ? ie: some_event{tags="foo,bar,baz"}. Then it would simply be a matter of using the =~ operator in the prometheus query.

About the attributes, again I think it's really important to keep them all. You want as much info as possible about your events in prometheus too. Cleaning up events from extraneous tags/attributes before submission to prometheus is easy in riemann. Having to work with incomplete data in prometheus, less so.

So here's a sample event as seen from riemann:
#riemann.codec.Event{:host toto, :service example, :state ok, :description this is some description, :metric 123, :tags [foo bar blah], :time 1.464465272033E9, :ttl 60, :x-client riemann-c-client, :attr1 val1, :attr2 val2}
...and here is what it looks like in prometheus:
example{host="toto",instance="",job="riemann",tag0="foo",tag1="bar",tag2="blah"} 123
My suggestion is that it should look like:
example{host="toto",instance="",job="riemann",state="ok",description="this is some description",x-client="riemann-c-client",attr1="val1",attr2="val2",tags="foo,bar,blah"} 123

Finally, a couple of thoughts/wishlist for later (ie: I non-requirements for an initial implementation imho):

  • ability to delete metrics. typically to call from the (expired) function in riemann.
  • add the riemann server's hostname in the "instance" field.
  • ability to add the timestamp & ttl to outgoing events, as an option. We're basically at the interface between a push-based and a pull-based system here, so there is a risk of missing ephemeral state changes, detecting stale data will be tricky, etc. By providing this, we give users a key element to make it less painful.
  • also wondering if firing an http request for every event, at high rate, won't be harmful to the pushgateway. But I'm not sure it's really supposed to be used with frequently updated metrics anyways.

@faxm0dem
Copy link
Contributor

👍 for keeping all attributes

@brian-brazil
Copy link

Prometheus developer here. I'm not too familiar with Riemann, I've only read through the docs a few times.

Wouldn't it be better to join them together in a single label, seperated by a comma for example (better: configurable character)

This is our standard way of dealing with lists of items for service discovery. We also prefix and suffix a comma so that it's easier to write the regex so tags=",foo,bar,baz,". http://www.robustperception.io/little-things-matter/ explains why.

It provides a Pushgateway server which acts as an intermediate proxy for pushing datapoints.

This is explicitly not what the pushgateway is for, see https://prometheus.io/docs/practices/pushing/

I think what you want here is two fold. First something similar to the influxdb and graphite exporters that can take in data in the Riemann format and expose it in the Prometheus data format. Secondly an integration in Riemann itself to talk out to the Prometheus Alertmanager (https://prometheus.io/docs/alerting/clients/) as you would something like Pagerduty.

We're basically at the interface between a push-based and a pull-based system here, so there is a risk of missing ephemeral state changes, detecting stale data will be tricky, etc

Push vs pull is not the issue here, there's ways to handle that. The real challenge is that this is an interface between an event logging system and a metric system. I'm not sure it's sane to try an create a generic zero-configuration link between such systems, something more query based is probably the most you can do (https://github.com/chop-dbhi/prometheus-sql is probably the closest example, and https://github.com/prometheus/nagios_plugins from the other direction).

You want your time series to be continuous over time, which doesn't work with a dynamic label such as state as every time it changes you get a new time series which will not work out well either semantically (staleness) or performance (churn) wise.

But I'm not sure it's really supposed to be used with frequently updated metrics anyways.

No, the intention is relativity rarely run service-level batch jobs that might have say a few pushes a minute in aggregate. Accordingly we haven't done any tuning or benchmarking of the pushgateway.

@brian-brazil
Copy link

Taking a bit of a look at your data format, your host maps to our instance and your service probably maps to our job label.

@pradeepchhetri
Copy link
Member Author

@brian-brazil Thank you for joining the discussion and providing great suggestions in improving the plugin.

One quick question: Is pushgateway the only way to push metrics to prometheus. Since riemann pushes streams of metrics while pushgateway is for batch kind of jobs, I personally feel that there is high possiblity of missing some datapoints while pushing through pushgateway. Can i push datapoints directly to prometheus in realtime.

@brian-brazil
Copy link

I personally feel that there is high possiblity of missing some datapoints while pushing through pushgateway. Can i push datapoints directly to prometheus in realtime.

We don't and won't support this. However if you look at things like the influxdb, collectd and graphite exporters they all take in realtime pushes of metrics.

You can lose information with exporter style approaches depending on exactly what you're doing, which is why we recommend using the Prometheus client libraries which are designed to be more resilient (you could even get them to output data in the Riemann format).

Since riemann pushes streams of metrics while pushgateway is for batch kind of jobs

As far as I can tell Riemann pushes events, not metrics. This mismatch is what makes this trickier than usual, as it's not just the usual question of how to map your tags and attributes to ours.

How has this problem been solved with graphite, opentsdb and kariosdb for riemann? They also deal in metrics rather than events.

@pradeepchhetri
Copy link
Member Author

pradeepchhetri commented May 30, 2016

How has this problem been solved with graphite, opentsdb and kariosdb for riemann? They also deal in metrics rather than events.

Graphite, opentsdb, kairosdb & other timeseries databases pre-define the attributes for each datapoint while riemann provides the flexibility of adding additional attributes apart from the pre-defined ones. Hence a riemann event is just a superset of each of these timeseries database metric.

@pradeepchhetri
Copy link
Member Author

pradeepchhetri commented May 30, 2016

@jamtur01 @mfournier I have updated the plugin with the following fixes:

  • Fixed the issue when there is an empty riemann attribute.
  • Fixed the tags.
  • Added extra test for the case with tags.
  • Any extra riemann attributes will be available as prometheus label.

screen shot 2016-05-30 at 8 13 28 pm

@pradeepchhetri
Copy link
Member Author

Bumping this.


(def special-fields
"A set of event fields in Riemann with special handling logic."
#{:service :metric :tags :time :ttl})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there needs to be some way to exclude fields from labels. Let's say I create a new field with a value, like a metric, for which a label doesn't make sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

special-fields is the variable for excluding riemann fields from prometheus labels (probably should give better name). Also should expose it as configuration variable so that anyone can overwrite it. @jamtur01 Does that sounds right ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I meant exposing it.

@pradeepchhetri
Copy link
Member Author

Updated plugin with the above feedbacks.

@pradeepchhetri
Copy link
Member Author

Bumping this :)

@jamtur01
Copy link
Member

jamtur01 commented Jun 9, 2016

@pradeepchhetri You can keep bumping it but I am afraid it's still going to be "we'll get to it soon"!

@jamtur01
Copy link
Member

Thanks @pradeepchhetri !

@jamtur01 jamtur01 merged commit 9d522ce into riemann:master Jul 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants