-
Notifications
You must be signed in to change notification settings - Fork 387
Ruby-based handlers are expensive to spawn, causing high CPU usage/load average #425
Comments
Just catching up, but have you looked at the other Handler types? TCP, UDP, AMQP. |
Perhaps we could experiment w/ a plugin architecture for mutators. Sensu could provide a base mutator class, glob library loading, eg. "sensu-mutator-statsd" sensu/mutators/statsd? |
I think it's a good place to start with a more flexible solution. Naming conventions are always fishy :-) This would remove the need for a handler class when you just want to shape the data & ship it to out, Hmmm |
@eladroz I've been able to push > 1k events/sec through a pipe handler that I wrote in c, spawn() is cheap, it's just the mutator or handler VM that makes it dog slow :( |
@eladroz You're doing a bunch of data manipulation, which could be done w/ a mutator, perhaps a quality mutator in c for converting graphite format into various other formats (statsd, opentsdb, etc). |
so we're back to c now? let me grow a beard then ;-) |
@eladroz I'm going to throw together a Sensu::Extension class, w/ :type (check, mutator, handler), providing a simple API to hack on. |
Wow, really looks like a great start! Two smallish thoughts: |
@eladroz (a) no chaining, keeping it simple (b) will add a cli arg "-e" to point sensu at an additional dir to require Made a few changes to https://github.com/portertech/sensu/compare/extension :) |
"Now witness the firepower of this fully ARMED and OPERATIONAL battle station!" |
"Sometimes I feel like I could take on the whole Empire" This is rad. It strikes me that this will create some confusion as to when to use extension mutators/checks/handlers, TCP/UDP handlers, or spawned mutators/checks/handlers. It seems like their is some overlap, which creates maintenance overhead and configuration ambiguity. Maybe TCP/UDP handlers could be refactored as extensions? Maybe spawned mutators can get yanked (since they are heavy and often used for high-throughput metrics, and there is a small set of standard ones)? Great functionality, definitely addresses a common complaint. I just think of helping people to install/configure sensu in IRC, or over a beer, and it's getting more complex ;) |
This "feature" is intended for only those who are capable of Sensu spelunking, it is not "safe" to suggest most of the community even touch this. |
@nstielau I think it's important to push check, mutator, and handler scripting, this is more for the elitists :P |
Zrrm, yeah OK. Maybe we can starting charging for spelunking tours: "put on your helmet and turn on your flashlight, this might get a lil' messy" ;) |
I love the new extension stuff. I'm trying to port my statsd ruby handler to an extension. With the number of metrics we receive and try to spawn many tens of processes sensu can't process keepalives anymore so all the boxes start looking dead. I'm sure its mostly starting up the VM. The part I'm stuck at now is that sensu-plugin handlers have access to the settings. They parse it when they start up, but its still a very common idiom in the community plugins repo. Any reason we can't have access to the base settings in the extensions? |
I've merged #443 for settings in extensions into master, which will be in the next build (0.9.10). I'm closing this issue as handler and mutator extension support was added in 0.9.9, and is currently in use. |
Hi, so...
Bottom-line
Seems that when I have Ruby handlers for metrics (or if many non-ok events are gonna fire at some point), it's very easy for the server to become a resource hog.
Sorry for the scroll, I want to share the data...
The details
On our first productive deployment of Sensu, there are about ~15 machines, each with a set of about 3-5 standalone metrics checks running every 30 seconds (that means about 2 checks arriving per second to the server, a small VM of 2 cores & 2GB RAM. Grep'ing the actual log yielded a count of 132 metric-type results per minute).
I do not count here all other non-metric type checks which do no cause handlers to run, since they are almost all in ok state.
Metrics are coming in in Graphite format, and I'm using a custom handler to transform these into StatsD format (this is a server-side optimization, so all metrics in the landscape are only flushed periodically)
The resulting load is sometimes fairly low (0.2-0.3), but might climb to 0.5, 1.0 or even more...
The investigation :-)
It's actually hard to see the culprit via top, so first suspect was actually Graphite (currently running on the same machine), especially since it's a VM. Stopping both StatsD & Graphite had no impact for the better.
When I directed the metrics to a kind of very cheap "dummy" handler (sending the results via UDP to a random port), load dropped to practically zero.
I patched the server code to do the same thing the handler does, just to see what's the cost of that is gonna be - again, very low load - even when changing metric checks to 1-second intervals on a few nodes.
What can be done?
Usually, they don't fire much, however imagine we have extensive set of checks, and then the DB falls and everything starts screaming...so the monitor slows down just when you need it.
I think we may need to allow in-process plugins or something similar, in some form. I know, I know...but it's so much more efficient, and the problem is real. The server itself seems very efficient, but the handler part...
When you start the server after being down for some time, then you feel a REAL crunch, with load-average through the roof (this is also somewhat related to #398 I guess...but there's a different story there)
Maybe we can think of this in the context of a plugin architecture inside the server, allowing to register custom mutators, handlers or whatever without a huge penalty. For example, I want to store the last output of any metric received, so I can fetch this data, show it and alert on it (server-side) without going to Graphite. Another example - I want to have a "positive confirmation" for each check, ensuring that it actually ran ok. These may be features, or plugins...but they need to run very efficiently.
/scroll ends
The text was updated successfully, but these errors were encountered: