Ruby-based handlers are expensive to spawn, causing high CPU usage/load average #425

eladroz · 2013-01-02T11:07:00Z

Hi, so...

Bottom-line

Seems that when I have Ruby handlers for metrics (or if many non-ok events are gonna fire at some point), it's very easy for the server to become a resource hog.

Sorry for the scroll, I want to share the data...

The details

On our first productive deployment of Sensu, there are about ~15 machines, each with a set of about 3-5 standalone metrics checks running every 30 seconds (that means about 2 checks arriving per second to the server, a small VM of 2 cores & 2GB RAM. Grep'ing the actual log yielded a count of 132 metric-type results per minute).
I do not count here all other non-metric type checks which do no cause handlers to run, since they are almost all in ok state.

Metrics are coming in in Graphite format, and I'm using a custom handler to transform these into StatsD format (this is a server-side optimization, so all metrics in the landscape are only flushed periodically)

The resulting load is sometimes fairly low (0.2-0.3), but might climb to 0.5, 1.0 or even more...

The investigation :-)

It's actually hard to see the culprit via top, so first suspect was actually Graphite (currently running on the same machine), especially since it's a VM. Stopping both StatsD & Graphite had no impact for the better.

When I directed the metrics to a kind of very cheap "dummy" handler (sending the results via UDP to a random port), load dropped to practically zero.

I patched the server code to do the same thing the handler does, just to see what's the cost of that is gonna be - again, very low load - even when changing metric checks to 1-second intervals on a few nodes.

What can be done?

For the specific case of metrics for StatsD, I think of: Writing a Sensu::Plugin::Metric::CLI::StatsD class to directly emit results in the needed format, then have a simple UDP handler on the server to dump this into the central StatsD. This handler would only handle OK severity.
This would work for this specific case (and arguably maybe we don't need the central StatsD in the current setup, and Graphite would be efficient enough on its own - need to check). Still, what do we do about all the event handlers?
Usually, they don't fire much, however imagine we have extensive set of checks, and then the DB falls and everything starts screaming...so the monitor slows down just when you need it.

I think we may need to allow in-process plugins or something similar, in some form. I know, I know...but it's so much more efficient, and the problem is real. The server itself seems very efficient, but the handler part...
When you start the server after being down for some time, then you feel a REAL crunch, with load-average through the roof (this is also somewhat related to #398 I guess...but there's a different story there)

Maybe we can think of this in the context of a plugin architecture inside the server, allowing to register custom mutators, handlers or whatever without a huge penalty. For example, I want to store the last output of any metric received, so I can fetch this data, show it and alert on it (server-side) without going to Graphite. Another example - I want to have a "positive confirmation" for each check, ensuring that it actually ran ok. These may be features, or plugins...but they need to run very efficiently.

/scroll ends

portertech · 2013-01-02T17:12:11Z

Just catching up, but have you looked at the other Handler types? TCP, UDP, AMQP.

portertech · 2013-01-02T17:23:00Z

Perhaps we could experiment w/ a plugin architecture for mutators. Sensu could provide a base mutator class, glob library loading, eg. "sensu-mutator-statsd" sensu/mutators/statsd?

eladroz · 2013-01-02T17:45:18Z

I think it's a good place to start with a more flexible solution. Naming conventions are always fishy :-)

This would remove the need for a handler class when you just want to shape the data & ship it to out,
but I still don't know what to do with handlers running code/logic, though - like the mailer handler which I use, or anything that relies on stashes. Maybe we think a little more how to do it more efficiently, without a ton of added complexity...
Basically I think the handler classes are cool ;-) so I just wish the performance penalty would be really low.

Hmmm

portertech · 2013-01-02T18:21:52Z

@eladroz I've been able to push > 1k events/sec through a pipe handler that I wrote in c, spawn() is cheap, it's just the mutator or handler VM that makes it dog slow :(

portertech · 2013-01-02T18:27:02Z

@eladroz You're doing a bunch of data manipulation, which could be done w/ a mutator, perhaps a quality mutator in c for converting graphite format into various other formats (statsd, opentsdb, etc).

eladroz · 2013-01-02T18:41:44Z

so we're back to c now? let me grow a beard then ;-)
i'm sure there's a way to do it in ruby, with all the niceties the handler class provides... (easy filtering, get stash, get settings etc. - I mean, you wrote it, you know it's comfy ;-)
I'd try to come back with some ideas...

portertech · 2013-01-02T21:13:30Z

@eladroz I'm going to throw together a Sensu::Extension class, w/ :type (check, mutator, handler), providing a simple API to hack on.

portertech · 2013-01-03T01:48:05Z

@eladroz ;) https://github.com/portertech/sensu/compare/extension

eladroz · 2013-01-03T04:11:47Z

Wow, really looks like a great start!
It would take me more time to munch on that, than it took you to write it...

Two smallish thoughts:
(a) do we want to allow chaining mutators? (b) maybe load extensions both from lib/extensions (for built-in), and /etc/sensu/extensions for user-created ones

portertech · 2013-01-03T15:40:35Z

@eladroz (a) no chaining, keeping it simple (b) will add a cli arg "-e" to point sensu at an additional dir to require

Made a few changes to https://github.com/portertech/sensu/compare/extension :)

eladroz · 2013-01-03T16:14:03Z

"Now witness the firepower of this fully ARMED and OPERATIONAL battle station!"

nstielau · 2013-01-03T16:46:21Z

"Sometimes I feel like I could take on the whole Empire" This is rad. It strikes me that this will create some confusion as to when to use extension mutators/checks/handlers, TCP/UDP handlers, or spawned mutators/checks/handlers. It seems like their is some overlap, which creates maintenance overhead and configuration ambiguity.

Maybe TCP/UDP handlers could be refactored as extensions? Maybe spawned mutators can get yanked (since they are heavy and often used for high-throughput metrics, and there is a small set of standard ones)?

Great functionality, definitely addresses a common complaint. I just think of helping people to install/configure sensu in IRC, or over a beer, and it's getting more complex ;)

portertech · 2013-01-03T17:00:17Z

This "feature" is intended for only those who are capable of Sensu spelunking, it is not "safe" to suggest most of the community even touch this.

portertech · 2013-01-03T17:23:45Z

@nstielau I think it's important to push check, mutator, and handler scripting, this is more for the elitists :P

nstielau · 2013-01-03T17:28:55Z

Zrrm, yeah OK. Maybe we can starting charging for spelunking tours: "put on your helmet and turn on your flashlight, this might get a lil' messy" ;)

jjrussell · 2013-01-17T13:59:29Z

I love the new extension stuff. I'm trying to port my statsd ruby handler to an extension. With the number of metrics we receive and try to spawn many tens of processes sensu can't process keepalives anymore so all the boxes start looking dead. I'm sure its mostly starting up the VM.

The part I'm stuck at now is that sensu-plugin handlers have access to the settings. They parse it when they start up, but its still a very common idiom in the community plugins repo. Any reason we can't have access to the base settings in the extensions?

portertech · 2013-01-21T19:08:16Z

I've merged #443 for settings in extensions into master, which will be in the next build (0.9.10). I'm closing this issue as handler and mutator extension support was added in 0.9.9, and is currently in use.

portertech mentioned this issue Jan 4, 2013

Sensu "extensions", load mutator & handler extensions with Sensu itself #428

Merged

jjrussell mentioned this issue Jan 17, 2013

Make base.settings available to extensions #441

Closed

portertech closed this as completed Jan 21, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ruby-based handlers are expensive to spawn, causing high CPU usage/load average #425

Ruby-based handlers are expensive to spawn, causing high CPU usage/load average #425

eladroz commented Jan 2, 2013

portertech commented Jan 2, 2013

portertech commented Jan 2, 2013

eladroz commented Jan 2, 2013

portertech commented Jan 2, 2013

portertech commented Jan 2, 2013

eladroz commented Jan 2, 2013

portertech commented Jan 2, 2013

portertech commented Jan 3, 2013

eladroz commented Jan 3, 2013

portertech commented Jan 3, 2013

eladroz commented Jan 3, 2013

nstielau commented Jan 3, 2013

portertech commented Jan 3, 2013

portertech commented Jan 3, 2013

nstielau commented Jan 3, 2013

jjrussell commented Jan 17, 2013

portertech commented Jan 21, 2013

Ruby-based handlers are expensive to spawn, causing high CPU usage/load average #425

Ruby-based handlers are expensive to spawn, causing high CPU usage/load average #425

Comments

eladroz commented Jan 2, 2013

Bottom-line

The details

The investigation :-)

What can be done?

portertech commented Jan 2, 2013

portertech commented Jan 2, 2013

eladroz commented Jan 2, 2013

portertech commented Jan 2, 2013

portertech commented Jan 2, 2013

eladroz commented Jan 2, 2013

portertech commented Jan 2, 2013

portertech commented Jan 3, 2013

eladroz commented Jan 3, 2013

portertech commented Jan 3, 2013

eladroz commented Jan 3, 2013

nstielau commented Jan 3, 2013

portertech commented Jan 3, 2013

portertech commented Jan 3, 2013

nstielau commented Jan 3, 2013

jjrussell commented Jan 17, 2013

portertech commented Jan 21, 2013