runaway rabbitmq queues #398

bethanybenzur · 2012-10-16T15:17:09Z

On 9/26 I upgraded sensu from 9.6-4 to 9.7-1. On 10/10 I received an alert from my external monitoring service saying that there was no disk space left on my sensu server. /var/lib/rabbitmq was filling up because one of the queues had no consumers. Stopping sensu-server, purging the queue and restarting sensu-server resolved the issue.

On 10/11 I upgraded rabbitmq to latest stable 2.8.7 and made sure all of my sensu clients were running 9.7-1, but on 10/12 and every day since we've had one or two instances of /results or /results and /keepalives filling up because nothing is consuming them and had to repeat the stop, purge, restart procedure to get our alerts back. There is nothing useful or unusual in the rabbitmq log or sensu-sever log.

Per @portertech, going to try this patch this afternoon to see if we can get some better info in the logs: https://github.com/sensu/sensu/blob/master/lib/sensu/server.rb#L79-87

bethanybenzur · 2012-10-22T18:45:36Z

3 days ago I upgraded my Erlang version from R14B04 to R15B02 and since then we've had no issues with Sensu. Closing this issue with cautious optimism! 😃

bethanybenzur · 2012-10-23T19:21:48Z

Non-draining keepalive and results RabbitMQ queues with a seemingly healthy sensu-server process are back on my production Sensu server. If I stop and start sensu-server without purging the queues, they never drain (or at least still haven't after an hour) and I end up manually purging them just to get our sensu-dashboard to update and to force stale alerts to recover.

portertech · 2012-10-26T09:27:09Z

I recommend adding a second Sensu server, as work will be distributed. The 0.9.7 build introduced Mutators, the built-in ones unnecessarily use the thread pool, this has been changed in the 0.9.8 betas. I'm currently experimenting w/ AMQP pre-fetching/QoS to stop RabbitMQ from flooding a single Sensu server's in-memory buffers w/ results & keepalives.

bethanybenzur · 2012-10-26T21:02:13Z

We brought up a second sensu-server VM on 10/24 and so far it's been completely stable through deploys. I'll continue to keep an eye on it over the next week. Thanks @portertech !

portertech · 2012-11-26T18:54:24Z

Trying #409 to aid.

eladroz · 2013-01-02T12:06:17Z

Hi,
Do we actually want messages to queue up when there's no server to process them?

Since all checks are periodically running, after starting the server you should end up very quickly with the up-to-date state of affairs, without having to process untold thousands of past checks.
Processing all these stale results can even be considered an anti-pattern in our case IMHO.

One approach to handle this is setting TTL at queue or message level. I hope this should also limit storage size, but not sure.

Another thing (which might be on top of that, or instead) is to use proper exchanges instead of writing to the results queue directly - which is really discouraged (http://rdoc.info/github/ruby-amqp/amqp/master/AMQP/Queue:publish). In this way, the queue can be set to auto-delete automatically when there's no consumer around for a while.

It's a bit more hassle, when using exchanges, to ensure reliability when either client/s, server or rabbitmq go down, but it can definitely be done. The exchange must be created before reading/writing to the queue, and it should not auto delete. Otherwise, everything gets fucked quite quickly.

[edit: we would lose all metrics sent during server downtime, but that's ok IMHO]

jondot · 2013-01-25T14:14:08Z

@eladroz doesn't that mean that given what you described, the template for such messaging semantics is XMPP rather than AMQP?

eladroz · 2013-01-30T19:56:37Z

@jondot, can you elaborate? do you mean something more real-time which is valid only right now, and if the receiver is online?

jondot · 2013-01-30T20:20:47Z

Yes. Like you, I've been wondering what is the value of a Sensu server recovering from a crash, just to churn on all the (now probably stale) data that stacked up in the queue.

If one strategy to resolve this is the Sensu server draining the Queue on recovery (or the solution you've proposed), then whats the purpose of having a queue at all?
Being able to handle an unexpected (an exceptional) surge of requests? when does that happen given each agent sends back results on regular interval?

This question interested me so much that I had a talk with our Ops guy, and we've looked at 3 such commercial and clustered systems - neither had queues, all had direct requests from agents.

All is left for me is to assume that in this case, there was more value for AMQP as a protocol and facilitator of networking patterns over a queue per-se (topical fan-out, and load balancing / job distribution that happens incidentally through a queue).

As an educational exercise, I'd love to know the real reason for this in Sensu, perhaps there's also a lesson these commercial tools need to learn.

eladroz · 2013-01-31T05:10:30Z

I also share your assumption: AMQP is just one way to not worry about receiver/s being able to handle N simultaneous connections or be available at that second, to not have to know who your clients are, how many servers there are...
It's a nice abstraction (with implementation ;-) of all this goodness, so my point was that with proper exchange/queue separation, you could make it actually behave the way you need to - from deleting unused queues after a very short grace period, to persisting them to disk and surviving a broker crash...

Btw, I couldn't work out where do you work (when you said "our Ops guy") - but maybe that's to remain a great mystery?... ;-)

portertech · 2013-03-20T07:43:29Z

0.9.11 no longer uses the default exchange, or publishes directly to queues.

https://github.com/sensu/sensu/blob/master/CHANGELOG.md#0911---2013-02-22

Adding result acks seems to throttle consumption enough to stop the strange ruby-amqp DoS behaviour.

@bethanybenzur Have you experienced this issue lately? Has your Sensu infrastructure grown?

skymob · 2013-03-20T14:22:46Z

@portertech no such issues in a while. Our Sensu infrastructure has not changed - still running four Sensu servers, and am considering scaling that down to three. We're still running 0.9.9 in prod but will be upgrading to 0.9.11 this week.

portertech · 2013-03-20T17:28:52Z

@skymob I would hold on the 0.9.11 upgrade, as 0.9.12 will be dropping very soon, with several improvements.

bethanybenzur closed this as completed Oct 22, 2012

bethanybenzur reopened this Oct 23, 2012

eladroz mentioned this issue Jan 2, 2013

Ruby-based handlers are expensive to spawn, causing high CPU usage/load average #425

Closed

portertech closed this as completed Mar 20, 2013

chiriaev mentioned this issue Nov 26, 2013

[ Sensu/Server ] : Sensu server stops consuming RMQ “results” and “keelalives” queues #673

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runaway rabbitmq queues #398

runaway rabbitmq queues #398

bethanybenzur commented Oct 16, 2012

bethanybenzur commented Oct 22, 2012

bethanybenzur commented Oct 23, 2012

portertech commented Oct 26, 2012

bethanybenzur commented Oct 26, 2012

portertech commented Nov 26, 2012

eladroz commented Jan 2, 2013

jondot commented Jan 25, 2013

eladroz commented Jan 30, 2013

jondot commented Jan 30, 2013

eladroz commented Jan 31, 2013

portertech commented Mar 20, 2013

skymob commented Mar 20, 2013

portertech commented Mar 20, 2013

runaway rabbitmq queues #398

runaway rabbitmq queues #398

Comments

bethanybenzur commented Oct 16, 2012

bethanybenzur commented Oct 22, 2012

bethanybenzur commented Oct 23, 2012

portertech commented Oct 26, 2012

bethanybenzur commented Oct 26, 2012

portertech commented Nov 26, 2012

eladroz commented Jan 2, 2013

jondot commented Jan 25, 2013

eladroz commented Jan 30, 2013

jondot commented Jan 30, 2013

eladroz commented Jan 31, 2013

portertech commented Mar 20, 2013

skymob commented Mar 20, 2013

portertech commented Mar 20, 2013