Skip to content
This repository has been archived by the owner on Jan 1, 2020. It is now read-only.

runaway rabbitmq queues #398

Closed
bethanybenzur opened this issue Oct 16, 2012 · 13 comments
Closed

runaway rabbitmq queues #398

bethanybenzur opened this issue Oct 16, 2012 · 13 comments

Comments

@bethanybenzur
Copy link

On 9/26 I upgraded sensu from 9.6-4 to 9.7-1. On 10/10 I received an alert from my external monitoring service saying that there was no disk space left on my sensu server. /var/lib/rabbitmq was filling up because one of the queues had no consumers. Stopping sensu-server, purging the queue and restarting sensu-server resolved the issue.

On 10/11 I upgraded rabbitmq to latest stable 2.8.7 and made sure all of my sensu clients were running 9.7-1, but on 10/12 and every day since we've had one or two instances of /results or /results and /keepalives filling up because nothing is consuming them and had to repeat the stop, purge, restart procedure to get our alerts back. There is nothing useful or unusual in the rabbitmq log or sensu-sever log.

Per @portertech, going to try this patch this afternoon to see if we can get some better info in the logs: https://github.com/sensu/sensu/blob/master/lib/sensu/server.rb#L79-87

@bethanybenzur
Copy link
Author

3 days ago I upgraded my Erlang version from R14B04 to R15B02 and since then we've had no issues with Sensu. Closing this issue with cautious optimism! 😃

@bethanybenzur
Copy link
Author

Non-draining keepalive and results RabbitMQ queues with a seemingly healthy sensu-server process are back on my production Sensu server. If I stop and start sensu-server without purging the queues, they never drain (or at least still haven't after an hour) and I end up manually purging them just to get our sensu-dashboard to update and to force stale alerts to recover.

@bethanybenzur bethanybenzur reopened this Oct 23, 2012
@portertech
Copy link
Contributor

I recommend adding a second Sensu server, as work will be distributed. The 0.9.7 build introduced Mutators, the built-in ones unnecessarily use the thread pool, this has been changed in the 0.9.8 betas. I'm currently experimenting w/ AMQP pre-fetching/QoS to stop RabbitMQ from flooding a single Sensu server's in-memory buffers w/ results & keepalives.

@bethanybenzur
Copy link
Author

We brought up a second sensu-server VM on 10/24 and so far it's been completely stable through deploys. I'll continue to keep an eye on it over the next week. Thanks @portertech !

@portertech
Copy link
Contributor

Trying #409 to aid.

@eladroz
Copy link

eladroz commented Jan 2, 2013

Hi,
Do we actually want messages to queue up when there's no server to process them?

Since all checks are periodically running, after starting the server you should end up very quickly with the up-to-date state of affairs, without having to process untold thousands of past checks.
Processing all these stale results can even be considered an anti-pattern in our case IMHO.

One approach to handle this is setting TTL at queue or message level. I hope this should also limit storage size, but not sure.

Another thing (which might be on top of that, or instead) is to use proper exchanges instead of writing to the results queue directly - which is really discouraged (http://rdoc.info/github/ruby-amqp/amqp/master/AMQP/Queue:publish). In this way, the queue can be set to auto-delete automatically when there's no consumer around for a while.

It's a bit more hassle, when using exchanges, to ensure reliability when either client/s, server or rabbitmq go down, but it can definitely be done. The exchange must be created before reading/writing to the queue, and it should not auto delete. Otherwise, everything gets fucked quite quickly.

[edit: we would lose all metrics sent during server downtime, but that's ok IMHO]

@jondot
Copy link

jondot commented Jan 25, 2013

@eladroz doesn't that mean that given what you described, the template for such messaging semantics is XMPP rather than AMQP?

@eladroz
Copy link

eladroz commented Jan 30, 2013

@jondot, can you elaborate? do you mean something more real-time which is valid only right now, and if the receiver is online?

@jondot
Copy link

jondot commented Jan 30, 2013

Yes. Like you, I've been wondering what is the value of a Sensu server recovering from a crash, just to churn on all the (now probably stale) data that stacked up in the queue.

If one strategy to resolve this is the Sensu server draining the Queue on recovery (or the solution you've proposed), then whats the purpose of having a queue at all?
Being able to handle an unexpected (an exceptional) surge of requests? when does that happen given each agent sends back results on regular interval?

This question interested me so much that I had a talk with our Ops guy, and we've looked at 3 such commercial and clustered systems - neither had queues, all had direct requests from agents.

All is left for me is to assume that in this case, there was more value for AMQP as a protocol and facilitator of networking patterns over a queue per-se (topical fan-out, and load balancing / job distribution that happens incidentally through a queue).

As an educational exercise, I'd love to know the real reason for this in Sensu, perhaps there's also a lesson these commercial tools need to learn.

@eladroz
Copy link

eladroz commented Jan 31, 2013

I also share your assumption: AMQP is just one way to not worry about receiver/s being able to handle N simultaneous connections or be available at that second, to not have to know who your clients are, how many servers there are...
It's a nice abstraction (with implementation ;-) of all this goodness, so my point was that with proper exchange/queue separation, you could make it actually behave the way you need to - from deleting unused queues after a very short grace period, to persisting them to disk and surviving a broker crash...

Btw, I couldn't work out where do you work (when you said "our Ops guy") - but maybe that's to remain a great mystery?... ;-)

@portertech
Copy link
Contributor

0.9.11 no longer uses the default exchange, or publishes directly to queues.

https://github.com/sensu/sensu/blob/master/CHANGELOG.md#0911---2013-02-22

Adding result acks seems to throttle consumption enough to stop the strange ruby-amqp DoS behaviour.

@bethanybenzur Have you experienced this issue lately? Has your Sensu infrastructure grown?

@skymob
Copy link

skymob commented Mar 20, 2013

@portertech no such issues in a while. Our Sensu infrastructure has not changed - still running four Sensu servers, and am considering scaling that down to three. We're still running 0.9.9 in prod but will be upgrading to 0.9.11 this week.

@portertech
Copy link
Contributor

@skymob I would hold on the 0.9.11 upgrade, as 0.9.12 will be dropping very soon, with several improvements.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants