runaway rabbitmq queues #398
Comments
3 days ago I upgraded my Erlang version from R14B04 to R15B02 and since then we've had no issues with Sensu. Closing this issue with cautious optimism! 😃 |
Non-draining keepalive and results RabbitMQ queues with a seemingly healthy sensu-server process are back on my production Sensu server. If I stop and start sensu-server without purging the queues, they never drain (or at least still haven't after an hour) and I end up manually purging them just to get our sensu-dashboard to update and to force stale alerts to recover. |
I recommend adding a second Sensu server, as work will be distributed. The 0.9.7 build introduced Mutators, the built-in ones unnecessarily use the thread pool, this has been changed in the 0.9.8 betas. I'm currently experimenting w/ AMQP pre-fetching/QoS to stop RabbitMQ from flooding a single Sensu server's in-memory buffers w/ results & keepalives. |
We brought up a second sensu-server VM on 10/24 and so far it's been completely stable through deploys. I'll continue to keep an eye on it over the next week. Thanks @portertech ! |
Trying #409 to aid. |
Hi, Since all checks are periodically running, after starting the server you should end up very quickly with the up-to-date state of affairs, without having to process untold thousands of past checks. One approach to handle this is setting TTL at queue or message level. I hope this should also limit storage size, but not sure. Another thing (which might be on top of that, or instead) is to use proper exchanges instead of writing to the results queue directly - which is really discouraged (http://rdoc.info/github/ruby-amqp/amqp/master/AMQP/Queue:publish). In this way, the queue can be set to auto-delete automatically when there's no consumer around for a while. It's a bit more hassle, when using exchanges, to ensure reliability when either client/s, server or rabbitmq go down, but it can definitely be done. The exchange must be created before reading/writing to the queue, and it should not auto delete. Otherwise, everything gets fucked quite quickly. [edit: we would lose all metrics sent during server downtime, but that's ok IMHO] |
@eladroz doesn't that mean that given what you described, the template for such messaging semantics is XMPP rather than AMQP? |
@jondot, can you elaborate? do you mean something more real-time which is valid only right now, and if the receiver is online? |
Yes. Like you, I've been wondering what is the value of a Sensu server recovering from a crash, just to churn on all the (now probably stale) data that stacked up in the queue. If one strategy to resolve this is the Sensu server draining the Queue on recovery (or the solution you've proposed), then whats the purpose of having a queue at all? This question interested me so much that I had a talk with our Ops guy, and we've looked at 3 such commercial and clustered systems - neither had queues, all had direct requests from agents. All is left for me is to assume that in this case, there was more value for AMQP as a protocol and facilitator of networking patterns over a queue per-se (topical fan-out, and load balancing / job distribution that happens incidentally through a queue). As an educational exercise, I'd love to know the real reason for this in Sensu, perhaps there's also a lesson these commercial tools need to learn. |
I also share your assumption: AMQP is just one way to not worry about receiver/s being able to handle N simultaneous connections or be available at that second, to not have to know who your clients are, how many servers there are... Btw, I couldn't work out where do you work (when you said "our Ops guy") - but maybe that's to remain a great mystery?... ;-) |
0.9.11 no longer uses the default exchange, or publishes directly to queues. https://github.com/sensu/sensu/blob/master/CHANGELOG.md#0911---2013-02-22 Adding result acks seems to throttle consumption enough to stop the strange ruby-amqp DoS behaviour. @bethanybenzur Have you experienced this issue lately? Has your Sensu infrastructure grown? |
@portertech no such issues in a while. Our Sensu infrastructure has not changed - still running four Sensu servers, and am considering scaling that down to three. We're still running 0.9.9 in prod but will be upgrading to 0.9.11 this week. |
@skymob I would hold on the 0.9.11 upgrade, as 0.9.12 will be dropping very soon, with several improvements. |
On 9/26 I upgraded sensu from 9.6-4 to 9.7-1. On 10/10 I received an alert from my external monitoring service saying that there was no disk space left on my sensu server. /var/lib/rabbitmq was filling up because one of the queues had no consumers. Stopping sensu-server, purging the queue and restarting sensu-server resolved the issue.
On 10/11 I upgraded rabbitmq to latest stable 2.8.7 and made sure all of my sensu clients were running 9.7-1, but on 10/12 and every day since we've had one or two instances of /results or /results and /keepalives filling up because nothing is consuming them and had to repeat the stop, purge, restart procedure to get our alerts back. There is nothing useful or unusual in the rabbitmq log or sensu-sever log.
Per @portertech, going to try this patch this afternoon to see if we can get some better info in the logs: https://github.com/sensu/sensu/blob/master/lib/sensu/server.rb#L79-87
The text was updated successfully, but these errors were encountered: