aliveness-test does still work when memory high watermark is reached and publishing is blocked #23

LarsFronius · 2013-10-17T12:36:28Z

Hi,

from the docs it seems if I call /api/aliveness-test I can be 100% sure my apps are able to publish and consume messages from my MQ.

Declares a test queue, then publishes and consumes a
          message. Intended for use by monitoring tools.

So, when I put my queue into a state where it will not allow publishing new items, for instance through setting rabbitmqctl set_vm_memory_high_watermark 0.01
and I can see Publishers will be blocked until this alarm clear in the logs and my application can't actually publish new messages, I'd consider the aliveness-test to tell me that.
What I actually see is

root@mq:/home/vagrant# curl -u foo:bar http://localhost:15672/api/aliveness-test/%2F
{"status":"ok"}

This is not correct, considering this API call should really "publish and consume a message. Intended for use by monitoring tools."

This is what I run at the moment

{running_applications,
     [{rabbitmq_management,"RabbitMQ Management Console","3.1.5"},
      {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.1.5"},
      {webmachine,"webmachine","1.10.3-rmq3.1.5-gite9359c7"},
      {mochiweb,"MochiMedia Web Server","2.7.0-rmq3.1.5-git680dba8"},
      {rabbitmq_management_agent,"RabbitMQ Management Agent","3.1.5"},
      {rabbit,"RabbitMQ","3.1.5"},
      {os_mon,"CPO  CXC 138 46","2.2.9"},
      {inets,"INETS  CXC 138 49","5.9"},
      {xmerl,"XML parser","1.3.1"},
      {mnesia,"MNESIA  CXC 138 12","4.7"},
      {amqp_client,"RabbitMQ AMQP Client","3.1.5"},
      {sasl,"SASL  CXC 138 11","2.2.1"},
      {stdlib,"ERTS  CXC 138 10","1.18.1"},
      {kernel,"ERTS  CXC 138 10","2.15.1"}]},
 {os,{unix,linux}},
 {erlang_version,
     "Erlang R15B01 (erts-5.9.1) [source] [64-bit] [async-threads:30] [kernel-poll:true]\n"}

The text was updated successfully, but these errors were encountered:

michaelklishin · 2013-10-17T12:42:50Z

Publishers are blocked on per-connection basis. Since aliveness test cannot use your apps connection, it cannot know that they are blocked.

A new monitoring endpoint may be worth adding but aliveness-test cannot provide information for your connections,
only overall node health.

michaelklishin · 2013-10-17T12:43:31Z

"when I put my queue into a state where it will not allow publishing new items"

this is not a correct statement. Connections can be in that (blocked) state but not queues.

michaelklishin · 2013-10-17T12:44:24Z

There are already HTTP API endpoints for connection info. If you'd like something to be added there, please start a discussion on rabbitmq-discuss.

LarsFronius · 2013-10-17T13:20:22Z

Thanks for the clarification, got the point it is of course not my apps connection and it can't know the app is blocked for publishing.
But this entry in the log:

Publishers will be blocked until this alarm clear.

Tells me as a user, that no publisher should be able to publish anything onto this RabbitMQ host, doesn't it? So the aliveness-test should not be able to do that as well?

Anyway - as a user tagged for monitoring, is there any way I can get the information, that RabbitMQ is in the alarm state where publishers will possibly be blocked?

michaelklishin · 2013-10-17T13:26:25Z

Aliveness test is part of a RabbitMQ plugin which can do/use things AMQP 0-9-1 clients cannot:

There is a so-called direct Erlang client that uses distributed Erlang facilities instead of AMQP connections.
Plugins can bypass many checks that basic.publish handler normally goes through.

and so on. Aliveness test is mean to test key broker infrastructure and not every possible failure scenario.

I need to take a look if management API can provide any information about alarms. Will get back to you.

CVTJNII · 2016-02-22T18:11:38Z

As a user I agree with @LarsFronius that the aliveness check should fail when publishing is blocked by the watermarks being exceeded. That the aliveness check passes when the queue cannot process messages is both surprising and disappointing.

michaelklishin · 2016-02-22T19:32:44Z

@CVTJNII only specific connections are blocked, so your claim that "the queue cannot process messages" is factually untrue even when alarms are in place. Like I said, we are VERY far from having a consensus about how the aliveness test should work. Everybody wants it to work the way their company's monitoring work. Sorry, we cannot accommodate all requests.

CVTJNII · 2016-02-22T20:56:33Z

@michaelklishin I respectfully disagree. Per Rabbit's output when the watermark is breached:

**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************

Per the API documentation at http://hg.rabbitmq.com/rabbitmq-management/raw-file/rabbitmq_v3_3_4/priv/www/api/index.html:

api/aliveness-test/vhost    Declares a test queue, then publishes and consumes a message. Intended for use by monitoring tools.

So per the check's documentation it should publish and consume a message. However, per Rabbit's log in this scenario publishers are blocked. So I disagree that this is ambiguity in how the check should work, in my opinion the check is not operating as documented. That is the spirit of my objection, based on the documentation I've found I would not expect a 200 response when publishers are blocked.

Furthermore, when I hit this issue I had the watermark set to zero, and per the memory documentation at https://www.rabbitmq.com/memory.html I would expect all publishing to be stopped as per the log message above:

A value of 0 makes the memory alarm go off immediately and thus disables all publishing (this may be useful if you wish to disable publishing globally); use rabbitmqctl set_vm_memory_high_watermark 0.

If there is better documentation for the API please let me know, that was the best I found and aligned with other searches for how the endpoint should function.

So again, I disagree with the assertion that this is an issue with consensus on how the check should operate, and instead believe the check is not operating as documented. As the check is supposed to publish and then consume a message it should not pass when publishing is disabled.

michaelklishin · 2016-02-23T05:41:06Z

@CVTJNII publishers are blocked after they publish at least one message, which can (although not guaranteed and typically won't) be accepted fully, e.g. if it has a blank body. How would RabbitMQ know that a connection is publishing otherwise?

The aliveness check uses a regular RabbitMQ client and the only thing that could be missing is a timeout — which again, everybody has their own ideal default for.

So instead of assuming that the aliveness test should cover everything, please stop suggesting that we make it do A, B, C, …, X, Y, Z. Use multiple monitoring checks, the HTTP API provides a ton of data, and so does rabbitmqctl, which now reports alarms in status.

michaelklishin · 2016-02-23T05:42:56Z

I'm locking this because this is getting ridiculous. I cannot think of a data service where a single request can discover every possible issue there might be (given that definitions of "healthy" varies from company to company, or even project to project). Yet somehow RabbitMQ is expected to provide it.

michaelklishin · 2016-02-23T05:48:20Z

Filed rabbitmq/rabbitmq-management#137 for one specific improvement we can make.

michaelklishin · 2016-02-23T05:52:21Z

@CVTJNII furthermore, you assume that when a publish method returns, the message is published. That's wrong: it simply means that the data was written to the socket. It absolutely does not meant that it has reached RabbitMQ, was read, parsed, and routed. When alarms are in place, connections that see a basic.publish or content metadata or body frame will stop reading from the socket. However, the client knows nothing about that, even if it is actually blocked, unless it registers a connection.blocked handler.

michaelklishin closed this as completed Oct 17, 2013

dumbbell modified the milestone: n/a Mar 24, 2015

rabbitmq locked and limited conversation to collaborators Feb 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aliveness-test does still work when memory high watermark is reached and publishing is blocked #23

aliveness-test does still work when memory high watermark is reached and publishing is blocked #23

LarsFronius commented Oct 17, 2013

michaelklishin commented Oct 17, 2013

michaelklishin commented Oct 17, 2013

michaelklishin commented Oct 17, 2013

LarsFronius commented Oct 17, 2013

michaelklishin commented Oct 17, 2013

CVTJNII commented Feb 22, 2016

michaelklishin commented Feb 22, 2016

CVTJNII commented Feb 22, 2016

michaelklishin commented Feb 23, 2016

michaelklishin commented Feb 23, 2016

michaelklishin commented Feb 23, 2016

michaelklishin commented Feb 23, 2016

aliveness-test does still work when memory high watermark is reached and publishing is blocked #23

aliveness-test does still work when memory high watermark is reached and publishing is blocked #23

Comments

LarsFronius commented Oct 17, 2013

michaelklishin commented Oct 17, 2013

michaelklishin commented Oct 17, 2013

michaelklishin commented Oct 17, 2013

LarsFronius commented Oct 17, 2013

michaelklishin commented Oct 17, 2013

CVTJNII commented Feb 22, 2016

michaelklishin commented Feb 22, 2016

CVTJNII commented Feb 22, 2016

michaelklishin commented Feb 23, 2016

michaelklishin commented Feb 23, 2016

michaelklishin commented Feb 23, 2016

michaelklishin commented Feb 23, 2016