This continues the discussion from logstash-plugins/logstash-input-rabbitmq#76 . I have confirmed that this is a March Hare bug. This is potentially also a Java client bug, but I'm not knowledgeable enough about the internals here to know that.
The bottom line is that there are situations where, even with :automatic_recovery on, the TCP connection can completely disappear and yet March Hare will not attempt to reconnect, and essentially goes dead into an unrecoverable state. Restarting the server on the other end has no effect, that client is permanently borked.
I confirmed this using the logstash <-> haproxy <-> rmq <- producer setup in the link above with a wireshark capture and lsof.
logstash <-> haproxy <-> rmq <- producer
The short summary here is that if the following sequence happens:
NOTE that by step 8. the connection is half-open, the final FIN,ACK from HA Proxy is never sent. However, the connection is dead at the OS level, so presumably an exception was surfaced to the socket (my sockets knowledge is rusty) that was not handled.
At this point nothing can be done to repair the connection because the client will not attempt to reconnect again.
The stack trace of the MarchHare with a healthy logstash (receiving RMQ input correctly) has 3 background thread pools that I believe come from March Hare, while the one that's unhealthy (in a wedged state) only has one.
Any ideas @michaelklishin ? My money is on some threadpools dying with the socket in a weird state.
On the one hand HAProxy has some fault here perhaps, but the client library should recover from any bad connection state at the TCP level. A bad router could also inflict this kind of damage.
@andrewvc I'm sorry but "threadpools dying with the socket in a weird state" doesn't make any sense. No thread pools are used for connection I/O and RabbitMQ Java client uses JDK I/O exceptions to detect connection failures. If none are thrown then it can only learn about peer termination after a heartbeat timeout. "Bad connection state at TCP level" cannot be reliably detected any other way.
How can this be reproduced without Logstash?
I demonstrated in logstash-plugins/logstash-input-rabbitmq#76 that I cannot reproduce it with March Hare alone.
@acogoluegnes can you try reproducing this with logstash-plugins/logstash-input-rabbitmq#76 (comment) on Linux? I'd not use the compose thing just yet because the last thing we need is involving more moving parts.
It would be interesting to conduct the same test for the Java client (e.g. with a JRuby or Groovy script).
From logstash-plugins/logstash-input-rabbitmq#76 (comment):
Interestingly MarchHare does recover if rabbitmq comes back in under the 5s window
so a working theory is that MarchHare::Session#reconnecting_on_network_failures gets an exception it does not expect.
Catch juc.TimeoutException errors on reconnect
This automatically recovers in the event of all network failures now
Fixes ruby-amqp#107 (comment)
Based on the behavior, I was able to reproduce this without docker and without haproxy
Run rabbitmq + logstash
Let logstash rmq input connect
run nc -l 5672 to accept the next connection retry from Logstash
wait until 'AMQP' appears on nc (this is the rmq client attempting to reconnect)
After this attempt to reconnect, rmq input never tries again.
so what's really going on is AMQConnection in RabbitMQ Java client not receiving a protocol handshake response and no heartbeat mechanism is initialized at that point. To cope with that, Java client has a handshake timeout but March Hare did not retry on a java.util.concurrent.TimeoutException.
Thanks to solid investigative work by @andrewvc and @jordansissel, this should be resolved by #108.