Connection recovery doesn't happen with HAProxy without available backends #107

Closed
andrewvc opened this Issue Nov 1, 2016 · 6 comments

Projects

None yet

2 participants

@andrewvc
Contributor
andrewvc commented Nov 1, 2016

This continues the discussion from logstash-plugins/logstash-input-rabbitmq#76 . I have confirmed that this is a March Hare bug. This is potentially also a Java client bug, but I'm not knowledgeable enough about the internals here to know that.

The bottom line is that there are situations where, even with :automatic_recovery on, the TCP connection can completely disappear and yet March Hare will not attempt to reconnect, and essentially goes dead into an unrecoverable state. Restarting the server on the other end has no effect, that client is permanently borked.

I confirmed this using the logstash <-> haproxy <-> rmq <- producer setup in the link above with a wireshark capture and lsof.

The short summary here is that if the following sequence happens:

  1. RabbitMQ dies
  2. HA Proxy Sends FIN,ACK, MarchHare sends ACK
  3. MarchHare sends FIN,ACK, HA Proxy sends ACK
  4. Connection now closed
  5. MarchHare attempts reconnect 5s later
  6. No response is received
  7. 5s after reconnect attempt MarchHare sends FIN,ACK
  8. HAProxy sends ACK
  9. Silence, no connection exists in the OS stack.

NOTE that by step 8. the connection is half-open, the final FIN,ACK from HA Proxy is never sent. However, the connection is dead at the OS level, so presumably an exception was surfaced to the socket (my sockets knowledge is rusty) that was not handled.

At this point nothing can be done to repair the connection because the client will not attempt to reconnect again.

The stack trace of the MarchHare with a healthy logstash (receiving RMQ input correctly) has 3 background thread pools that I believe come from March Hare, while the one that's unhealthy (in a wedged state) only has one.

Any ideas @michaelklishin ? My money is on some threadpools dying with the socket in a weird state.

On the one hand HAProxy has some fault here perhaps, but the client library should recover from any bad connection state at the TCP level. A bad router could also inflict this kind of damage.

@andrewvc andrewvc referenced this issue in logstash-plugins/logstash-input-rabbitmq Nov 1, 2016
Closed

No error if RabbitMQ goes down #76

@michaelklishin
Member

@andrewvc I'm sorry but "threadpools dying with the socket in a weird state" doesn't make any sense. No thread pools are used for connection I/O and RabbitMQ Java client uses JDK I/O exceptions to detect connection failures. If none are thrown then it can only learn about peer termination after a heartbeat timeout. "Bad connection state at TCP level" cannot be reliably detected any other way.

How can this be reproduced without Logstash?

@michaelklishin michaelklishin changed the title from March Hare can enter unrecoverable state during connection instability to Connection recovery doesn't happen with HAProxy without available backends Nov 1, 2016
@michaelklishin
Member

I demonstrated in logstash-plugins/logstash-input-rabbitmq#76 that I cannot reproduce it with March Hare alone.

@michaelklishin
Member

@acogoluegnes can you try reproducing this with logstash-plugins/logstash-input-rabbitmq#76 (comment) on Linux? I'd not use the compose thing just yet because the last thing we need is involving more moving parts.

It would be interesting to conduct the same test for the Java client (e.g. with a JRuby or Groovy script).

@michaelklishin
Member
michaelklishin commented Nov 1, 2016 edited

From logstash-plugins/logstash-input-rabbitmq#76 (comment):

Interestingly MarchHare does recover if rabbitmq comes back in under the 5s window

so a working theory is that MarchHare::Session#reconnecting_on_network_failures gets an exception it does not expect.

@andrewvc andrewvc added a commit to andrewvc/march_hare that referenced this issue Nov 1, 2016
@andrewvc andrewvc Catch juc.TimeoutException errors on reconnect
This automatically recovers in the event of all network failures now

Fixes ruby-amqp#107 (comment)
a118759
@michaelklishin
Member

From logstash-plugins/logstash-input-rabbitmq#76 (comment):

Based on the behavior, I was able to reproduce this without docker and without haproxy

Run rabbitmq + logstash
Let logstash rmq input connect
Stop rabbitmq
run nc -l 5672 to accept the next connection retry from Logstash
wait until 'AMQP' appears on nc (this is the rmq client attempting to reconnect)
After this attempt to reconnect, rmq input never tries again.

so what's really going on is AMQConnection in RabbitMQ Java client not receiving a protocol handshake response and no heartbeat mechanism is initialized at that point. To cope with that, Java client has a handshake timeout but March Hare did not retry on a java.util.concurrent.TimeoutException.

@michaelklishin
Member

Thanks to solid investigative work by @andrewvc and @jordansissel, this should be resolved by #108.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment