Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Connection recovery doesn't happen with HAProxy without available backends #107
This continues the discussion from logstash-plugins/logstash-input-rabbitmq#76 . I have confirmed that this is a March Hare bug. This is potentially also a Java client bug, but I'm not knowledgeable enough about the internals here to know that.
The bottom line is that there are situations where, even with
I confirmed this using the
The short summary here is that if the following sequence happens:
NOTE that by step 8. the connection is half-open, the final FIN,ACK from HA Proxy is never sent. However, the connection is dead at the OS level, so presumably an exception was surfaced to the socket (my sockets knowledge is rusty) that was not handled.
At this point nothing can be done to repair the connection because the client will not attempt to reconnect again.
The stack trace of the MarchHare with a healthy logstash (receiving RMQ input correctly) has 3 background thread pools that I believe come from March Hare, while the one that's unhealthy (in a wedged state) only has one.
Any ideas @michaelklishin ? My money is on some threadpools dying with the socket in a weird state.
On the one hand HAProxy has some fault here perhaps, but the client library should recover from any bad connection state at the TCP level. A bad router could also inflict this kind of damage.
@andrewvc I'm sorry but "threadpools dying with the socket in a weird state" doesn't make any sense. No thread pools are used for connection I/O and RabbitMQ Java client uses JDK I/O exceptions to detect connection failures. If none are thrown then it can only learn about peer termination after a heartbeat timeout. "Bad connection state at TCP level" cannot be reliably detected any other way.
How can this be reproduced without Logstash?
@acogoluegnes can you try reproducing this with logstash-plugins/logstash-input-rabbitmq#76 (comment) on Linux? I'd not use the compose thing just yet because the last thing we need is involving more moving parts.
It would be interesting to conduct the same test for the Java client (e.g. with a JRuby or Groovy script).
so a working theory is that MarchHare::Session#reconnecting_on_network_failures gets an exception it does not expect.
added a commit
Nov 1, 2016
referenced this issue
Nov 1, 2016
so what's really going on is