-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection recovery doesn't happen with HAProxy without available backends #107
Comments
@andrewvc I'm sorry but "threadpools dying with the socket in a weird state" doesn't make any sense. No thread pools are used for connection I/O and RabbitMQ Java client uses JDK I/O exceptions to detect connection failures. If none are thrown then it can only learn about peer termination after a heartbeat timeout. "Bad connection state at TCP level" cannot be reliably detected any other way. How can this be reproduced without Logstash? |
I demonstrated in logstash-plugins/logstash-input-rabbitmq#76 that I cannot reproduce it with March Hare alone. |
@acogoluegnes can you try reproducing this with logstash-plugins/logstash-input-rabbitmq#76 (comment) on Linux? I'd not use the compose thing just yet because the last thing we need is involving more moving parts. It would be interesting to conduct the same test for the Java client (e.g. with a JRuby or Groovy script). |
From logstash-plugins/logstash-input-rabbitmq#76 (comment):
so a working theory is that MarchHare::Session#reconnecting_on_network_failures gets an exception it does not expect. |
This automatically recovers in the event of all network failures now Fixes ruby-amqp#107 (comment)
From logstash-plugins/logstash-input-rabbitmq#76 (comment):
so what's really going on is |
Thanks to solid investigative work by @andrewvc and @jordansissel, this should be resolved by #108. |
This continues the discussion from logstash-plugins/logstash-input-rabbitmq#76 . I have confirmed that this is a March Hare bug. This is potentially also a Java client bug, but I'm not knowledgeable enough about the internals here to know that.
The bottom line is that there are situations where, even with
:automatic_recovery
on, the TCP connection can completely disappear and yet March Hare will not attempt to reconnect, and essentially goes dead into an unrecoverable state. Restarting the server on the other end has no effect, that client is permanently borked.I confirmed this using the
logstash <-> haproxy <-> rmq <- producer
setup in the link above with a wireshark capture andlsof
.The short summary here is that if the following sequence happens:
NOTE that by step 8. the connection is half-open, the final FIN,ACK from HA Proxy is never sent. However, the connection is dead at the OS level, so presumably an exception was surfaced to the socket (my sockets knowledge is rusty) that was not handled.
At this point nothing can be done to repair the connection because the client will not attempt to reconnect again.
The stack trace of the MarchHare with a healthy logstash (receiving RMQ input correctly) has 3 background thread pools that I believe come from March Hare, while the one that's unhealthy (in a wedged state) only has one.
Any ideas @michaelklishin ? My money is on some threadpools dying with the socket in a weird state.
On the one hand HAProxy has some fault here perhaps, but the client library should recover from any bad connection state at the TCP level. A bad router could also inflict this kind of damage.
The text was updated successfully, but these errors were encountered: