New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection recovery attempt interrupted by an ECONNRESET seems to stop recovery process #491
Comments
The connection does attempt a recovery, sends a protocol preamble and then its connection is closed by the peer. I'd recommend upgrading to 2.6.4 first. It is difficult to reproduce the sequence of events without a traffic capture or a lot more logging and 2.6.4 logs quite a bit more. |
I'd recommend upgrading to 2.6.4 first. It is difficult to reproduce the sequence of events without a traffic capture or a lot more logging and 2.6.4 logs quite a bit more. |
Just un update to let you know that I've found some time to dig into the problem and I'm currently able to reproduce the issue (kind of...) but I'm still looking for a suitable solution. |
I seem to also be having the problem. Interestingly, my rabbitmq server logs indicate a handshake failure, which is weird, since the initial connection was not made with TLS. Could the reconnect be attempting a reconnect with TLS? |
You haven't provided any logs but AMQP 0-9-1 also has a handshake, as do
many
protocols.
…On Sun, Apr 23, 2017 at 6:51 AM, Matt Chun-Lum ***@***.***> wrote:
I seem to also be having the problem. Interestingly, my rabbitmq server
logs indicate a handshake failure, which is weird, since the initial
connection was not made with TLS. Could the reconnect be attempting a
reconnect with TLS?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#491 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAEQhAcvk4KmoTrnwRPAAxzlLfOZfjwks5rytjUgaJpZM4MtKcI>
.
--
MK
Staff Software Engineer, Pivotal/RabbitMQ
|
@michaelklishin I just added a spec on my fork, "issue491" branch that simulates the issue through inserting a sleep into Bunny::Session#send_preamble Long story short, try to run the issue491_spec and note that at some moment the client becomes like a "zombie" and does nothing more. I digged into the code and it seems like the exception raised here goes up the stack without being caught. I also made few attempts to properly catch it in the right place, but unfortunately I haven't found a suitable solution yet. Do you have any suggestion to point me into the right direction? Ale |
You injected a sleep after sending the preamble, not before, so I'm not sure what may be going on without digging deeper. If a preamble isn't sent in a certain amount of time RabbitMQ will simply close the TCP connection. |
Ok, maybe I've found the issue. Following the method calls:
I suggested a simple fix to the problem here by adding @michaelklishin do you think this can be a good solution? |
At first approximation this sounds reasonable. Please submit a PR and test
it to the extent you can. Thank you very much for not giving up on this!
On Thu, 27 Apr 2017 at 12:22, Alessandro Verlato ***@***.***> wrote:
Ok, maybe I've found the issue.
Following the method calls:
Session#recover_from_network_failure -> Session#start -> an exception is
raised and caught by this rescue
<https://github.com/ruby-amqp/bunny/blob/master/lib/bunny/session.rb#L320>.
The exception is then re-raised and going back to
Session#recover_from_network_failure doesn't gets caught by no rescue.
The method then returns and the client enters the said "zombie mode".
I suggested a simple fix to the problem here
<https://github.com/madAle/bunny/tree/issue491> by adding SystemCallError
to the list of rescued exceptions
<https://github.com/madAle/bunny/blob/issue491/lib/bunny/session.rb#L727>
@michaelklishin <https://github.com/michaelklishin> do you think this can
be a good solution?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#491 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAEQr6jlgqhpwVHWT6HHceUiCRGheJ_ks5r0Gw9gaJpZM4MtKcI>
.
--
Staff Software Engineer, Pivotal/RabbitMQ
|
The test is reasonable but it has two issues (like all tests that kill RabbitMQ nodes): * The nodei s always running in the background and started using rabbitmq-server in PATH, so it's not possible to use a node built from source, for example. * The test is environments-specific w.r.t. the test suite permissions, timing, and RabbitMQ provisioning choice. Those are the reasons why our existing recovery tests force close connections using HTTP API and use a special Bunny recovery option. Ultimately such issues can only be considered resolved after the code has been running in production for some time.
The test is reasonable but it has two issues (like all tests that kill RabbitMQ nodes): * The nodei s always running in the background and started using rabbitmq-server in PATH, so it's not possible to use a node built from source, for example. * The test is environments-specific w.r.t. the test suite permissions, timing, and RabbitMQ provisioning choice. Those are the reasons why our existing recovery tests force close connections using HTTP API and use a special Bunny recovery option. Ultimately such issues can only be considered resolved after the code has been running in production for some time.
Hi everybody,
sometimes I'm having this weird behaviour:
Note: Where the log says "Client not connected! Check internet connection" is an output produced by my code that is simply saying that the Bunny client is not connected.
Sequence of events:
Rabbitmq becomes unreachable -> Reconnection attempts are triggered -> Reconnection is initiated (Using CA certificates etc...) -> an Errno::ECONNRESET exception is catch
So it seems that Bunny is trying to reconnect, but then it catches an Errno::ECONNRESET exception that somehow prevents the client from initiating the reconnection procedure.😅
Does this smell like some kind of race condition?
I'm digging into the code trying to find a cause, but I'm facing difficulties reproducing the issue. Do you know how to simulate an Errno::ECONNRESET on localhost?
Cheers,
The text was updated successfully, but these errors were encountered: