Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection recovery attempt interrupted by an ECONNRESET seems to stop recovery process #491

Closed
madAle opened this issue Mar 29, 2017 · 9 comments · Fixed by #500
Closed

Comments

@madAle
Copy link
Contributor

madAle commented Mar 29, 2017

Hi everybody,

sometimes I'm having this weird behaviour:

D, [2017-03-27T08:40:17.603625 #2124] DEBUG -- : Sending a heartbeat, last activity time: 2017-03-27 08:40:13 +0000, interval (s): 4
D, [2017-03-27T08:40:37.605348 #2124] DEBUG -- : Sending a heartbeat, last activity time: 2017-03-27 08:40:33 +0000, interval (s): 4
D, [2017-03-27T08:40:57.611197 #2124] DEBUG -- : Sending a heartbeat, last activity time: 2017-03-27 08:40:53 +0000, interval (s): 4
D, [2017-03-27T08:41:17.612933 #2124] DEBUG -- : Sending a heartbeat, last activity time: 2017-03-27 08:41:13 +0000, interval (s): 4
D, [2017-03-27T08:41:37.619373 #2124] DEBUG -- : Sending a heartbeat, last activity time: 2017-03-27 08:41:33 +0000, interval (s): 4
D, [2017-03-27T08:41:57.621130 #2124] DEBUG -- : Sending a heartbeat, last activity time: 2017-03-27 08:41:53 +0000, interval (s): 4
E, [2017-03-27T08:41:58.193471 #2124] ERROR -- : Exception in the reader loop: AMQ::Protocol::EmptyResponseError: Empty response received from the server.
E, [2017-03-27T08:41:58.193818 #2124] ERROR -- : Backtrace: 
E, [2017-03-27T08:41:58.194270 #2124] ERROR -- : 	/usr/local/rvm/gems/ruby-2.3.1/gems/amq-protocol-2.1.0/lib/amq/protocol/frame.rb:60:in `decode_header'
E, [2017-03-27T08:41:58.194442 #2124] ERROR -- : 	/usr/local/rvm/gems/ruby-2.3.1/gems/bunny-2.6.3/lib/bunny/transport.rb:245:in `read_next_frame'
E, [2017-03-27T08:41:58.194607 #2124] ERROR -- : 	/usr/local/rvm/gems/ruby-2.3.1/gems/bunny-2.6.3/lib/bunny/reader_loop.rb:68:in `run_once'
E, [2017-03-27T08:41:58.194770 #2124] ERROR -- : 	/usr/local/rvm/gems/ruby-2.3.1/gems/bunny-2.6.3/lib/bunny/reader_loop.rb:35:in `block in run_loop'
E, [2017-03-27T08:41:58.194933 #2124] ERROR -- : 	/usr/local/rvm/gems/ruby-2.3.1/gems/bunny-2.6.3/lib/bunny/reader_loop.rb:32:in `loop'
E, [2017-03-27T08:41:58.195093 #2124] ERROR -- : 	/usr/local/rvm/gems/ruby-2.3.1/gems/bunny-2.6.3/lib/bunny/reader_loop.rb:32:in `run_loop'
W, [2017-03-27T08:41:58.195293 #2124]  WARN -- : Recovering from a network failure...
E, [2017-03-27T08:41:58.268906 #2124] ERROR -- : Client not connected! Check internet connection
D, [2017-03-27T08:42:03.195727 #2124] DEBUG -- : About to start connection recovery...
E, [2017-03-27T08:42:03.271855 #2124] ERROR -- : Client not connected! Check internet connection
D, [2017-03-27T08:42:08.196443 #2124] DEBUG -- : About to start connection recovery...
D, [2017-03-27T08:42:08.207352 #2124] DEBUG -- : Using CA certificates at /usr/lib/ssl/cert.pem, /usr/lib/ssl/certs/ca-certificates.crt, /usr/lib/ssl/certs/ca-bundle.crt, /usr/lib/ssl/certs/ca-bundle.pem
D, [2017-03-27T08:42:08.207748 #2124] DEBUG -- : Using 0 inline CA certificates
W, [2017-03-27T08:42:08.295329 #2124]  WARN -- : Using TLS but no client certificate is provided! If RabbitMQ is configured to verify peer
certificate, connection upgrade will fail!

W, [2017-03-27T08:42:08.295715 #2124]  WARN -- : Retrying connection on next host in line: endpoint.spacebunny.io:5671
E, [2017-03-27T08:42:08.314578 #2124] ERROR -- : Client not connected! Check internet connection
D, [2017-03-27T08:42:08.499313 #2124] DEBUG -- : Sent protocol preamble
E, [2017-03-27T08:42:09.498437 #2124] ERROR -- : Got an exception when receiving data: Connection reset by peer (Errno::ECONNRESET)
E, [2017-03-27T08:42:13.315922 #2124] ERROR -- : Client not connected! Check internet connection
E, [2017-03-27T08:42:18.315014 #2124] ERROR -- : Client not connected! Check internet connection
E, [2017-03-27T08:42:23.315040 #2124] ERROR -- : Client not connected! Check internet connection
E, [2017-03-27T08:42:28.331195 #2124] ERROR -- : Client not connected! Check internet connection
E, [2017-03-27T08:42:33.315217 #2124] ERROR -- : Client not connected! Check internet connection

Note: Where the log says "Client not connected! Check internet connection" is an output produced by my code that is simply saying that the Bunny client is not connected.

Sequence of events:
Rabbitmq becomes unreachable -> Reconnection attempts are triggered -> Reconnection is initiated (Using CA certificates etc...) -> an Errno::ECONNRESET exception is catch

So it seems that Bunny is trying to reconnect, but then it catches an Errno::ECONNRESET exception that somehow prevents the client from initiating the reconnection procedure.
Does this smell like some kind of race condition?
I'm digging into the code trying to find a cause, but I'm facing difficulties reproducing the issue. Do you know how to simulate an Errno::ECONNRESET on localhost? 😅

Cheers,

@michaelklishin
Copy link
Member

ECONNRESET means that the peer (a RabbitMQ node, a proxy, a load balancer) closed TCP connection.

The connection does attempt a recovery, sends a protocol preamble and then its connection is closed by the peer.

I'd recommend upgrading to 2.6.4 first. It is difficult to reproduce the sequence of events without a traffic capture or a lot more logging and 2.6.4 logs quite a bit more.

@michaelklishin michaelklishin changed the title Weird behaviour - client not automatically reconnecting Connection recovery attempt interrupted by an ECONNRESET seems to stop recovery process Mar 29, 2017
@michaelklishin
Copy link
Member

ECONNRESET means that the peer (a RabbitMQ node, a proxy, a load balancer) closed TCP connection.
The connection does attempt a recovery, sends a protocol preamble and then its connection is closed by the peer. My only guess is that while that happened, this parallel recovery guard could kick in.
Adding more logging in the alternative branch could shed some light.

I'd recommend upgrading to 2.6.4 first. It is difficult to reproduce the sequence of events without a traffic capture or a lot more logging and 2.6.4 logs quite a bit more.

@madAle
Copy link
Contributor Author

madAle commented Apr 21, 2017

Just un update to let you know that I've found some time to dig into the problem and I'm currently able to reproduce the issue (kind of...) but I'm still looking for a suitable solution.

@mattcl
Copy link

mattcl commented Apr 23, 2017

I seem to also be having the problem. Interestingly, my rabbitmq server logs indicate a handshake failure, which is weird, since the initial connection was not made with TLS. Could the reconnect be attempting a reconnect with TLS?

@michaelklishin
Copy link
Member

michaelklishin commented Apr 23, 2017 via email

@madAle
Copy link
Contributor Author

madAle commented Apr 23, 2017

@michaelklishin I just added a spec on my fork, "issue491" branch that simulates the issue through inserting a sleep into Bunny::Session#send_preamble
Despite being an awful trick, this causes a very similar sequence of events like that reported in the opening message. At the moment I haven't found any better way for reproducing the issue 😞 although having made many efforts using a proxy that truncates connections on demand...

Long story short, try to run the issue491_spec and note that at some moment the client becomes like a "zombie" and does nothing more.

I digged into the code and it seems like the exception raised here goes up the stack without being caught. I also made few attempts to properly catch it in the right place, but unfortunately I haven't found a suitable solution yet.

Do you have any suggestion to point me into the right direction?
Cheers,

Ale

@michaelklishin
Copy link
Member

read_fully is used in read_next_frame so I'd expect it to be handled the same way as any other socket operation exceptions in the I/O loop.

You injected a sleep after sending the preamble, not before, so I'm not sure what may be going on without digging deeper. If a preamble isn't sent in a certain amount of time RabbitMQ will simply close the TCP connection.

@madAle
Copy link
Contributor Author

madAle commented Apr 27, 2017

Ok, maybe I've found the issue.

Following the method calls:

Session#recover_from_network_failure -> Session#start -> an exception is raised and caught by this rescue. The exception is then re-raised and going back to Session#recover_from_network_failure doesn't gets caught by any rescue.
The method then returns and the client enters the said "zombie mode".

I suggested a simple fix to the problem here by adding SystemCallError to the list of rescued exceptions

@michaelklishin do you think this can be a good solution?

@michaelklishin
Copy link
Member

michaelklishin commented Apr 27, 2017 via email

madAle added a commit to madAle/bunny that referenced this issue Apr 28, 2017
@madAle madAle mentioned this issue Apr 28, 2017
michaelklishin added a commit that referenced this issue Apr 28, 2017
The test is reasonable but it has two issues (like all tests that kill RabbitMQ nodes):

 * The nodei s always running in the background and started using rabbitmq-server
   in PATH, so it's not possible to use a node built from source, for example.
 * The test is environments-specific w.r.t. the test suite permissions, timing,
   and RabbitMQ provisioning choice.

Those are the reasons why our existing recovery tests force close
connections using HTTP API and use a special Bunny recovery option.

Ultimately such issues can only be considered resolved after the code
has been running in production for some time.
michaelklishin pushed a commit that referenced this issue Apr 29, 2017
michaelklishin added a commit that referenced this issue Apr 29, 2017
The test is reasonable but it has two issues (like all tests that kill RabbitMQ nodes):

 * The nodei s always running in the background and started using rabbitmq-server
   in PATH, so it's not possible to use a node built from source, for example.
 * The test is environments-specific w.r.t. the test suite permissions, timing,
   and RabbitMQ provisioning choice.

Those are the reasons why our existing recovery tests force close
connections using HTTP API and use a special Bunny recovery option.

Ultimately such issues can only be considered resolved after the code
has been running in production for some time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants