Connection recovery attempt interrupted by an ECONNRESET seems to stop recovery process #491

Closed
madAle opened this Issue Mar 29, 2017 · 9 comments

Comments

Projects
None yet
3 participants
@madAle
Contributor

madAle commented Mar 29, 2017

Hi everybody,

sometimes I'm having this weird behaviour:

D, [2017-03-27T08:40:17.603625 #2124] DEBUG -- : Sending a heartbeat, last activity time: 2017-03-27 08:40:13 +0000, interval (s): 4
D, [2017-03-27T08:40:37.605348 #2124] DEBUG -- : Sending a heartbeat, last activity time: 2017-03-27 08:40:33 +0000, interval (s): 4
D, [2017-03-27T08:40:57.611197 #2124] DEBUG -- : Sending a heartbeat, last activity time: 2017-03-27 08:40:53 +0000, interval (s): 4
D, [2017-03-27T08:41:17.612933 #2124] DEBUG -- : Sending a heartbeat, last activity time: 2017-03-27 08:41:13 +0000, interval (s): 4
D, [2017-03-27T08:41:37.619373 #2124] DEBUG -- : Sending a heartbeat, last activity time: 2017-03-27 08:41:33 +0000, interval (s): 4
D, [2017-03-27T08:41:57.621130 #2124] DEBUG -- : Sending a heartbeat, last activity time: 2017-03-27 08:41:53 +0000, interval (s): 4
E, [2017-03-27T08:41:58.193471 #2124] ERROR -- : Exception in the reader loop: AMQ::Protocol::EmptyResponseError: Empty response received from the server.
E, [2017-03-27T08:41:58.193818 #2124] ERROR -- : Backtrace: 
E, [2017-03-27T08:41:58.194270 #2124] ERROR -- : 	/usr/local/rvm/gems/ruby-2.3.1/gems/amq-protocol-2.1.0/lib/amq/protocol/frame.rb:60:in `decode_header'
E, [2017-03-27T08:41:58.194442 #2124] ERROR -- : 	/usr/local/rvm/gems/ruby-2.3.1/gems/bunny-2.6.3/lib/bunny/transport.rb:245:in `read_next_frame'
E, [2017-03-27T08:41:58.194607 #2124] ERROR -- : 	/usr/local/rvm/gems/ruby-2.3.1/gems/bunny-2.6.3/lib/bunny/reader_loop.rb:68:in `run_once'
E, [2017-03-27T08:41:58.194770 #2124] ERROR -- : 	/usr/local/rvm/gems/ruby-2.3.1/gems/bunny-2.6.3/lib/bunny/reader_loop.rb:35:in `block in run_loop'
E, [2017-03-27T08:41:58.194933 #2124] ERROR -- : 	/usr/local/rvm/gems/ruby-2.3.1/gems/bunny-2.6.3/lib/bunny/reader_loop.rb:32:in `loop'
E, [2017-03-27T08:41:58.195093 #2124] ERROR -- : 	/usr/local/rvm/gems/ruby-2.3.1/gems/bunny-2.6.3/lib/bunny/reader_loop.rb:32:in `run_loop'
W, [2017-03-27T08:41:58.195293 #2124]  WARN -- : Recovering from a network failure...
E, [2017-03-27T08:41:58.268906 #2124] ERROR -- : Client not connected! Check internet connection
D, [2017-03-27T08:42:03.195727 #2124] DEBUG -- : About to start connection recovery...
E, [2017-03-27T08:42:03.271855 #2124] ERROR -- : Client not connected! Check internet connection
D, [2017-03-27T08:42:08.196443 #2124] DEBUG -- : About to start connection recovery...
D, [2017-03-27T08:42:08.207352 #2124] DEBUG -- : Using CA certificates at /usr/lib/ssl/cert.pem, /usr/lib/ssl/certs/ca-certificates.crt, /usr/lib/ssl/certs/ca-bundle.crt, /usr/lib/ssl/certs/ca-bundle.pem
D, [2017-03-27T08:42:08.207748 #2124] DEBUG -- : Using 0 inline CA certificates
W, [2017-03-27T08:42:08.295329 #2124]  WARN -- : Using TLS but no client certificate is provided! If RabbitMQ is configured to verify peer
certificate, connection upgrade will fail!

W, [2017-03-27T08:42:08.295715 #2124]  WARN -- : Retrying connection on next host in line: endpoint.spacebunny.io:5671
E, [2017-03-27T08:42:08.314578 #2124] ERROR -- : Client not connected! Check internet connection
D, [2017-03-27T08:42:08.499313 #2124] DEBUG -- : Sent protocol preamble
E, [2017-03-27T08:42:09.498437 #2124] ERROR -- : Got an exception when receiving data: Connection reset by peer (Errno::ECONNRESET)
E, [2017-03-27T08:42:13.315922 #2124] ERROR -- : Client not connected! Check internet connection
E, [2017-03-27T08:42:18.315014 #2124] ERROR -- : Client not connected! Check internet connection
E, [2017-03-27T08:42:23.315040 #2124] ERROR -- : Client not connected! Check internet connection
E, [2017-03-27T08:42:28.331195 #2124] ERROR -- : Client not connected! Check internet connection
E, [2017-03-27T08:42:33.315217 #2124] ERROR -- : Client not connected! Check internet connection

Note: Where the log says "Client not connected! Check internet connection" is an output produced by my code that is simply saying that the Bunny client is not connected.

Sequence of events:
Rabbitmq becomes unreachable -> Reconnection attempts are triggered -> Reconnection is initiated (Using CA certificates etc...) -> an Errno::ECONNRESET exception is catch

So it seems that Bunny is trying to reconnect, but then it catches an Errno::ECONNRESET exception that somehow prevents the client from initiating the reconnection procedure.
Does this smell like some kind of race condition?
I'm digging into the code trying to find a cause, but I'm facing difficulties reproducing the issue. Do you know how to simulate an Errno::ECONNRESET on localhost? 😅

Cheers,

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Mar 29, 2017

Member

ECONNRESET means that the peer (a RabbitMQ node, a proxy, a load balancer) closed TCP connection.

The connection does attempt a recovery, sends a protocol preamble and then its connection is closed by the peer.

I'd recommend upgrading to 2.6.4 first. It is difficult to reproduce the sequence of events without a traffic capture or a lot more logging and 2.6.4 logs quite a bit more.

Member

michaelklishin commented Mar 29, 2017

ECONNRESET means that the peer (a RabbitMQ node, a proxy, a load balancer) closed TCP connection.

The connection does attempt a recovery, sends a protocol preamble and then its connection is closed by the peer.

I'd recommend upgrading to 2.6.4 first. It is difficult to reproduce the sequence of events without a traffic capture or a lot more logging and 2.6.4 logs quite a bit more.

@michaelklishin michaelklishin changed the title from Weird behaviour - client not automatically reconnecting to Connection recovery attempt interrupted by an ECONNRESET seems to stop recovery process Mar 29, 2017

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Mar 29, 2017

Member

ECONNRESET means that the peer (a RabbitMQ node, a proxy, a load balancer) closed TCP connection.
The connection does attempt a recovery, sends a protocol preamble and then its connection is closed by the peer. My only guess is that while that happened, this parallel recovery guard could kick in.
Adding more logging in the alternative branch could shed some light.

I'd recommend upgrading to 2.6.4 first. It is difficult to reproduce the sequence of events without a traffic capture or a lot more logging and 2.6.4 logs quite a bit more.

Member

michaelklishin commented Mar 29, 2017

ECONNRESET means that the peer (a RabbitMQ node, a proxy, a load balancer) closed TCP connection.
The connection does attempt a recovery, sends a protocol preamble and then its connection is closed by the peer. My only guess is that while that happened, this parallel recovery guard could kick in.
Adding more logging in the alternative branch could shed some light.

I'd recommend upgrading to 2.6.4 first. It is difficult to reproduce the sequence of events without a traffic capture or a lot more logging and 2.6.4 logs quite a bit more.

@madAle

This comment has been minimized.

Show comment
Hide comment
@madAle

madAle Apr 21, 2017

Contributor

Just un update to let you know that I've found some time to dig into the problem and I'm currently able to reproduce the issue (kind of...) but I'm still looking for a suitable solution.

Contributor

madAle commented Apr 21, 2017

Just un update to let you know that I've found some time to dig into the problem and I'm currently able to reproduce the issue (kind of...) but I'm still looking for a suitable solution.

@mattcl

This comment has been minimized.

Show comment
Hide comment
@mattcl

mattcl Apr 23, 2017

I seem to also be having the problem. Interestingly, my rabbitmq server logs indicate a handshake failure, which is weird, since the initial connection was not made with TLS. Could the reconnect be attempting a reconnect with TLS?

mattcl commented Apr 23, 2017

I seem to also be having the problem. Interestingly, my rabbitmq server logs indicate a handshake failure, which is weird, since the initial connection was not made with TLS. Could the reconnect be attempting a reconnect with TLS?

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Apr 23, 2017

Member
Member

michaelklishin commented Apr 23, 2017

@madAle

This comment has been minimized.

Show comment
Hide comment
@madAle

madAle Apr 23, 2017

Contributor

@michaelklishin I just added a spec on my fork, "issue491" branch that simulates the issue through inserting a sleep into Bunny::Session#send_preamble
Despite being an awful trick, this causes a very similar sequence of events like that reported in the opening message. At the moment I haven't found any better way for reproducing the issue 😞 although having made many efforts using a proxy that truncates connections on demand...

Long story short, try to run the issue491_spec and note that at some moment the client becomes like a "zombie" and does nothing more.

I digged into the code and it seems like the exception raised here goes up the stack without being caught. I also made few attempts to properly catch it in the right place, but unfortunately I haven't found a suitable solution yet.

Do you have any suggestion to point me into the right direction?
Cheers,

Ale

Contributor

madAle commented Apr 23, 2017

@michaelklishin I just added a spec on my fork, "issue491" branch that simulates the issue through inserting a sleep into Bunny::Session#send_preamble
Despite being an awful trick, this causes a very similar sequence of events like that reported in the opening message. At the moment I haven't found any better way for reproducing the issue 😞 although having made many efforts using a proxy that truncates connections on demand...

Long story short, try to run the issue491_spec and note that at some moment the client becomes like a "zombie" and does nothing more.

I digged into the code and it seems like the exception raised here goes up the stack without being caught. I also made few attempts to properly catch it in the right place, but unfortunately I haven't found a suitable solution yet.

Do you have any suggestion to point me into the right direction?
Cheers,

Ale

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Apr 23, 2017

Member

read_fully is used in read_next_frame so I'd expect it to be handled the same way as any other socket operation exceptions in the I/O loop.

You injected a sleep after sending the preamble, not before, so I'm not sure what may be going on without digging deeper. If a preamble isn't sent in a certain amount of time RabbitMQ will simply close the TCP connection.

Member

michaelklishin commented Apr 23, 2017

read_fully is used in read_next_frame so I'd expect it to be handled the same way as any other socket operation exceptions in the I/O loop.

You injected a sleep after sending the preamble, not before, so I'm not sure what may be going on without digging deeper. If a preamble isn't sent in a certain amount of time RabbitMQ will simply close the TCP connection.

@madAle

This comment has been minimized.

Show comment
Hide comment
@madAle

madAle Apr 27, 2017

Contributor

Ok, maybe I've found the issue.

Following the method calls:

Session#recover_from_network_failure -> Session#start -> an exception is raised and caught by this rescue. The exception is then re-raised and going back to Session#recover_from_network_failure doesn't gets caught by any rescue.
The method then returns and the client enters the said "zombie mode".

I suggested a simple fix to the problem here by adding SystemCallError to the list of rescued exceptions

@michaelklishin do you think this can be a good solution?

Contributor

madAle commented Apr 27, 2017

Ok, maybe I've found the issue.

Following the method calls:

Session#recover_from_network_failure -> Session#start -> an exception is raised and caught by this rescue. The exception is then re-raised and going back to Session#recover_from_network_failure doesn't gets caught by any rescue.
The method then returns and the client enters the said "zombie mode".

I suggested a simple fix to the problem here by adding SystemCallError to the list of rescued exceptions

@michaelklishin do you think this can be a good solution?

@michaelklishin

This comment has been minimized.

Show comment
Hide comment
@michaelklishin

michaelklishin Apr 27, 2017

Member
Member

michaelklishin commented Apr 27, 2017

madAle added a commit to madAle/bunny that referenced this issue Apr 28, 2017

@madAle madAle referenced this issue Apr 28, 2017

Merged

Fix #491 #500

michaelklishin added a commit that referenced this issue Apr 28, 2017

Drop the proposed test for #491.
The test is reasonable but it has two issues (like all tests that kill RabbitMQ nodes):

 * The nodei s always running in the background and started using rabbitmq-server
   in PATH, so it's not possible to use a node built from source, for example.
 * The test is environments-specific w.r.t. the test suite permissions, timing,
   and RabbitMQ provisioning choice.

Those are the reasons why our existing recovery tests force close
connections using HTTP API and use a special Bunny recovery option.

Ultimately such issues can only be considered resolved after the code
has been running in production for some time.

michaelklishin added a commit that referenced this issue Apr 29, 2017

michaelklishin added a commit that referenced this issue Apr 29, 2017

Drop the proposed test for #491.
The test is reasonable but it has two issues (like all tests that kill RabbitMQ nodes):

 * The nodei s always running in the background and started using rabbitmq-server
   in PATH, so it's not possible to use a node built from source, for example.
 * The test is environments-specific w.r.t. the test suite permissions, timing,
   and RabbitMQ provisioning choice.

Those are the reasons why our existing recovery tests force close
connections using HTTP API and use a special Bunny recovery option.

Ultimately such issues can only be considered resolved after the code
has been running in production for some time.

@michaelklishin michaelklishin added bug and removed no decision labels May 10, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment