Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.4.4 -> 3.5.1 Timeout #68

Closed
btecu opened this issue Apr 22, 2015 · 21 comments
Closed

3.4.4 -> 3.5.1 Timeout #68

btecu opened this issue Apr 22, 2015 · 21 comments
Assignees
Milestone

Comments

@btecu
Copy link

btecu commented Apr 22, 2015

I have a small wrapper using this library on .NET 4.5.
I've updated from 3.4.4 to 3.5.1 and now, the connection gets closed and it crashes whenver I try to awknowledge a message:
AMQP close-reason, initiated by Library, code=0, text="End of stream"

I've found that this only happens when processing the message takes more than the current RequestHeartbeat. So if I increase it, it Ack() fine, otherwise it closes the connection and if throws.

Has anything changed related to this or am I using RequestHeartbeat in the wrong way?

@michaelklishin
Copy link
Member

3.5 uses timers for heartbeats. I suspect there's an interesting side-effect to that that we haven't foreseen.

@michaelklishin
Copy link
Member

Can you please do a wire capture with Wireshark or our Tracer tool (writes to stdout)? Use a small heartbeat timeout value, e.g. 5 seconds.

@btecu
Copy link
Author

btecu commented Apr 22, 2015

Tracer doesn't seem to want to cooperate:

   Serially traces connections on the <listenport>, logging
   frames received and passing them to the connect host and port.
Invoked as: Tracer 5673 localhost 5672
com.rabbitmq.tools.Tracer.WITHHOLD_INBOUND_HEARTBEATS = false
com.rabbitmq.tools.Tracer.WITHHOLD_OUTBOUND_HEARTBEATS = false
com.rabbitmq.tools.Tracer.NO_ASSEMBLE_FRAMES = false
com.rabbitmq.tools.Tracer.NO_DECODE_FRAMES = false
com.rabbitmq.tools.Tracer.SUPPRESS_COMMAND_BODIES = false

I did find the following in the logs:

=INFO REPORT==== 22-Apr-2015::16:43:49 ===
accepting AMQP connection <0.10029.1> (192.168.1.6:64537 -> 192.168.1.10:5672)

=ERROR REPORT==== 22-Apr-2015::16:44:01 ===
closing AMQP connection <0.10029.1> (192.168.1.6:64537 -> 192.168.1.10:5672):
{heartbeat_timeout,running}

=INFO REPORT==== 22-Apr-2015::16:45:48 ===
accepting AMQP connection <0.10100.1> (192.168.1.6:64554 -> 192.168.1.10:5672)

=ERROR REPORT==== 22-Apr-2015::16:46:00 ===
closing AMQP connection <0.10100.1> (192.168.1.6:64554 -> 192.168.1.10:5672):
{heartbeat_timeout,running}

@michaelklishin
Copy link
Member

I could reproduce it but it's sporadic. Investigating.

michaelklishin added a commit that referenced this issue Apr 23, 2015
With this change .NET client uses the same approach as the Java
one.

Previously in 3.5.x we've used two timers that
re-scheduled themselves and used auto-reset events to be
notified about connection activity on the I/O loop.

This turned out to have some issues:

 * It was clever but fairly complicated
 * Picking the wrong heartbeat interval value led to sporadic
   failures due to race conditions [on the wire]
 * It didn't play well with socket timeouts

Our new strategy is more straightforward:
heartbeat writer uses a periodic timer that always sends
a frame, regardless of recent activity, every 1/2nd a timeout.

Given that heartbeat timeouts are often in minutes, this is not
unreasonable but possibly the most straightforward implementation
possible. It also doesn't suffer from race conditions on the wire
because we send a heartbeat frame every half the interval, so
slight inaccuracy in .NET runtime scheduling of timers won't
get in our way with reasonable (> 1 second) heartbeat intervals.

Heartbeat "reader" now relies on the socket receive timeout,
just like in the Java client. This is more straightforward
and there is no dissonance between I/O timeouts and the one we
use for incoming heartbeats.

Various cases when a SocketException could indicate a problem
other than a timeout (usually early in the connection
lifecycle) are accounted for.

References #68.
@michaelklishin
Copy link
Member

Can someone please build the client from branch rabbitmq-dotnet-client-68 and give it a try?

I believe it fixes the issue (there was a couple of issues, in fact). I'll leave a test running for 6-8 hours.

@btecu
Copy link
Author

btecu commented Apr 24, 2015

I'll give it a try tomorrow. Thanks for your help.

@btecu
Copy link
Author

btecu commented Apr 24, 2015

I now get a different error, although it may be related:
AMQP close-reason, initiated by Library, code=0, text="End of stream", classId=0, methodId=0, cause=System.IO.IOException: Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host.

The heartbeat is under five seconds. This only happens if the consumer spends more than the heartbeat duration to process and acknowledge the message. If I Ack() the message immediately, it works.

@michaelklishin
Copy link
Member

That message means that TCP connection was closed by the server: what's in RabbitMQ log? Please always consult RabbitMQ log when investigating a heartbeat issue.

Any activity counts for a heartbeat, including acks. Values < 5 seconds are generally unnecessary and may result in false positives with slow networks, although I've added tests with values from 2 to 6.

@michaelklishin
Copy link
Member

I've just pushed something that should improve the situation for very low heartbeats (1-3 seconds). Please pull, rebuild and give it another try.

@michaelklishin
Copy link
Member

The branch I'm pushing to is rabbitmq-dotnet-client-68.

@michaelklishin
Copy link
Member

I was still able to reproduce timeouts after a longer period of time. Back to the drawing board.

@btecu
Copy link
Author

btecu commented Apr 24, 2015

I just tried the update.
Before the update, I could see the connection was closed right when I was about to Ack(). Now it's not closed at that point and I stepped into ModelBase.cs and I see the connection closed at ModelSend():
CloseReason: {AMQP close-reason, initiated by Library, code=0, text="End of stream", classId=0, methodId=0, cause=System.IO.IOException: Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host

From the log:

=ERROR REPORT==== 24-Apr-2015::09:45:09 ===
closing AMQP connection <0.20369.1> (192.168.1.6:58138 -> 192.168.1.10:5672):
{heartbeat_timeout,running}

@michaelklishin
Copy link
Member

What you see is RabbitMQ closing TCP connection due to a detected heartbeat, then the client fails a write.

@michaelklishin
Copy link
Member

With the most recent precision fix I've been running 500 connections with intervals from 1 to 7 seconds for about an hour now.

@btecu we are ready for another round :)

@btecu
Copy link
Author

btecu commented Apr 24, 2015

It works perfectly! Thanks.

@michaelklishin
Copy link
Member

My tests have been running for a few hours and also see not a single heartbeat timeout. Will merge soon.

@dumbbell
Copy link
Member

dumbbell commented May 6, 2015

I change the milestone to 3.6.0 because it was merged to master.

@dumbbell dumbbell modified the milestones: 3.6.0, 3.5.2 May 6, 2015
@btecu
Copy link
Author

btecu commented May 6, 2015

So it won't make it into 3.5.2?

@dumbbell dumbbell modified the milestones: 3.5.2, 3.6.0 May 6, 2015
@dumbbell
Copy link
Member

dumbbell commented May 6, 2015

Sorry, disregard my comment, my working copy of the stable branch was out-of-date. It will be in 3.5.2.

@michaelklishin
Copy link
Member

It should be in stable already.

MK

On 6/5/2015, at 16:55, btecu notifications@github.com wrote:

So it won't make it into 3.5.2?


Reply to this email directly or view it on GitHub.

@osudude
Copy link

osudude commented May 6, 2015

@michaelklishin Thanks for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants