Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async API - socket read timeout has no effect #1492

Closed
lukebakken opened this issue Feb 6, 2024 · 7 comments · Fixed by #1497
Closed

Async API - socket read timeout has no effect #1492

lukebakken opened this issue Feb 6, 2024 · 7 comments · Fixed by #1497
Assignees
Milestone

Comments

@lukebakken
Copy link
Contributor

lukebakken commented Feb 6, 2024

Is your feature request related to a problem? Please describe.

The v7 async API uses .NET Pipelines and ReadAsync from a NetworkStream in SocketFrameHandler:

https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/main/projects/RabbitMQ.Client/client/impl/SocketFrameHandler.cs#L284

https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/main/projects/RabbitMQ.Client/client/impl/Frame.cs#L257-L283

Even though we set a socket read timeout here, this will have no effect on an async read (docs)

So, this means that reads will never time out. This should be OK because we have heartbeats, but it would be great to have a test for this.

Describe the solution you'd like

A test for socket read timeout. It'll probably involve toxiproxy.

@lukebakken lukebakken added this to the 7.0.0 milestone Feb 6, 2024
@lukebakken lukebakken self-assigned this Feb 6, 2024
@lukebakken
Copy link
Contributor Author

lukebakken commented Feb 9, 2024

cc @danielmarbach @bording @bollhals

I'm pinging y'all because of what I've found with regard to an async read timeout. As I suspected, it can only be achieved using a CancellationTokenSource per read attempt:

https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/rabbitmq-dotnet-client-1492/projects/RabbitMQ.Client/client/impl/SocketFrameHandler.cs#L289-L294

This will probably affect performance negatively.

What I have found, however, is that the heartbeat mechanism does eventually figure out when no data is coming over the socket, and closes the connection. However, with RequestedHeartbeat set to 5 seconds, it takes about 45 seconds for the connection to actually close:

> .\projects\gH1492\bin\Debug\net6.0\GH1492.exe
[INFO] 2024-02-09T17:52:47 START TOXIC NOW
[INFO] 2024-02-09T17:52:58 publishing message: 0
[INFO] 2024-02-09T17:52:59 publishing message: 1
[INFO] 2024-02-09T17:53:00 publishing message: 2
[INFO] 2024-02-09T17:53:01 publishing message: 3
[INFO] 2024-02-09T17:53:02 publishing message: 4
[INFO] 2024-02-09T17:53:03 publishing message: 5
[INFO] 2024-02-09T17:53:04 publishing message: 6
[ERROR] 2024-02-09T17:53:48 Caught execption: RabbitMQ.Client.Exceptions.AlreadyClosedException: Already closed: The AMQP operation was interrupted: AMQP close-reason, initiated by Library, code=0, text='End of stream', classId=0, methodId=0, exception=System.IO.EndOfStreamException: Heartbeat missing with heartbeat == 00:00:05 seconds
   at RabbitMQ.Client.Impl.ChannelBase.WaitForConfirmsOrDieAsync(CancellationToken token) in C:\Users\lbakken\development\rabbitmq\rabbitmq-dotnet-client_main\projects\RabbitMQ.Client\client\impl\ChannelBase.cs:line 1922
   at GH1492.Program.AsyncMain() in C:\Users\lbakken\development\rabbitmq\rabbitmq-dotnet-client_main\projects\GH1492\Program.cs:line 40
[INFO] 2024-02-09T17:53:48 Closing channel and connection...
[INFO] 2024-02-09T17:53:48 Done closing channel and connection.

I feel like I should focus on getting the heartbeat mechanism a bit more reactive in this situation.

Thoughts?

@bording
Copy link
Collaborator

bording commented Feb 9, 2024

This will probably affect performance negatively.

Can you expand a bit on what you mean by this?

@lukebakken
Copy link
Contributor Author

Can you expand a bit on what you mean by this?

Well, there have been so many micro-optimizations in this library with regard to reducing object initialization and allocation I'm assuming the same would apply to this scenario. I can run some benchmarks, of course.

I think taking a close look at heartbeats needs to happen anyway.

@danielmarbach
Copy link
Collaborator

Can you observe the timer queue counter behavior and take a dump? The timed CTS uses timers as far as I recall and it sounds like we are stalling there somehow.

You could also try to pool the CTS and try reset them to see if it makes the problem occur less which could give us more intel too

@lukebakken
Copy link
Contributor Author

@danielmarbach there's no problem - when I use a cancellation token source with a timeout, the async read is cancelled as expected:

https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/rabbitmq-dotnet-client-1492/projects/RabbitMQ.Client/client/impl/SocketFrameHandler.cs#L289-L294

My concern is about the performance cost of setting up a CTS for every call to that method.

@danielmarbach
Copy link
Collaborator

danielmarbach commented Feb 11, 2024

@lukebakken sorry I misunderstood the whole conversation.

I think I would get started with the version you have in the linked code. You can then throw it under a benchmark if needed to compare to a more optimized version. I was thinking a bit about an optimized version but it could become quite cumbersome and potentially error prone. For example you could optimize stuff around the fact that there is only one mainloop and the read timeout rarely ever changes. That would make it possible to have a cancellation token source managed by the frame handler that is always reset after every ReadFrameAsync call. You would then still have to have a linked token source that is linked to the token passed into ReadFrameAsync. Of course technically you could do a similar type of trickery to further assume the token passed there is only ever get cancelled when the main loop gets cancelled and cache that linked one too. Given though that this significantly increases the complexity of this code due to the branching and the coupling to the assumptions of how it is being called this type of "optimization" could bite back sooner rather than latter so I'm very skeptical if it is worth it and as said you would first have to measure the proposed changes against the version with resets and additional conditionals to see if it is indeed faster.

Maybe there are also other solutions but this is all I could come up with in the limited time I had available.

@lukebakken
Copy link
Contributor Author

@danielmarbach I appreciate your input!

That would make it possible to have a cancellation token source managed by the frame handler that is always reset after every ReadFrameAsync call.

Yep, I considered that, but reset is not supported in netstandard2.0.

I am going to add tests to ensure the heartbeat and continuation timeout feature adequately covers the scenario of a socket read timeout. I'm pretty sure it will. In that case, socket read and write timeouts should no longer be necessary.

lukebakken added a commit that referenced this issue Feb 19, 2024
Fixes #1492

* Add dedicated test class for Toxiproxy
* Start Toxiproxy via docker on GHA ubuntu runner
* Start Toxiproxy in same Powershell session as `dotnet test` on GHA
  windows runner
* Add toxiproxy-netcore as a submodule so that we can strong-name it.
* Remove use of socket read timeout, and depend on heartbeats instead.
* Remove duplicate tests
* Use `ConcurrentBag` where necessary
lukebakken added a commit that referenced this issue Feb 19, 2024
Fixes #1492

* Add dedicated test class for Toxiproxy
* Start Toxiproxy via docker on GHA ubuntu runner
* Start Toxiproxy in same Powershell session as `dotnet test` on GHA
  windows runner
* Add toxiproxy-netcore as a submodule so that we can strong-name it.
* Remove use of socket read timeout, and depend on heartbeats instead.
* Remove duplicate tests
* Use `ConcurrentBag` where necessary
lukebakken added a commit that referenced this issue Feb 19, 2024
Fixes #1492

* Add dedicated test class for Toxiproxy
* Start Toxiproxy via docker on GHA ubuntu runner
* Start Toxiproxy in same Powershell session as `dotnet test` on GHA
  windows runner
* Add toxiproxy-netcore as a submodule so that we can strong-name it.
* Remove use of socket read timeout, and depend on heartbeats instead.
* Remove duplicate tests
* Use `ConcurrentBag` where necessary
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants