-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When server closes TCP connection because TCP window is full, connection recovery does not kick in #341
Comments
I executed the provided test. It simulates a slow consumer which does eventually lead to a full TCP buffer on the server end that trips up socket writes. It works well over localhost. Connection recovery does kick in, however: the "Consumed so far …" counter does keep growing:
There's evidence of successful client reconnections in the broker log:
|
Thanks @michaelklishin for looking into this! If you are seeing connection recoveries, then that is more than I'm seeing. The moment I see:
I see no additional reconnects to the client (and your log does too?) and Wireshark seems to confirm this as well. I see a two TCP connections established to RabbitMQ from the Java process: the producer connection and consumer connection. Once the TCP buffer is full, RabbitMQ appears to close the connection. No matter how long I wait, I see none of my Shutdown or Recovery listeners fire and Wireshark and RabbitMQ logs seems to indicate no further connections being established. I haven't looked deeply into the java client library, but I suspect that the reason it keeps consuming messages is because they are simply buffered in memory and we're processing them very slowly (one ever three seconds). If the underlying connection did recover, I suspect that the |
This is not my area of expertise, but I can say that we observe a similar issue from time to time, going back to the 4.x series of the client. Symptoms: (a) only happens when a lot of messages are sent and received (single instance) within a very short time span, (b) only happens when SSL is enabled and (c) recovery doesn't kick in. I'm sorry that I cannot be of more help; we cannot reproduce the issue on demand (which is why so far we haven't filed an issue) and I cannot even say for sure whether what we observe matches @vincentjames501's issue. But it does sound like it. |
Socket write timeouts can be configured using the Our team does not use issues for investigations or discussions. This one is about listeners not firing even though recovery does happen with the provided test. |
@michaelklishin , are you positive about this? As I mentioned above I'm not seeing this at all and your log suggested that too from what you posted. Any feedback on my comment #341 (comment) ? |
I cannot reproduce the supposed inability of the library to recover
connection. I've provided evidence in the form
of test program output and
I don't see any messages about listener events, we haven't investigated why.
As I explained above, there is no way for this client or RabbitMQ to avoid
TCP socket writes, whether they happen
due to a full buffer because the consumer cannot keep up or any other
reason. I have mentioned two TCP socket
settings that should reduce the probability of socket write timeouts for
the specific scenario in the provided test app.
As I mentioned twice our team does not use GitHub issues for
investigations. Please take this to the mailing list
and clarify the behavior you expect.
…On Thu, Jan 11, 2018 at 6:07 PM, Vincent Pizzo ***@***.***> wrote:
This one is about listeners not firing even though recovery does happen
with the provided test.
@michaelklishin <https://github.com/michaelklishin> , are you positive
about this? As I mentioned above I'm not seeing this at all and your log
suggested that too from what you posted. Any feedback on my comment #341
(comment)
<#341 (comment)>
?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#341 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAEQn_wrDcFv9CpLXn5I8pIpgYmub-tks5tJiOmgaJpZM4RaQoo>
.
--
MK
Staff Software Engineer, Pivotal/RabbitMQ
|
I've provided evidence in the form
of test program output and
…and server logs which clearly mention multiple successful client
connections (I did not start multiple
copies of the test).
…On Thu, Jan 11, 2018 at 4:23 PM, Vincent Pizzo ***@***.***> wrote:
Thanks @michaelklishin <https://github.com/michaelklishin> for looking
into this! If you are seeing connection recoveries, then that is more than
I'm seeing. The moment I see:
2018-01-11 07:09:24.930 [warning] <0.548.0> closing AMQP connection <0.548.0> (127.0.0.1:57760 -> 127.0.0.1:5672):
{writer,send_failed,{error,timeout}}
I see no additional reconnects to the client (and your log does too?) and
Wireshark seems to confirm this as well. I see a two TCP connections
established to RabbitMQ from the Java process: the producer connection and
consumer connection. Once the TCP buffer is full, RabbitMQ appears to close
the connection. No matter how long I wait, I see none of my Shutdown or
Recovery listeners fire and Wireshark and RabbitMQ logs seems to indicate
no further connections being established.
I haven't looked deeply into the java client library, but I suspect that
the reason it keeps consuming messages is because they are simply buffered
in memory and we're processing them very slowly (one ever three seconds).
If the underlying connection did recover, I suspect that the basicAcks
would eventually begin to work again, but no matter how long you run the
tests they never begin to work.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#341 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAEQkwF8_DMK8-j3jBsqJezswbUOR8Tks5tJgs5gaJpZM4RaQoo>
.
--
MK
Staff Software Engineer, Pivotal/RabbitMQ
|
I re-run the test and here are some more findings.
So even after connection recovery it would not get any more deliveries @vincentjames501 expects. I see different socket exception variations e.g.
According to RabbitMQ's connection state ( There are two things that I haven't noticed the first time:
Without inspecting the entire capture and adding a certain amount of debug logging to the client I cannot tell why the client never gets any I/O exceptions on the path that kicks off connection recovery. Once we have more details and can confirm that it can be considered a bug in the library, a new issue will be filed. Using manual acknowledgements with a limited prefetch avoids the fundamental issue of TCP buffers filling up, so that's the recommendation we have in the meantime. Thanks again for the provided example. |
@Stephan202 if what you see only happens with TLS enabled, it could be related to this rabbitmq-users thread. |
new ConnectionFactory().isAutomaticRecoveryEnabled(); => true
new ConnectionFactory().isTopologyRecoveryEnabled(); => true I pushed some small changes to the test to make things easier. I now close the producer connection & channel after we're done and name the connections so they are easier to see and visualize. I also added the topology recovery option. I don't know why this issue keeps getting closed. There is plenty of information here to suggest an issue with the client. Here is a video showing that there is no reconnection attempts at all made. |
@vincentjames501 our team does not use GitHub issues for investigations. When we understand what is going on, a new issue will be filed. If you have new findings, feel free to start a rabbitmq-users thread. |
For those who are affected: the workaround is to use manual acknowledgements with a prefetch. |
I imported the project but couldn't reproduce the issue. I got the following twice:
My environment:
|
@acogoluegnes , FWIW I can't get it to happen when running RabbitMQ inside the official RabbitMQ docker container. It is easily reproducible on an OSX machine running the official RabbitMQ binaries directly on the host (OSX 10.13.2). I forgot to note that in the original bug report. We DO see the exact same behavior in our AWS environment, however, so I don't think it is a Linux vs OSX thing either. I think the reason it works in Docker for me is because it is substantially slower when using persistent messages than it is when I run it natively but it could be a lower level networking difference as well (though seems unlikely?). |
I can reproduce it on a 4 year old Mac easily. I suspect the rate of deliveries has to be at least so high for TCP window to be exhausted completely. |
@acogoluegnes , you may be able to get it to happen by disabling persistent messages to increase the rate of delivery? Not entirely sure but I created that test to be a boiled down version of exactly what we are doing in production so I didn't tinker with many more settings. |
Reproduced as-is on my 2015 MBP:
Reproduced on Linux by setting the number of messages to 100 000:
|
A thread dump reveals the reading thread is stuck waiting for a delivery to be put on the work queue. Unfortunately, this operation has not timeout. Needs more digging to know why the consumption of deliveries doesn't unblock the working queue. |
The consumption does unblock the working queue, but more deliveries come in. They must be the remaining of the saturated TCP buffer. Automatic connection recovery doesn't kick in as it's started only on the reading side. The reading thread will finally notice something is wrong once it has finished reading the TCP buffer, so this can take a while. There's nothing much we can do to detect the connection is dead, considering this is a typical case of the broker overflowing a slow client and that's why QoS has been implemented for. We can make the work queue capacity configurable (the default is 1000 per channel). Making it bigger wouldn't hurt in cases like this one. WDYT @vincentjames501 @michaelklishin? |
I haven't spent much time analyzing the internals of the library to give a thoughtful response, however, there must be some hook that the underlying TCP connection is disconnected, no? A packet capture clearly shows the connection is terminated, shouldn't the consuming thread notice this and kick off a recovery (and stop delivering more work to the subscribers)? Would you mind explaining (because of my ignorance) why this is different than a typical connection failure? If I just terminate the connection at RabbitMQ connection recovery initiates immediately even if there are unacked messages. |
@vincentjames501 this feature has been around for years, no need to assume it doesn't handle most obvious failure scenarios. If we ignore NIO for moment, every connection has an I/O loop thread which handles all exceptions and starts a shutdown procedure. Connections with enabled automatic recovery register a shutdown hook that kicks off the recovery. When TCP window is saturated, there is no I/O exception thrown by the JDK even after the server detects missed heartbeats and closes the connection. If knew why this would likely have been addressed already. |
@acogoluegnes seems to have a hypothesis, we are discussing it at the moment. |
@acogoluegnes I think we should try making the interval lower, or rather, in line with the heartbeat mechanism. Having a consumer work queue timeout would be an improvement but it wouldn't address the root cause here unless I misunderstand the above explanation. |
This reminds me of a similar scenario that took us a long time to handle well in the server, see rabbitmq/rabbitmq-common#31. Socket implementations can behave differently when it comes to event/error reporting on sockets with saturated buffers (windows). It should be possible to do something similar with JDK sockets, although it can be hacky as hell. |
@vincentjames501 Believe it or not, but there's no hook to know the TCP connection is terminated with Java's As @michaelklishin suggested, we should be able to notify the reading thread that the connection is gone from the heartbeat sender or from any writing operation that has failed due a network problem. The only way is to interrupt the reading thread. Interrupting a thread in Java doesn't carry much information, but this should be enough to know when to trigger recovery. |
Detecting connection failure on reading can take a lot of time (even forever if the reading thread is stuck), so connection recovery can now be triggered when a write operation fails. This can make the client more reactive to detect failing connections. References #341
Test disabled for NIO: the IO thread (reading and writing) can be stuck in reading mode (work pool full), which also blocks the writing (no heartbeat), and no detection connection failure detection. References #341
Now recovery can be triggered from write operations, late connection failure discoveries can re-trigger the shutdown and emit spurious exception. References #341
Automatic connection recovery triggers now by default when a write operation fails because of an IO exception. The recovery process takes place in a dedicated thread so the write operation doesn't wait (it receives the same IO exception immediatly). The test to trigger the error has changed: it doesn't use manual ack anymore, as this could sometimes block the broker and make recovery fail (broker was busy re-enqueuing messages). The test now sends a message in the consumer, which is enough to reproduce the error. Note the test against NIO is skipped right now, as it needs additional care. [#154263515] References #341
Making the work pool fail after it didn't manage to enqueue work for a given time makes the client more reactive to broker overload. Note this usually happens to clients that do not set QoS properly. Neverlethess, making the client as early as possible can avoid hard-to-debug connection failure. This complements the triggering of connection recovery on failed write operations. Work pool enqueueing timeout is usefull for NIO, where the same thread is used for both reading and writing (if the thread is stuck waiting on work pool enqueueing, no write operation can occur, and the TCP connection failure is never detected). [#154263515] Fixes #341
With a version I'm testing (#349), I observe a successful recovery:
|
This way connection recovery triggering on write can be disabled or customised. [#154263515] References #341
Summary
We started seeing the following in production:
After looking at a packet capture, we were seeing
[TCP Window Full]
messages when RabbitMQ was sending to our consumer followed by a ton of[TCP ZeroWindow]
frames coming from the consumer. After enough of these, RabbitMQ abruptly closes the connection by sending aRST
frame and reporting only this in the logs:Further investigation revealed that I was setting the QoS for the consumer only AFTER I had already started consuming (i.e. I was calling
basicConsume
directly beforebasicQos
). I realize that this was wrong of me, however, it seems odd that this would cause the exception above especially with such a relatively small number of small messages. The more concerning thing is that none of the recovery/shutdown methods on either the connection or channel seemed to be called despite the fact that the connection was indeed closed.Reproduction
I made a little test project that shows the issue.
https://github.com/vincentjames501/rmq-fail/blob/master/src/test/java/RMQProblemTest.java
Running the test will show the issue after 30 seconds or so. It appears reproducible on most RabbitMQ versions but I documented by setup below.
Environment
RabbitMQ server - 3.6.12
Erlang - Erlang 20.1
Operating system version (and distribution, if applicable) - OSX 10.13.2
All client libraries used - RabbitMQ Java Client 4.4.1
RabbitMQ plugins (if applicable) - Just Management Console
Thanks!
The text was updated successfully, but these errors were encountered: