-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Read timeout for readbuffer #68
Conversation
Read() may wait forever if no deadline is set and no calls to Close() are made, causing goroutine and memory leaks. This patch adds a read timeout in case no deadline is set, calling Close() if more than DefaultReadTimeout elapses. It is meant to be a last line of defense if all else fails closing the connection. Such a problematic condition was observed in Rancher 2.7.6, specifically with remotedialer.pipe() calls blocked by Read() on the rancher pods. In that scenario, Close() is expected to be called upon receiving Error messages, but those might never be delivered at all, eg. because of werite timeouts at the sending end. Signed-off-by: Silvio Moioli <silvio@moioli.net>
Signed-off-by: Silvio Moioli <silvio@moioli.net>
Started reviewing this. Here's a diagram for anyone else that may find this useful: I understand that Trying to track down those |
I think those are separate concerns - they do overlap to a degree, but not completely. One difference is that the deadline is set on the overall request - eg. "this overall http request should be done before 10:20:21". But that is different from saying: after X timeout seconds without receiving any data we consider this connection closed, because remotedialer assumes some notifications were lost or never sent. So the timeout should be reset every time a byte is successfully received, because that indicates the connection is still alive. Deadline's contract, on the other hand, says it should be respected matter the network activity, because the caller has an idea of how long the overall request should take, even if network activity is always present. Another problem with setting a default deadline is that from remotedialer we do not have an idea how long a "long" request could realistically take, and if we set it too short, it will create problems to the callers. Therefore I believe it's up to callers to set deadlines depending on their knowledge of how long they should take, while the timeout can be a remotedialer specific protocol parameter that is true for every connection. Does that make sense? |
Yeah gotcha. I read the docs for |
Signed-off-by: Silvio Moioli <silvio@moioli.net>
"sync" | ||
"time" | ||
|
||
"github.com/sirupsen/logrus" | ||
) | ||
|
||
const ( | ||
MaxBuffer = 1 << 21 | ||
MaxBuffer = 1 << 21 | ||
DefaultReadTimeout = 1 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, two things that came to mind.
1. Long running requests
Correct me if I'm wrong, but I believe the remotedialer is used for clients to be able to WATCH kubernetes resources.
In that case, what happens if there are no changes in a minute? Do we / the UI send PING frames (cursory look, seems to be the keepalive mechanism in http2 land) to keep the readbuffer opened?
2. Big downloads
For regular http 1.1 requests downloading a big file, I don't remember if the receiver sends data or waits for the whole thing. In case it waits for the whole thing without sending any data back, the readbuffer would be closed after 1 minute which would close the two pipes and break the connection. If that's the case, do we know about such big downloads within rancher/rancher that would be broken with this new read timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll go in reverse order.
Big downloads
This should be OK IIUC. http client sends request headers to the http server, which in this case is the remotedialer server, which it pipes them to the remotedialer client, which reads them via readbuffer. This does not last long, and after it's done it may timeout without consequences.
Of course the response is sent in the opposite direction, with the http server getting the downloaded bytes from a readbuffer - but here data continously flows so it will not time out. Of course it will time out if data stalls for more than the timeout, which is the whole point anyway.
You can try this out via:
./server/server -debug -id first -token aaa -listen :8121
./client/client -debug -id foo -connect ws://localhost:8121/connect
python3 -m http.server 8125 --bind 127.0.0.1 --directory ~/Downloads/
curl --limit-rate 1K --output /dev/null -v http://localhost:8121/client/foo/http/127.0.0.1:8125/big_file.zip
The limit-rate slows down transfer to make the download last longer
Long running requests
remotedialer is used for clients to be able to WATCH kubernetes resources.
That is correct, and WATCH does use long-running requests (long polling).
Do we / the UI send PING frames (cursory look, seems to be the keepalive mechanism in http2 land) to keep the readbuffer opened?
Bad news: if there are no changes before the timeout the connection will time out and be closed as you predicted. http 2 PING frames won't help here, because the readbuffer timeout is reset only when actual message payloads are delivered - and PINGs carry no data. And by the way, http 1.1 has nothing like that at all - it's delegated to TCP keepalive (which makes it equally invisible to the upper protocols).
This causes for example kubectl get --wait
to exit:
SMOIOLI@suse-mbp ~ % time kubectl get pods --watch --all-namespaces
[...]
0.11s user 0.04s system 0% cpu 1:00.22 total
Good news: kubectl
watch times out anyway for the same reason, just with longer time (30 minutes). IIUC clients of the WATCH API should be prepared to reconnect anytime - client-go does that. Also, the Rancher UI automatically reconnects on dropped WebSockets (which Rancher serves out of WATCH calls). So, to the best of my knowledge and testing, this isn't really a problem for general Rancher navigation, shouldn't generally be a problem for direct API usage (but could expose new bugs in clients) and it's probably a nuisance for kubectl
watch interactive users.
One thing I can do quickly is to elongate the timeout to a more reasonable value (I will admit that 1 minute was chosen for to make test results very evident, I do not have good reason to keep it that low).
Alternative ideas are also welcome. My goal is to try not to make this intervention too invasive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the problem resides on the server never receiving the close notification because the client failed and gave up sending it, but moved on closing resources on its side, hence causing the server's list of active connection to differ from the client's.
Something that came to my mind was to extend the protocol to add a new message type to communicate the active connections IDs, so that the other end can figure out which connections are no longer needed and close them. This could also be implemented using the payload of ping control messages (according to RFC, it can contain application data, but Pong responses' should then match this payload), by setting a custom PingHandler, although it does not necessarily needs to be sent on every ping.
This idea would still compatible with a longer idle timeout, in case users at the client side also fails to close the connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the problem resides on the server never receiving the close notification because the client failed and gave up sending it, but moved on closing resources on its side, hence causing the server's list of active connection to differ from the client's.
So from the PR description, the issue was located in the Read()
call where it was blocked and preventing the goroutine to be removed. This being blocked can happen in two places:
- In the pipe in
Peer B
if data is tunneled between peers. If that's the case then the issue is thatPeer B
stopped receiving data fromPeer A
. (This meansOnData()
is not called to fill up the readbuffer). - In the pipe in
Client A
. If that's the case then the issue is thatClient A
stopped receiving data fromPeer B
(which could meanPeer B
is also not receiving data fromPeer A
fwiw).
Now what you're believing is that the server (eg: Peer B
) never received a close notification from the client (eg: Client A
). I'm struggling to see how that would cause the Read()
call to be blocked. Can you elaborate on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now what you’re believing is that the server (eg: Peer B) never received a close notification from the client (eg: Client A). I’m struggling to see how that would cause the Read() call to be blocked. Can you elaborate on that?
Sure. What we see from flame graphs is an accumulation of goroutines stuck on cond.Wait()
inside of Read()
. So there’s too many of them, and they grow without limit until OOM. We are observing this phenomenon on the local cluster while piping from/to peers, so this is case 1. in your description.
Now because of how sync.Cond works once a remotedialer goroutine gets into a sync.Cond.Wait()
, the only way to get out of it is some other goroutine calling sync.Cond.Broadcast()
or sync.Cond.Signal()
. In the case of readbuffer
that only happens in three cases:
- on
Offer()
, after at least one new byte has been copied from the source into thereadbuffer
, - on
Read()
, after at least one new byte has been read from thereadbuffer
and is ready to be returned to the caller. That BTW can only happen after anOffer()
, or - on
Close()
For those accumulating goroutines:
- We know that 1. does not happen - otherwise we would not see a growing count of goroutines stuck in
cond.Wait()
- We know 2. does not happen either -
Read()
code is actually stuck later, incond.Wait()
- Therefore the only reasonable case is that 3. also never happens. That is also indirectly confirmed by the fact we gave the customer a patched image with the changes in the current version of this PR, which force close after a certain amount of time, and that stopped the leaking.
Now the question changes to: why is readbuffer.Close()
never called?
IIUC readbuffer.Close()
is called only by doTunnelClose(), which in turn is called:
- when a pipe is done - but pipe is stuck in
Read()
so that will never happen in this condition, or - when tunnelClose() is called, which happens when:
2.1. the whole Session is closed, which is clearly not always the case - downstream clusters and peers can of course be up for weeks or more
2.2. there’s errors during dialing - not the case here, connection is established fine, problem happens later
2.3. closeConnection is called, which happens when:
2.3.1. there is an error while sending a connect message - not the case here, messages go through without problems
2.3.2. there is an error receiving data - not the case here
2.3.3. connection.Close is called
2.3.4. an explicit Error message is received
Further considerations are needed for the last two points.
2.3.3 is interesting because remotedialer.connection
implements net.Conn
and it’s returned by Dial
outside of remotedialer
code. Therefore users of remotedialer.connection
can and should, of course, call Close
when they are done, and that will end up closing the readbuffer
as well. But in this case the server (local cluster) can’t know that the client has abruptly disconnected without some notification from the peer, so it might very well never call Close
- for what it knows, the connection might still have work to do. So that’s my understanding as to why it does not happen either.
2.3.4 is error handling, and it includes the special case when the “error” is actually EOF, which is the way clients/peers notify servers that they are done with a connection. IOW: a graceful connection close mechanism. Now of course remotedialer code should be robust to non-graceful closes as well - EOF messages from clients/peers might never be delivered to servers for a variety of network reasons - if nothing else because all net.Conn writes have a configurable timeout that's outside of remotedialer control. Moreover, in case of failure, graceful closing is not retried.
So what we are saying is that the various close mechanisms in place all seem to have failed, in particular the delivery of messages for graceful connection close cannot be relied upon in all cases. Therefore we need something more or something else to avoid leaks.
The original version of this PR added a Read timeout as a catch-all measure which of course worked, but might create regressions for very long and inactive, but otherwise legitimate, connections. We propose to have peers periodically advertising their known client lists as a potential alternative.
Does that clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that clarify?
Yes, thanks for the explanation. I ended up with a similar list of possible path leading to a blocked Read()
some weeks ago, but yesterday I got stuck thinking about only OnData()
and forgot the other cases that would also unblock Read()
.
I think the solution suggested make sense. A timeout is needed to detect ungraceful connection termination.
Just this is not enough though because we may end up closing connection too soon if we're not receiving any data. So we need to differentiate between a timeout due to a connection terminated on the other end vs no data coming in for other reasons (eg: waiting on some response, etc). This differentiation can be done by introducing some type of "keepalive" mechanism that specifies that the connection is still active.
We could do this:
- Per connection (per
Session
). There would be one message per session every X seconds telling the other end that this session is still alive. OR - As suggested by @aruiz14 , one message per peer every X seconds telling the other peer all the sessions that are still alive. I would be against this if every
Session
from two peers had a different TCP/http connection, because then you might run into the case of the "keepalive" connection messages being dropped while the actual data connection is fine. So connection could be dropped for no reason. Looking at the code, it seems like the custom dialer we use simply piggyback on the same connection, so that's not an issue here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to make sure that we can reliably detect which sessions are alive from browser -> first peer, otherwise we're just moving the problem. Do we already have a mechanism to reliably detect that a browser has closed the connection (in an ungraceful termination case)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomleb sorry I missed your last comment. I assume that we rely on the HTTP/TCP to detect the websocket connection died: Session.Serve
will fail obtaining NextReader
and return, making the caller call Session.Close
, which will eventually close all the multiplexed connections. The problem here is when the client closes one of those multiplexed connections without the other end noticing.
I've submitted this PR, please take a look when you have the chance, thanks!
#74
includes rancher/remotedialer#68 Signed-off-by: Silvio Moioli <silvio@moioli.net>
Superseded by #74 |
Read()
may wait forever if no deadline is set and no calls toClose()
are made, causing goroutine and memory leaks.This patch adds a read timeout in case no deadline is set, calling
Close()
if more thanDefaultReadTimeout
elapses. It is meant to be a last line of defense if all else fails closing the connection.Such a problematic condition was observed in Rancher 2.7.6 (rancher/rancher#44576 internal reference SURE-7122), specifically with
remotedialer.pipe()
calls blocked byRead()
on the rancher pods. In that scenario,Close()
is expected to be called upon receivingError
messages, but those might never be delivered at all, eg. because of write timeouts at the sending end.Please note this mechanism is meant as a robustness measure - in general trust in network communication/well-behaving clients cannot guaranteed for cleanups. Nevertheless any systematic causes should be analyzed and addressed individually wherever possible (in the specific case investigation continues via #67).
Credits to @aruiz14 for finding the crucial piece of the puzzle in this investigation.
This patch was tested by an affected user and it noticeably helped in keeping the number of Rancher goroutines (summed over the three replica pods) under control: