New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finagle streaming client not closing sockets #475
Comments
Hi @mritman, the HTTP client should be managing connections correctly whether closed locally or remotely even when interrupted during a read or write. We test these properties quite exhaustively in https://github.com/twitter/finagle/blob/develop/finagle-http/src/test/scala/com/twitter/finagle/http/StreamingTest.scala (see the tests around line 80). I'll have to experiment with your example, but I didn't see anything out of the ordinary while testing in REPL: https://gist.github.com/luciferous/8f98fd3367af7d2aa7ef. tcpdump output seems reasonable too: http://pastie.org/private/tbiuuafiory7jsm2e9v0xg. |
Hi @luciferous, I should note that the problem does not occur when the server sends back a 200 response. I'm interested to hear what you find when testing with my example. |
When running the example application as provided I see "CLOSE_WAIT" states as well with Edit: Also, when just returning a success message I see far less open connections and they don't enter the "CLOSE_WAIT" state. Output Edit2: When explicitly closing the reader and the message all connections are in a response = Await.result(service.apply(request), Duration.fromMilliseconds(
response.reader().discard();
Await.ready(response.close(), Duration.fromMilliseconds(200)); |
This introduces a new test to com.twitter.finagle.http.StreamingTest that works as follows: 1. Start a streaming http server that never responds to requests. It has a special transport that may be closed prematurely. 2. Connect a client to this service. 3. Send a request on this client. 4. Close the transport on the server. 5. Ensure that the client observes an exception on its request. A failure is not observed in step 5.
The behavior is as follows: - A server processes a request and then closes the connection - The client receives the response - The client never closes the transport, leaving the connection in the CLOSE_WAIT state.
As I understand it the situation is as follows:
Running @mritman's test program locally, I was able to confirm that the server sockets end up in FIN_WAIT_2 and the client sockets end up in CLOSE_WAIT. I have attempted to reproduce this in a finagle test: develop...BuoyantIO:ver/test-streaming-disconnects#diff-84324cdca2db122c30fa0be6887261a2R126 If the disconnect were received during the lifetime of a request, the transport would be torn down properly. I am still trying to determine how the client should receive notification of remote connection teardown outside of the request lifecycle... Perhaps this can't be determined until another operation (read or write) occurs on the transport? I'll try to dive deeper with that test to figure out what's going on. |
Try replacing your Tomcat server with the following: while true; do printf 'HTTP/1.1 403 Forbidden\r\nServer: test\r\n\r\n'|nc -l 8080; done You'll see the connections are closing. Also try a Finagle HTTP server: import com.twitter.finagle.http.{Http=>_, _}
import com.twitter.finagle.{Http, Service}
import com.twitter.util.{Await, Future}
val forbidden = new Service[Request, Response] {
def apply(req: Request): Future[Response] = {
val res = Response()
res.status = Status.Forbidden
Future.value(res)
}
}
Http.serve(":8080", forbidden) These connections are also closing, but they don't close as fast. My lsof output showed the connection count hovering around 50. Seems to me like the Tomcat server is the issue? |
Ah I'm setting up the server wrongly. I need to make the server send a chunk encoded response. while true; do printf 'HTTP/1.1 403 Forbidden\r\nServer: test\r\nTransfer-Encoding: chunked\r\n\r\n1\r\na\r\n0\r\n\r\n'|nc -l 8080; done The lsof count hovers around 50 again. The connections do get cleared, so it doesn't sound like a leak to me. What do you think? For curiosity's sake I also tried the Finagle server again with streaming enabled: val forbidden = new Service[Request, Response] {
def apply(req: Request): Future[Response] =
Future.value(Response(Version.Http11, Status.Forbidden, Reader.fromBuf(Buf.Utf8("a"))))
}
Http.server.withStreaming(enabled=true).serve(":8080", forbidden) I saw the same thing with lsof lines hovering around 50. The strangest thing I'm seeing is that a lot of the requests are timing out, so maybe that's worth looking into. But I don't think there's a connection leak. |
Correct me if I'm wrong, but I think the netcat and Finagle examples don't close the connection whereas the Spring-Boot server closes the connection. |
I'd also argue that regardless of whether this issue is specific to the use of Tomcat or not, since it's the client's sockets that increasingly end op in CLOSE_WAIT, this is something you'll want to guard against from the side of the client. |
@mritman that's reasonable. I'm really suspicious of the timeouts. I think looking into why they're happening is a good place to start. |
@luciferous I also noticed the timeouts while testing. If you set the connection pool limit to 1 in my example project and wait for Tomcat to close the connection you'll see that Finagle starts timing out on it. So it looks like Finagle doesn't detect the connection is closed by the server and continues to try to read from it, which is in line with @olix0r's observations. |
@mritman, very interesting! I'd like to take a look too, hoping to check it out this weekend if folks haven't already figured it out by then. |
So, it turns out this has a really simple explanation. Because we set up the client with Add I remember being bitten by this before too, once or twice. There's definitely room to improve the API/UX here. Sorry about that. |
And, if you weren't seeing it with the |
@luciferous isn't this an inherent race condition in the protocol? if a stream is chunk-encoded and ends the transfer by closing the connection, it's indistinguishable from the remote dying. wouldn't a "correct" implementation of this API signal to the reader, "we don't know whether the server died or is finished"? |
@luciferous actually, it looks like it doesn't matter if the server's responses are chunked or not . . . weird. |
The clients do close if the server closes. See above where I tried to reproduce with servers that close. But the tomcat one seems to keep the connection open. |
How then do you explain these CLOSE_WAIT. I discarded the reader there. Granted I see a lot less connections than without but they are still there. Also the 200 OK responses contained the string "Good" and thus there was no empty reader. |
@olix0r I found a race condition in your first test, seems to work OK after fixing that. taking a look at the second one now. |
@spockz Raise your timeout (e.g. 1 second) and you'll see the CLOSE_WAIT disappear. With the 20ms timeout does the number of CLOSE_WAIT grow? I see it hovering at ~20 connections. |
@spockz Regarding the 200 OK case, it really depends on how Netty reads it from the socket. If the whole request fits into the first read, Netty doesn't chunk-encode it. |
@olix0r and I think that ChannelTransport closes itself, which is why your proxy never detected a close. I've added the commit which no longer uses the proxy to my PR, so both tests should now pass. |
@spockz @mritman I'm beginning to think that @luciferous is right that this is a case of a difficult to use API, although it looks like there are a few bugs too. Since your server is sending a chunk-encoded response, you need to use the A couple open questions:
My suspicion is that the reader constitutes a read handle, and that we don't want to cut the read handle prematurely, so that it can still read whatever is in its socket buffer, so 3 makes sense, but 1 and 2 less so. I think it's also a bug that content / getContentString don't actually drain the reader. The intent is to avoid blocking on io, but I think we should instead make it clear that getContent is only safe to call on unchunked content, or when you don't mind if you block. @luciferous what do you think? @mritman @spockz For what it's worth, if you control your remote servers, the remote server closing the socket should be an edge case, not the normal one. Tcp sessions are not light, and you'll entail significantly GC pressure if you're constantly tearing down tcp sessions. With that said, we absolutely want to get to the bottom of these bugs, so we appreciate you finding the minimization, since we know that it will happen occasionally even with servers you control, and that some servers you don't control may do weird things. ✨ 👍 |
@mritman @spockz I've made a PR here: mritman/finagle-not-closing-sockets#1 please let me know what you think! |
I can confirm that consuming the whole reader as done in mritman/finagle-not-closing-sockets#1 solves the issue.
I think it is pretty clear from the signature that it is a blocking call (or that it returns what it already received). Perhaps
If we can still read from the socket when the connection is already closed that means that the contents are buffered somewhere on the machine. Thus, it might not be so bad to pull the contents into the finagle layer when the connection is closed. If we do that the connection can be closed without losing data.
|
@spockz yeah, it returns what is already received. I think this is confusing, but it's potentially also confusing to change a call that doesn't block to one that does . . . I agree that it would be nice if responses that were chunked didn't have a content / getContentString, but it's difficult to change the API of Request, since it's one of the APIs that people use the most. Yeah, it might be worth pursuing having finagle try to read what's in the socketbuffer into memory. I think that should be safe, although I'm not sure if it might be make us run into weird problems if a remote host has sent us more bytes that we want to copy into the JVM. |
OK, I dug into finagle streaming clients some more, and I now think that the two main problems were not getting to the end of the read stream, and timing out without getting to the end of the read stream. Note that if our Await.result fails, then we never assign the new response to the response variable, so if it opens a read handle, we will never have the opportunity to discard the read handle. I no longer think that there's a bug where you need to drain AND discard. I think it worked because it slowed down the operation, and not because of an underlying bug. @mritman @spockz I'm convinced that there isn't a bug, but the API is difficult to use. We'll be working to make it simpler. Do you think it's OK to close the ticket? |
So if this case can't be handled by the user of the Finagle API, I take it it's correctly handled by Finagle itself? Is the following understanding of how to use the Finagle client with streaming enabled correct? When using streaming you either:
No need to answer my questions specifically, but if you can provide a summary or a link to some documentation on how to correctly use the Finagle client with streaming enabled you can close this ticket as far as I'm concerned. |
@mritman it can be handled by the user of the finagle API–almost none of our users use the ah, the problem is with the 1 / 2. Please note that any use of
As far as docs go, we know that streaming is a weak part of our documentation. This ticket is convincing us that we should change the Request / Response API too, so we're probably going to change the API to be easier to use, and then document it. It might make sense to document safe use of Reader first though, since that's probably not going to change that significantly. I'm going to close this ticket, but I've opened ticket #488 to improve documentation of Reader. Thanks for bearing with us as we figured out what was going on in your snippet! |
Problem finagle-http's streaming test coverage misses some important cases Solution Introduce new tests to http.StreamingTest to validate that 1. A client behaves well when a connection is lost before a server responds. 2. A server closes the transport after responding, as described in twitter#475.
Problem finagle-http's streaming test coverage misses some important cases Solution Introduce new tests to http.StreamingTest to validate that 1. A client behaves well when a connection is lost before a server responds. 2. A server closes the transport after responding, as described in #475. Signed-off-by: Moses Nakamura <mnakamura@twitter.com> RB_ID=837540
The finagle client does not seem to close connections that are closed by the other side when streaming is enabled. If you run the application from the repository linked to below, you should notice the open number of sockets in the CLOSE_WAIT status held by the process to increase rapidly (Tomcat is configured to close connections after 5 seconds). This ultimately caused our application to hit its open file limit. When setting the max connection size of the connection pool to 1, finagle keeps using the same connection while timing out with each request that it makes.
When streaming is disabled none of the issues occur.
https://github.com/mritman/finagle-not-closing-sockets
The text was updated successfully, but these errors were encountered: