Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finagle streaming client not closing sockets #475

Closed
mritman opened this issue Mar 7, 2016 · 28 comments
Closed

Finagle streaming client not closing sockets #475

mritman opened this issue Mar 7, 2016 · 28 comments

Comments

@mritman
Copy link

mritman commented Mar 7, 2016

The finagle client does not seem to close connections that are closed by the other side when streaming is enabled. If you run the application from the repository linked to below, you should notice the open number of sockets in the CLOSE_WAIT status held by the process to increase rapidly (Tomcat is configured to close connections after 5 seconds). This ultimately caused our application to hit its open file limit. When setting the max connection size of the connection pool to 1, finagle keeps using the same connection while timing out with each request that it makes.

When streaming is disabled none of the issues occur.

https://github.com/mritman/finagle-not-closing-sockets

@luciferous
Copy link
Collaborator

Hi @mritman, the HTTP client should be managing connections correctly whether closed locally or remotely even when interrupted during a read or write. We test these properties quite exhaustively in https://github.com/twitter/finagle/blob/develop/finagle-http/src/test/scala/com/twitter/finagle/http/StreamingTest.scala (see the tests around line 80). I'll have to experiment with your example, but I didn't see anything out of the ordinary while testing in REPL: https://gist.github.com/luciferous/8f98fd3367af7d2aa7ef. tcpdump output seems reasonable too: http://pastie.org/private/tbiuuafiory7jsm2e9v0xg.

@mritman
Copy link
Author

mritman commented Mar 8, 2016

Hi @luciferous, I should note that the problem does not occur when the server sends back a 200 response. I'm interested to hear what you find when testing with my example.

@mritman mritman closed this as completed Mar 8, 2016
@mritman mritman reopened this Mar 8, 2016
@spockz
Copy link
Contributor

spockz commented Mar 21, 2016

When running the example application as provided I see "CLOSE_WAIT" states as well with lsof -i :8080. Full output here.

Edit: Also, when just returning a success message I see far less open connections and they don't enter the "CLOSE_WAIT" state. Output

Edit2: When explicitly closing the reader and the message all connections are in a CLOSE_WAIT state, no requests get handled anymore. Output.
Code:

response = Await.result(service.apply(request), Duration.fromMilliseconds(
response.reader().discard();
Await.ready(response.close(), Duration.fromMilliseconds(200));

olix0r added a commit to BuoyantIO/finagle that referenced this issue Mar 22, 2016
This introduces a new test to com.twitter.finagle.http.StreamingTest that works as follows:

1. Start a streaming http server that never responds to requests.
   It has a special transport that may be closed prematurely.
2. Connect a client to this service.
3. Send a request on this client.
4. Close the transport on the server.
5. Ensure that the client observes an exception on its request.

A failure is not observed in step 5.
olix0r added a commit to BuoyantIO/finagle that referenced this issue Mar 22, 2016
The behavior is as follows:
- A server processes a request and then closes the connection
- The client receives the response
- The client never closes the transport, leaving the connection in the CLOSE_WAIT state.
@olix0r
Copy link
Contributor

olix0r commented Mar 22, 2016

As I understand it the situation is as follows:

  1. The client sends a request.
  2. The server responds properly.
  3. The client receives the response.
  4. The server closes the connection.
  5. The client never closes its side of the connection, leaving it in CLOSE_WAIT.

Running @mritman's test program locally, I was able to confirm that the server sockets end up in FIN_WAIT_2 and the client sockets end up in CLOSE_WAIT. I have attempted to reproduce this in a finagle test: develop...BuoyantIO:ver/test-streaming-disconnects#diff-84324cdca2db122c30fa0be6887261a2R126

If the disconnect were received during the lifetime of a request, the transport would be torn down properly. I am still trying to determine how the client should receive notification of remote connection teardown outside of the request lifecycle... Perhaps this can't be determined until another operation (read or write) occurs on the transport? I'll try to dive deeper with that test to figure out what's going on.

@luciferous
Copy link
Collaborator

Try replacing your Tomcat server with the following:

while true; do printf 'HTTP/1.1 403 Forbidden\r\nServer: test\r\n\r\n'|nc -l 8080; done

You'll see the connections are closing. Also try a Finagle HTTP server:

import com.twitter.finagle.http.{Http=>_, _}
import com.twitter.finagle.{Http, Service}
import com.twitter.util.{Await, Future}

val forbidden = new Service[Request, Response] {
  def apply(req: Request): Future[Response] = {
    val res = Response()
    res.status = Status.Forbidden
    Future.value(res)
  }
}

Http.serve(":8080", forbidden)

These connections are also closing, but they don't close as fast. My lsof output showed the connection count hovering around 50.

Seems to me like the Tomcat server is the issue?

@luciferous
Copy link
Collaborator

Ah I'm setting up the server wrongly. I need to make the server send a chunk encoded response.

while true; do printf 'HTTP/1.1 403 Forbidden\r\nServer: test\r\nTransfer-Encoding: chunked\r\n\r\n1\r\na\r\n0\r\n\r\n'|nc -l 8080; done

The lsof count hovers around 50 again. The connections do get cleared, so it doesn't sound like a leak to me. What do you think? For curiosity's sake I also tried the Finagle server again with streaming enabled:

val forbidden = new Service[Request, Response] {
  def apply(req: Request): Future[Response] =
    Future.value(Response(Version.Http11, Status.Forbidden, Reader.fromBuf(Buf.Utf8("a"))))
}
Http.server.withStreaming(enabled=true).serve(":8080", forbidden)

I saw the same thing with lsof lines hovering around 50. The strangest thing I'm seeing is that a lot of the requests are timing out, so maybe that's worth looking into. But I don't think there's a connection leak.

@spockz
Copy link
Contributor

spockz commented Mar 23, 2016

Correct me if I'm wrong, but I think the netcat and Finagle examples don't close the connection whereas the Spring-Boot server closes the connection.

@mritman
Copy link
Author

mritman commented Mar 23, 2016

I'd also argue that regardless of whether this issue is specific to the use of Tomcat or not, since it's the client's sockets that increasingly end op in CLOSE_WAIT, this is something you'll want to guard against from the side of the client.

@luciferous
Copy link
Collaborator

@mritman that's reasonable. I'm really suspicious of the timeouts. I think looking into why they're happening is a good place to start.

@mritman
Copy link
Author

mritman commented Mar 24, 2016

@luciferous I also noticed the timeouts while testing. If you set the connection pool limit to 1 in my example project and wait for Tomcat to close the connection you'll see that Finagle starts timing out on it. So it looks like Finagle doesn't detect the connection is closed by the server and continues to try to read from it, which is in line with @olix0r's observations.

@mosesn
Copy link
Contributor

mosesn commented Mar 25, 2016

@mritman, very interesting! I'd like to take a look too, hoping to check it out this weekend if folks haven't already figured it out by then.

@luciferous
Copy link
Collaborator

So, it turns out this has a really simple explanation. Because we set up the client with withStreaming(true) it means the response received will be chunked encoded (depending on how Netty reads it from the socket). This means that before Finagle can consider a response "done", you have to drain the reader fully or discard it.

Add response.reader().discard(); and you'll see the issue go away.

I remember being bitten by this before too, once or twice. There's definitely room to improve the API/UX here. Sorry about that.

@luciferous
Copy link
Collaborator

And, if you weren't seeing it with the 200 OK, it might have been because the HTTP response didn't contain a body, and therefore an empty reader.

@mosesn
Copy link
Contributor

mosesn commented Mar 27, 2016

@luciferous isn't this an inherent race condition in the protocol? if a stream is chunk-encoded and ends the transfer by closing the connection, it's indistinguishable from the remote dying. wouldn't a "correct" implementation of this API signal to the reader, "we don't know whether the server died or is finished"?

@mosesn
Copy link
Contributor

mosesn commented Mar 27, 2016

@luciferous actually, it looks like it doesn't matter if the server's responses are chunked or not . . . weird.

@luciferous
Copy link
Collaborator

The clients do close if the server closes. See above where I tried to reproduce with servers that close. But the tomcat one seems to keep the connection open.

@spockz
Copy link
Contributor

spockz commented Mar 27, 2016

How then do you explain these CLOSE_WAIT. I discarded the reader there. Granted I see a lot less connections than without but they are still there. Also the 200 OK responses contained the string "Good" and thus there was no empty reader.

@mosesn
Copy link
Contributor

mosesn commented Mar 27, 2016

@olix0r I found a race condition in your first test, seems to work OK after fixing that. taking a look at the second one now.

@luciferous
Copy link
Collaborator

@spockz Raise your timeout (e.g. 1 second) and you'll see the CLOSE_WAIT disappear. With the 20ms timeout does the number of CLOSE_WAIT grow? I see it hovering at ~20 connections.

@luciferous
Copy link
Collaborator

@spockz Regarding the 200 OK case, it really depends on how Netty reads it from the socket. If the whole request fits into the first read, Netty doesn't chunk-encode it.

@mosesn
Copy link
Contributor

mosesn commented Mar 28, 2016

@olix0r and I think that ChannelTransport closes itself, which is why your proxy never detected a close. I've added the commit which no longer uses the proxy to my PR, so both tests should now pass.

@mosesn
Copy link
Contributor

mosesn commented Mar 28, 2016

@spockz @mritman I'm beginning to think that @luciferous is right that this is a case of a difficult to use API, although it looks like there are a few bugs too. Since your server is sending a chunk-encoded response, you need to use the response.reader to parse it. The CLOSE_WAITs seem to disappear when I change to reading the entire reader explicitly and discarding it.

A couple open questions:

  1. why isn't it sufficient to read the entire reader?
  2. why isn't it sufficient to discard the entire reader?
  3. why isn't it sufficient for the server to cut the connection?

My suspicion is that the reader constitutes a read handle, and that we don't want to cut the read handle prematurely, so that it can still read whatever is in its socket buffer, so 3 makes sense, but 1 and 2 less so.

I think it's also a bug that content / getContentString don't actually drain the reader. The intent is to avoid blocking on io, but I think we should instead make it clear that getContent is only safe to call on unchunked content, or when you don't mind if you block.

@luciferous what do you think?

@mritman @spockz For what it's worth, if you control your remote servers, the remote server closing the socket should be an edge case, not the normal one. Tcp sessions are not light, and you'll entail significantly GC pressure if you're constantly tearing down tcp sessions. With that said, we absolutely want to get to the bottom of these bugs, so we appreciate you finding the minimization, since we know that it will happen occasionally even with servers you control, and that some servers you don't control may do weird things. ✨ 👍

@mosesn
Copy link
Contributor

mosesn commented Mar 28, 2016

@mritman @spockz I've made a PR here: mritman/finagle-not-closing-sockets#1 please let me know what you think!

@spockz
Copy link
Contributor

spockz commented Mar 28, 2016

I can confirm that consuming the whole reader as done in mritman/finagle-not-closing-sockets#1 solves the issue.

I think it's also a bug that content / getContentString don't actually drain the reader. The intent is to avoid blocking on io, but I think we should instead make it clear that getContent is only safe to call on unchunked content, or when you don't mind if you block.

I think it is pretty clear from the signature that it is a blocking call (or that it returns what it already received). Perhaps Response values that are chunked shouldn't have a getContent? In other words, using a Streaming client is a conscious decision of the developer that has to be reflected in which methods he can use?

My suspicion is that the reader constitutes a read handle, and that we don't want to cut the read handle prematurely, so that it can still read whatever is in its socket buffer, so 3 makes sense, but 1 and 2 less so.

If we can still read from the socket when the connection is already closed that means that the contents are buffered somewhere on the machine. Thus, it might not be so bad to pull the contents into the finagle layer when the connection is closed. If we do that the connection can be closed without losing data.

@mritman @spockz For what it's worth ..
I agree with that we should prevent connections to be closed on errors. I'll investigate where these things occur and why. Perhaps there are good reasons for them.

@mosesn
Copy link
Contributor

mosesn commented Mar 28, 2016

@spockz yeah, it returns what is already received. I think this is confusing, but it's potentially also confusing to change a call that doesn't block to one that does . . . I agree that it would be nice if responses that were chunked didn't have a content / getContentString, but it's difficult to change the API of Request, since it's one of the APIs that people use the most.

Yeah, it might be worth pursuing having finagle try to read what's in the socketbuffer into memory. I think that should be safe, although I'm not sure if it might be make us run into weird problems if a remote host has sent us more bytes that we want to copy into the JVM.

@mosesn
Copy link
Contributor

mosesn commented Mar 29, 2016

OK, I dug into finagle streaming clients some more, and I now think that the two main problems were not getting to the end of the read stream, and timing out without getting to the end of the read stream. Note that if our Await.result fails, then we never assign the new response to the response variable, so if it opens a read handle, we will never have the opportunity to discard the read handle.

I no longer think that there's a bug where you need to drain AND discard. I think it worked because it slowed down the operation, and not because of an underlying bug.

@mritman @spockz I'm convinced that there isn't a bug, but the API is difficult to use. We'll be working to make it simpler. Do you think it's OK to close the ticket?

@mritman
Copy link
Author

mritman commented Mar 29, 2016

Note that if our Await.result fails, then we never assign the new response to the response variable, so if it opens a read handle, we will never have the opportunity to discard the read handle.

So if this case can't be handled by the user of the Finagle API, I take it it's correctly handled by Finagle itself?

Is the following understanding of how to use the Finagle client with streaming enabled correct?

When using streaming you either:

  1. use Await.result(Reader$.MODULE$.readAll(response.reader()))); without a timeout, thus blocking
  2. or use it with a timeout but you keep trying until the call returns without timing out. Await.result(Reader$.MODULE$.readAll(response.reader()), duration));
  3. or you keep reading in a streaming fashion until you encounter the a final chunk of length zero (How to detect this? By inspecting the response? Does the Response class offer a method to check for this?)
  4. or you need to make sure you call response.read().discard().

No need to answer my questions specifically, but if you can provide a summary or a link to some documentation on how to correctly use the Finagle client with streaming enabled you can close this ticket as far as I'm concerned.

@mosesn
Copy link
Contributor

mosesn commented Mar 29, 2016

@mritman it can be handled by the user of the finagle API–almost none of our users use the Await.result API because most of our users use the asynchronous API, not the blocking one. It's easy to deadlock if you use the blocking one and you're using finagle clients as well as servers, so we advise strongly against it. It's much simpler to reason about if you're using future transformations. I've included an example of how you might describe this behavior asynchronously lower down.

ah, the problem is with the Await.result(service.apply(request)), not in the Await.result(Reader.readAll(reader)) section.

1 / 2. Please note that any use of Await.result is blocking, regardless of whether you set the timeout or not. The only case in which it's not is if you set a timeout of 0. You should not use Await.result if you can avoid it. It's better to use future transformations. If you want to time something out, you can use Future#within or Future#raiseWithin.

Reader.readAll buffers the entire message, so you should only use it if you don't need your response streamed to you. Otherwise it's preferable to use normal Reader#read(Int). You can set a timeout on the read with Future#within or Future#raiseWithin and if it fails either of those timeouts, you can then discard it. Here is one way that would be safe:

val rep = service(request)
rep.raiseWithin(20.milliseconds).flatMap { response =>
  Reader.readAll(response.reader).raiseWithin(20.milliseconds).onFailure { timeout: TimeoutException => response.reader.discard() }
}.onFailure { _ =>
  rep.onSuccess { response => response.response.discard() }
}
  1. If you're using the Reader#read API directly, it will return you a Future[Option[Buf]]. When the future is satisfied, if it's satisfied with a None, you know that you're reached the EOF and you can stop reading.
  2. You only need to make sure you call response.reader.discard() if you know you're no longer going to need the read handle and you haven't reached the end of the stream, or the stream hasn't failed (timeouts don't count). In other cases, the stream will take care of itself.

As far as docs go, we know that streaming is a weak part of our documentation. This ticket is convincing us that we should change the Request / Response API too, so we're probably going to change the API to be easier to use, and then document it. It might make sense to document safe use of Reader first though, since that's probably not going to change that significantly. I'm going to close this ticket, but I've opened ticket #488 to improve documentation of Reader.

Thanks for bearing with us as we figured out what was going on in your snippet!

@mosesn mosesn closed this as completed Mar 29, 2016
olix0r added a commit to BuoyantIO/finagle that referenced this issue May 27, 2016
Problem

finagle-http's streaming test coverage misses some important cases

Solution

Introduce new tests to http.StreamingTest to validate that

1. A client behaves well when a connection is lost before a server responds.
2. A server closes the transport after responding, as described in twitter#475.
finaglehelper pushed a commit that referenced this issue Jun 1, 2016
Problem

finagle-http's streaming test coverage misses some important cases

Solution

Introduce new tests to http.StreamingTest to validate that

1. A client behaves well when a connection is lost before a server responds.
2. A server closes the transport after responding, as described in #475.

Signed-off-by: Moses Nakamura <mnakamura@twitter.com>

RB_ID=837540
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants