Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closing responses with incomplete chunked bodies retrieved over a Unix socket takes a long time on Linux #4233

Closed
charleskorn opened this issue Aug 26, 2018 · 16 comments
Labels
bug Bug in existing code
Milestone

Comments

@charleskorn
Copy link

charleskorn commented Aug 26, 2018

Description

If a response meets all of the following criteria...

  • it was retrieved over a Unix domain socket (using the sample code)
  • the body is chunked (ie. has the header Transfer-Encoding: chunked)
  • the body is incomplete (ie. does not have a trailing 0\r\n\r\n chunk), such as when streaming events from a Docker daemon with the events API
  • the application is running on Linux

...then calling close() on Response takes quite a long time. The amount of time seems to be related to whatever has been set with readTimeout() on OkHttpClient (ie. if the timeout is 3s, closing the response takes just over 3s, if the timeout is 20s, closing the response takes just over 20s).

This behaviour does not occur if any of the following is true:

  • the body is not chunked
  • the body is chunked but complete (ie. has a trailing 0\r\n\r\n chunk)
  • the application is running on OS X

Steps to reproduce

I've created a sample application that demonstrates this behaviour.

Expected behaviour

Calling Response.close() takes a very short period of time (few milliseconds max).

Actual behaviour

Calling Response.close() takes quite some time, and seems to be roughly equal to readTimeout.

@charleskorn charleskorn changed the title Closing Responses with incomplete chunked bodies retrieved over a Unix socket takes a long time on Linux Closing responses with incomplete chunked bodies retrieved over a Unix socket takes a long time on Linux Aug 26, 2018
charleskorn added a commit to batect/batect that referenced this issue Aug 26, 2018
@yschimke yschimke added the bug Bug in existing code label Aug 26, 2018
@swankjesse
Copy link
Member

Good find. We're attempting to discard the response body for the benefit of connection reuse. We allow up to 100ms before we give up and discard the connection.

But in this case the 100ms timeout isn't being honored. My guess is it's a problem in the Unix socket library. We should prepare a minimal test case & try to get that fixed in that project.

@yschimke yschimke added this to the 3.12 milestone Aug 26, 2018
@charleskorn
Copy link
Author

Thanks for the quick response. Do you need any more information from me?

@swankjesse
Copy link
Member

Nope! Got all we need.

@yschimke
Copy link
Collaborator

@swankjesse potentially two issues to fix/improve here

  1. Correct the (observed) timeout for unix sockets
  2. Can we make the drain on close async, we have a dispatcher, we could hand off to that to do the close, since the client has indicated it doesn't care any more and we are now doing work for other requests.

@charleskorn
Copy link
Author

charleskorn commented Sep 16, 2018

@swankjesse / @yschimke: I'm keen to see the second option implemented (independent of getting it fixed in the Unix socket library), as that will remove the 100ms delay on Mac as well. Are you able to give me some pointers on where to look and how to do this?

(If I get something working, I'll submit a PR.)

@swankjesse
Copy link
Member

The idea would be to put connections back into the pool without draining them. The pool would then either drain them itself on its clean up thread, or on some new thread that we create for this task.

There's a policy decision on what to do if a caller asks the pool for a connection that hasn't yet been drained. My preference is for that connection to be returned but only if it drains within 100 ms.

@yschimke
Copy link
Collaborator

@charleskorn you have a good reproduction above, maybe bring the test over from https://github.com/charleskorn/okhttp-issue and use it as the start of the PR. It would be a good integration test that you have finally fixed it.

@charleskorn
Copy link
Author

I've just started digging into this and I've never looked at the OkHttp codebase before, so I would love some feedback on what I'm thinking before I dive in:

  • It looks like the problematic code is calls to Util.discard() in FixedLengthSource and ChunkedSource:

    @Override public void close() throws IOException {
      if (closed) return;
    
      if (bytesRemaining != 0 && !Util.discard(this, DISCARD_STREAM_TIMEOUT_MILLIS, MILLISECONDS)) {
        endOfInput(false, null);
      }
    
      closed = true;
    }	

    If I'm understanding your suggestion correctly, then the call to Util.discard() should run on another thread (eg. the connection pool cleanup thread)

  • Side note: in the current setup, if there are bytes / chunks remaining and Util.discard() returns true (ie. it reached the end of the stream), what calls endOfInput() (and presumably with arguments like endOfInput(true, null)) to signal that the stream can be reused?

  • If that's correct, then I'm thinking that, in close():

    • If the source has no bytes or chunks remaining it just immediately sets closed to true and returns
    • If the source has bytes or chunks remaining then it queues the source for draining by the connection pool. I'm imagining something like a drain() method on StreamAllocation which passes this to the connection pool to drain the source, and the draining is done by the connection pool on a separate thread (rather than blocking the cleanup thread).
      • Given we can't guarantee that the draining will take less than 100 ms, we wouldn't return the connection to the connection pool until the draining has completed (or timed out) - that way, new calls aren't blocked by responses that are slow to drain (or that will never drain in the case of endless chunked responses).
      • What's the correct level of abstraction for ConnectionPool to operate at? It looks like it works at the connection level, so I'm thinking that the background thread there should call a method on the connection to drain the response source, rather than it having a reference to the source itself.

How does that sound?

@swankjesse
Copy link
Member

It's getting closer. There's some moving parts that complicate things.

The discard() method calls back into the source stream to read it, and that's how endOfInput() is reached.

The StreamAllocation object is owned by the call, so we can’t give it to the ConnectionPool. Instead the ConnectionPool would need another type; maybe just the RealConnection. That class could benefit from a new field like streamToDrain that would need to be drained before the connection is returned.

@charleskorn
Copy link
Author

Ah, I see - didn't make the connection between discard() reading and the source's read(), but that makes sense.

Using the RealConnection in ConnectionPool makes sense as well, given it already operates at that level.

Given that, I'm thinking something like this:

  • If close() in ChunkedSource and FixedLengthSource is called and the stream hasn't been completely read, it calls something like drain(source) on StreamAllocation, which sets streamToDrain on itself and then calls drain(connection) on the ConnectionPool
  • The ConnectionPool has a background thread for draining sources, which is started when the first drain() call is made (or restarted if it finished draining all the connections that needed to be drained) - or should this just queue separate jobs on the ThreadPoolExecutor, one for each connection that needs to be drained? That way, multiple connections could be drained in parallel.
  • Each drainable Source would implement DrainableSource (a new interface) with a drain() method that calls Util.discard(this, ...) and calls endOfInput(false, null) if the discard times out.

Does that line up more with what you were thinking? Any suggestions?

@swankjesse
Copy link
Member

I wanna do it with fewer types and fewer moving parts. Maybe a connection in the pool has a nullable SourceToDrain, and we spend up to 100ms draining it when fetching the stream from the pool. Rather than draining it on a background thread we just let the pool’s user do the work.

The motivation here is better performance in the common case. If we shuttle a connection to a background thread then callers who make 2 calls to the same server in succession will see a pool miss, for an overall worse performance. If we put the connection in the pool with it's own unfinished business then we only pay for that business before we benefit from it.

@charleskorn
Copy link
Author

For my use case (connecting to a unencrypted Unix socket), creating a new connection takes less than a fraction of 100 ms, so it doesn't make sense for me to wait for the connection to drain in any circumstance - but I understand the 'normal' use case is a bit different.

So I'm now thinking that instead of trying to drain the connection (synchronously or otherwise), could we add an option to just disable discarding streams if they haven't been read completely (and so just close the connection rather than trying to reuse it)? That solves my problem and also means I'll never spend 100ms waiting for a connection to drain that I won't be able to reuse anyway.

@swankjesse
Copy link
Member

If we wanted to get super fancy we could measure the time it took to create the connection, and use say, (50% of that + 10 ms) as the discard timeout.

I think the real proper fix for all of this is not in OkHttp, however. I think we really just want to fix the underlying UNIX domain sockets system to honor timeouts.

@charleskorn
Copy link
Author

Completely agree, the underlying issue needs to be fixed.

However, I still think there's something here for not even trying to drain a connection if the user knows there's no point - for example, in my case, I know the drain will never succeed because the server will never stop sending events, so there's no point spending any amount of time trying to drain it. Would you be open to a PR that adds this functionality?

charleskorn added a commit to charleskorn/jnr-unixsocket that referenced this issue Sep 23, 2018
This passes on OS X and fails on Linux and is the cause of
square/okhttp#4233.
charleskorn added a commit to batect/batect that referenced this issue Sep 23, 2018
charleskorn added a commit to batect/batect that referenced this issue Sep 23, 2018
@swankjesse
Copy link
Member

You’re in the intersection of two special cases: you’re reading a never-ending stream, and you’re using sockets that don’t correctly implement timeouts. I suspect that if we offered an API for this almost nobody would use it.

The right long-term strategy for you and everyone else is HTTP/2. It doesn’t have this problem. Anything we do to further band-aid HTTP/1 is wasted complexity.

@swankjesse swankjesse modified the milestones: 3.13, Icebox Nov 3, 2018
charleskorn added a commit to charleskorn/jnr-unixsocket that referenced this issue Feb 14, 2019
This passes on OS X and fails on Linux and is the cause of
square/okhttp#4233.
@yschimke
Copy link
Collaborator

Won't fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug in existing code
Projects
None yet
Development

No branches or pull requests

3 participants