Closing responses with incomplete chunked bodies retrieved over a Unix socket takes a long time on Linux #4233

charleskorn · 2018-08-26T04:09:33Z

Description

If a response meets all of the following criteria...

it was retrieved over a Unix domain socket (using the sample code)
the body is chunked (ie. has the header Transfer-Encoding: chunked)
the body is incomplete (ie. does not have a trailing 0\r\n\r\n chunk), such as when streaming events from a Docker daemon with the events API
the application is running on Linux

...then calling close() on Response takes quite a long time. The amount of time seems to be related to whatever has been set with readTimeout() on OkHttpClient (ie. if the timeout is 3s, closing the response takes just over 3s, if the timeout is 20s, closing the response takes just over 20s).

This behaviour does not occur if any of the following is true:

the body is not chunked
the body is chunked but complete (ie. has a trailing 0\r\n\r\n chunk)
the application is running on OS X

Steps to reproduce

I've created a sample application that demonstrates this behaviour.

Expected behaviour

Calling Response.close() takes a very short period of time (few milliseconds max).

Actual behaviour

Calling Response.close() takes quite some time, and seems to be roughly equal to readTimeout.

The text was updated successfully, but these errors were encountered:

This currently fails on Linux due to square/okhttp#4233.

swankjesse · 2018-08-26T12:07:11Z

Good find. We're attempting to discard the response body for the benefit of connection reuse. We allow up to 100ms before we give up and discard the connection.

But in this case the 100ms timeout isn't being honored. My guess is it's a problem in the Unix socket library. We should prepare a minimal test case & try to get that fixed in that project.

charleskorn · 2018-08-27T10:25:10Z

Thanks for the quick response. Do you need any more information from me?

swankjesse · 2018-08-27T11:37:02Z

Nope! Got all we need.

yschimke · 2018-08-28T06:16:53Z

@swankjesse potentially two issues to fix/improve here

Correct the (observed) timeout for unix sockets
Can we make the drain on close async, we have a dispatcher, we could hand off to that to do the close, since the client has indicated it doesn't care any more and we are now doing work for other requests.

charleskorn · 2018-09-16T06:44:55Z

@swankjesse / @yschimke: I'm keen to see the second option implemented (independent of getting it fixed in the Unix socket library), as that will remove the 100ms delay on Mac as well. Are you able to give me some pointers on where to look and how to do this?

(If I get something working, I'll submit a PR.)

swankjesse · 2018-09-16T07:06:05Z

The idea would be to put connections back into the pool without draining them. The pool would then either drain them itself on its clean up thread, or on some new thread that we create for this task.

There's a policy decision on what to do if a caller asks the pool for a connection that hasn't yet been drained. My preference is for that connection to be returned but only if it drains within 100 ms.

yschimke · 2018-09-16T07:19:55Z

@charleskorn you have a good reproduction above, maybe bring the test over from https://github.com/charleskorn/okhttp-issue and use it as the start of the PR. It would be a good integration test that you have finally fixed it.

charleskorn · 2018-09-19T11:11:42Z

I've just started digging into this and I've never looked at the OkHttp codebase before, so I would love some feedback on what I'm thinking before I dive in:

It looks like the problematic code is calls to Util.discard() in FixedLengthSource and ChunkedSource:

@Override public void close() throws IOException {
  if (closed) return;

  if (bytesRemaining != 0 && !Util.discard(this, DISCARD_STREAM_TIMEOUT_MILLIS, MILLISECONDS)) {
    endOfInput(false, null);
  }

  closed = true;
}

If I'm understanding your suggestion correctly, then the call to Util.discard() should run on another thread (eg. the connection pool cleanup thread)

Side note: in the current setup, if there are bytes / chunks remaining and Util.discard() returns true (ie. it reached the end of the stream), what calls endOfInput() (and presumably with arguments like endOfInput(true, null)) to signal that the stream can be reused?
If that's correct, then I'm thinking that, in close():
- If the source has no bytes or chunks remaining it just immediately sets closed to true and returns
- If the source has bytes or chunks remaining then it queues the source for draining by the connection pool. I'm imagining something like a drain() method on StreamAllocation which passes this to the connection pool to drain the source, and the draining is done by the connection pool on a separate thread (rather than blocking the cleanup thread).
  - Given we can't guarantee that the draining will take less than 100 ms, we wouldn't return the connection to the connection pool until the draining has completed (or timed out) - that way, new calls aren't blocked by responses that are slow to drain (or that will never drain in the case of endless chunked responses).
  - What's the correct level of abstraction for ConnectionPool to operate at? It looks like it works at the connection level, so I'm thinking that the background thread there should call a method on the connection to drain the response source, rather than it having a reference to the source itself.

How does that sound?

swankjesse · 2018-09-19T11:55:38Z

It's getting closer. There's some moving parts that complicate things.

The discard() method calls back into the source stream to read it, and that's how endOfInput() is reached.

The StreamAllocation object is owned by the call, so we can’t give it to the ConnectionPool. Instead the ConnectionPool would need another type; maybe just the RealConnection. That class could benefit from a new field like streamToDrain that would need to be drained before the connection is returned.

charleskorn · 2018-09-20T10:45:32Z

Ah, I see - didn't make the connection between discard() reading and the source's read(), but that makes sense.

Using the RealConnection in ConnectionPool makes sense as well, given it already operates at that level.

Given that, I'm thinking something like this:

If close() in ChunkedSource and FixedLengthSource is called and the stream hasn't been completely read, it calls something like drain(source) on StreamAllocation, which sets streamToDrain on itself and then calls drain(connection) on the ConnectionPool
The ConnectionPool has a background thread for draining sources, which is started when the first drain() call is made (or restarted if it finished draining all the connections that needed to be drained) - or should this just queue separate jobs on the ThreadPoolExecutor, one for each connection that needs to be drained? That way, multiple connections could be drained in parallel.
Each drainable Source would implement DrainableSource (a new interface) with a drain() method that calls Util.discard(this, ...) and calls endOfInput(false, null) if the discard times out.

Does that line up more with what you were thinking? Any suggestions?

swankjesse · 2018-09-20T11:37:21Z

I wanna do it with fewer types and fewer moving parts. Maybe a connection in the pool has a nullable SourceToDrain, and we spend up to 100ms draining it when fetching the stream from the pool. Rather than draining it on a background thread we just let the pool’s user do the work.

The motivation here is better performance in the common case. If we shuttle a connection to a background thread then callers who make 2 calls to the same server in succession will see a pool miss, for an overall worse performance. If we put the connection in the pool with it's own unfinished business then we only pay for that business before we benefit from it.

charleskorn · 2018-09-21T09:47:25Z

For my use case (connecting to a unencrypted Unix socket), creating a new connection takes less than a fraction of 100 ms, so it doesn't make sense for me to wait for the connection to drain in any circumstance - but I understand the 'normal' use case is a bit different.

So I'm now thinking that instead of trying to drain the connection (synchronously or otherwise), could we add an option to just disable discarding streams if they haven't been read completely (and so just close the connection rather than trying to reuse it)? That solves my problem and also means I'll never spend 100ms waiting for a connection to drain that I won't be able to reuse anyway.

swankjesse · 2018-09-22T00:41:01Z

If we wanted to get super fancy we could measure the time it took to create the connection, and use say, (50% of that + 10 ms) as the discard timeout.

I think the real proper fix for all of this is not in OkHttp, however. I think we really just want to fix the underlying UNIX domain sockets system to honor timeouts.

charleskorn · 2018-09-22T08:09:05Z

Completely agree, the underlying issue needs to be fixed.

However, I still think there's something here for not even trying to drain a connection if the user knows there's no point - for example, in my case, I know the drain will never succeed because the server will never stop sending events, so there's no point spending any amount of time trying to drain it. Would you be open to a PR that adds this functionality?

This passes on OS X and fails on Linux and is the cause of square/okhttp#4233.

This currently fails on Linux due to square/okhttp#4233.

…ime. This works around jnr/jnr-unixsocket#60, which is caused by square/okhttp#4233.

swankjesse · 2018-09-24T01:08:35Z

You’re in the intersection of two special cases: you’re reading a never-ending stream, and you’re using sockets that don’t correctly implement timeouts. I suspect that if we offered an API for this almost nobody would use it.

The right long-term strategy for you and everyone else is HTTP/2. It doesn’t have this problem. Anything we do to further band-aid HTTP/1 is wasted complexity.

This passes on OS X and fails on Linux and is the cause of square/okhttp#4233.

yschimke · 2020-03-29T17:37:55Z

Won't fix.

charleskorn changed the title ~~Closing Responses with incomplete chunked bodies retrieved over a Unix socket takes a long time on Linux~~ Closing responses with incomplete chunked bodies retrieved over a Unix socket takes a long time on Linux Aug 26, 2018

charleskorn added a commit to batect/batect that referenced this issue Aug 26, 2018

Add test case for dependency that takes a long time to become healthy.

4b8bb07

This currently fails on Linux due to square/okhttp#4233.

yschimke added the bug Bug in existing code label Aug 26, 2018

yschimke added this to the 3.12 milestone Aug 26, 2018

charleskorn added a commit to charleskorn/jnr-unixsocket that referenced this issue Sep 23, 2018

Add test case for interrupting a read() call.

b0981ae

This passes on OS X and fails on Linux and is the cause of square/okhttp#4233.

charleskorn mentioned this issue Sep 23, 2018

Calling close on a UnixSocket with a pending read waits for the read to time out on Linux jnr/jnr-unixsocket#60

Closed

charleskorn added a commit to batect/batect that referenced this issue Sep 23, 2018

Add test case for dependency that takes a long time to become healthy.

70cb24a

This currently fails on Linux due to square/okhttp#4233.

charleskorn added a commit to batect/batect that referenced this issue Sep 23, 2018

Add workaround for issue where closing a connection can take a long t…

99066ad

…ime. This works around jnr/jnr-unixsocket#60, which is caused by square/okhttp#4233.

swankjesse modified the milestones: 3.13, Icebox Nov 3, 2018

charleskorn added a commit to charleskorn/jnr-unixsocket that referenced this issue Feb 14, 2019

Add test case for interrupting a read() call.

0fa719b

This passes on OS X and fails on Linux and is the cause of square/okhttp#4233.

yschimke closed this as completed Mar 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closing responses with incomplete chunked bodies retrieved over a Unix socket takes a long time on Linux #4233

Closing responses with incomplete chunked bodies retrieved over a Unix socket takes a long time on Linux #4233

charleskorn commented Aug 26, 2018 •

edited by yschimke

swankjesse commented Aug 26, 2018

charleskorn commented Aug 27, 2018

swankjesse commented Aug 27, 2018

yschimke commented Aug 28, 2018

charleskorn commented Sep 16, 2018 •

edited

swankjesse commented Sep 16, 2018

yschimke commented Sep 16, 2018

charleskorn commented Sep 19, 2018

swankjesse commented Sep 19, 2018

charleskorn commented Sep 20, 2018

swankjesse commented Sep 20, 2018

charleskorn commented Sep 21, 2018

swankjesse commented Sep 22, 2018

charleskorn commented Sep 22, 2018

swankjesse commented Sep 24, 2018

yschimke commented Mar 29, 2020

Closing responses with incomplete chunked bodies retrieved over a Unix socket takes a long time on Linux #4233

Closing responses with incomplete chunked bodies retrieved over a Unix socket takes a long time on Linux #4233

Comments

charleskorn commented Aug 26, 2018 • edited by yschimke

Description

Steps to reproduce

Expected behaviour

Actual behaviour

swankjesse commented Aug 26, 2018

charleskorn commented Aug 27, 2018

swankjesse commented Aug 27, 2018

yschimke commented Aug 28, 2018

charleskorn commented Sep 16, 2018 • edited

swankjesse commented Sep 16, 2018

yschimke commented Sep 16, 2018

charleskorn commented Sep 19, 2018

swankjesse commented Sep 19, 2018

charleskorn commented Sep 20, 2018

swankjesse commented Sep 20, 2018

charleskorn commented Sep 21, 2018

swankjesse commented Sep 22, 2018

charleskorn commented Sep 22, 2018

swankjesse commented Sep 24, 2018

yschimke commented Mar 29, 2020

charleskorn commented Aug 26, 2018 •

edited by yschimke

charleskorn commented Sep 16, 2018 •

edited