Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dispatcher no longer has quadratic behaviour #4581

Merged
merged 8 commits into from
Feb 1, 2019

Conversation

iamdanfox
Copy link
Contributor

We had a production incident earlier today in which outgoing requests were blocked, with jstacks suggesting a lock in the Dispatcher as the culprit (on 3.12.0). (cc @swankjesse who worked on this recently)

This PR eliminates some quadratic behaviour in the Dispatcher by keeping track of requests per host.

I doubt this PR will really solve the problem we encountered (especially as we only use maxRequests = 256 and maxRequestsPerHost = 64), but it did seem a bit inefficient that the promoteAndExecute method iterated through the entire runningAsyncCalls for every item of readyAsyncCalls. I thought I'd put this PR up anyway as you might be interested in a perf improvement. Happy to add more test coverage if this is a direction you'd be interested in.


A very unscientific benchmark on my laptop showed speedups as maxRequests increased:

maxRequests | before   | after
64          | 30ms     | 30ms
640         | 101ms    | 71ms
6400        | 1266ms   | 196ms
64000       | 2minutes | 917ms
  @Test public void benchmark() throws Exception {
    int maxRequests = 640;
    dispatcher.setMaxRequestsPerHost(maxRequests);
    dispatcher.setMaxRequests(maxRequests);
    long before = System.currentTimeMillis();
    for (int i = 0; i < maxRequests; i ++) {
      client.newCall(newRequest("http://a/" + i)).enqueue(callback);
    }
    long after = System.currentTimeMillis();
    System.out.println("Elapsed = " + (after - before));
    assertEquals(executor.calls.size(), maxRequests);
  }

For reference, jstacks showed:

  • 64 threads BLOCKED on 0x00007f28956a5000 line okhttp3.Dispatcher.finished(Dispatcher.java:219)
  • 52 threads BLOCKED on 0x00007f28956a5000 on line okhttp3.Dispatcher.enqueue(Dispatcher.java:134)
  • jstacks over ~5 seconds showed a RUNNABLE thread stuck holding the lock in the runningCallsForHost method:
"OkHttp https://redacted:redacted/..." #699949 prio=5 os_prio=0 tid=0x00007f2929116000 nid=0x9730 runnable [0x00007f28956a5000]
   java.lang.Thread.State: RUNNABLE
	at okhttp3.Dispatcher.runningCallsForHost(Dispatcher.java:198)
	at okhttp3.Dispatcher.promoteAndExecute(Dispatcher.java:175)
	- locked <0x00000000ab2ff958> (a okhttp3.Dispatcher)
	at okhttp3.Dispatcher.finished(Dispatcher.java:224)
	at okhttp3.Dispatcher.finished(Dispatcher.java:209)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:218)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

@swankjesse
Copy link
Member

Wow, that's both awful and unexpected. I assume the problem is that we're doing work while holding that lock that we shouldn't be. The dispatcher lock is intended to defend against memory accesses and should never be a contention point.

@swankjesse
Copy link
Member

May I propose an alternate fix? Two broad ideas:

Take an optional host hint for the method and pass it when a call completes or is enqueued. If we have more than 1 general capacity then search for a call for that host specifically.

Add an AtomicInteger to RealCall which represents the running calls for that host. Initialize it when the call is enqueued to the AtomicInteger of another call that shares the host, or 0 if there are none.

@iamdanfox
Copy link
Contributor Author

@swankjesse OK I can take a stab at that AtomicInteger version - is the motivation there to reduce the Map allocation or minimise the new abstractions in one PR? It seems a bit sad to add mutability to the otherwise immutable RealCall.AsyncCall...

Copy link
Member

@swankjesse swankjesse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I like this much better.

// Mutate the AsyncCall so that it shares the AtomicInteger of an existing running call to
// the same host.
if (!call.get().forWebSocket) {
for (AsyncCall existingCall : runningAsyncCalls) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably needs to loop over ready calls also!

@iamdanfox
Copy link
Contributor Author

Also, it seems that synchronous calls are not affected by the maxRequestsPerHost limit... I think I've preserved this behaviour, but is this desirable/correct?

@swankjesse
Copy link
Member

Synchronous calls don't participate. You got it.

}
promoteAndExecute();
}

@Nullable private AsyncCall findExistingCallWithHost(String host) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@swankjesse swankjesse merged commit 56f55d9 into square:master Feb 1, 2019
@swankjesse
Copy link
Member

Nice improvement. Presumably your benchmark numbers are still a big improvement?!

@iamdanfox
Copy link
Contributor Author

Yep, I think we should still see a speedup when lots of requests are queued/in-flight!

bulldozer-bot bot pushed a commit to palantir/conjure-java-runtime that referenced this pull request Mar 11, 2019
…o okhttp3.Dispatcher (#940)

## Before this PR

We recently experienced two P0 issues that seemed to be caused by locking inside the okhttp 3.12.0 dispatcher.  I opened a PR to okhttp which should hopefully reduce the amount of work done inside `synchronized` blocks (square/okhttp#4581).

One internal service forced okhttp to 3.11.0 as a temporary workaround.

## After this PR

Hopefully that 3.11.0 downgrade can be reverted, and P0s won't re-occur!

_Full changelog https://github.com/square/okhttp/blob/master/CHANGELOG.md_

Do not merge until:

- [x] ~TLSv1 and TLSv1.1 are no longer enabled by default. - do we need to opt-in to `ConnectionSpec.COMPATIBLE_TLS`~ seems like we've always been overriding the connection spec to [only specify TLS 1.2](https://github.com/palantir/conjure-java-runtime/blob/9d88bfa3c3e1de74ea57f158734186628d1bb8c9/okhttp-clients/src/main/java/com/palantir/conjure/java/okhttp/OkHttpClients.java#L225).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants