Remove static concurrency limit from gRPC server #7544

edoakes · 2020-03-10T20:51:05Z

Why are these changes needed?

We currently have a configurable limit on the number of concurrent calls of each RPC handler that can be accepted at a given time (after that, the calls will block in the gRPC layer as there is no tag for the completion queue to populate). On startup, that number of gRPC tags are created for future requests. This causes high memory consumption with no real benefit, as each thread can only be consuming at most one call at a time (while it is executing the handler) and creates a new one synchronously once it finishes executing the handler (even if the RPC reply was not yet sent).

To summarize, the existing behavior is:

On startup, create N tags that will be populated by K completion queues as requests come in
K threads poll one tag at a time from the completion queues. The threads handle the request synchronously (calling the user-defined methods), then create a new tag once the handler is finished (note that many of our handlers just post to another event loop, so return quickly - "finished" just means that the handler returned, not necessarily that the reply callback was called).

Note that there are never fewer than N - K tags in the completion queues and there can never be more than K handlers running at a given time (because there are only K threads to run them), but there's no limit on the number of outstanding requests being handled (if they're being handled on another thread). If all tags in the completion queue are populated, the completion queues are also blocked waiting for the handlers to create new tags.

The behavior with this PR is:

On startup, create K tags that will be populated by K completion queues as requests come in.
K threads poll one tag at a time from the completion queues. The threads create a new tag, then handle the request synchronously (calling the user-defined methods).

The key differences are:

We don't pre-allocate tags, reducing the memory usage (as reported in the issue).
We now always have K tags in the completion queues. The queues will be blocked if these tags are populated, but we still have some pipelining to populate these tags while handlers run.
There still can never be more than K handlers running at a given time because there are only K threads handling requests.

Related issue number

Closes #7543

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://ray.readthedocs.io/en/latest/.
[] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested (please justify below)

AmplabJenkins · 2020-03-10T20:55:55Z

Can one of the admins verify this patch?

AmplabJenkins · 2020-03-10T21:35:55Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22978/
Test FAILed.

src/ray/rpc/server_call.h

stephanie-wang

I know it was also an issue before, but should we be concerned at all that there is no backpressure on the server's request queue?

edoakes · 2020-03-10T23:22:52Z

@raulchen merging this because the high memory consumption is causing CI failures in master and the tests passed here, but please have a look when you can and let me know if I'm missing something.

raulchen · 2020-03-11T03:03:29Z

@edoakes, while reducing memory usage, this PR also removes the back pressure functionality on the server side. I believe this will also cause issues.

Previously, when I first introduced gRPC to ray, without supporting back pressure, I saw an issue that worker suddenly submits tons of tasks to raylet and makes raylet overloaded. Some stressful tests were failing. I believe this PR will cause similar problems.

I think the best solution to this issue is that we only pre-allocate K tags. When a new request is accepted, we allocate a new tag if the total number of outstanding requests is less than N.

edoakes · 2020-03-11T03:11:21Z

@raulchen not sure I fully understand - where was the backpressure that this change removes? We can still only ever have at most 2*K tags in scope at any given time - K currently being handled by the threads and K allocated in the completion queues.

Also note that most handlers had 9999 set for the max concurrency, in which case there was effectively no backpressure AFAICT.

Remove max concurrency from RPC handlers

9f6d699

edoakes changed the title ~~Remove max concurrency from RPC handlers~~ Simplify concurrency handling in gRPC server Mar 10, 2020

edoakes changed the title ~~Simplify concurrency handling in gRPC server~~ Remove static concurrency limit from gRPC server Mar 10, 2020

edoakes requested review from raulchen and ericl March 10, 2020 21:38

ericl reviewed Mar 10, 2020

View reviewed changes

src/ray/rpc/server_call.h Show resolved Hide resolved

ericl approved these changes Mar 10, 2020

View reviewed changes

stephanie-wang approved these changes Mar 10, 2020

View reviewed changes

comment

e662313

edoakes merged commit 119a303 into ray-project:master Mar 10, 2020

This was referenced Mar 12, 2020

Retry RPCs with Status::Unavailable #7572

Closed

Allocate a buffer of 100 calls for each RPC handler #7573

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove static concurrency limit from gRPC server #7544

Remove static concurrency limit from gRPC server #7544

edoakes commented Mar 10, 2020 •

edited

AmplabJenkins commented Mar 10, 2020

AmplabJenkins commented Mar 10, 2020

stephanie-wang left a comment

edoakes commented Mar 10, 2020

raulchen commented Mar 11, 2020

edoakes commented Mar 11, 2020 •

edited

Remove static concurrency limit from gRPC server #7544

Remove static concurrency limit from gRPC server #7544

Conversation

edoakes commented Mar 10, 2020 • edited

Why are these changes needed?

Related issue number

Checks

AmplabJenkins commented Mar 10, 2020

AmplabJenkins commented Mar 10, 2020

stephanie-wang left a comment

Choose a reason for hiding this comment

edoakes commented Mar 10, 2020

raulchen commented Mar 11, 2020

edoakes commented Mar 11, 2020 • edited

edoakes commented Mar 10, 2020 •

edited

edoakes commented Mar 11, 2020 •

edited