Gcs pull resource reports #14336

wuisawesome · 2021-02-25T05:39:44Z

Why are these changes needed?

This PR should greatly stabilize the scheduling resource report system. In the past, raylets have pushed heartbeats to GCS. If there were many raylets, the overhead of all these reports was quite bad because a large number of services are all run on the main thread. Thus, raylet resource reports had the potential to overwhelm GCS and slow down all other operations.

Testing:
To test this, we have a feature flag pull_based_resource_reporting which causes gcs to pull resource reports. This should: 1. Move the resource polling to a new thread, decreasing the stress on the main thread.
2. Naturally rate limit gcs polling, since if gcs is overwhelmed, it will naturally slow down the number of pull requests it sends out to avoid overwhelming itself.

Here is gcs latency before and after:

note: we simulated high resource report load by increasing the report frequency to 10ms.

push based (old)

pull based (new)

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Co-authored-by: Eric Liang <ekhliang@gmail.com>

src/ray/gcs/gcs_server/gcs_resource_report_poller.h

wuisawesome · 2021-02-26T01:47:44Z

src/ray/gcs/gcs_server/gcs_resource_report_poller.cc

+    auto raylet_client = raylet_client_pool_->GetOrConnectByAddress(address);
+    raylet_client->RequestResourceReport(
+                                         [this, node_id](const Status &status, const rpc::RequestResourceReportReply &reply) {
+                                           // TODO (Alex): This callback is always posted onto the main thread. Since most


If asio schedules in FIFO order this will result in a bounded delay for everything else on the main thread, but we sure verify asio schedules the way we expect to.

wuisawesome · 2021-02-26T01:54:24Z

Move this to be node based (instead of round based)

clarkzinzow

A few small things and nits, especially some repeated nits about avoiding default by-reference lambda captures (lambda best practice, item 31 of "Effective Modern C++", yada yada yada) and about using absl's thread-safety annotations, both of which I believe we'd like to standardize around, but please correct me if I'm wrong.

src/ray/common/ray_config_def.h

src/ray/gcs/gcs_server/gcs_resource_manager.cc

src/ray/gcs/gcs_server/gcs_resource_manager.h

src/ray/gcs/gcs_server/gcs_resource_report_poller.h

src/ray/gcs/gcs_server/gcs_resource_report_poller.cc

…ource_reports

ericl · 2021-03-21T04:17:40Z

@wuisawesome is this ready for review? Please remove the label if so. Don't use the request reviews button

…ource_reports

src/ray/gcs/gcs_server/gcs_resource_report_poller.cc

ericl · 2021-03-22T21:28:29Z

src/ray/gcs/gcs_server/gcs_resource_report_poller.cc

+                 << ")# of remaining nodes: " << nodes_.size();
+}
+
+void GcsResourceReportPoller::Tick() { TryPullResourceReport(); }


This method seems unnecessary; you can just call TryPull directly.

src/ray/gcs/gcs_server/gcs_resource_report_poller.cc

ericl · 2021-03-22T21:31:00Z

src/ray/gcs/gcs_server/gcs_resource_report_poller.cc

+    to_pull_queue_.push_back(state);
+  }
+
+  polling_service_.post([this] { TryPullResourceReport(); });


Why do we need this post, isn't there a Tick() scheduled already?

I think we explicitly want to pull more things without waiting for a tick. cc @stephanie-wang

Why not just wait for the tick? Especially if it's firing at high frequency (<10ms) We should either have a ticker or call it periodically, having both is unnecessary.

I don't think it's going to make a big difference either way, but why not do it this way since it's more responsive?

The only scenario I can see this mattering in is if rpc's fail very quickly for some reason.

ericl · 2021-03-22T21:34:34Z

src/ray/gcs/gcs_server/test/gcs_resource_report_poller_test.cc

+  // Now enough time has passed.
+  Tick(100);
+  RunPollingService();
+  ASSERT_TRUE(rpc_sent);


Eventually we shouldn't retry failed polls right? We can treat it the same as heartbeat timeout. Maybe add a TODO in the relevant locations for now.

Hmmm see this discussion: #14336 (comment)

I think in the common case, the rpc will fail because the node died, in which case the node removed handler will get called which will stop the polling.

…ource_reports

ericl · 2021-03-25T22:23:40Z

@wuisawesome why did you mark this tests-ok? I looked at the past three commits and it seems test_placement_groups is consistently failing:

//python/ray/tests:test_placement_group                                  FAILED in 3 out of 3 in 272.3s
  Stats over 3 runs: max = 272.3s, min = 234.8s, avg = 252.8s, dev = 15.3s
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_placement_group/test.log
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_placement_group/test_attempts/attempt_1.log
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_placement_group/test_attempts/attempt_2.log

wuisawesome · 2021-03-25T23:36:52Z

oh interesting, i looked at the buildkite tests and they passed.

wuisawesome · 2021-03-29T18:26:38Z

@ericl I think you're looking at the branch build which was useful for building wheels, but doesn't auto-merge latest master. The PR build is green

S

ericl · 2021-03-29T18:36:52Z

Merged. Please remember to remove the action required label in the future

Alex Wu and others added 18 commits February 2, 2021 18:02

.

dc9336a

Merge branch 'master' of github.com:ray-project/ray into master

6432664

done?

ca2afd1

Merge branch 'master' of github.com:ray-project/ray into master

04b12a6

Merge branch 'master' of github.com:ray-project/ray into master

d69ca53

Merge branch 'master' of github.com:ray-project/ray

4b51207

Merge branch 'master' of github.com:ray-project/ray into master

652acb2

Merge branch 'master' of github.com:ray-project/ray

098e8c2

Merge branch 'master' of github.com:wuisawesome/ray

d92867c

raylet side done?

58e2e78

.

21cc430

.

fe94f3b

Update src/ray/raylet/node_manager.cc

35a32ed

Co-authored-by: Eric Liang <ekhliang@gmail.com>

.

3a084bc

.

7084689

lint

b70ebe7

added feature flag

bc6b3bd

.

4d061d9

wuisawesome commented Feb 26, 2021

View reviewed changes

src/ray/gcs/gcs_server/gcs_resource_report_poller.h Outdated Show resolved Hide resolved

wuisawesome commented Feb 26, 2021

View reviewed changes

Alex added 4 commits February 26, 2021 03:16

.

5c915e7

Without max inflight

0938b04

Without max inflight

8fb050f

Looks like it might just work?

dc6405e

wuisawesome marked this pull request as ready for review February 27, 2021 06:07

.

c59e5d7

clarkzinzow reviewed Mar 1, 2021

View reviewed changes

stephanie-wang self-requested a review March 2, 2021 17:12

clarkzinzow mentioned this pull request Mar 2, 2021

Raylet request resource report endpoint #14291

Merged

6 tasks

Alex added 2 commits March 19, 2021 05:25

Merge branch 'master' of github.com:ray-project/ray into gcs_pull_res…

ff1f123

…ource_reports

Done with testing?

b3a717a

wuisawesome requested review from stephanie-wang and ericl March 21, 2021 03:39

Alex added 4 commits March 22, 2021 03:43

logs

31eab0d

Merge branch 'master' of github.com:ray-project/ray into gcs_pull_res…

11d9677

…ource_reports

lint

d227826

Merge branch 'master' of github.com:ray-project/ray into gcs_pull_res…

68bd287

…ource_reports

wuisawesome removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 22, 2021

ericl reviewed Mar 22, 2021

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 22, 2021

Alex added 7 commits March 22, 2021 23:29

works? need to fix logging now

a8f9ce0

.

67d574d

.

4bb447d

fix test

5a34b41

appease clang

b66b5c3

Merge branch 'master' of github.com:ray-project/ray into gcs_pull_res…

131d836

…ource_reports

.

4646239

wuisawesome added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Mar 25, 2021

ericl added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Mar 25, 2021

ericl approved these changes Mar 29, 2021

View reviewed changes

ericl merged commit 1f4d4df into ray-project:master Mar 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gcs pull resource reports #14336

Gcs pull resource reports #14336

wuisawesome commented Feb 25, 2021 •

edited

wuisawesome Feb 26, 2021

wuisawesome commented Feb 26, 2021

clarkzinzow left a comment •

edited

ericl commented Mar 21, 2021 •

edited

ericl Mar 22, 2021

ericl Mar 22, 2021

wuisawesome Mar 22, 2021

ericl Mar 22, 2021

wuisawesome Mar 22, 2021

ericl Mar 22, 2021

wuisawesome Mar 22, 2021

ericl commented Mar 25, 2021

wuisawesome commented Mar 25, 2021

wuisawesome commented Mar 29, 2021

ericl commented Mar 29, 2021

Gcs pull resource reports #14336

Gcs pull resource reports #14336

Conversation

wuisawesome commented Feb 25, 2021 • edited

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

wuisawesome commented Feb 26, 2021

clarkzinzow left a comment • edited

Choose a reason for hiding this comment

ericl commented Mar 21, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl commented Mar 25, 2021

wuisawesome commented Mar 25, 2021

wuisawesome commented Mar 29, 2021

ericl commented Mar 29, 2021

wuisawesome commented Feb 25, 2021 •

edited

clarkzinzow left a comment •

edited

ericl commented Mar 21, 2021 •

edited