Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gcs pull resource reports #14336

Merged
merged 57 commits into from Mar 29, 2021
Merged

Conversation

wuisawesome
Copy link
Contributor

@wuisawesome wuisawesome commented Feb 25, 2021

Why are these changes needed?

This PR should greatly stabilize the scheduling resource report system. In the past, raylets have pushed heartbeats to GCS. If there were many raylets, the overhead of all these reports was quite bad because a large number of services are all run on the main thread. Thus, raylet resource reports had the potential to overwhelm GCS and slow down all other operations.

Testing:
To test this, we have a feature flag pull_based_resource_reporting which causes gcs to pull resource reports. This should: 1. Move the resource polling to a new thread, decreasing the stress on the main thread.
2. Naturally rate limit gcs polling, since if gcs is overwhelmed, it will naturally slow down the number of pull requests it sends out to avoid overwhelming itself.

Here is gcs latency before and after:

note: we simulated high resource report load by increasing the report frequency to 10ms.

push based (old)
push

pull based (new)
pull

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

auto raylet_client = raylet_client_pool_->GetOrConnectByAddress(address);
raylet_client->RequestResourceReport(
[this, node_id](const Status &status, const rpc::RequestResourceReportReply &reply) {
// TODO (Alex): This callback is always posted onto the main thread. Since most
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If asio schedules in FIFO order this will result in a bounded delay for everything else on the main thread, but we sure verify asio schedules the way we expect to.

@wuisawesome
Copy link
Contributor Author

Move this to be node based (instead of round based)

@wuisawesome wuisawesome marked this pull request as ready for review February 27, 2021 06:07
Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small things and nits, especially some repeated nits about avoiding default by-reference lambda captures (lambda best practice, item 31 of "Effective Modern C++", yada yada yada) and about using absl's thread-safety annotations, both of which I believe we'd like to standardize around, but please correct me if I'm wrong.

src/ray/common/ray_config_def.h Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_resource_manager.cc Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_resource_manager.h Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_resource_report_poller.h Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_resource_report_poller.h Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_resource_report_poller.h Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_resource_report_poller.h Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_resource_report_poller.cc Outdated Show resolved Hide resolved
@ericl
Copy link
Contributor

ericl commented Mar 21, 2021

@wuisawesome is this ready for review? Please remove the label if so. Don't use the request reviews button

@wuisawesome wuisawesome removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 22, 2021
src/ray/gcs/gcs_server/gcs_resource_report_poller.cc Outdated Show resolved Hide resolved
<< ")# of remaining nodes: " << nodes_.size();
}

void GcsResourceReportPoller::Tick() { TryPullResourceReport(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method seems unnecessary; you can just call TryPull directly.

src/ray/gcs/gcs_server/gcs_resource_report_poller.cc Outdated Show resolved Hide resolved
to_pull_queue_.push_back(state);
}

polling_service_.post([this] { TryPullResourceReport(); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this post, isn't there a Tick() scheduled already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we explicitly want to pull more things without waiting for a tick. cc @stephanie-wang

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just wait for the tick? Especially if it's firing at high frequency (<10ms) We should either have a ticker or call it periodically, having both is unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's going to make a big difference either way, but why not do it this way since it's more responsive?

The only scenario I can see this mattering in is if rpc's fail very quickly for some reason.

// Now enough time has passed.
Tick(100);
RunPollingService();
ASSERT_TRUE(rpc_sent);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually we shouldn't retry failed polls right? We can treat it the same as heartbeat timeout. Maybe add a TODO in the relevant locations for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm see this discussion: #14336 (comment)

I think in the common case, the rpc will fail because the node died, in which case the node removed handler will get called which will stop the polling.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 22, 2021
@wuisawesome wuisawesome added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Mar 25, 2021
@ericl
Copy link
Contributor

ericl commented Mar 25, 2021

@wuisawesome why did you mark this tests-ok? I looked at the past three commits and it seems test_placement_groups is consistently failing:

//python/ray/tests:test_placement_group                                  FAILED in 3 out of 3 in 272.3s
  Stats over 3 runs: max = 272.3s, min = 234.8s, avg = 252.8s, dev = 15.3s
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_placement_group/test.log
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_placement_group/test_attempts/attempt_1.log
  /home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_placement_group/test_attempts/attempt_2.log

@ericl ericl added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Mar 25, 2021
@wuisawesome
Copy link
Contributor Author

oh interesting, i looked at the buildkite tests and they passed.

@wuisawesome
Copy link
Contributor Author

@ericl I think you're looking at the branch build which was useful for building wheels, but doesn't auto-merge latest master. The PR build is green

@ericl ericl merged commit 1f4d4df into ray-project:master Mar 29, 2021
@ericl
Copy link
Contributor

ericl commented Mar 29, 2021

Merged. Please remember to remove the action required label in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants