New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gcs pull resource reports #14336
Gcs pull resource reports #14336
Conversation
Co-authored-by: Eric Liang <ekhliang@gmail.com>
auto raylet_client = raylet_client_pool_->GetOrConnectByAddress(address); | ||
raylet_client->RequestResourceReport( | ||
[this, node_id](const Status &status, const rpc::RequestResourceReportReply &reply) { | ||
// TODO (Alex): This callback is always posted onto the main thread. Since most |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If asio schedules in FIFO order this will result in a bounded delay for everything else on the main thread, but we sure verify asio schedules the way we expect to.
Move this to be node based (instead of round based) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small things and nits, especially some repeated nits about avoiding default by-reference lambda captures (lambda best practice, item 31 of "Effective Modern C++", yada yada yada) and about using absl's thread-safety annotations, both of which I believe we'd like to standardize around, but please correct me if I'm wrong.
@wuisawesome is this ready for review? Please remove the label if so. Don't use the request reviews button |
<< ")# of remaining nodes: " << nodes_.size(); | ||
} | ||
|
||
void GcsResourceReportPoller::Tick() { TryPullResourceReport(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method seems unnecessary; you can just call TryPull directly.
to_pull_queue_.push_back(state); | ||
} | ||
|
||
polling_service_.post([this] { TryPullResourceReport(); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this post, isn't there a Tick() scheduled already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we explicitly want to pull more things without waiting for a tick. cc @stephanie-wang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just wait for the tick? Especially if it's firing at high frequency (<10ms) We should either have a ticker or call it periodically, having both is unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's going to make a big difference either way, but why not do it this way since it's more responsive?
The only scenario I can see this mattering in is if rpc's fail very quickly for some reason.
// Now enough time has passed. | ||
Tick(100); | ||
RunPollingService(); | ||
ASSERT_TRUE(rpc_sent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually we shouldn't retry failed polls right? We can treat it the same as heartbeat timeout. Maybe add a TODO in the relevant locations for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm see this discussion: #14336 (comment)
I think in the common case, the rpc will fail because the node died, in which case the node removed handler will get called which will stop the polling.
@wuisawesome why did you mark this tests-ok? I looked at the past three commits and it seems test_placement_groups is consistently failing:
|
oh interesting, i looked at the buildkite tests and they passed. |
@ericl I think you're looking at the branch build which was useful for building wheels, but doesn't auto-merge latest master. The PR build is green |
Merged. Please remember to remove the action required label in the future |
Why are these changes needed?
This PR should greatly stabilize the scheduling resource report system. In the past, raylets have pushed heartbeats to GCS. If there were many raylets, the overhead of all these reports was quite bad because a large number of services are all run on the main thread. Thus, raylet resource reports had the potential to overwhelm GCS and slow down all other operations.
Testing:
To test this, we have a feature flag
pull_based_resource_reporting
which causes gcs to pull resource reports. This should: 1. Move the resource polling to a new thread, decreasing the stress on the main thread.2. Naturally rate limit gcs polling, since if gcs is overwhelmed, it will naturally slow down the number of pull requests it sends out to avoid overwhelming itself.
Here is gcs latency before and after:
note: we simulated high resource report load by increasing the report frequency to 10ms.
push based (old)
pull based (new)
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.