-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] Fewer non terminated nodes calls #20359
[autoscaler] Fewer non terminated nodes calls #20359
Conversation
Please review, would be good to have this ready soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! this logic feels a lot simpler.
@wuisawesome thanks for reviewing. Now just need to make sure CI builds the wheels.. |
self.load_metrics.prune_active_ips( | ||
[self.provider.internal_ip(node_id) for node_id in self.all_nodes]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why self.all_nodes instead of self.all_workers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the head node is not a worker node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misinterpreted the question.
Because previously the load metrics method secretly hacked in the head node ip, leading to weird bugs when writing new node providers.
I've removed the head ip instance variable from load metrics and the head's ip is passed in here, along with the rest of things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But won't it always be active?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Setting head ip active in load metrics saves us the redundancy of passing it in each time, but it makes the code brittle. There have been bugs from setting the head ip differently when initializing load metrics than when reading from node provider.
Requires e2e testing and deliberation before merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Lets run the necessary testing and merge.
We need to come to a consensus on what the necessary testing is. |
...Should at least pass CI :) |
e2d47e4
to
2c661da
Compare
manual testing in progress |
Passes
I'd say that's good enough (best we can do) (Need to re-run CI for the mac build, though.) |
This is great @DmitriGekhtman , thanks for pushing this very close to the finish line! |
oh, actually mac build looks like it's running, will let that do its thing |
Can you shed more light on the sanity checks you have in mind? Can we do something more intensive (e.g. run nightly autoscaler tests against AWS and GCP)? Ideally we would also do this for Azure too. |
Following up offline. |
While there aren't currently nightly autoscaler tests, what we can do is schedule some Tune runs to trigger brief upscaling to, say, 20 GPU workers and 100 CPU workers. That can be repeated for AWS and GCP. |
I ran an experiment along the lines suggested in the last comment, also the nightly decision tree example with various tree depths. |
When I ran commit after this, I've seen (base) ray@ip-10-0-0-71:~/sang-nightly-large-disk-test% ray status
Traceback (most recent call last):
File "/home/ray/anaconda3/bin/ray", line 8, in <module>
sys.exit(main())
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1989, in main
return cli()
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1137, in __call__
return self.main(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1062, in main
rv = self.invoke(ctx)
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1668, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 763, in invoke
return __callback(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1534, in status
print(debug_status(status, error))
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 110, in debug_status
lm_summary = LoadMetricsSummary(**as_dict["load_metrics_report"])
TypeError: __init__() got an unexpected keyword argument 'head_ip' Do you think this could be related to this PR? |
Yes, will submit a fix momentarily.. Not sure how I missed that. |
I see, it's an old autoscaler <-> new ray compatibility thing -- fix incoming. |
Why are these changes needed?
non_terminated_nodes calls are expensive for some node provider implementations.
This PR refactors
autoscaler._update()
such that it results in at most one non_terminated_nodes call.Conceptually, the change is that the autoscaler only needs a consistent view of the world once per update interval.
The structure of an autoscaler update is now
There's a small operational difference introduced:
Previously -- After a node is created, its NodeUpdater thread is initiated immediately.
Now -- After a node is created, its NodeUpdater thread is initiated in the next autoscaler update.
This typically will not introduce latency, since the time to get SSH access (a few minutes) is much longer than the autoscaler update interval (5 seconds by default).
Along the way, I've removed the local_ip initialization parameter for LoadMetrics because it was confusing and not useful (and caused some tests to fail)
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.Updated existing autoscaler tests to verify that non_terminated_nodes is called only once per autoscaler._update.