-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] Fix local node provider #16202
[autoscaler] Fix local node provider #16202
Conversation
did another check, just in case :) |
# Set external head ip, if provided by user. | ||
# Useful if calling `ray up` from outside the network. | ||
external_head_ip = provider_config.get("external_head_ip") | ||
if external_head_ip: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still puzzling for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad to remove it, but I will no longer be able to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works exactly like external and internal ips for aws.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(In fact that's what it is.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd refer to the AWS implementation for more context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a more detailed docstring and removed from example-minimal.yaml, as this is primarily a debugging tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elaborating a little on our setup -- this isn't something I was aware of previously:
The autoscaler tells its node updaters to use internal node ip:
use_internal_ip=True, |
The local cluster launcher does not get the
use_internal_ip
arg and so uses an external_ip:updater = NodeUpdaterThread( |
Whether or not to use an external or internal ip affects the ssh command here:
def _get_node_ip(self): |
To be able to ray up
from my laptop, I had to use an external ip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought about it -- using public ips everywhere in this context wouldn't work, as ray internals infer internal ips but autoscaler would try use the public ips, leading to key errors in ResourceDemandScheduler...
# If restarting Ray on a manually-managed on-prem cluster, | ||
# we need to sync local and head representations of cluster state. | ||
# If we're not restarting the cluster (empty ray start cmds), don't | ||
# sync to avoid breaking on-prem cluster autoscaler state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this related to ray start commands though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ray start commands are non-empty if and only if the ray cluster is being restarted.
I'll make this clearer with variable naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It now reads more like English:
if restarting_ray and is_local_manual_node_provider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ray up machine thinks that worker nodes are terminated because it never creates worker nodes. It knows that the head node is non-terminated. The only time it is acceptable and necessary to sync that state to the head node is when ray is being started or restarted. If you are updating the cluster without restarting ray and you sync the state, the autoscaler will think that the the workers are terminated and will restart ray on the workers.
Splitting up the control plane in the way we do can lead to complications.
Reviewers can merge at their discretion. |
Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
* Don't override resources for local node provider. * Wip * Local node provider prep logic * ../python/ray/autoscaler/local/defaults.yaml * wip * Fix example-full * defaults comment * wip * head type max workers * sync-state * No docker * Fix * external head ip option * wip * move external_ip out of tags * Update examples * Update comment * Skip local defaults * Config test * Test external ip * Change ray start commands to what they were before * missing yamls * Fix test * Remove scary Docker * Fixes * Extra test * address comments * fixes pre-single-node-type-attempy * rewrite comment a bit * One type * fix * get rid of pdb * no placeholders * fix * worker nodes and head node optional during launch * fix * fix again * config comment fixes * mock -> aws, not local * Update python/ray/autoscaler/_private/local/config.py Co-authored-by: Ian Rodney <ian.rodney@gmail.com> * second pop fixed * Explanatory comments for config logic * deprecation comments * Update python/ray/autoscaler/_private/local/config.py Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu> * update test * fix * More descriptive name for local provider check * Remove external-ip from example minimal and add a more detailed doc string. * Make clearer the equivalence between a ray restart and non-empty ray-start commands * extra comment * Update python/ray/autoscaler/_private/local/node_provider.py * Update python/ray/autoscaler/_private/commands.py * Update python/ray/autoscaler/_private/commands.py * Update python/ray/autoscaler/_private/util.py * lint * Update python/ray/autoscaler/_private/local/node_provider.py Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu> Co-authored-by: Ian Rodney <ian.rodney@gmail.com> Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
* Don't override resources for local node provider. * Wip * Local node provider prep logic * ../python/ray/autoscaler/local/defaults.yaml * wip * Fix example-full * defaults comment * wip * head type max workers * sync-state * No docker * Fix * external head ip option * wip * move external_ip out of tags * Update examples * Update comment * Skip local defaults * Config test * Test external ip * Change ray start commands to what they were before * missing yamls * Fix test * Remove scary Docker * Fixes * Extra test * address comments * fixes pre-single-node-type-attempy * rewrite comment a bit * One type * fix * get rid of pdb * no placeholders * fix * worker nodes and head node optional during launch * fix * fix again * config comment fixes * mock -> aws, not local * Update python/ray/autoscaler/_private/local/config.py Co-authored-by: Ian Rodney <ian.rodney@gmail.com> * second pop fixed * Explanatory comments for config logic * deprecation comments * Update python/ray/autoscaler/_private/local/config.py Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu> * update test * fix * More descriptive name for local provider check * Remove external-ip from example minimal and add a more detailed doc string. * Make clearer the equivalence between a ray restart and non-empty ray-start commands * extra comment * Update python/ray/autoscaler/_private/local/node_provider.py * Update python/ray/autoscaler/_private/commands.py * Update python/ray/autoscaler/_private/commands.py * Update python/ray/autoscaler/_private/util.py * lint * Update python/ray/autoscaler/_private/local/node_provider.py Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu> Co-authored-by: Ian Rodney <ian.rodney@gmail.com> Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
Why are these changes needed?
Closes #15342
This PR focuses on the most common use case for the local node provider: bootstrapping ray at a given static set of ips
This functionality was broken in the transition to available-node-type configs. This PR fixes it by
Mostly to simplify testing, this PR also adds an optional
external_head_ip
so that you canray up
from a machine (e.g. your laptop) outside of the Ray cluster's network (e.g. an AWS VPC).Default behavior is updated to be more intuitive, so for example users don't have to explicitly set min and max workers when specifying a list of ips.
Checks
scripts/format.sh
to lint the changes in this PR.Added unit tests to
Tested manually as follows
(recording for posterity if someone needs to debug this again):
ray up
with the following AWS configray attach
to the head node, start ipython, paste and run the following script to get aws node ids, instance types, resources, internal ips, external ipsIn the same session, run the following to stop ray on all nodes
Exit the ssh session.
ray up
with the following "local node provider" configray attach to the head node using the local config.
ray status
shows the head and all workers with the correct resource counts:ray down
to confirm thatray stop
is executed successfully on the head node.