Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Fix local node provider #16202

Merged

Conversation

DmitriGekhtman
Copy link
Contributor

@DmitriGekhtman DmitriGekhtman commented Jun 2, 2021

Why are these changes needed?

Closes #15342

This PR focuses on the most common use case for the local node provider: bootstrapping ray at a given static set of ips
This functionality was broken in the transition to available-node-type configs. This PR fixes it by

  • Preparing local node provider configs with the available-node-type format for ingestion by autoscaler
  • rsyncing the cluster state file so the head node is able to recognize itself as a non-terminated node
  • modifying the node updater to avoid overriding node resources when using the local provider

Mostly to simplify testing, this PR also adds an optional external_head_ip so that you can ray up from a machine (e.g. your laptop) outside of the Ray cluster's network (e.g. an AWS VPC).

Default behavior is updated to be more intuitive, so for example users don't have to explicitly set min and max workers when specifying a list of ips.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Added unit tests to

  • Test bootstrapping of manually managed static clusters
  • Check file-mounts are set to sync cluster state
  • Check external head ip is processed correctly

Tested manually as follows
(recording for posterity if someone needs to debug this again):
ray up with the following AWS config

cluster_name: local-node-provider-test
max_workers: 3
upscaling_speed: 1.0
docker: {}
idle_timeout_minutes: 5
provider:
    type: aws
    region: us-west-2
    availability_zone: us-west-2a,us-west-2b
    cache_stopped_nodes: True
auth:
    ssh_user: ubuntu
    ssh_private_key: xxxx.pem
available_node_types:
    ray.head.default:
        min_workers: 0
        max_workers: 0
        resources: {}
        node_config:
            KeyName: xxxx
            InstanceType: m5.2xlarge
            ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
            BlockDeviceMappings:
                - DeviceName: /dev/sda1
                  Ebs:
                      VolumeSize: 100
    ray.worker.big:
        min_workers: 1
        max_workers: 1
        resources: {}
        node_config:
            KeyName: xxxx
            InstanceType: m5.xlarge
            ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
    ray.worker.default:
        min_workers: 1
        max_workers: 1
        resources: {}
        node_config:
            KeyName: xxxx
            InstanceType: m5.large
            ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
    ray.worker.gpu:
        min_workers: 1
        max_workers: 1
        resources: {}
        node_config:
            KeyName: xxxx
            InstanceType: p2.xlarge
            ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
head_node_type: ray.head.default
file_mounts: {}
cluster_synced_files: []
file_mounts_sync_continuously: False
rsync_exclude:
    - "**/.git"
    - "**/.git/**"
rsync_filter:
    - ".gitignore"
initialization_commands: []
setup_commands:
    - >-
        (stat $HOME/anaconda3/envs/tensorflow2_latest_p37/ &> /dev/null &&
        echo 'export PATH="$HOME/anaconda3/envs/tensorflow2_latest_p37/bin:$PATH"' >> ~/.bashrc) || true
    - which ray || pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"
head_setup_commands:
    - pip install 'boto3>=1.4.8'  # 1.4.8 adds InstanceMarketOptions
worker_setup_commands: []
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
head_node: {}
worker_nodes: {}

ray attach to the head node, start ipython, paste and run the following script to get aws node ids, instance types, resources, internal ips, external ips

import yaml
import subprocess
from ray.autoscaler._private.aws.node_provider import AWSNodeProvider
config = yaml.safe_load(open("ray_bootstrap_config.yaml").read())
provider = AWSNodeProvider(config["provider"], config["cluster_name"])

nodes = provider.non_terminated_nodes({})

def get_type(node):
    return provider.node_tags(node)["ray-user-node-type"]

def node_type(node):
    type = get_type(node)
    return config["available_node_types"][type]

def node_config(node):
    type = get_type(node)
    return node_type(node)["node_config"]

def instance_type(node):
    return node_config(node)["InstanceType"]

def resources(node):
    return node_type(node)["resources"]

for node in nodes:
    print(node, instance_type(node), resources(node), "i"+provider.internal_ip(node), "e"+provider.external_ip(node))

In the same session, run the following to stop ray on all nodes

def runner(node):
    return provider.get_command_runner("dmitri",
                                        node,
                                        config["auth"],
                                        config["cluster_name"],
                                        subprocess,
                                        True,
                                        {})
def run(node, cmd):
    return runner(node).run(cmd, with_output=True)

print(subprocess.check_output("ray stop", shell=True))
for node in nodes[1:]:
    out = run(node, "ray stop")
    print(out)

Exit the ssh session.

ray up with the following "local node provider" config

cluster_name: minimal-manual

provider:
    type: local
    head_ip: <INTERNAL HEAD IP>
    external_head_ip: <EXTERNAL HEAD IP>
    worker_ips: [<INTERNAL WORKER IP 1>, <INTERNAL WORKER IP 2> , <INTERNAL WORKER IP 3>]

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
    ssh_private_key: xxxx.pem

setup_commands:
    - if [ $(which ray) ]; then pip uninstall ray -y; fi
    - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"
    - if [ -d ray ]; then rm -Rf ray; fi
    - git clone --single-branch --branch fix-local-node-provider https://github.com/DmitriGekhtman/ray
    - pushd ray/python/ray && python setup-dev.py -y && popd

ray attach to the head node using the local config.
ray status shows the head and all workers with the correct resource counts:

======== Autoscaler status: 2021-06-02 21:22:13.680789 ========
Node status
---------------------------------------------------------------
Healthy:
 4 ray.node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 0.0/18.0 CPU
 0.0/1.0 GPU
 0.0/1.0 accelerator_type:K80
 0.00/75.396 GiB memory
 0.00/33.604 GiB object_store_memory

Demands:
 (no resource demands)

ray down to confirm that ray stop is executed successfully on the head node.

@DmitriGekhtman
Copy link
Contributor Author

did another check, just in case :)

# Set external head ip, if provided by user.
# Useful if calling `ray up` from outside the network.
external_head_ip = provider_config.get("external_head_ip")
if external_head_ip:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still puzzling for me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to remove it, but I will no longer be able to test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works exactly like external and internal ips for aws.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(In fact that's what it is.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd refer to the AWS implementation for more context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a more detailed docstring and removed from example-minimal.yaml, as this is primarily a debugging tool.

Copy link
Contributor Author

@DmitriGekhtman DmitriGekhtman Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elaborating a little on our setup -- this isn't something I was aware of previously:
The autoscaler tells its node updaters to use internal node ip:

use_internal_ip=True,

The local cluster launcher does not get the use_internal_ip arg and so uses an external_ip:
updater = NodeUpdaterThread(

Whether or not to use an external or internal ip affects the ssh command here:

To be able to ray up from my laptop, I had to use an external ip.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought about it -- using public ips everywhere in this context wouldn't work, as ray internals infer internal ips but autoscaler would try use the public ips, leading to key errors in ResourceDemandScheduler...

# If restarting Ray on a manually-managed on-prem cluster,
# we need to sync local and head representations of cluster state.
# If we're not restarting the cluster (empty ray start cmds), don't
# sync to avoid breaking on-prem cluster autoscaler state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this related to ray start commands though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ray start commands are non-empty if and only if the ray cluster is being restarted.
I'll make this clearer with variable naming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It now reads more like English:
if restarting_ray and is_local_manual_node_provider

Copy link
Contributor Author

@DmitriGekhtman DmitriGekhtman Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ray up machine thinks that worker nodes are terminated because it never creates worker nodes. It knows that the head node is non-terminated. The only time it is acceptable and necessary to sync that state to the head node is when ray is being started or restarted. If you are updating the cluster without restarting ray and you sync the state, the autoscaler will think that the the workers are terminated and will restart ray on the workers.

Splitting up the control plane in the way we do can lead to complications.

@DmitriGekhtman
Copy link
Contributor Author

Reviewers can merge at their discretion.

DmitriGekhtman and others added 2 commits June 4, 2021 16:46
Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
@AmeerHajAli AmeerHajAli merged commit 7d1e7a0 into ray-project:master Jun 5, 2021
@DmitriGekhtman DmitriGekhtman deleted the fix-local-node-provider branch June 5, 2021 18:28
mwtian pushed a commit that referenced this pull request Jun 5, 2021
* Don't override resources for local node provider.

* Wip

* Local node provider prep logic

* ../python/ray/autoscaler/local/defaults.yaml

* wip

* Fix example-full

* defaults comment

* wip

* head type max workers

* sync-state

* No docker

* Fix

* external head ip option

* wip

* move external_ip out of tags

* Update examples

* Update comment

* Skip local defaults

* Config test

* Test external ip

* Change ray start commands to what they were before

* missing yamls

* Fix test

* Remove scary Docker

* Fixes

* Extra test

* address comments

* fixes pre-single-node-type-attempy

* rewrite comment a bit

* One type

* fix

* get rid of pdb

* no placeholders

* fix

* worker nodes and head node optional during launch

* fix

* fix again

* config comment fixes

* mock -> aws, not local

* Update python/ray/autoscaler/_private/local/config.py

Co-authored-by: Ian Rodney <ian.rodney@gmail.com>

* second pop fixed

* Explanatory comments for config logic

* deprecation comments

* Update python/ray/autoscaler/_private/local/config.py

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>

* update test

* fix

* More descriptive name for local provider check

* Remove external-ip from example minimal and add a more detailed doc string.

* Make clearer the equivalence between a ray restart and non-empty ray-start commands

* extra comment

* Update python/ray/autoscaler/_private/local/node_provider.py

* Update python/ray/autoscaler/_private/commands.py

* Update python/ray/autoscaler/_private/commands.py

* Update python/ray/autoscaler/_private/util.py

* lint

* Update python/ray/autoscaler/_private/local/node_provider.py

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>

Co-authored-by: Ian Rodney <ian.rodney@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
mwtian pushed a commit that referenced this pull request Jun 5, 2021
* Don't override resources for local node provider.

* Wip

* Local node provider prep logic

* ../python/ray/autoscaler/local/defaults.yaml

* wip

* Fix example-full

* defaults comment

* wip

* head type max workers

* sync-state

* No docker

* Fix

* external head ip option

* wip

* move external_ip out of tags

* Update examples

* Update comment

* Skip local defaults

* Config test

* Test external ip

* Change ray start commands to what they were before

* missing yamls

* Fix test

* Remove scary Docker

* Fixes

* Extra test

* address comments

* fixes pre-single-node-type-attempy

* rewrite comment a bit

* One type

* fix

* get rid of pdb

* no placeholders

* fix

* worker nodes and head node optional during launch

* fix

* fix again

* config comment fixes

* mock -> aws, not local

* Update python/ray/autoscaler/_private/local/config.py

Co-authored-by: Ian Rodney <ian.rodney@gmail.com>

* second pop fixed

* Explanatory comments for config logic

* deprecation comments

* Update python/ray/autoscaler/_private/local/config.py

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>

* update test

* fix

* More descriptive name for local provider check

* Remove external-ip from example minimal and add a more detailed doc string.

* Make clearer the equivalence between a ray restart and non-empty ray-start commands

* extra comment

* Update python/ray/autoscaler/_private/local/node_provider.py

* Update python/ray/autoscaler/_private/commands.py

* Update python/ray/autoscaler/_private/commands.py

* Update python/ray/autoscaler/_private/util.py

* lint

* Update python/ray/autoscaler/_private/local/node_provider.py

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>

Co-authored-by: Ian Rodney <ian.rodney@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[autoscaler] Local node provider doesn't work
3 participants