[autoscaler] Fix local node provider #16202

DmitriGekhtman · 2021-06-02T22:34:41Z

Why are these changes needed?

This PR focuses on the most common use case for the local node provider: bootstrapping ray at a given static set of ips
This functionality was broken in the transition to available-node-type configs. This PR fixes it by

Preparing local node provider configs with the available-node-type format for ingestion by autoscaler
rsyncing the cluster state file so the head node is able to recognize itself as a non-terminated node
modifying the node updater to avoid overriding node resources when using the local provider

Mostly to simplify testing, this PR also adds an optional external_head_ip so that you can ray up from a machine (e.g. your laptop) outside of the Ray cluster's network (e.g. an AWS VPC).

Default behavior is updated to be more intuitive, so for example users don't have to explicitly set min and max workers when specifying a list of ips.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Added unit tests to

Test bootstrapping of manually managed static clusters
Check file-mounts are set to sync cluster state
Check external head ip is processed correctly

Tested manually as follows
(recording for posterity if someone needs to debug this again):
ray up with the following AWS config

cluster_name: local-node-provider-test
max_workers: 3
upscaling_speed: 1.0
docker: {}
idle_timeout_minutes: 5
provider:
    type: aws
    region: us-west-2
    availability_zone: us-west-2a,us-west-2b
    cache_stopped_nodes: True
auth:
    ssh_user: ubuntu
    ssh_private_key: xxxx.pem
available_node_types:
    ray.head.default:
        min_workers: 0
        max_workers: 0
        resources: {}
        node_config:
            KeyName: xxxx
            InstanceType: m5.2xlarge
            ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
            BlockDeviceMappings:
                - DeviceName: /dev/sda1
                  Ebs:
                      VolumeSize: 100
    ray.worker.big:
        min_workers: 1
        max_workers: 1
        resources: {}
        node_config:
            KeyName: xxxx
            InstanceType: m5.xlarge
            ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
    ray.worker.default:
        min_workers: 1
        max_workers: 1
        resources: {}
        node_config:
            KeyName: xxxx
            InstanceType: m5.large
            ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
    ray.worker.gpu:
        min_workers: 1
        max_workers: 1
        resources: {}
        node_config:
            KeyName: xxxx
            InstanceType: p2.xlarge
            ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
head_node_type: ray.head.default
file_mounts: {}
cluster_synced_files: []
file_mounts_sync_continuously: False
rsync_exclude:
    - "**/.git"
    - "**/.git/**"
rsync_filter:
    - ".gitignore"
initialization_commands: []
setup_commands:
    - >-
        (stat $HOME/anaconda3/envs/tensorflow2_latest_p37/ &> /dev/null &&
        echo 'export PATH="$HOME/anaconda3/envs/tensorflow2_latest_p37/bin:$PATH"' >> ~/.bashrc) || true
    - which ray || pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"
head_setup_commands:
    - pip install 'boto3>=1.4.8'  # 1.4.8 adds InstanceMarketOptions
worker_setup_commands: []
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
head_node: {}
worker_nodes: {}

ray attach to the head node, start ipython, paste and run the following script to get aws node ids, instance types, resources, internal ips, external ips

import yaml
import subprocess
from ray.autoscaler._private.aws.node_provider import AWSNodeProvider
config = yaml.safe_load(open("ray_bootstrap_config.yaml").read())
provider = AWSNodeProvider(config["provider"], config["cluster_name"])

nodes = provider.non_terminated_nodes({})

def get_type(node):
    return provider.node_tags(node)["ray-user-node-type"]

def node_type(node):
    type = get_type(node)
    return config["available_node_types"][type]

def node_config(node):
    type = get_type(node)
    return node_type(node)["node_config"]

def instance_type(node):
    return node_config(node)["InstanceType"]

def resources(node):
    return node_type(node)["resources"]

for node in nodes:
    print(node, instance_type(node), resources(node), "i"+provider.internal_ip(node), "e"+provider.external_ip(node))

In the same session, run the following to stop ray on all nodes

def runner(node):
    return provider.get_command_runner("dmitri",
                                        node,
                                        config["auth"],
                                        config["cluster_name"],
                                        subprocess,
                                        True,
                                        {})
def run(node, cmd):
    return runner(node).run(cmd, with_output=True)

print(subprocess.check_output("ray stop", shell=True))
for node in nodes[1:]:
    out = run(node, "ray stop")
    print(out)

Exit the ssh session.

ray up with the following "local node provider" config

cluster_name: minimal-manual

provider:
    type: local
    head_ip: <INTERNAL HEAD IP>
    external_head_ip: <EXTERNAL HEAD IP>
    worker_ips: [<INTERNAL WORKER IP 1>, <INTERNAL WORKER IP 2> , <INTERNAL WORKER IP 3>]

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
    ssh_private_key: xxxx.pem

setup_commands:
    - if [ $(which ray) ]; then pip uninstall ray -y; fi
    - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"
    - if [ -d ray ]; then rm -Rf ray; fi
    - git clone --single-branch --branch fix-local-node-provider https://github.com/DmitriGekhtman/ray
    - pushd ray/python/ray && python setup-dev.py -y && popd

ray attach to the head node using the local config.
ray status shows the head and all workers with the correct resource counts:

======== Autoscaler status: 2021-06-02 21:22:13.680789 ========
Node status
---------------------------------------------------------------
Healthy:
 4 ray.node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 0.0/18.0 CPU
 0.0/1.0 GPU
 0.0/1.0 accelerator_type:K80
 0.00/75.396 GiB memory
 0.00/33.604 GiB object_store_memory

Demands:
 (no resource demands)

ray down to confirm that ray stop is executed successfully on the head node.

python/ray/autoscaler/_private/local/config.py

DmitriGekhtman · 2021-06-03T23:32:40Z

did another check, just in case :)

AmeerHajAli · 2021-06-03T23:40:10Z

python/ray/autoscaler/_private/local/node_provider.py

+                # Set external head ip, if provided by user.
+                # Useful if calling `ray up` from outside the network.
+                external_head_ip = provider_config.get("external_head_ip")
+                if external_head_ip:


This is still puzzling for me

Glad to remove it, but I will no longer be able to test.

This works exactly like external and internal ips for aws.

(In fact that's what it is.)

I'd refer to the AWS implementation for more context.

Added a more detailed docstring and removed from example-minimal.yaml, as this is primarily a debugging tool.

Elaborating a little on our setup -- this isn't something I was aware of previously:
The autoscaler tells its node updaters to use internal node ip:

ray/python/ray/autoscaler/_private/autoscaler.py

Line 623 in 226754d

use_internal_ip=True,

The local cluster launcher does not get the use_internal_ip arg and so uses an external_ip:

ray/python/ray/autoscaler/_private/commands.py

Line 712 in 226754d

updater = NodeUpdaterThread(

Whether or not to use an external or internal ip affects the ssh command here:

ray/python/ray/autoscaler/_private/command_runner.py

Line 347 in 226754d

def _get_node_ip(self):

To be able to ray up from my laptop, I had to use an external ip.

Thought about it -- using public ips everywhere in this context wouldn't work, as ray internals infer internal ips but autoscaler would try use the public ips, leading to key errors in ResourceDemandScheduler...

python/ray/autoscaler/_private/commands.py

AmeerHajAli · 2021-06-03T23:50:21Z

python/ray/autoscaler/_private/commands.py

+        # If restarting Ray on a manually-managed on-prem cluster,
+        # we need to sync local and head representations of cluster state.
+        # If we're not restarting the cluster (empty ray start cmds), don't
+        # sync to avoid breaking on-prem cluster autoscaler state.


Why is this related to ray start commands though?

ray start commands are non-empty if and only if the ray cluster is being restarted.
I'll make this clearer with variable naming.

It now reads more like English:
if restarting_ray and is_local_manual_node_provider

The ray up machine thinks that worker nodes are terminated because it never creates worker nodes. It knows that the head node is non-terminated. The only time it is acceptable and necessary to sync that state to the head node is when ray is being started or restarted. If you are updating the cluster without restarting ray and you sync the state, the autoscaler will think that the the workers are terminated and will restart ray on the workers.

Splitting up the control plane in the way we do can lead to complications.

…tring.

…start commands

DmitriGekhtman · 2021-06-04T02:19:42Z

Reviewers can merge at their discretion.

python/ray/autoscaler/_private/local/node_provider.py

python/ray/autoscaler/_private/commands.py

python/ray/autoscaler/_private/util.py

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>

* Don't override resources for local node provider. * Wip * Local node provider prep logic * ../python/ray/autoscaler/local/defaults.yaml * wip * Fix example-full * defaults comment * wip * head type max workers * sync-state * No docker * Fix * external head ip option * wip * move external_ip out of tags * Update examples * Update comment * Skip local defaults * Config test * Test external ip * Change ray start commands to what they were before * missing yamls * Fix test * Remove scary Docker * Fixes * Extra test * address comments * fixes pre-single-node-type-attempy * rewrite comment a bit * One type * fix * get rid of pdb * no placeholders * fix * worker nodes and head node optional during launch * fix * fix again * config comment fixes * mock -> aws, not local * Update python/ray/autoscaler/_private/local/config.py Co-authored-by: Ian Rodney <ian.rodney@gmail.com> * second pop fixed * Explanatory comments for config logic * deprecation comments * Update python/ray/autoscaler/_private/local/config.py Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu> * update test * fix * More descriptive name for local provider check * Remove external-ip from example minimal and add a more detailed doc string. * Make clearer the equivalence between a ray restart and non-empty ray-start commands * extra comment * Update python/ray/autoscaler/_private/local/node_provider.py * Update python/ray/autoscaler/_private/commands.py * Update python/ray/autoscaler/_private/commands.py * Update python/ray/autoscaler/_private/util.py * lint * Update python/ray/autoscaler/_private/local/node_provider.py Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu> Co-authored-by: Ian Rodney <ian.rodney@gmail.com> Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>

DmitriGekhtman added 23 commits May 20, 2021 14:14

Don't override resources for local node provider.

27cf8d9

Wip

2343fb7

Local node provider prep logic

55d334f

../python/ray/autoscaler/local/defaults.yaml

502316e

wip

353dcd1

Fix example-full

1bf5607

defaults comment

9afbfba

wip

d06880c

Merge branch 'master' into fix-local-node-provider

155d921

head type max workers

3b9a472

sync-state

1b0b942

No docker

b816c8c

Fix

faecef7

external head ip option

cc9d3fb

wip

5c86afb

move external_ip out of tags

01ed086

Update examples

a9bfe08

Update comment

44cb4c0

Skip local defaults

95a231f

Config test

838e08e

Test external ip

3363e5f

Change ray start commands to what they were before

27b5ff9

Merge branch 'master' into fix-local-node-provider

12c2520

DmitriGekhtman assigned ijrsvt and AmeerHajAli Jun 2, 2021

DmitriGekhtman added 2 commits June 2, 2021 19:01

missing yamls

37ba075

Fix test

dd1ae49

ijrsvt reviewed Jun 3, 2021

View reviewed changes

python/ray/autoscaler/_private/local/config.py Outdated Show resolved Hide resolved

ijrsvt reviewed Jun 3, 2021

View reviewed changes

python/ray/autoscaler/_private/local/config.py Show resolved Hide resolved

ijrsvt reviewed Jun 3, 2021

View reviewed changes

python/ray/autoscaler/_private/local/config.py Outdated Show resolved Hide resolved

AmeerHajAli reviewed Jun 3, 2021

View reviewed changes

python/ray/autoscaler/_private/commands.py Outdated Show resolved Hide resolved

AmeerHajAli reviewed Jun 3, 2021

View reviewed changes

DmitriGekhtman added 3 commits June 3, 2021 17:04

More descriptive name for local provider check

25705d2

Remove external-ip from example minimal and add a more detailed doc s…

d16361f

…tring.

Make clearer the equivalence between a ray restart and non-empty ray-…

5e1d9c8

…start commands

DmitriGekhtman added 2 commits June 4, 2021 12:35

Merge branch 'master' into fix-local-node-provider

2a96c16

extra comment

226754d