[Doc] Add vSphere Ray cluster launcher user guide (#39630)

Similar as other providers, this change adds a user guide for vSphere Ray cluster launcher, including how to prepare the vSphere environment and the frozen VM, as well as the general steps to launch the cluster. It also contains a section on how to use vSAN File Service to provision NFS endpoints as persistent storage for Ray AIR, with a new example YAML file. In addition to that, existing examples and docs are updated to include the correct command to install vSphere Python SDK. Signed-off-by: Fangchi Wang wfangchi@vmware.com Why are these changes needed? As mentioned in PR #39379 , we need a dedicated user guide for launching Ray clusters on vSphere. This change does that with a newly added vsphere.md, including a solution for Ray 2.7's deprecation of syncing to head node for Ray AIR, using VMware vSAN File Service. --------- Signed-off-by: Fangchi Wang <wfangchi@vmware.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
ray-project · Sep 15, 2023 · ba730ce · ba730ce
1 parent 586f1b5
commit ba730ce
Show file tree

Hide file tree

Showing 6 changed files with 256 additions and 13 deletions.
diff --git a/doc/source/cluster/vms/getting-started.rst b/doc/source/cluster/vms/getting-started.rst
@@ -69,7 +69,7 @@ Before we start, you will need to install some Python dependencies as follows:
 
             .. code-block:: shell
 
-                $ pip install -U "ray[default]" vsphere-automation-sdk-python
+                $ pip install -U "ray[default]" "git+https://github.com/vmware/vsphere-automation-sdk-python.git"
 
             vSphere Cluster Launcher Maintainers (GitHub handles): @vinodkri, @LaynePeng
 

diff --git a/doc/source/cluster/vms/user-guides/launching-clusters/index.rst b/doc/source/cluster/vms/user-guides/launching-clusters/index.rst
@@ -1,7 +1,7 @@
 .. _launching-vm-clusters:
 
-Launching Ray Clusters on AWS, GCP, Azure, On-Prem
-==================================================
+Launching Ray Clusters on AWS, GCP, Azure, vSphere, On-Prem
+===========================================================
 
 In this section, you can find guides for launching Ray clusters in various clouds or on-premises.
 
@@ -14,4 +14,5 @@ Table of Contents
     aws.md
     gcp.md
     azure.md
+    vsphere.md
     on-premises.md
diff --git a/doc/source/cluster/vms/user-guides/launching-clusters/vsphere.md b/doc/source/cluster/vms/user-guides/launching-clusters/vsphere.md
@@ -0,0 +1,98 @@
+# Launching Ray Clusters on vSphere
+
+This guide details the steps needed to launch a Ray cluster in a vSphere environment.
+
+To start a vSphere Ray cluster, you will use the Ray cluster launcher with the VMware vSphere Automation SDK for Python.
+
+## Prepare the vSphere environment
+
+If you don't already have a vSphere deployment, you can learn more about it by reading the [vSphere documentation](https://docs.vmware.com/en/VMware-vSphere/index.html). The following prerequisites are needed in order to create Ray clusters.
+* [A vSphere cluster](https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vcenter-esxi-management/GUID-F7818000-26E3-4E2A-93D2-FCDCE7114508.html) and [resource pools](https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-resource-management/GUID-60077B40-66FF-4625-934A-641703ED7601.html) to host VMs composing Ray Clusters.
+* A network port group (either for a [standard switch](https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-networking/GUID-E198C88A-F82C-4FF3-96C9-E3DF0056AD0C.html) or [distributed switch](https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-networking/GUID-375B45C7-684C-4C51-BA3C-70E48DFABF04.html)) or an [NSX segment](https://docs.vmware.com/en/VMware-NSX/4.1/administration/GUID-316E5027-E588-455C-88AD-A7DA930A4F0B.html). VMs connected to this network should be able to obtain IP address via DHCP.
+* A datastore that can be accessed by all the hosts in the vSphere cluster.
+
+Another way to prepare the vSphere environment is with VMware Cloud Foundation (VCF). VCF is a unified software-defined datacenter (SDDC) platform that seamlessly integrates vSphere, vSAN, and NSX into a natively integrated stack, delivering enterprise-ready cloud infrastructure for both private and public cloud environments. If you are using VCF, you can refer to the VCF documentation to  [create workload domains](https://docs.vmware.com/en/VMware-Cloud-Foundation/5.0/vcf-admin/GUID-3A478CF8-AFF8-43D9-9635-4E40A0E372AD.html) for running Ray Clusters. A VCF workload domain comprises one or more vSphere clusters, shared storage like vSAN, and a software-defined network managed by NSX. You can also [create NSX Edge Clusters using VCF](https://docs.vmware.com/en/VMware-Cloud-Foundation/5.0/vcf-admin/GUID-D17D0274-7764-43BD-8252-D9333CA7415A.html) and create segment for Ray VMs network.
+
+## Prepare the frozen VM
+
+The vSphere Ray cluster launcher requires the vSphere environment to have a VM in a frozen state prior to deploying a Ray cluster. This VM is later used to rapidly create head and worker nodes with VMware's [instant clone](https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vm-administration/GUID-853B1E2B-76CE-4240-A654-3806912820EB.html) technology. The internal details of the Ray cluster provisioning process using frozen VM can be found in this [Ray on vSphere architecture document](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/vsphere/ARCHITECTURE.md). 
+
+You can follow [this document](https://via.vmw.com/frozen-vm-ovf) to create and set up the frozen VM. By default, Ray clusters' head and worker node VMs will be placed in the same resource pool as the frozen VM. When building and deploying the frozen VM, there are a couple of things to note:
+
+* The VM's network adapter should be connected to the port group or NSX segment configured in the above section. And the `Connect At Power On` check box should be selected.
+* After the frozen VM is built, a private key file (`ray-bootstrap-key.pem`) and a public key file (`ray_bootstrap_public_key.key`) will be generated under the HOME directory of the current user. If you want to deploy Ray clusters from another machine, these files should be copied to that machine's HOME directory to be picked up by the vSphere cluster launcher.
+* An OVF will be generated in the content library. If you want to deploy Ray clusters in other vSphere deployments, you can export the OVF and use it to create the frozen VM, instead of building it again from scratch.
+* The VM should be in powered-on status before you launch Ray clusters.
+
+## Install Ray cluster launcher
+
+The Ray cluster launcher is part of the `ray` CLI. Use the CLI to start, stop and attach to a running ray cluster using commands such as `ray up`, `ray down` and `ray attach`. You can use pip to install the ray CLI with cluster launcher support. Follow [the Ray installation documentation](installation) for more detailed instructions.
+
+```bash
+# install ray
+pip install -U ray[default]
+```
+
+## Install VMware vSphere Automation SDK for Python
+
+Next, install the VMware vSphere Automation SDK for Python.
+
+```bash
+# Install the VMware vSphere Automation SDK for Python.
+pip install 'git+https://github.com/vmware/vsphere-automation-sdk-python.git'
+```
+
+You can append a version tag to install a specific version.
+```bash
+# Install the v8.0.1.0 version of the SDK.
+pip install 'git+https://github.com/vmware/vsphere-automation-sdk-python.git@v8.0.1.0'
+```
+
+## Start Ray with the Ray cluster launcher
+
+Once the vSphere Automation SDK is installed, you should be ready to launch your cluster using the cluster launcher. The provided [cluster config file](https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/vsphere/example-full.yaml) will create a small cluster with a head node configured to autoscale to up to two workers.
+
+Note that you need to configure your vSphere credentials and vCenter server address either via setting environment variables or adding them to the Ray cluster configuration YAML file.
+
+Test that it works by running the following commands from your local machine:
+
+```bash
+# Download the example-full.yaml
+wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/vsphere/example-full.yaml
+
+# Setup vSphere credentials using environment variables
+export VSPHERE_SERVER=vcenter-address.example.com
+export VSPHERE_USER=foo
+export VSPHERE_PASSWORD=bar
+
+# Edit the example-full.yaml to update the frozen VM name for both the head's and worker's node_config.
+# Optionally configure the head and worker node resource pool and datastore placement.
+# If not configured via environment variables, the vSphere credentials can alternatively be configured in this file.
+# vi example-full.yaml
+
+# Create or update the cluster. When the command finishes, it will print
+# out the command that can be used to SSH into the cluster head node.
+ray up example-full.yaml
+
+# Get a remote screen on the head node.
+ray attach example-full.yaml
+
+# Try running a Ray program.
+python -c 'import ray; ray.init()'
+exit
+
+# Tear down the cluster.
+ray down example-full.yaml
+```
+
+Congrats, you have started a Ray cluster on vSphere!
+
+## Configure vSAN File Service as persistent storage for Ray AIR
+
+Starting in Ray 2.7, Ray AIR (Train and Tune) will require users to provide a cloud storage or NFS path when running distributed training or tuning jobs. In a vSphere environment with a vSAN datastore, you can utilize the vSAN File Service feature to employ vSAN as a shared persistent storage. You can refer to [this vSAN File Service document](https://docs.vmware.com/en/VMware-vSphere/8.0/vsan-administration/GUID-CA9CF043-9434-454E-86E7-DCA9AD9B0C09.html) to create and configure NFS file shares supported by vSAN. The general steps are as follows:
+
+1. Enable vSAN File Service and configure it with domain information and IP address pools.
+2. Create a vSAN file share with NFS as the protocol.
+3. View the file share information to get NFS export path.
+
+Once a file share is created, you can mount it into the head and worker node and use the mount path as the `storage_path` for Ray AIR's `RunConfig` parameter. Please refer to [this example YAML](https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/vsphere/example-vsan-file-service.yaml) as a template on how to mount and configure the path. You will need to modify the NFS export path in the `initialization_commands` list and bind the mounted path within the Ray container. In this example, you will need to put `/mnt/shared_storage/experiment_results` as the `storage_path` for `RunConfig`.
diff --git a/python/ray/autoscaler/vsphere/defaults.yaml b/python/ray/autoscaler/vsphere/defaults.yaml
@@ -52,8 +52,9 @@ auth:
 # The node config specifies the launch config and physical instance type.
 available_node_types:
     ray.head.default:
-        # For example, {"CPU": 4, "Memory": 8192}
-        resources: {}
+        # The node type's CPU and GPU resources are by default the same as the frozen VM.
+        # You can override the resources here.
+        resources: {"CPU": 2}
         node_config:
             # The resource pool where the head node should live, if unset, will be
             # the frozen VM's resource pool.
@@ -67,8 +68,9 @@ available_node_types:
         # The minimum number of nodes of this type to launch.
         # This number should be >= 0.
         min_workers: 0
-        # For example, {"CPU": 4, "Memory": 8192}
-        resources: {}
+        # The node type's CPU and GPU resources are by default the same as the frozen VM.
+        # You can override the resources here.
+        resources: {"CPU": 2}
         node_config:
             # The resource pool where the worker node should live, if unset, will be
             # the frozen VM's resource pool.
@@ -116,7 +118,8 @@ initialization_commands: []
 setup_commands: []
 
 # Custom commands that will be run on the head node after common setup.
-head_setup_commands: []
+head_setup_commands:
+    - pip install 'git+https://github.com/vmware/vsphere-automation-sdk-python.git'
 
 # Custom commands that will be run on worker nodes after common setup.
 worker_setup_commands: []

diff --git a/python/ray/autoscaler/vsphere/example-full.yaml b/python/ray/autoscaler/vsphere/example-full.yaml
@@ -52,8 +52,9 @@ auth:
 # The node config specifies the launch config and physical instance type.
 available_node_types:
     ray.head.default:
-        # For example, {"CPU": 4, "Memory": 8192}
-        resources: {}
+        # The node type's CPU and GPU resources are by default the same as the frozen VM.
+        # You can override the resources here.
+        resources: {"CPU": 2}
         node_config:
             # The resource pool where the head node should live, if unset, will be
             # the frozen VM's resource pool.
@@ -67,8 +68,9 @@ available_node_types:
         # The minimum number of nodes of this type to launch.
         # This number should be >= 0.
         min_workers: 1
-        # For example, {"CPU": 4, "Memory": 8192}
-        resources: {}
+        # The node type's CPU and GPU resources are by default the same as the frozen VM.
+        # You can override the resources here.
+        resources: {"CPU": 2}
         node_config:
             # The resource pool where the worker node should live, if unset, will be
             # the frozen VM's resource pool.
@@ -116,7 +118,8 @@ initialization_commands: []
 setup_commands: []
 
 # Custom commands that will be run on the head node after common setup.
-head_setup_commands: []
+head_setup_commands:
+    - pip install 'git+https://github.com/vmware/vsphere-automation-sdk-python.git'
 
 # Custom commands that will be run on worker nodes after common setup.
 worker_setup_commands: []

diff --git a/python/ray/autoscaler/vsphere/example-vsan-file-service.yaml b/python/ray/autoscaler/vsphere/example-vsan-file-service.yaml
@@ -0,0 +1,138 @@
+# An unique identifier for the head node and workers of this cluster.
+cluster_name: vsan-fs
+
+# The maximum number of workers nodes to launch in addition to the head
+# node.
+max_workers: 2
+
+# The autoscaler will scale up the cluster faster with higher upscaling speed.
+# E.g., if the task requires adding more nodes then autoscaler will gradually
+# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
+# This number should be > 0.
+upscaling_speed: 1.0
+
+# This executes all commands on all nodes in the docker container,
+# and opens all the necessary ports to support the Ray cluster.
+# Empty string means disabled.
+docker:
+    image: "rayproject/ray-ml:latest"
+    # image: rayproject/ray:latest   # use this one if you don't need ML dependencies, it's faster to pull
+    container_name: "ray_container"
+    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
+    # if no cached version is present.
+    pull_before_run: True
+    run_options:   # Extra options to pass into "docker run"
+        - --ulimit nofile=65536:65536
+        - -v /mnt/shared_storage/experiment_results:/mnt/shared_storage/experiment_results
+
+# If a node is idle for this many minutes, it will be removed.
+idle_timeout_minutes: 5
+
+# Cloud-provider specific configuration.
+provider:
+    type: vsphere
+
+# Credentials configured here will take precedence over credentials set in the
+# environment variables.
+#   vsphere_config:
+#       credentials:
+#           user: vc_username
+#           password: vc_password
+#           server: vc_address
+
+# How Ray will authenticate with newly launched nodes.
+auth:
+    ssh_user: ray
+# By default Ray creates a new private keypair, but you can also use your own.
+# If you do so, make sure to also set "KeyName" in the head and worker node
+# configurations below.
+#    ssh_private_key: /path/to/your/key.pem
+
+# Tell the autoscaler the allowed node types and the resources they provide.
+# The key is the name of the node type, which is just for debugging purposes.
+# The node config specifies the launch config and physical instance type.
+available_node_types:
+    ray.head.default:
+        # The node type's CPU and GPU resources are by default the same as the frozen VM.
+        # You can override the resources here.
+        resources: {"CPU": 2}
+        node_config:
+            # The resource pool where the head node should live, if unset, will be
+            # the frozen VM's resource pool.
+            resource_pool:
+            # Mandatory: The frozen VM name from which the head node will be instant-cloned.
+            frozen_vm_name: frozen-vm
+            # The datastore to store the vmdk of the head node vm, if unset, will be
+            # the frozen VM's datastore.
+            datastore:
+    worker:
+        # The minimum number of nodes of this type to launch.
+        # This number should be >= 0.
+        min_workers: 1
+        # The node type's CPU and GPU resources are by default the same as the frozen VM.
+        # You can override the resources here.
+        resources: {"CPU": 2}
+        node_config:
+            # The resource pool where the worker node should live, if unset, will be
+            # the frozen VM's resource pool.
+            resource_pool:
+            # Mandatory: The frozen VM name from which the work node will be instant-cloned.
+            frozen_vm_name: frozen-vm
+            # The datastore to store the vmdk(s) of the worker node vm(s), if unset, will be
+            # the frozen VM's datastore.
+            datastore:
+
+# Specify the node type of the head node (as configured above).
+head_node_type: ray.head.default
+
+# Files or directories to copy to the head and worker nodes. The format is a
+# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
+file_mounts: {
+#    "/path1/on/remote/machine": "/path1/on/local/machine",
+#    "/path2/on/remote/machine": "/path2/on/local/machine",
+}
+
+# Files or directories to copy from the head node to the worker nodes. The format is a
+# list of paths. The same path on the head node will be copied to the worker node.
+# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
+# you should just use file_mounts. Only use this if you know what you're doing!
+cluster_synced_files: []
+
+# Whether changes to directories in file_mounts or cluster_synced_files in the head node
+# should sync to the worker node continuously
+file_mounts_sync_continuously: False
+
+# Patterns for files to exclude when running rsync up or rsync down
+rsync_exclude: []
+
+# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
+# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
+# as a value, the behavior will match git's behavior for finding and using .gitignore files.
+rsync_filter: []
+
+# List of commands that will be run before `setup_commands`. If docker is
+# enabled, these commands will run outside the container and before docker
+# is setup.
+initialization_commands:
+    - sudo mkdir -p /mnt/shared_storage/experiment_results
+    - sudo mount <NFS_SERVER_ADDRESS>:<NFS_EXPORT_PATH> /mnt/shared_storage/experiment_results
+
+# List of shell commands to run to set up nodes.
+setup_commands: []
+
+# Custom commands that will be run on the head node after common setup.
+head_setup_commands:
+    - pip install 'git+https://github.com/vmware/vsphere-automation-sdk-python.git'
+
+# Custom commands that will be run on worker nodes after common setup.
+worker_setup_commands: []
+
+# Command to start ray on the head node. You don't need to change this.
+head_start_ray_commands:
+    - ray stop
+    - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0
+
+# Command to start ray on worker nodes. You don't need to change this.
+worker_start_ray_commands:
+    - ray stop
+    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076