[Cluster Launcher] Vsphere fixes cherry pick (#39954)

* [Doc] Add vSphere Ray cluster launcher user guide (#39630) Similar as other providers, this change adds a user guide for vSphere Ray cluster launcher, including how to prepare the vSphere environment and the frozen VM, as well as the general steps to launch the cluster. It also contains a section on how to use vSAN File Service to provision NFS endpoints as persistent storage for Ray AIR, with a new example YAML file. In addition to that, existing examples and docs are updated to include the correct command to install vSphere Python SDK. Signed-off-by: Fangchi Wang wfangchi@vmware.com Why are these changes needed? As mentioned in PR #39379 , we need a dedicated user guide for launching Ray clusters on vSphere. This change does that with a newly added vsphere.md, including a solution for Ray 2.7's deprecation of syncing to head node for Ray AIR, using VMware vSAN File Service. --------- Signed-off-by: Fangchi Wang <wfangchi@vmware.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> * [vSphere Provider] Optimize the log, and remove the part for connecting NIC in Python (#39143) This is one of the tech debt. The philosopy of this change is: For the one-time operation during ray up, such has creating the tag category, and the tags on vSphere, still using cli_logger.info For the other code which will be executed both during ray up and by the autoscaler in the head node, I use the logger. I changed many logs to debug level, except for the important ones, such as create a VM, delete a VM and reuse the existing VM. This change also removes a logic for connecting NIC. We don't need that part anymore, because we will have one script in the customze.sh scirpt planted in the frozen VM which does the job. This script will be exectued once right after instant cloning. --------- Signed-off-by: Chen Jing <jingch@vmware.com> * [Cluster launcher] [vSphere] Support deploying one frozen VM, or a set of frozen VMs from OVF, then do ray up. (#39783) Bug fix The default.yaml file was not built into the Python wheel, also not in the setup.py scirpt. This change added it. New features 1. Support creating Ray nodes from a set of frozen VMs in a resource pool. The motivation is when doing instant clone, the new VM must be on the same ESXi host with the parent VM. Previously we have only one frozen VM. The Ray nodes created from that frozen VM need to be relocated to other ESXi hosts by vSphere DRS. After this change, we can do round robin on the ESXi hosts to do instant clone to create the Ray nodes. We save the overhead of doing DRS. 2. Support creating the frozen VM, or a set of frozen VMs from OVF template. This feature helps save some manual steps when the user has no existing frozen vm(s) but has an OVF template. Previously the user must manully login onto vSphere and deploy a frozen VM from the OVF first. Now we covered this fucntionality in ray up. 3. Support powering on the frozen VM when the VM is at powered off status when doing ray up, we will wait the frozen VM is really "frozen", then do ray up. Previously we have code logic to power on the frozen VM, but we will not wait it until it is frozen (usually need 2 mins or so). This is a bug actually. In this change we add a function called "wait_until_frozen" to resolve this issue. 4. Some code refactoring work. We split the vsphere sdk related code into another Python file. 5. Update the yaml example files and the corresponding docs for above changes. --------- Signed-off-by: Chen Jing <jingch@vmware.com> * [Doc] Update the vSphere cluster Launcher Maintainer. (#39758) Since Vinod has left the company, we need to update the vSphere Launcher maintainer list to add Roshan and Chen. Roshan acts as Vinod's successor, while Chen will be responsible for overseeing Ray-OSS and facilitating open-source development collaboration. Signed-off-by: Layne Peng <playne@vmware.com> --------- Signed-off-by: Fangchi Wang <wfangchi@vmware.com> Signed-off-by: Chen Jing <jingch@vmware.com> Signed-off-by: Layne Peng <playne@vmware.com> Co-authored-by: Fangchi Wang <wfangchi@vmware.com> Co-authored-by: Chen Jing <jingch@vmware.com> Co-authored-by: Layne Peng <appamail@hotmail.com>
ray-project · Sep 29, 2023 · 9cacdc4 · 9cacdc4
1 parent c0c6fd7
commit 9cacdc4
Show file tree

Hide file tree

Showing 18 changed files with 1,314 additions and 335 deletions.
diff --git a/BUILD.bazel b/BUILD.bazel
@@ -2403,6 +2403,7 @@ filegroup(
         "python/ray/autoscaler/azure/defaults.yaml",
         "python/ray/autoscaler/gcp/defaults.yaml",
         "python/ray/autoscaler/local/defaults.yaml",
+        "python/ray/autoscaler/vsphere/defaults.yaml",
         "python/ray/cloudpickle/*.py",
         "python/ray/core/__init__.py",
         "python/ray/core/generated/__init__.py",

diff --git a/doc/source/cluster/vms/getting-started.rst b/doc/source/cluster/vms/getting-started.rst
@@ -69,9 +69,9 @@ Before we start, you will need to install some Python dependencies as follows:
 
             .. code-block:: shell
 
-                $ pip install -U "ray[default]" vsphere-automation-sdk-python
+                $ pip install -U "ray[default]" "git+https://github.com/vmware/vsphere-automation-sdk-python.git"
 
-            vSphere Cluster Launcher Maintainers (GitHub handles): @vinodkri, @LaynePeng
+            vSphere Cluster Launcher Maintainers (GitHub handles): @LaynePeng, @roshankathawate, @JingChen23
 
 
 Next, if you're not set up to use your cloud provider from the command line, you'll have to configure your credentials:

diff --git a/doc/source/cluster/vms/references/ray-cluster-configuration.rst b/doc/source/cluster/vms/references/ray-cluster-configuration.rst
@@ -179,6 +179,8 @@ vSphere Config
 
             :ref:`credentials <cluster-configuration-vsphere-credentials>`:
                 :ref:`vSphere Credentials <cluster-configuration-vsphere-credentials-type>`
+            :ref:`frozen_vm <cluster-configuration-vsphere-frozen-vm>`:
+                :ref:`vSphere Frozen VM Configs <cluster-configuration-vsphere-frozen-vm-configs>`
 
 .. _cluster-configuration-vsphere-credentials-type:
 
@@ -195,8 +197,26 @@ vSphere Credentials
             :ref:`password <cluster-configuration-vsphere-password>`: str
             :ref:`server <cluster-configuration-vsphere-server>`: str
 
+.. _cluster-configuration-vsphere-frozen-vm-configs:
+
+vSphere Frozen VM Configs
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. tab-set::
+
+    .. tab-item:: vSphere
+
+        .. parsed-literal::
+
+            :ref:`name <cluster-configuration-vsphere-frozen-vm-name>`: str
+            :ref:`library_item <cluster-configuration-vsphere-frozen-vm-library-item>`: str
+            :ref:`resource_pool <cluster-configuration-vsphere-frozen-vm-resource-pool>`: str
+            :ref:`cluster <cluster-configuration-vsphere-frozen-vm-cluster>`: str
+            :ref:`datastore <cluster-configuration-vsphere-frozen-vm-datastore>`: str
+
 .. _cluster-configuration-node-types-type:
 
+
 Node types
 ~~~~~~~~~~
 
@@ -254,8 +274,6 @@ nodes with the newly applied ``node_config`` will then be created according to c
             # The resource pool where the head node should live, if unset, will be
             # the frozen VM's resource pool.
             resource_pool: str
-            # Mandatory: The frozen VM name from which the head node will be instant-cloned.
-            frozen_vm_name: str
             # The datastore to store the vmdk of the head node vm, if unset, will be
             # the frozen VM's datastore.
             datastore: str
@@ -1127,7 +1145,7 @@ controlled by your cloud provider's configuration.
 
     .. tab-item:: vSphere
 
-        vSphere configuations used to connect vCenter Server. If not configured,
+        vSphere configurations used to connect vCenter Server. If not configured,
         the VSPHERE_* environment variables will be used.
 
         * **Required:** No
@@ -1201,6 +1219,121 @@ The vSphere vCenter Server address.
 * **Importance:** Low
 * **Type:** String
 
+.. _cluster-configuration-vsphere-frozen-vm:
+
+``vsphere_config.frozen_vm``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The frozen VM related configurations.
+
+If the frozen VM(s) is/are existing, then ``library_item`` should be unset. Either an existing frozen VM should be specified by ``name``, or a resource pool name of frozen VMs on every ESXi (https://docs.vmware.com/en/VMware-vSphere/index.html) host should be specified by ``resource_pool``.
+
+If the frozen VM(s) is/are to be deployed from OVF template, then `library_item` must be set to point to an OVF template (https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vm-administration/GUID-AFEDC48B-C96F-4088-9C1F-4F0A30E965DE.html) in the content library. In such a case, ``name`` must be set to indicate the name or the name prefix of the frozen VM(s). Then, either ``resource_pool`` should be set to indicate that a set of frozen VMs will be created on each ESXi host of the resource pool, or ``cluster`` should be set to indicate that creating a single frozen VM in the vSphere cluster. The config ``datastore`` (https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.storage.doc/GUID-D5AB2BAD-C69A-4B8D-B468-25D86B8D39CE.html) is mandatory in this case.
+
+Valid examples:
+
+1. ``ray up`` on a frozen VM to be deployed from an OVF template:
+
+    .. code-block:: yaml
+
+        frozen_vm:
+            name: single-frozen-vm
+            library_item: frozen-vm-template
+            cluster: vsanCluster
+            datastore: vsanDatastore
+
+2. ``ray up`` on an existing frozen VM:
+
+    .. code-block:: yaml
+
+        frozen_vm:
+            name: existing-single-frozen-vm
+
+3. ``ray up`` on a resource pool of frozen VMs to be deployed from an OVF template:
+
+    .. code-block:: yaml
+
+        frozen_vm:
+            name: frozen-vm-prefix
+            library_item: frozen-vm-template
+            resource_pool: frozen-vm-resource-pool
+            datastore: vsanDatastore
+
+4. ``ray up`` on an existing resource pool of frozen VMs:
+
+    .. code-block:: yaml
+
+        frozen_vm:
+            resource_pool: frozen-vm-resource-pool
+
+Other cases not in above examples are invalid.
+
+* **Required:** Yes
+* **Importance:** High
+* **Type:** :ref:`vSphere Frozen VM Configs <cluster-configuration-vsphere-frozen-vm-configs>`
+
+.. _cluster-configuration-vsphere-frozen-vm-name:
+
+``vsphere_config.frozen_vm.name``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The name or the name prefix of the frozen VM.
+Can only be unset when ``resource_pool`` is set and pointing to an existing resource pool of frozen VMs.
+
+* **Required:** No
+* **Importance:** Medium
+* **Type:** String
+
+.. _cluster-configuration-vsphere-frozen-vm-library-item:
+
+``vsphere_config.frozen_vm.library_item``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The library item (https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vm-administration/GUID-D3DD122F-16A5-4F36-8467-97994A854B16.html#GUID-D3DD122F-16A5-4F36-8467-97994A854B16) of the OVF template of the frozen VM. If set, the frozen VM or a set of frozen VMs will be deployed from an OVF template specified by ``library_item``. Otherwise, frozen VM(s) should be existing.
+
+Visit the VM Packer for Ray project (https://github.com/vmware-ai-labs/vm-packer-for-ray) to know how to create an OVF template for frozen VMs.
+
+* **Required:** No
+* **Importance:** Low
+* **Type:** String
+
+.. _cluster-configuration-vsphere-frozen-vm-resource-pool:
+
+``vsphere_config.frozen_vm.resource_pool``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The resource pool name of the frozen VMs, can point to an existing resource pool of frozen VMs. Otherwise, ``library_item`` must be specified and a set of frozen VMs will be deployed on each ESXi host.
+
+The frozen VMs will be named as "{frozen_vm.name}-{the vm's ip address}"
+
+* **Required:** No
+* **Importance:** Medium
+* **Type:** String
+
+.. _cluster-configuration-vsphere-frozen-vm-cluster:
+
+``vsphere_config.frozen_vm.cluster``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The vSphere cluster name, only takes effect when ``library_item`` is set and ``resource_pool`` is unset.
+Indicates to deploy a single frozen VM on the vSphere cluster from OVF template.
+
+* **Required:** No
+* **Importance:** Medium
+* **Type:** String
+
+.. _cluster-configuration-vsphere-frozen-vm-datastore:
+
+``vsphere_config.frozen_vm.datastore``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The target vSphere datastore name for storing the virtual machine files of the frozen VM to be deployed from OVF template.
+Will take effect only when ``library_item`` is set. If ``resource_pool`` is also set, this datastore must be a shared datastore among the ESXi hosts.
+
+* **Required:** No
+* **Importance:** Low
+* **Type:** String
+
 .. _cluster-configuration-node-config:
 
 ``available_node_types.<node_type_name>.node_type.node_config``

diff --git a/doc/source/cluster/vms/user-guides/launching-clusters/index.rst b/doc/source/cluster/vms/user-guides/launching-clusters/index.rst
@@ -1,7 +1,7 @@
 .. _launching-vm-clusters:
 
-Launching Ray Clusters on AWS, GCP, Azure, On-Prem
-==================================================
+Launching Ray Clusters on AWS, GCP, Azure, vSphere, On-Prem
+===========================================================
 
 In this section, you can find guides for launching Ray clusters in various clouds or on-premises.
 
@@ -14,4 +14,5 @@ Table of Contents
     aws.md
     gcp.md
     azure.md
+    vsphere.md
     on-premises.md
diff --git a/doc/source/cluster/vms/user-guides/launching-clusters/vsphere.md b/doc/source/cluster/vms/user-guides/launching-clusters/vsphere.md
@@ -0,0 +1,103 @@
+# Launching Ray Clusters on vSphere
+
+This guide details the steps needed to launch a Ray cluster in a vSphere environment.
+
+To start a vSphere Ray cluster, you will use the Ray cluster launcher with the VMware vSphere Automation SDK for Python.
+
+## Prepare the vSphere environment
+
+If you don't already have a vSphere deployment, you can learn more about it by reading the [vSphere documentation](https://docs.vmware.com/en/VMware-vSphere/index.html). The following prerequisites are needed in order to create Ray clusters.
+* [A vSphere cluster](https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vcenter-esxi-management/GUID-F7818000-26E3-4E2A-93D2-FCDCE7114508.html) and [resource pools](https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-resource-management/GUID-60077B40-66FF-4625-934A-641703ED7601.html) to host VMs composing Ray Clusters.
+
+* A network port group (either for a [standard switch](https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-networking/GUID-E198C88A-F82C-4FF3-96C9-E3DF0056AD0C.html) or [distributed switch](https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-networking/GUID-375B45C7-684C-4C51-BA3C-70E48DFABF04.html)) or an [NSX segment](https://docs.vmware.com/en/VMware-NSX/4.1/administration/GUID-316E5027-E588-455C-88AD-A7DA930A4F0B.html). VMs connected to this network should be able to obtain IP address via DHCP.
+* A datastore that can be accessed by all the hosts in the vSphere cluster.
+
+Another way to prepare the vSphere environment is with VMware Cloud Foundation (VCF). VCF is a unified software-defined datacenter (SDDC) platform that seamlessly integrates vSphere, vSAN, and NSX into a natively integrated stack, delivering enterprise-ready cloud infrastructure for both private and public cloud environments. If you are using VCF, you can refer to the VCF documentation to  [create workload domains](https://docs.vmware.com/en/VMware-Cloud-Foundation/5.0/vcf-admin/GUID-3A478CF8-AFF8-43D9-9635-4E40A0E372AD.html) for running Ray Clusters. A VCF workload domain comprises one or more vSphere clusters, shared storage like vSAN, and a software-defined network managed by NSX. You can also [create NSX Edge Clusters using VCF](https://docs.vmware.com/en/VMware-Cloud-Foundation/5.0/vcf-admin/GUID-D17D0274-7764-43BD-8252-D9333CA7415A.html) and create segment for Ray VMs network.
+
+## Prepare the frozen VM
+
+The vSphere Ray cluster launcher requires the vSphere environment to have a VM in a frozen state prior to deploying a Ray cluster. This VM has all the dependencies installed and is later used to rapidly create head and worker nodes by VMware's [instant clone](https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vm-administration/GUID-853B1E2B-76CE-4240-A654-3806912820EB.html) technology. The internal details of the Ray cluster provisioning process using frozen VM can be found in this [Ray on vSphere architecture document](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/vsphere/ARCHITECTURE.md). 
+
+You can follow [this document](https://via.vmw.com/frozen-vm-ovf) to create and set up the frozen VM, or a set of frozen VMs in which each one will be hosted on a distinct ESXi host in the vSphere cluster. By default, Ray clusters' head and worker node VMs will be placed in the same resource pool as the frozen VM. When building and deploying the frozen VM, there are a couple of things to note:
+
+* The VM's network adapter should be connected to the port group or NSX segment configured in the above section. And the `Connect At Power On` check box should be selected.
+* After the frozen VM is built, a private key file (`ray-bootstrap-key.pem`) and a public key file (`ray_bootstrap_public_key.key`) will be generated under the HOME directory of the current user. If you want to deploy Ray clusters from another machine, these files should be copied to that machine's HOME directory to be picked up by the vSphere cluster launcher.
+* An OVF will be generated in the content library. If you want to deploy Ray clusters in other vSphere deployments, you can export the OVF and use it to create the frozen VM, instead of building it again from scratch.
+* The VM should be in powered-on status before you launch Ray clusters.
+
+## Install Ray cluster launcher
+
+The Ray cluster launcher is part of the `ray` CLI. Use the CLI to start, stop and attach to a running ray cluster using commands such as `ray up`, `ray down` and `ray attach`. You can use pip to install the ray CLI with cluster launcher support. Follow [the Ray installation documentation](installation) for more detailed instructions.
+
+```bash
+# install ray
+pip install -U ray[default]
+```
+
+## Install VMware vSphere Automation SDK for Python
+
+Next, install the VMware vSphere Automation SDK for Python.
+
+```bash
+# Install the VMware vSphere Automation SDK for Python.
+pip install 'git+https://github.com/vmware/vsphere-automation-sdk-python.git'
+```
+
+You can append a version tag to install a specific version.
+```bash
+# Install the v8.0.1.0 version of the SDK.
+pip install 'git+https://github.com/vmware/vsphere-automation-sdk-python.git@v8.0.1.0'
+```
+
+## Start Ray with the Ray cluster launcher
+
+Once the vSphere Automation SDK is installed, you should be ready to launch your cluster using the cluster launcher. The provided [cluster config file](https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/vsphere/example-full.yaml) will create a small cluster with a head node configured to autoscale to up to two workers.
+
+Note that you need to configure your vSphere credentials and vCenter server address either via setting environment variables or adding them to the Ray cluster configuration YAML file.
+
+Test that it works by running the following commands from your local machine:
+
+```bash
+# Download the example-full.yaml
+wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/vsphere/example-full.yaml
+
+# Setup vSphere credentials using environment variables
+export VSPHERE_SERVER=vcenter-address.example.com
+export VSPHERE_USER=foo
+export VSPHERE_PASSWORD=bar
+
+# Edit the example-full.yaml to update the frozen VM related configs under vsphere_config. there are 3 options:
+# 1. If you have a single frozen VM, set the "name" under "frozen_vm".
+# 2. If you have a set of frozen VMs in a resource pool (one VM on each ESXi host), set the "resource_pool" under "frozen_vm".
+# 3. If you don't have any existing frozen VM in the vSphere cluster, but you have an OVF template of a frozen VM, set the "library_item" under "frozen_vm". After that, you need to either set the "name" of the to-be-deployed frozen VM, or set the "resource_pool" to point to an existing resource pool for the to-be-deployed frozen VMs for all the ESXi hosts in the vSphere cluster. Also, the "datastore" must be specified.
+# Optionally configure the head and worker node resource pool and datastore placement.
+# If not configured via environment variables, the vSphere credentials can alternatively be configured in this file.
+
+# vi example-full.yaml
+
+# Create or update the cluster. When the command finishes, it will print
+# out the command that can be used to SSH into the cluster head node.
+ray up example-full.yaml
+
+# Get a remote screen on the head node.
+ray attach example-full.yaml
+
+# Try running a Ray program.
+python -c 'import ray; ray.init()'
+exit
+
+# Tear down the cluster.
+ray down example-full.yaml
+```
+
+Congrats, you have started a Ray cluster on vSphere!
+
+## Configure vSAN File Service as persistent storage for Ray AIR
+
+Starting in Ray 2.7, Ray AIR (Train and Tune) will require users to provide a cloud storage or NFS path when running distributed training or tuning jobs. In a vSphere environment with a vSAN datastore, you can utilize the vSAN File Service feature to employ vSAN as a shared persistent storage. You can refer to [this vSAN File Service document](https://docs.vmware.com/en/VMware-vSphere/8.0/vsan-administration/GUID-CA9CF043-9434-454E-86E7-DCA9AD9B0C09.html) to create and configure NFS file shares supported by vSAN. The general steps are as follows:
+
+1. Enable vSAN File Service and configure it with domain information and IP address pools.
+2. Create a vSAN file share with NFS as the protocol.
+3. View the file share information to get NFS export path.
+
+Once a file share is created, you can mount it into the head and worker node and use the mount path as the `storage_path` for Ray AIR's `RunConfig` parameter. Please refer to [this example YAML](https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/vsphere/example-vsan-file-service.yaml) as a template on how to mount and configure the path. You will need to modify the NFS export path in the `initialization_commands` list and bind the mounted path within the Ray container. In this example, you will need to put `/mnt/shared_storage/experiment_results` as the `storage_path` for `RunConfig`.