-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Cluster Launcher] Vsphere fixes cherry pick (#39954)
* [Doc] Add vSphere Ray cluster launcher user guide (#39630) Similar as other providers, this change adds a user guide for vSphere Ray cluster launcher, including how to prepare the vSphere environment and the frozen VM, as well as the general steps to launch the cluster. It also contains a section on how to use vSAN File Service to provision NFS endpoints as persistent storage for Ray AIR, with a new example YAML file. In addition to that, existing examples and docs are updated to include the correct command to install vSphere Python SDK. Signed-off-by: Fangchi Wang wfangchi@vmware.com Why are these changes needed? As mentioned in PR #39379 , we need a dedicated user guide for launching Ray clusters on vSphere. This change does that with a newly added vsphere.md, including a solution for Ray 2.7's deprecation of syncing to head node for Ray AIR, using VMware vSAN File Service. --------- Signed-off-by: Fangchi Wang <wfangchi@vmware.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> * [vSphere Provider] Optimize the log, and remove the part for connecting NIC in Python (#39143) This is one of the tech debt. The philosopy of this change is: For the one-time operation during ray up, such has creating the tag category, and the tags on vSphere, still using cli_logger.info For the other code which will be executed both during ray up and by the autoscaler in the head node, I use the logger. I changed many logs to debug level, except for the important ones, such as create a VM, delete a VM and reuse the existing VM. This change also removes a logic for connecting NIC. We don't need that part anymore, because we will have one script in the customze.sh scirpt planted in the frozen VM which does the job. This script will be exectued once right after instant cloning. --------- Signed-off-by: Chen Jing <jingch@vmware.com> * [Cluster launcher] [vSphere] Support deploying one frozen VM, or a set of frozen VMs from OVF, then do ray up. (#39783) Bug fix The default.yaml file was not built into the Python wheel, also not in the setup.py scirpt. This change added it. New features 1. Support creating Ray nodes from a set of frozen VMs in a resource pool. The motivation is when doing instant clone, the new VM must be on the same ESXi host with the parent VM. Previously we have only one frozen VM. The Ray nodes created from that frozen VM need to be relocated to other ESXi hosts by vSphere DRS. After this change, we can do round robin on the ESXi hosts to do instant clone to create the Ray nodes. We save the overhead of doing DRS. 2. Support creating the frozen VM, or a set of frozen VMs from OVF template. This feature helps save some manual steps when the user has no existing frozen vm(s) but has an OVF template. Previously the user must manully login onto vSphere and deploy a frozen VM from the OVF first. Now we covered this fucntionality in ray up. 3. Support powering on the frozen VM when the VM is at powered off status when doing ray up, we will wait the frozen VM is really "frozen", then do ray up. Previously we have code logic to power on the frozen VM, but we will not wait it until it is frozen (usually need 2 mins or so). This is a bug actually. In this change we add a function called "wait_until_frozen" to resolve this issue. 4. Some code refactoring work. We split the vsphere sdk related code into another Python file. 5. Update the yaml example files and the corresponding docs for above changes. --------- Signed-off-by: Chen Jing <jingch@vmware.com> * [Doc] Update the vSphere cluster Launcher Maintainer. (#39758) Since Vinod has left the company, we need to update the vSphere Launcher maintainer list to add Roshan and Chen. Roshan acts as Vinod's successor, while Chen will be responsible for overseeing Ray-OSS and facilitating open-source development collaboration. Signed-off-by: Layne Peng <playne@vmware.com> --------- Signed-off-by: Fangchi Wang <wfangchi@vmware.com> Signed-off-by: Chen Jing <jingch@vmware.com> Signed-off-by: Layne Peng <playne@vmware.com> Co-authored-by: Fangchi Wang <wfangchi@vmware.com> Co-authored-by: Chen Jing <jingch@vmware.com> Co-authored-by: Layne Peng <appamail@hotmail.com>
- Loading branch information
1 parent
c0c6fd7
commit 9cacdc4
Showing
18 changed files
with
1,314 additions
and
335 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
103 changes: 103 additions & 0 deletions
103
doc/source/cluster/vms/user-guides/launching-clusters/vsphere.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,103 @@ | ||
# Launching Ray Clusters on vSphere | ||
|
||
This guide details the steps needed to launch a Ray cluster in a vSphere environment. | ||
|
||
To start a vSphere Ray cluster, you will use the Ray cluster launcher with the VMware vSphere Automation SDK for Python. | ||
|
||
## Prepare the vSphere environment | ||
|
||
If you don't already have a vSphere deployment, you can learn more about it by reading the [vSphere documentation](https://docs.vmware.com/en/VMware-vSphere/index.html). The following prerequisites are needed in order to create Ray clusters. | ||
* [A vSphere cluster](https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vcenter-esxi-management/GUID-F7818000-26E3-4E2A-93D2-FCDCE7114508.html) and [resource pools](https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-resource-management/GUID-60077B40-66FF-4625-934A-641703ED7601.html) to host VMs composing Ray Clusters. | ||
|
||
* A network port group (either for a [standard switch](https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-networking/GUID-E198C88A-F82C-4FF3-96C9-E3DF0056AD0C.html) or [distributed switch](https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-networking/GUID-375B45C7-684C-4C51-BA3C-70E48DFABF04.html)) or an [NSX segment](https://docs.vmware.com/en/VMware-NSX/4.1/administration/GUID-316E5027-E588-455C-88AD-A7DA930A4F0B.html). VMs connected to this network should be able to obtain IP address via DHCP. | ||
* A datastore that can be accessed by all the hosts in the vSphere cluster. | ||
|
||
Another way to prepare the vSphere environment is with VMware Cloud Foundation (VCF). VCF is a unified software-defined datacenter (SDDC) platform that seamlessly integrates vSphere, vSAN, and NSX into a natively integrated stack, delivering enterprise-ready cloud infrastructure for both private and public cloud environments. If you are using VCF, you can refer to the VCF documentation to [create workload domains](https://docs.vmware.com/en/VMware-Cloud-Foundation/5.0/vcf-admin/GUID-3A478CF8-AFF8-43D9-9635-4E40A0E372AD.html) for running Ray Clusters. A VCF workload domain comprises one or more vSphere clusters, shared storage like vSAN, and a software-defined network managed by NSX. You can also [create NSX Edge Clusters using VCF](https://docs.vmware.com/en/VMware-Cloud-Foundation/5.0/vcf-admin/GUID-D17D0274-7764-43BD-8252-D9333CA7415A.html) and create segment for Ray VMs network. | ||
|
||
## Prepare the frozen VM | ||
|
||
The vSphere Ray cluster launcher requires the vSphere environment to have a VM in a frozen state prior to deploying a Ray cluster. This VM has all the dependencies installed and is later used to rapidly create head and worker nodes by VMware's [instant clone](https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vm-administration/GUID-853B1E2B-76CE-4240-A654-3806912820EB.html) technology. The internal details of the Ray cluster provisioning process using frozen VM can be found in this [Ray on vSphere architecture document](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/vsphere/ARCHITECTURE.md). | ||
|
||
You can follow [this document](https://via.vmw.com/frozen-vm-ovf) to create and set up the frozen VM, or a set of frozen VMs in which each one will be hosted on a distinct ESXi host in the vSphere cluster. By default, Ray clusters' head and worker node VMs will be placed in the same resource pool as the frozen VM. When building and deploying the frozen VM, there are a couple of things to note: | ||
|
||
* The VM's network adapter should be connected to the port group or NSX segment configured in the above section. And the `Connect At Power On` check box should be selected. | ||
* After the frozen VM is built, a private key file (`ray-bootstrap-key.pem`) and a public key file (`ray_bootstrap_public_key.key`) will be generated under the HOME directory of the current user. If you want to deploy Ray clusters from another machine, these files should be copied to that machine's HOME directory to be picked up by the vSphere cluster launcher. | ||
* An OVF will be generated in the content library. If you want to deploy Ray clusters in other vSphere deployments, you can export the OVF and use it to create the frozen VM, instead of building it again from scratch. | ||
* The VM should be in powered-on status before you launch Ray clusters. | ||
|
||
## Install Ray cluster launcher | ||
|
||
The Ray cluster launcher is part of the `ray` CLI. Use the CLI to start, stop and attach to a running ray cluster using commands such as `ray up`, `ray down` and `ray attach`. You can use pip to install the ray CLI with cluster launcher support. Follow [the Ray installation documentation](installation) for more detailed instructions. | ||
|
||
```bash | ||
# install ray | ||
pip install -U ray[default] | ||
``` | ||
|
||
## Install VMware vSphere Automation SDK for Python | ||
|
||
Next, install the VMware vSphere Automation SDK for Python. | ||
|
||
```bash | ||
# Install the VMware vSphere Automation SDK for Python. | ||
pip install 'git+https://github.com/vmware/vsphere-automation-sdk-python.git' | ||
``` | ||
|
||
You can append a version tag to install a specific version. | ||
```bash | ||
# Install the v8.0.1.0 version of the SDK. | ||
pip install 'git+https://github.com/vmware/vsphere-automation-sdk-python.git@v8.0.1.0' | ||
``` | ||
|
||
## Start Ray with the Ray cluster launcher | ||
|
||
Once the vSphere Automation SDK is installed, you should be ready to launch your cluster using the cluster launcher. The provided [cluster config file](https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/vsphere/example-full.yaml) will create a small cluster with a head node configured to autoscale to up to two workers. | ||
|
||
Note that you need to configure your vSphere credentials and vCenter server address either via setting environment variables or adding them to the Ray cluster configuration YAML file. | ||
|
||
Test that it works by running the following commands from your local machine: | ||
|
||
```bash | ||
# Download the example-full.yaml | ||
wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/vsphere/example-full.yaml | ||
|
||
# Setup vSphere credentials using environment variables | ||
export VSPHERE_SERVER=vcenter-address.example.com | ||
export VSPHERE_USER=foo | ||
export VSPHERE_PASSWORD=bar | ||
|
||
# Edit the example-full.yaml to update the frozen VM related configs under vsphere_config. there are 3 options: | ||
# 1. If you have a single frozen VM, set the "name" under "frozen_vm". | ||
# 2. If you have a set of frozen VMs in a resource pool (one VM on each ESXi host), set the "resource_pool" under "frozen_vm". | ||
# 3. If you don't have any existing frozen VM in the vSphere cluster, but you have an OVF template of a frozen VM, set the "library_item" under "frozen_vm". After that, you need to either set the "name" of the to-be-deployed frozen VM, or set the "resource_pool" to point to an existing resource pool for the to-be-deployed frozen VMs for all the ESXi hosts in the vSphere cluster. Also, the "datastore" must be specified. | ||
# Optionally configure the head and worker node resource pool and datastore placement. | ||
# If not configured via environment variables, the vSphere credentials can alternatively be configured in this file. | ||
|
||
# vi example-full.yaml | ||
|
||
# Create or update the cluster. When the command finishes, it will print | ||
# out the command that can be used to SSH into the cluster head node. | ||
ray up example-full.yaml | ||
|
||
# Get a remote screen on the head node. | ||
ray attach example-full.yaml | ||
|
||
# Try running a Ray program. | ||
python -c 'import ray; ray.init()' | ||
exit | ||
|
||
# Tear down the cluster. | ||
ray down example-full.yaml | ||
``` | ||
|
||
Congrats, you have started a Ray cluster on vSphere! | ||
|
||
## Configure vSAN File Service as persistent storage for Ray AIR | ||
|
||
Starting in Ray 2.7, Ray AIR (Train and Tune) will require users to provide a cloud storage or NFS path when running distributed training or tuning jobs. In a vSphere environment with a vSAN datastore, you can utilize the vSAN File Service feature to employ vSAN as a shared persistent storage. You can refer to [this vSAN File Service document](https://docs.vmware.com/en/VMware-vSphere/8.0/vsan-administration/GUID-CA9CF043-9434-454E-86E7-DCA9AD9B0C09.html) to create and configure NFS file shares supported by vSAN. The general steps are as follows: | ||
|
||
1. Enable vSAN File Service and configure it with domain information and IP address pools. | ||
2. Create a vSAN file share with NFS as the protocol. | ||
3. View the file share information to get NFS export path. | ||
|
||
Once a file share is created, you can mount it into the head and worker node and use the mount path as the `storage_path` for Ray AIR's `RunConfig` parameter. Please refer to [this example YAML](https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/vsphere/example-vsan-file-service.yaml) as a template on how to mount and configure the path. You will need to modify the NFS export path in the `initialization_commands` list and bind the mounted path within the Ray container. In this example, you will need to put `/mnt/shared_storage/experiment_results` as the `storage_path` for `RunConfig`. |
Oops, something went wrong.