[vSphere] Support deploying one frozen VM, or a set of frozen VMs from OVF, then do ray up. #39783

JingChen23 · 2023-09-21T08:20:17Z

Why are these changes needed?

Bug fix

The default.yaml file was not built into the Python wheel, also not in the setup.py scirpt. This change added it.

New features

1. Support creating Ray nodes from a set of frozen VMs in a resource pool.

The motivation is when doing instant clone, the new VM must be on the same ESXi host with the parent VM. Previously we have only one frozen VM. The Ray nodes created from that frozen VM need to be relocated to other ESXi hosts by vSphere DRS. After this change, we can do round robin on the ESXi hosts to do instant clone to create the Ray nodes. We save the overhead of doing DRS.

2. Support creating the frozen VM, or a set of frozen VMs from OVF template.

This feature helps save some manual steps when the user has no existing frozen vm(s) but has an OVF template. Previously the user must manully login onto vSphere and deploy a frozen VM from the OVF first. Now we covered this fucntionality in ray up.

3. Support powering on the frozen VM when the VM is at powered off status when doing ray up, we will wait the frozen VM is really "frozen", then do ray up.

Previously we have code logic to power on the frozen VM, but we will not wait it until it is frozen (usually need 2 mins or so). This is a bug actually. In this change we add a function called "wait_until_frozen" to resolve this issue.

4. Some code refactoring work. We split the vsphere sdk related code into another Python file.

5. Update the yaml example files and the corresponding docs for above changes.

Tests

Create one single frozen VM then do 'ray up'

The yaml snippet:

      frozen_vm:
        name: frozen-vm
        library_item: frozen-vm-1
        cluster: x77-cluster
        datastore: vsanDatastore

Verified that Ray up succeed.

ray up with one existing frozen VM

The yaml snippet:

      frozen_vm:
        name: frozen-vm
        # library_item: frozen-vm-1
        # resource_pool: frozen-vms
        # cluster: x77-cluster
        # datastore: vsanDatastore

Verified that Ray up succeed.

Creat a set of frozen VMs in a resource pool then do ray up, create ray nodes by round robin

The yaml snippet:

      frozen_vm:
        name: frozen-vm
        library_item: frozen-vm-item
        resource_pool: frozen-vms
        # cluster: vsphere-cluster
        datastore: vsanDatastore

Verified that Ray up succeed.
verified that the Ray nodes are spread on different ESXi hosts.

Ray up on an existing resource pool of frozen VMs.

The yaml snippet:

      frozen_vm:
        #name: frozen-vm
        #library_item: frozen-vm-item
        resource_pool: frozen-vms
        # cluster: vsphere-cluster
        # datastore: vsanDatastore

Verified that Ray up succeed.
verified that the Ray nodes are spread on different ESXi hosts.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Chen Jing <jingch@vmware.com>

architkulkarni · 2023-09-21T16:38:27Z

@JingChen23 Just letting you know you can mark the PR as draft until it's ready for review! https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/changing-the-stage-of-a-pull-request#converting-a-pull-request-to-a-draft

Signed-off-by: Chen Jing <jingch@vmware.com>

python/ray/autoscaler/_private/vsphere/node_provider.py

python/ray/autoscaler/_private/vsphere/scheduler.py

Signed-off-by: Chen Jing <jingch@vmware.com>

architkulkarni · 2023-09-25T17:01:32Z

@JingChen23 Looks like there are some potentially relevant test failures: https://buildkite.com/ray-project/premerge/builds/6411

architkulkarni

The new API seems a bit complicated for someone unfamiliar with vSphere, but this might be unavoidable. I think the following suggestions might mitigate this:

Are there references in vSphere docs for the new terms "library item", "resource pool", "datastore" etc? It would be good to link to these.
I think adding some example config snippets for frozen_vm in the docs would go a long way! The ones in the PR description might be a good starting point.
There are a lot of constraints of the form "X must be specified, or if Y is specified Z must also be specified". Can we make sure these constraints are validated and can we add unit tests to make sure they fail fast with user friendly errors?

Other than this, looks good! Just minor comments.

The PR is a bit large, in the future it would be great to submit a series of smaller PRs.

doc/source/cluster/vms/references/ray-cluster-configuration.rst

python/ray/autoscaler/_private/vsphere/round_robin_scheduler.py

python/ray/autoscaler/_private/vsphere/scheduler.py

python/ray/autoscaler/_private/vsphere/sdk_provider.py

Signed-off-by: Chen Jing <jingch@vmware.com>

JingChen23 · 2023-09-26T05:14:09Z

@JingChen23 Looks like there are some potentially relevant test failures: https://buildkite.com/ray-project/premerge/builds/6411

Thanks Archit, this is bacause I forgot to checkout the change on the UT file from our internal repo.

JingChen23 · 2023-09-26T08:23:09Z

The new API seems a bit complicated for someone unfamiliar with vSphere, but this might be unavoidable. I think the following suggestions might mitigate this:

Are there references in vSphere docs for the new terms "library item", "resource pool", "datastore" etc? It would be good to link to these.

I think adding some example config snippets for frozen_vm in the docs would go a long way! The ones in the PR description might be a good starting point.

There are a lot of constraints of the form "X must be specified, or if Y is specified Z must also be specified". Can we make sure these constraints are validated and can we add unit tests to make sure they fail fast with user friendly errors?

Other than this, looks good! Just minor comments.

The PR is a bit large, in the future it would be great to submit a series of smaller PRs.

Reference added. The initial thought is that: the people who want to run Ray on vSphere should already have some basic knowledge of vSphere. This is a sellable product, it is a rare case that a person buy it but doesn't learn about it. 😄
Example snipptes added.
We will add this in the next PR, basically my idea is to add a validator function to check the node config at the early stage, covering all the combinations. Then add a UT covers all the cases for the validator function.

The large PR is because we didn't know that you have a code freeze and cherry pick process. We did this change in our internal repo with 8 small MRs, we intentionally made them on-hold to wait your 2.7.0 tag. But actually we shouldn't have worried about your 2.7.0 release because the commits will not be cherry-picked if we made consensus on this in our Slack channel.

From now on we will only raise small PRs, and we will pin the ones we want you to help cherry-pick in our Slack channel.

Signed-off-by: Chen Jing <jingch@vmware.com>

architkulkarni

Looks good to me, as long as the unit tests are added in the next PR.

architkulkarni · 2023-09-26T17:41:09Z

doc/source/cluster/vms/references/ray-cluster-configuration.rst

+
+The frozen VM related configurations.
+If the frozen VM(s) is/are existing, then ``library_item`` should be unset. Either an existing frozen VM should be specified by ``name``, or a resource pool name of frozen VMs on every ESXi (https://docs.vmware.com/en/VMware-vSphere/index.html) host should be specified by ``resource_pool``.
+If the frozen VM(s) is/are to be deployed from OVF template, then `library_item` must be set to point to an OVF template (https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vm-administration/GUID-AFEDC48B-C96F-4088-9C1F-4F0A30E965DE.html) in the content library. In such as case, ``name`` must be set to indicate the name or the name prefix of the frozen VM(s). Then, either ``resource_pool`` should be set to indicate that a set of frozen VMs will be created on each ESXi host of the resource pool, or ``cluster`` should be set to indicate that creating a single frozen VM in the vSphere cluster. The config ``datastore`` (https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.storage.doc/GUID-D5AB2BAD-C69A-4B8D-B468-25D86B8D39CE.html) is mandatory in this case.


I'm not sure if raw urls are automatically converted to hyperlinks in RST, it would be good to check this in the doc build output, or just mimic how urls are written in other RST files

architkulkarni · 2023-09-26T17:42:39Z

Lint failed: https://buildkite.com/ray-project/premerge/builds/6504#018ad097-37b8-460b-b2c0-9fae441269bb/185-440

You can run setup_hooks.sh to install a pre-push hook that will automatically run lint and prevent this issue

architkulkarni · 2023-09-26T17:46:19Z

Assigning @richardliaw as remaining codeowner

Signed-off-by: Chen Jing <jingch@vmware.com>

JingChen23 · 2023-09-27T00:37:21Z

Lint failed: https://buildkite.com/ray-project/premerge/builds/6504#018ad097-37b8-460b-b2c0-9fae441269bb/185-440

You can run setup_hooks.sh to install a pre-push hook that will automatically run lint and prevent this issue

Thank you! This bothers me for some time.

Signed-off-by: Chen Jing <jingch@vmware.com>

JingChen23 · 2023-09-27T02:08:30Z

Looks good to me, as long as the unit tests are added in the next PR.

I have added the unit test in this PR in the latest commit, yesterday I didn't have time for that. Now that this PR had some issue and wasn't merged. I have plenty of time today so I paid the tech debt.

Signed-off-by: Chen Jing <jingch@vmware.com>

python/ray/autoscaler/_private/vsphere/config.py

architkulkarni · 2023-09-27T17:05:15Z

Nice, thanks for adding the tests!

Signed-off-by: Chen Jing <jingch@vmware.com>

JingChen23 · 2023-09-28T04:40:44Z

@architkulkarni @richardliaw Could you please help to merge this PR if the buildkites reports no isssues related to this PR?

JingChen23 · 2023-09-28T05:24:52Z

The rst document

architkulkarni · 2023-09-28T16:23:08Z

Failed tests: learning_tests_pendulum_ddppo, nested_action_spaces_ppo_torch, test_legacy_dataset_config are all unrelated

…t of frozen VMs from OVF, then do ray up. (ray-project#39783) Bug fix The default.yaml file was not built into the Python wheel, also not in the setup.py scirpt. This change added it. New features 1. Support creating Ray nodes from a set of frozen VMs in a resource pool. The motivation is when doing instant clone, the new VM must be on the same ESXi host with the parent VM. Previously we have only one frozen VM. The Ray nodes created from that frozen VM need to be relocated to other ESXi hosts by vSphere DRS. After this change, we can do round robin on the ESXi hosts to do instant clone to create the Ray nodes. We save the overhead of doing DRS. 2. Support creating the frozen VM, or a set of frozen VMs from OVF template. This feature helps save some manual steps when the user has no existing frozen vm(s) but has an OVF template. Previously the user must manully login onto vSphere and deploy a frozen VM from the OVF first. Now we covered this fucntionality in ray up. 3. Support powering on the frozen VM when the VM is at powered off status when doing ray up, we will wait the frozen VM is really "frozen", then do ray up. Previously we have code logic to power on the frozen VM, but we will not wait it until it is frozen (usually need 2 mins or so). This is a bug actually. In this change we add a function called "wait_until_frozen" to resolve this issue. 4. Some code refactoring work. We split the vsphere sdk related code into another Python file. 5. Update the yaml example files and the corresponding docs for above changes. --------- Signed-off-by: Chen Jing <jingch@vmware.com>

* [Doc] Add vSphere Ray cluster launcher user guide (#39630) Similar as other providers, this change adds a user guide for vSphere Ray cluster launcher, including how to prepare the vSphere environment and the frozen VM, as well as the general steps to launch the cluster. It also contains a section on how to use vSAN File Service to provision NFS endpoints as persistent storage for Ray AIR, with a new example YAML file. In addition to that, existing examples and docs are updated to include the correct command to install vSphere Python SDK. Signed-off-by: Fangchi Wang wfangchi@vmware.com Why are these changes needed? As mentioned in PR #39379 , we need a dedicated user guide for launching Ray clusters on vSphere. This change does that with a newly added vsphere.md, including a solution for Ray 2.7's deprecation of syncing to head node for Ray AIR, using VMware vSAN File Service. --------- Signed-off-by: Fangchi Wang <wfangchi@vmware.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> * [vSphere Provider] Optimize the log, and remove the part for connecting NIC in Python (#39143) This is one of the tech debt. The philosopy of this change is: For the one-time operation during ray up, such has creating the tag category, and the tags on vSphere, still using cli_logger.info For the other code which will be executed both during ray up and by the autoscaler in the head node, I use the logger. I changed many logs to debug level, except for the important ones, such as create a VM, delete a VM and reuse the existing VM. This change also removes a logic for connecting NIC. We don't need that part anymore, because we will have one script in the customze.sh scirpt planted in the frozen VM which does the job. This script will be exectued once right after instant cloning. --------- Signed-off-by: Chen Jing <jingch@vmware.com> * [Cluster launcher] [vSphere] Support deploying one frozen VM, or a set of frozen VMs from OVF, then do ray up. (#39783) Bug fix The default.yaml file was not built into the Python wheel, also not in the setup.py scirpt. This change added it. New features 1. Support creating Ray nodes from a set of frozen VMs in a resource pool. The motivation is when doing instant clone, the new VM must be on the same ESXi host with the parent VM. Previously we have only one frozen VM. The Ray nodes created from that frozen VM need to be relocated to other ESXi hosts by vSphere DRS. After this change, we can do round robin on the ESXi hosts to do instant clone to create the Ray nodes. We save the overhead of doing DRS. 2. Support creating the frozen VM, or a set of frozen VMs from OVF template. This feature helps save some manual steps when the user has no existing frozen vm(s) but has an OVF template. Previously the user must manully login onto vSphere and deploy a frozen VM from the OVF first. Now we covered this fucntionality in ray up. 3. Support powering on the frozen VM when the VM is at powered off status when doing ray up, we will wait the frozen VM is really "frozen", then do ray up. Previously we have code logic to power on the frozen VM, but we will not wait it until it is frozen (usually need 2 mins or so). This is a bug actually. In this change we add a function called "wait_until_frozen" to resolve this issue. 4. Some code refactoring work. We split the vsphere sdk related code into another Python file. 5. Update the yaml example files and the corresponding docs for above changes. --------- Signed-off-by: Chen Jing <jingch@vmware.com> * [Doc] Update the vSphere cluster Launcher Maintainer. (#39758) Since Vinod has left the company, we need to update the vSphere Launcher maintainer list to add Roshan and Chen. Roshan acts as Vinod's successor, while Chen will be responsible for overseeing Ray-OSS and facilitating open-source development collaboration. Signed-off-by: Layne Peng <playne@vmware.com> --------- Signed-off-by: Fangchi Wang <wfangchi@vmware.com> Signed-off-by: Chen Jing <jingch@vmware.com> Signed-off-by: Layne Peng <playne@vmware.com> Co-authored-by: Fangchi Wang <wfangchi@vmware.com> Co-authored-by: Chen Jing <jingch@vmware.com> Co-authored-by: Layne Peng <appamail@hotmail.com>

…t of frozen VMs from OVF, then do ray up. (ray-project#39783) Bug fix The default.yaml file was not built into the Python wheel, also not in the setup.py scirpt. This change added it. New features 1. Support creating Ray nodes from a set of frozen VMs in a resource pool. The motivation is when doing instant clone, the new VM must be on the same ESXi host with the parent VM. Previously we have only one frozen VM. The Ray nodes created from that frozen VM need to be relocated to other ESXi hosts by vSphere DRS. After this change, we can do round robin on the ESXi hosts to do instant clone to create the Ray nodes. We save the overhead of doing DRS. 2. Support creating the frozen VM, or a set of frozen VMs from OVF template. This feature helps save some manual steps when the user has no existing frozen vm(s) but has an OVF template. Previously the user must manully login onto vSphere and deploy a frozen VM from the OVF first. Now we covered this fucntionality in ray up. 3. Support powering on the frozen VM when the VM is at powered off status when doing ray up, we will wait the frozen VM is really "frozen", then do ray up. Previously we have code logic to power on the frozen VM, but we will not wait it until it is frozen (usually need 2 mins or so). This is a bug actually. In this change we add a function called "wait_until_frozen" to resolve this issue. 4. Some code refactoring work. We split the vsphere sdk related code into another Python file. 5. Update the yaml example files and the corresponding docs for above changes. --------- Signed-off-by: Chen Jing <jingch@vmware.com> Signed-off-by: Victor <vctr.y.m@example.com>

Provide option to launch Frozen VM on each ESXi host

5be1cbc

Signed-off-by: Chen Jing <jingch@vmware.com>

JingChen23 requested review from architkulkarni, wuisawesome, DmitriGekhtman, maxpumperla, pcmoritz, kevin85421, a team, richardliaw, ericl and edoakes as code owners September 21, 2023 08:20

JingChen23 changed the title ~~Provide option to launch Frozen VM on each ESXi host~~ [vSphere] Support deploy the Frozen VM from OVF, and support deploying a Frozen VM on each ESXi host Sep 21, 2023

JingChen23 force-pushed the frozen-vm-optimization branch from 61d2513 to 5be1cbc Compare September 21, 2023 10:39

JingChen23 changed the title ~~[vSphere] Support deploy the Frozen VM from OVF, and support deploying a Frozen VM on each ESXi host~~ [WIP] Support deploy the Frozen VM from OVF, and support deploying a Frozen VM on each ESXi host Sep 21, 2023

JingChen23 changed the title ~~[WIP] Support deploy the Frozen VM from OVF, and support deploying a Frozen VM on each ESXi host~~ [WIP don't review] Support deploy the Frozen VM from OVF, and support deploying a Frozen VM on each ESXi host Sep 21, 2023

JingChen23 marked this pull request as draft September 22, 2023 04:22

JingChen23 added 2 commits September 25, 2023 10:10

Fix the issue of ray up on Frozen VM deployed from OVF

4f7ceea

Signed-off-by: Chen Jing <jingch@vmware.com>

Merge branch 'master' into frozen-vm-optimization

03a6922

Signed-off-by: Chen Jing <jingch@vmware.com>

JingChen23 changed the title ~~[WIP don't review] Support deploy the Frozen VM from OVF, and support deploying a Frozen VM on each ESXi host~~ [vSphere] Support deploying one frozen VM, or a set of frozen VMs from OVF, then do ray up. Sep 25, 2023

JingChen23 marked this pull request as ready for review September 25, 2023 02:17

JingChen23 commented Sep 25, 2023

View reviewed changes

python/ray/autoscaler/_private/vsphere/node_provider.py Show resolved Hide resolved

python/ray/autoscaler/_private/vsphere/scheduler.py Show resolved Hide resolved

add a missed comma in the bazel file, thanks buildlkite.

12f5445

Signed-off-by: Chen Jing <jingch@vmware.com>

architkulkarni self-assigned this Sep 25, 2023

architkulkarni reviewed Sep 25, 2023

View reviewed changes

add the UT change to resolve the buildkite failure

bf6bea0

Signed-off-by: Chen Jing <jingch@vmware.com>

optmize the code to address comments

8a17fbc

Signed-off-by: Chen Jing <jingch@vmware.com>

architkulkarni approved these changes Sep 26, 2023

View reviewed changes

architkulkarni reviewed Sep 26, 2023

View reviewed changes

architkulkarni assigned richardliaw Sep 26, 2023

fix lint

eb160d7

Signed-off-by: Chen Jing <jingch@vmware.com>

add validator and UT, optimize rst doc

e40d44c

Signed-off-by: Chen Jing <jingch@vmware.com>

optmize the rst doc

53aa6c4

Signed-off-by: Chen Jing <jingch@vmware.com>

architkulkarni reviewed Sep 27, 2023

View reviewed changes

python/ray/autoscaler/_private/vsphere/config.py Outdated Show resolved Hide resolved

optimize the validator's msg

efa6188

Signed-off-by: Chen Jing <jingch@vmware.com>

edoakes approved these changes Sep 28, 2023

View reviewed changes

architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Sep 28, 2023

architkulkarni merged commit a069695 into ray-project:master Sep 28, 2023
103 of 107 checks passed

architkulkarni mentioned this pull request Sep 28, 2023

[Cluster Launcher] Vsphere fixes cherry pick #39954

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[vSphere] Support deploying one frozen VM, or a set of frozen VMs from OVF, then do ray up. #39783

[vSphere] Support deploying one frozen VM, or a set of frozen VMs from OVF, then do ray up. #39783

JingChen23 commented Sep 21, 2023 •

edited

architkulkarni commented Sep 21, 2023

architkulkarni commented Sep 25, 2023

architkulkarni left a comment

JingChen23 commented Sep 26, 2023

JingChen23 commented Sep 26, 2023 •

edited

architkulkarni left a comment

architkulkarni Sep 26, 2023

architkulkarni commented Sep 26, 2023

architkulkarni commented Sep 26, 2023

JingChen23 commented Sep 27, 2023

JingChen23 commented Sep 27, 2023 •

edited

architkulkarni commented Sep 27, 2023

JingChen23 commented Sep 28, 2023 •

edited

JingChen23 commented Sep 28, 2023

architkulkarni commented Sep 28, 2023

[vSphere] Support deploying one frozen VM, or a set of frozen VMs from OVF, then do ray up. #39783

[vSphere] Support deploying one frozen VM, or a set of frozen VMs from OVF, then do ray up. #39783

Conversation

JingChen23 commented Sep 21, 2023 • edited

Why are these changes needed?

Bug fix

New features

1. Support creating Ray nodes from a set of frozen VMs in a resource pool.

2. Support creating the frozen VM, or a set of frozen VMs from OVF template.

3. Support powering on the frozen VM when the VM is at powered off status when doing ray up, we will wait the frozen VM is really "frozen", then do ray up.

4. Some code refactoring work. We split the vsphere sdk related code into another Python file.

5. Update the yaml example files and the corresponding docs for above changes.

Tests

Create one single frozen VM then do 'ray up'

ray up with one existing frozen VM

Creat a set of frozen VMs in a resource pool then do ray up, create ray nodes by round robin

Ray up on an existing resource pool of frozen VMs.

Checks

architkulkarni commented Sep 21, 2023

architkulkarni commented Sep 25, 2023

architkulkarni left a comment

Choose a reason for hiding this comment

JingChen23 commented Sep 26, 2023

JingChen23 commented Sep 26, 2023 • edited

architkulkarni left a comment

Choose a reason for hiding this comment

architkulkarni Sep 26, 2023

Choose a reason for hiding this comment

architkulkarni commented Sep 26, 2023

architkulkarni commented Sep 26, 2023

JingChen23 commented Sep 27, 2023

JingChen23 commented Sep 27, 2023 • edited

architkulkarni commented Sep 27, 2023

JingChen23 commented Sep 28, 2023 • edited

JingChen23 commented Sep 28, 2023

architkulkarni commented Sep 28, 2023

JingChen23 commented Sep 21, 2023 •

edited

JingChen23 commented Sep 26, 2023 •

edited

JingChen23 commented Sep 27, 2023 •

edited

JingChen23 commented Sep 28, 2023 •

edited