[vSphere Provider] Fix the bug that multiple worker types doesn't work #40487

JingChen23 · 2023-10-19T06:16:56Z

Description

Currently our code assumes that there is only one worker node type.
In this change I fix the bug to let it support multiple worker node types.

Test

Added UT assert to cover the new case.

Used a yaml snippet like this:

available_node_types:
    ray.head.default:
        resources: {"CPU": 8, "Memory": 8192}
        node_config:
            resource_pool: test
            datastore: vsanDatastore
    worker:
        min_workers: 1
        max_workers: 5
        resources: {"CPU": 4, "Memory": 8192}
        node_config:
            resource_pool: test1
            datastore: vsanDatastore
    worker1:
        min_workers: 1
        max_workers: 5
        resources: {"CPU": 2, "Memory": 4096}
        node_config:
            resource_pool: test1
            datastore: vsanDatastore

Verified that the workers are in the expected resource pool, the 2 different node types resources are expected:

Also the Ray cluster looks good from the dashboard:

Signed-off-by: Chen Jing <jingch@vmware.com>

architkulkarni · 2023-10-19T16:37:07Z

python/ray/autoscaler/_private/vsphere/config.py


-    worker_node_config["datastore"] = worker_datastore


We don't need datastore anymore?

Please forgive the original code which is hard to read.

Before the change, the logic is: if the datastore of the worker node is unset, use the head node's datastore by default.

But, what we actually want is (which is also published in our documentations): if the datastore of the worker node is unset, use the frozen VM's datastore.

So I make this change. We left the datastore empty for the worker node. It will automatically use the datastore of the frozen VM with this empty parameter.

architkulkarni

Looks good, one minor question

architkulkarni · 2023-10-23T23:07:17Z

Linkcheck and chaos test and HA tests are unrelated, this PR only touches the vSphere cluster launcher.

…oesn't work (ray-project#40487) Currently our code assumes that there is only one worker node type. In this change I fix the bug to let it support multiple worker node types. Signed-off-by: Chen Jing <jingch@vmware.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>

…sue and support GPU nodes (#40667) * [Cluster launcher] [vSphere] Fix the bug that multiple worker types doesn't work (#40487) Currently our code assumes that there is only one worker node type. In this change I fix the bug to let it support multiple worker node types. Signed-off-by: Chen Jing <jingch@vmware.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> * [cluster launcher] [vSphere Provider] Fix vc conn timout issue (#40516) Fixed the issue using SessionOrientedStub. A session-oriented stub adapter that will relogin to the destination if a session-oriented exception is thrown. --------- Signed-off-by: Chen Jing <jingch@vmware.com> * [cluster launcher] [vSphere Provider] Support GPU Ray nodes on vSphere (#40616) This is for supporting passthrough the GPU on vSphere ESXi host into the Ray nodes. --------- Signed-off-by: Chen Jing <jingch@vmware.com> * [cluster launcher] [vSphere] Do not fetch runtime-info of vm from cached_nodes (#40655) Power-on-off status is runtime info of VM, should not fetch it from cached-nodes, which is probably dirty data. It should query by pyvmomi_sdk every time. Signed-off-by: Chen Hui <huchen@vmware.com> --------- Signed-off-by: Chen Jing <jingch@vmware.com> Signed-off-by: Chen Hui <huchen@vmware.com> Co-authored-by: Chen Jing <jingch@vmware.com> Co-authored-by: huchen2021 <85480625+huchen2021@users.noreply.github.com>

[PROT-317] Fix the bug that multiple worker types doesn't work

22e877f

Signed-off-by: Chen Jing <jingch@vmware.com>

JingChen23 requested review from ericl, architkulkarni and a team as code owners October 19, 2023 06:16

architkulkarni self-assigned this Oct 19, 2023

architkulkarni reviewed Oct 19, 2023

View reviewed changes

architkulkarni approved these changes Oct 19, 2023

View reviewed changes

architkulkarni added 2 commits October 20, 2023 10:06

Merge branch 'master' into multi-worker-type-support

cac284c

Merge branch 'master' into multi-worker-type-support

c2270ab

architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Oct 23, 2023

architkulkarni merged commit 7b83147 into ray-project:master Oct 23, 2023
39 of 43 checks passed

architkulkarni mentioned this pull request Oct 25, 2023

[Cluster launcher] [vSphere] Fix multiple worker_types and timeout issue and support GPU nodes #40667

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[vSphere Provider] Fix the bug that multiple worker types doesn't work #40487

[vSphere Provider] Fix the bug that multiple worker types doesn't work #40487

JingChen23 commented Oct 19, 2023

architkulkarni Oct 19, 2023

JingChen23 Oct 20, 2023 •

edited

Loading

architkulkarni left a comment

architkulkarni commented Oct 23, 2023

[vSphere Provider] Fix the bug that multiple worker types doesn't work #40487

[vSphere Provider] Fix the bug that multiple worker types doesn't work #40487

Conversation

JingChen23 commented Oct 19, 2023

Description

Test

architkulkarni Oct 19, 2023

Choose a reason for hiding this comment

JingChen23 Oct 20, 2023 • edited Loading

Choose a reason for hiding this comment

architkulkarni left a comment

Choose a reason for hiding this comment

architkulkarni commented Oct 23, 2023

JingChen23 Oct 20, 2023 •

edited

Loading