Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vSphere Provider] Fix the bug that multiple worker types doesn't work #40487

Merged

Conversation

JingChen23
Copy link
Contributor

Description

Currently our code assumes that there is only one worker node type.
In this change I fix the bug to let it support multiple worker node types.

Test

Added UT assert to cover the new case.

Used a yaml snippet like this:

available_node_types:
    ray.head.default:
        resources: {"CPU": 8, "Memory": 8192}
        node_config:
            resource_pool: test
            datastore: vsanDatastore
    worker:
        min_workers: 1
        max_workers: 5
        resources: {"CPU": 4, "Memory": 8192}
        node_config:
            resource_pool: test1
            datastore: vsanDatastore
    worker1:
        min_workers: 1
        max_workers: 5
        resources: {"CPU": 2, "Memory": 4096}
        node_config:
            resource_pool: test1
            datastore: vsanDatastore

Verified that the workers are in the expected resource pool, the 2 different node types resources are expected:

image

image
Also the Ray cluster looks good from the dashboard:

image

Signed-off-by: Chen Jing <jingch@vmware.com>
@architkulkarni architkulkarni self-assigned this Oct 19, 2023

worker_node_config["datastore"] = worker_datastore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need datastore anymore?

Copy link
Contributor Author

@JingChen23 JingChen23 Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please forgive the original code which is hard to read.

Before the change, the logic is: if the datastore of the worker node is unset, use the head node's datastore by default.

But, what we actually want is (which is also published in our documentations): if the datastore of the worker node is unset, use the frozen VM's datastore.

So I make this change. We left the datastore empty for the worker node. It will automatically use the datastore of the frozen VM with this empty parameter.

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, one minor question

@architkulkarni
Copy link
Contributor

Linkcheck and chaos test and HA tests are unrelated, this PR only touches the vSphere cluster launcher.

@architkulkarni architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Oct 23, 2023
@architkulkarni architkulkarni merged commit 7b83147 into ray-project:master Oct 23, 2023
39 of 43 checks passed
architkulkarni added a commit to architkulkarni/ray that referenced this pull request Oct 25, 2023
…oesn't work (ray-project#40487)

Currently our code assumes that there is only one worker node type.
In this change I fix the bug to let it support multiple worker node types.

Signed-off-by: Chen Jing <jingch@vmware.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
vitsai pushed a commit that referenced this pull request Oct 26, 2023
…sue and support GPU nodes (#40667)

* [Cluster launcher] [vSphere] Fix the bug that multiple worker types doesn't work (#40487)

Currently our code assumes that there is only one worker node type.
In this change I fix the bug to let it support multiple worker node types.

Signed-off-by: Chen Jing <jingch@vmware.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>

* [cluster launcher] [vSphere Provider] Fix vc conn timout issue (#40516)

Fixed the issue using SessionOrientedStub. A session-oriented stub adapter that will relogin to the destination if a session-oriented exception is thrown.

---------

Signed-off-by: Chen Jing <jingch@vmware.com>

* [cluster launcher] [vSphere Provider] Support GPU Ray nodes on vSphere (#40616)

This is for supporting passthrough the GPU on vSphere ESXi host into the Ray nodes.

---------

Signed-off-by: Chen Jing <jingch@vmware.com>

* [cluster launcher] [vSphere] Do not fetch runtime-info of vm from cached_nodes (#40655)

Power-on-off status is runtime info of VM, should not fetch it from cached-nodes, which is probably dirty data.
It should query by pyvmomi_sdk every time.

Signed-off-by: Chen Hui <huchen@vmware.com>

---------

Signed-off-by: Chen Jing <jingch@vmware.com>
Signed-off-by: Chen Hui <huchen@vmware.com>
Co-authored-by: Chen Jing <jingch@vmware.com>
Co-authored-by: huchen2021 <85480625+huchen2021@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants