Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cluster launcher] [vSphere] Fix multiple worker_types and timeout issue and support GPU nodes #40667

Merged

Conversation

architkulkarni
Copy link
Contributor

Why are these changes needed?

Cherry-picks the following PRs to the 2.8 release branch:

The changes are localized to the vSphere cluster launcher, so it will not affect any other Ray component.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

JingChen23 and others added 4 commits October 25, 2023 11:56
…oesn't work (ray-project#40487)

Currently our code assumes that there is only one worker node type.
In this change I fix the bug to let it support multiple worker node types.

Signed-off-by: Chen Jing <jingch@vmware.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
…roject#40516)

Fixed the issue using SessionOrientedStub. A session-oriented stub adapter that will relogin to the destination if a session-oriented exception is thrown.

---------

Signed-off-by: Chen Jing <jingch@vmware.com>
ray-project#40616)

This is for supporting passthrough the GPU on vSphere ESXi host into the Ray nodes.

---------

Signed-off-by: Chen Jing <jingch@vmware.com>
…hed_nodes (ray-project#40655)

Power-on-off status is runtime info of VM, should not fetch it from cached-nodes, which is probably dirty data.
It should query by pyvmomi_sdk every time.

Signed-off-by: Chen Hui <huchen@vmware.com>
Copy link
Collaborator

@zhe-thoughts zhe-thoughts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes are contained within vsphere support and there's a deadline for this feature. lets pick

@vitsai vitsai merged commit dd3e687 into ray-project:releases/2.8.0 Oct 26, 2023
41 of 47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-blocker P0 Issue that blocks the release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants