Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build Headless Service for Multi-Host TPU Worker Pods #1920

Merged
merged 15 commits into from
Feb 27, 2024

Conversation

ryanaoleary
Copy link
Contributor

Why are these changes needed?

In order to support Multi-Host TPU worker groups with Kuberay, it's necessary to build a headless service that exposes worker pods to allow pod-to-pod communication. This PR builds a headless service that selects for worker nodes in a RayCluster with NumOfHosts > 1. cc: @richardsliu @kevin85421

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@ryanaoleary
Copy link
Contributor Author

Following up on this @kevin85421 now that #1913 has been merged.

ray-operator/controllers/ray/utils/constant.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/utils/constant.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/utils/constant.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/common/service.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/common/service.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/common/service.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/common/service.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved
@kevin85421
Copy link
Member

CI has some errors. You should be able to run the unit tests and lint tests locally. Refer to https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md for more details.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Could you
(1) address https://github.com/ray-project/kuberay/pull/1920/files#r1503536623 => [Update] I will open a follow up directly.
(2) explain the details about how you manually tested this PR? Do you check the endpoints attached to the headless service?

Thanks!

@ryanaoleary
Copy link
Contributor Author

ryanaoleary commented Feb 27, 2024

LGTM. Could you (1) address https://github.com/ray-project/kuberay/pull/1920/files#r1503536623 => [Update] I will open a follow up directly. (2) explain the details about how you manually tested this PR? Do you check the endpoints attached to the headless service?

Thanks!

I manually tested by creating a RayCluster with a multi-host worker group and then checking to see if the headless service was created. I then checked to verify that the worker endpoints were exposed by the headless service.

@kevin85421 kevin85421 merged commit be4f988 into ray-project:master Feb 27, 2024
23 checks passed
@kevin85421
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants