-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build Headless Service for Multi-Host TPU Worker Pods #1920
Conversation
Following up on this @kevin85421 now that #1913 has been merged. |
CI has some errors. You should be able to run the unit tests and lint tests locally. Refer to https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md for more details. |
826b979
to
f1a67bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Could you
(1) address https://github.com/ray-project/kuberay/pull/1920/files#r1503536623 => [Update] I will open a follow up directly.
(2) explain the details about how you manually tested this PR? Do you check the endpoints attached to the headless service?
Thanks!
This reverts commit 305220a.
I manually tested by creating a RayCluster with a multi-host worker group and then checking to see if the headless service was created. I then checked to verify that the worker endpoints were exposed by the headless service. |
Thanks! |
Why are these changes needed?
In order to support Multi-Host TPU worker groups with Kuberay, it's necessary to build a headless service that exposes worker pods to allow pod-to-pod communication. This PR builds a headless service that selects for worker nodes in a RayCluster with
NumOfHosts > 1
. cc: @richardsliu @kevin85421Related issue number
Checks