-
Notifications
You must be signed in to change notification settings - Fork 330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Ray cluster spec for TPU pods #1292
Conversation
annotations: | ||
{} | ||
labels: | ||
cloud.google.com/gke-ray-node-type: head |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a special label for TPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are for TPU telemetry - allows us to help identify TPU node issues for customers. It's not strictly needed but is useful for support purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks!
imagePullSecrets: | ||
[] | ||
containers: | ||
- volumeMounts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add ports
info explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the TPU/JAX ports? I'm not sure if they are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the ports for the dashboard and GCS.
kuberay/ray-operator/config/samples/ray-cluster.complete.yaml
Lines 35 to 38 in 4f85055
- containerPort: 6379 | |
name: gcs | |
- containerPort: 8265 | |
name: dashboard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I cannot verify this YAML at this moment. I can try it after TPU is available on GKE.
Add Ray cluster spec for TPU pods
Add Ray cluster spec for TPU pods
This is a sample Kuberay spec for deploying a Ray cluster on a 2x2x2 TPU v4 topology.
For TPU onboarding (prerequisite), please refer to the documentation here: https://docs.google.com/document/d/1TRwtfi2pzXbT6We0WQdwNzDdKHruk3u1XeLLJwW38kE/edit#heading=h.8ntu2hqwqvhl