Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Ray cluster spec for TPU pods #1292

Merged
merged 6 commits into from
Aug 10, 2023
Merged

Conversation

richardsliu
Copy link
Contributor

This is a sample Kuberay spec for deploying a Ray cluster on a 2x2x2 TPU v4 topology.

For TPU onboarding (prerequisite), please refer to the documentation here: https://docs.google.com/document/d/1TRwtfi2pzXbT6We0WQdwNzDdKHruk3u1XeLLJwW38kE/edit#heading=h.8ntu2hqwqvhl

@rkooo567 rkooo567 assigned rkooo567 and unassigned rkooo567 Aug 5, 2023
@architkulkarni architkulkarni self-assigned this Aug 7, 2023
ray-operator/config/samples/ray-cluster-tpu.yaml Outdated Show resolved Hide resolved
ray-operator/config/samples/ray-cluster-tpu.yaml Outdated Show resolved Hide resolved
ray-operator/config/samples/ray-cluster-tpu.yaml Outdated Show resolved Hide resolved
annotations:
{}
labels:
cloud.google.com/gke-ray-node-type: head
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a special label for TPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are for TPU telemetry - allows us to help identify TPU node issues for customers. It's not strictly needed but is useful for support purposes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks!

ray-operator/config/samples/ray-cluster-tpu.yaml Outdated Show resolved Hide resolved
imagePullSecrets:
[]
containers:
- volumeMounts:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add ports info explicitly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the TPU/JAX ports? I'm not sure if they are needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the ports for the dashboard and GCS.

- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added.

ray-operator/config/samples/ray-cluster-tpu.yaml Outdated Show resolved Hide resolved
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I cannot verify this YAML at this moment. I can try it after TPU is available on GKE.

@kevin85421 kevin85421 merged commit 80a6d58 into ray-project:master Aug 10, 2023
19 of 21 checks passed
blublinsky pushed a commit to blublinsky/kuberay that referenced this pull request Aug 15, 2023
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Add Ray cluster spec for TPU pods
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants