-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Kubernetes: GPU resource detection not accurate #20265
Comments
It may be worth also posting this issue in the KubeRay repo. |
Hmm why? What is the issue with the cluster config? |
It's hard for me to say at this point -- I need to learn more about the KubeRay operator, which processes the config. |
Ah ok. I didn't realize that the kuberay and native ray cluster configs are so different. So if I use the native RayOperator, will I no longer have this issue? |
The two operators are independently developed solutions for the same problem. The operator in the main Ray repo reads gpu resources from the k8s pod spec and advertises those to Ray (overriding the built-in detection, which is likely to pick up the host's resources.) The config linked does not specify GPU resources -- is the expected behavior for Ray not to pick up any GPU resources? |
cc @pcmoritz |
I would like Ray to pick up GPU resources. How do I specify that in config? |
You should be able to do that by adding a |
This has been fixed. See https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/gpu.html for details. |
Search before asking
Ray Component
Ray Clusters
What happened + What you expected to happen
Ray's automatic GPU resource detection reflects the physical host's GPUs rather than the GPUs allocated to each kubernetes pod.
When using
ray.cluster_resources()
on a Ray K8s cluster that is on a host machine with 8 GPUs total, I am getting output like:Versions / Dependencies
Python 3.7.7
Ray 1.7.1
Ubuntu 18.04.5 LTS
Reproduction script
The cluster config used is here: https://github.com/ray-project/kuberay/blob/2c2bc7defbcb8c930d821dd8854bde5f44006cbb/ray-operator/config/samples/ray-cluster.heterogeneous.yaml
Python workload (borrowed from https://docs.ray.io/en/latest/cluster/quickstart.html#ref-cluster-quick-start):
run:
ADDRESS=ray://... python script.py
Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: