Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Kubernetes: GPU resource detection not accurate #20265

Closed
1 of 2 tasks
iconix opened this issue Nov 11, 2021 · 9 comments
Closed
1 of 2 tasks

[Bug] Kubernetes: GPU resource detection not accurate #20265

iconix opened this issue Nov 11, 2021 · 9 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't infra autoscaler, ray client, kuberay, related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)
Milestone

Comments

@iconix
Copy link

iconix commented Nov 11, 2021

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Clusters

What happened + What you expected to happen

Ray's automatic GPU resource detection reflects the physical host's GPUs rather than the GPUs allocated to each kubernetes pod.

When using ray.cluster_resources() on a Ray K8s cluster that is on a host machine with 8 GPUs total, I am getting output like:

This cluster consists of
    5 nodes in total
    5.0 CPU resources in total
    40.0 GPU resources in total

Versions / Dependencies

Python 3.7.7
Ray 1.7.1
Ubuntu 18.04.5 LTS

Reproduction script

The cluster config used is here: https://github.com/ray-project/kuberay/blob/2c2bc7defbcb8c930d821dd8854bde5f44006cbb/ray-operator/config/samples/ray-cluster.heterogeneous.yaml

Python workload (borrowed from https://docs.ray.io/en/latest/cluster/quickstart.html#ref-cluster-quick-start):

from collections import Counter
import os
import socket
import time

import ray

ray.init(address=os.getenv('ADDRESS'))

print('''This cluster consists of
    {} nodes in total
    {} CPU resources in total
    {} GPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU'], ray.cluster_resources()['GPU']))

print(ray.cluster_resources())
print(ray.available_resources())

@ray.remote
def f():
    time.sleep(0.001)
    # Return IP address.
    return socket.gethostbyname(socket.gethostname())

object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)

print('Tasks executed')
for ip_address, num_tasks in Counter(ip_addresses).items():
    print('    {} tasks on {}'.format(num_tasks, ip_address))

run: ADDRESS=ray://... python script.py

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@iconix iconix added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 11, 2021
@DmitriGekhtman DmitriGekhtman self-assigned this Nov 11, 2021
@DmitriGekhtman
Copy link
Contributor

It may be worth also posting this issue in the KubeRay repo.

@iconix
Copy link
Author

iconix commented Nov 11, 2021

It may be worth also posting this issue in the KubeRay repo.

Hmm why? What is the issue with the cluster config?

@DmitriGekhtman
Copy link
Contributor

It's hard for me to say at this point -- I need to learn more about the KubeRay operator, which processes the config.

@iconix
Copy link
Author

iconix commented Nov 12, 2021

It's hard for me to say at this point -- I need to learn more about the KubeRay operator, which processes the config.

Ah ok. I didn't realize that the kuberay and native ray cluster configs are so different.

So if I use the native RayOperator, will I no longer have this issue?

@DmitriGekhtman
Copy link
Contributor

The two operators are independently developed solutions for the same problem.
In the future, we're aiming to simplify and have one preferred solution (likely based more closely on the KubeRay operator), exactly to reduce this sort of confusion.

The operator in the main Ray repo reads gpu resources from the k8s pod spec and advertises those to Ray (overriding the built-in detection, which is likely to pick up the host's resources.)
It appears that the KubeRay operator needs to be modified to do the same thing.

The config linked does not specify GPU resources -- is the expected behavior for Ray not to pick up any GPU resources?

@DmitriGekhtman
Copy link
Contributor

cc @pcmoritz

@iconix
Copy link
Author

iconix commented Nov 12, 2021

I would like Ray to pick up GPU resources. How do I specify that in config?

@DmitriGekhtman
Copy link
Contributor

DmitriGekhtman commented Nov 12, 2021

You should be able to do that by adding a num-gpus parameter to the rayStartParams for the headGroupSpec or the desired workerGroupSpec, depending on whether you want the head or a particular worker group to use GPUs.
The setting is similar to num-cpus in the example.

@DmitriGekhtman DmitriGekhtman added this to the Serverless Autoscaling milestone Jan 4, 2022
@AmeerHajAli AmeerHajAli added the infra autoscaler, ray client, kuberay, related issues label Mar 26, 2022
@DmitriGekhtman
Copy link
Contributor

This has been fixed. See https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/gpu.html for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't infra autoscaler, ray client, kuberay, related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

3 participants