[Bug] Kubernetes: GPU resource detection not accurate #20265

iconix · 2021-11-11T19:59:24Z

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Clusters

What happened + What you expected to happen

Ray's automatic GPU resource detection reflects the physical host's GPUs rather than the GPUs allocated to each kubernetes pod.

When using ray.cluster_resources() on a Ray K8s cluster that is on a host machine with 8 GPUs total, I am getting output like:

This cluster consists of
    5 nodes in total
    5.0 CPU resources in total
    40.0 GPU resources in total

Versions / Dependencies

Python 3.7.7
Ray 1.7.1
Ubuntu 18.04.5 LTS

Reproduction script

The cluster config used is here: https://github.com/ray-project/kuberay/blob/2c2bc7defbcb8c930d821dd8854bde5f44006cbb/ray-operator/config/samples/ray-cluster.heterogeneous.yaml

Python workload (borrowed from https://docs.ray.io/en/latest/cluster/quickstart.html#ref-cluster-quick-start):

from collections import Counter
import os
import socket
import time

import ray

ray.init(address=os.getenv('ADDRESS'))

print('''This cluster consists of
    {} nodes in total
    {} CPU resources in total
    {} GPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU'], ray.cluster_resources()['GPU']))

print(ray.cluster_resources())
print(ray.available_resources())

@ray.remote
def f():
    time.sleep(0.001)
    # Return IP address.
    return socket.gethostbyname(socket.gethostname())

object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)

print('Tasks executed')
for ip_address, num_tasks in Counter(ip_addresses).items():
    print('    {} tasks on {}'.format(num_tasks, ip_address))

run: ADDRESS=ray://... python script.py

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

DmitriGekhtman · 2021-11-11T22:15:56Z

It may be worth also posting this issue in the KubeRay repo.

iconix · 2021-11-11T22:31:18Z

It may be worth also posting this issue in the KubeRay repo.

Hmm why? What is the issue with the cluster config?

DmitriGekhtman · 2021-11-11T22:42:48Z

It's hard for me to say at this point -- I need to learn more about the KubeRay operator, which processes the config.

iconix · 2021-11-12T02:31:30Z

It's hard for me to say at this point -- I need to learn more about the KubeRay operator, which processes the config.

Ah ok. I didn't realize that the kuberay and native ray cluster configs are so different.

So if I use the native RayOperator, will I no longer have this issue?

DmitriGekhtman · 2021-11-12T04:58:52Z

The two operators are independently developed solutions for the same problem.
In the future, we're aiming to simplify and have one preferred solution (likely based more closely on the KubeRay operator), exactly to reduce this sort of confusion.

The operator in the main Ray repo reads gpu resources from the k8s pod spec and advertises those to Ray (overriding the built-in detection, which is likely to pick up the host's resources.)
It appears that the KubeRay operator needs to be modified to do the same thing.

The config linked does not specify GPU resources -- is the expected behavior for Ray not to pick up any GPU resources?

DmitriGekhtman · 2021-11-12T05:17:58Z

cc @pcmoritz

iconix · 2021-11-12T15:32:42Z

I would like Ray to pick up GPU resources. How do I specify that in config?

DmitriGekhtman · 2021-11-12T17:17:43Z

You should be able to do that by adding a num-gpus parameter to the rayStartParams for the headGroupSpec or the desired workerGroupSpec, depending on whether you want the head or a particular worker group to use GPUs.
The setting is similar to num-cpus in the example.

DmitriGekhtman · 2022-10-26T14:58:15Z

This has been fixed. See https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/gpu.html for details.

iconix added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 11, 2021

DmitriGekhtman self-assigned this Nov 11, 2021

DmitriGekhtman mentioned this issue Nov 12, 2021

[Feature] Autodetect GPU resources to advertise to Ray. ray-project/kuberay#99

Closed

2 tasks

DmitriGekhtman added this to the Serverless Autoscaling milestone Jan 4, 2022

AmeerHajAli added the infra autoscaler, ray client, kuberay, related issues label Mar 26, 2022

DmitriGekhtman closed this as completed Oct 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Kubernetes: GPU resource detection not accurate #20265

[Bug] Kubernetes: GPU resource detection not accurate #20265

iconix commented Nov 11, 2021

DmitriGekhtman commented Nov 11, 2021

iconix commented Nov 11, 2021

DmitriGekhtman commented Nov 11, 2021

iconix commented Nov 12, 2021

DmitriGekhtman commented Nov 12, 2021

DmitriGekhtman commented Nov 12, 2021

iconix commented Nov 12, 2021

DmitriGekhtman commented Nov 12, 2021 •

edited

Loading

DmitriGekhtman commented Oct 26, 2022

[Bug] Kubernetes: GPU resource detection not accurate #20265

[Bug] Kubernetes: GPU resource detection not accurate #20265

Comments

iconix commented Nov 11, 2021

Search before asking

Ray Component

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Anything else

Are you willing to submit a PR?

DmitriGekhtman commented Nov 11, 2021

iconix commented Nov 11, 2021

DmitriGekhtman commented Nov 11, 2021

iconix commented Nov 12, 2021

DmitriGekhtman commented Nov 12, 2021

DmitriGekhtman commented Nov 12, 2021

iconix commented Nov 12, 2021

DmitriGekhtman commented Nov 12, 2021 • edited Loading

DmitriGekhtman commented Oct 26, 2022

DmitriGekhtman commented Nov 12, 2021 •

edited

Loading