[Bug] Ray Head access to extra GPU resources #2098

shaowei-su · 2024-04-23T23:48:29Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

If Ray head node is scheduled on GPU node with no GPU resource requested, e.g

      resources:
        limits:
          ephemeral-storage: 10Gi
          memory: 16Gi
        requests:
          cpu: '4'
          ephemeral-storage: 10Gi
          memory: 16Gi

Ray resource scheduler can still access those GPUs accidentally and considered the entire host GPU as "Logical Resources" during scheduling.

Reproduction script

Use RayJob CRD to scheduled both head and workers on the same physical host with > 1 GPUs.

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

kevin85421 · 2024-04-24T17:54:27Z

This is not a KubeRay-specific issue. See https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/gpu.html#gpu-multi-tenancy for more details. Recently, GPU UX on K8s seems to have improved. I will take a look at MIG and time-slicing GPU and get back to you.

shaowei-su added bug Something isn't working triage labels Apr 23, 2024

shaowei-su mentioned this issue Apr 24, 2024

[Feature] Allow different LocalQueue label for head and worker groups #2099

Closed

2 tasks

kevin85421 added rayjob and removed triage labels Apr 24, 2024

kevin85421 self-assigned this Apr 24, 2024

kevin85421 added go Pull requests that update Go code gpu and removed go Pull requests that update Go code rayjob labels Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Ray Head access to extra GPU resources #2098

[Bug] Ray Head access to extra GPU resources #2098

shaowei-su commented Apr 23, 2024

kevin85421 commented Apr 24, 2024

[Bug] Ray Head access to extra GPU resources #2098

[Bug] Ray Head access to extra GPU resources #2098

Comments

shaowei-su commented Apr 23, 2024

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

kevin85421 commented Apr 24, 2024