rpc error: code = Unknown desc = failed to find gpu id #10

zishen · 2020-11-10T12:40:28Z

Maybe is a bug

The yaml is

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: mindx-dls-gpu
  namespace: vcjob
spec:
  minAvailable: 1
  schedulerName: volcano
  policies:
    - event: PodEvicted
      action: RestartJob
  maxRetry: 3
  queue: default
  tasks:
  - name: "default-1p"
    replicas: 1
    template:
      metadata:
        labels:
          app: tf
      spec:
        containers:
        - image: nvidia-train:v1
          imagePullPolicy: IfNotPresent
          name: cuda-container
          command:
          - "/bin/bash"
          - "-c"
          #- "chmod 777 -R /job;cd /job/code/ModelZoo_Resnet50_HC; bash train_start.sh"
          args: [ "while true; do sleep 3000000; done;"  ]
          resources:
            requests:
              volcano.sh/gpu-number: 1
            limits:
              volcano.sh/gpu-number: 1
          volumeMounts:
          - name: timezone
            mountPath: /etc/timezone
          - name: localtime
            mountPath: /etc/localtime
        nodeSelector:
          accelerator: nvidia-tesla-v100
        volumes:
        - name: timezone
          hostPath:
            path: /etc/timezone
        - name: localtime
          hostPath:
            path: /etc/timezone
  name:
        restartPolicy: OnFailure`

And I use "olcano.sh/gpu-memory" resource is error:

Nov 10 20:11:23 ubuntu560 kubelet[26515]: E1110 20:11:23.149895   26515 manager.go:374] Failed to allocate device plugin resource for pod 28bc8549-e3b9-40f6-8adb-7830f967d97b: rpc error: code = Unknown desc = failed to find gpu id
Nov 10 20:11:23 ubuntu560 kubelet[26515]: W1110 20:11:23.149941   26515 predicate.go:74] Failed to admit pod mindx-dls-gpu-default-1p-0_vcjob(28bc8549-e3b9-40f6-8adb-7830f967d97b) - Update plugin resources failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected.

env:

volcano:v1.0.1
volcano-deviceplugin:v1.0.0
os:ubuntu 18.04  amd64

so is volcano.sh/gpu-memory support?

The text was updated successfully, but these errors were encountered:

zishen · 2020-11-10T12:41:29Z

The error picture is above.

hzxuzhonghu · 2020-11-11T01:24:23Z

cc @william-wang @Thor-wl

Thor-wl · 2020-11-11T01:35:36Z

OK, let me have a try.

mmhhss · 2020-11-16T10:17:25Z

I met the same problem too

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod1
spec:
  schedulerName: volcano
  containers:
    - name: cuda-container
      image: horovod_gpu:v2.0
      command: ["sleep", "100"]
      resources:
        limits:
          volcano.sh/gpu-memory: "1024"

kubectl describe pod gpu-pod1
Name:         gpu-pod1
Namespace:    default
Priority:     0
Node:         cmp-node6/
Start Time:   Mon, 16 Nov 2020 18:14:08 +0800
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"gpu-pod1","namespace":"default"},"spec":{"containers":[{"command":["s...
              scheduling.k8s.io/group-name: podgroup-e9a8ebc2-da88-402a-a8ea-c719e52fede3
Status:       Failed
Reason:       UnexpectedAdmissionError
Message:      Pod Update plugin resources failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected.
IP:           
Containers:
  cuda-container:
    Image:      horovod_gpu:v2.0
    Port:       <none>
    Host Port:  <none>
    Command:
      sleep
      100
    Limits:
      volcano.sh/gpu-memory:  1024
    Requests:
      volcano.sh/gpu-memory:  1024
    Environment:              <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-kr9kv (ro)
Volumes:
  default-token-kr9kv:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-kr9kv
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                    Age   From                Message
  ----     ------                    ----  ----                -------
  Normal   Scheduled                 30s   volcano             Successfully assigned default/gpu-pod1 to cmp-node6
  Warning  UnexpectedAdmissionError  30s   kubelet, cmp-node6  Update plugin resources failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected.

william-wang · 2020-11-28T06:23:59Z

@Thor-wl is there any progress for this issue?

wpeng102 · 2020-12-02T09:42:43Z

Please check：
1）If the volcano scheduler contains GPU share feature
2）If the volcano-scheduler-configmap enable predicate.GPUSharingEnable refer (https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_gpu_sharing.md)
3) If the volcano-scheduler cluster role has patch role for "pods", "pods/status"

If not free install latest volcano just replace volcano scheduler image maybe miss step 2) and 3).

In my test env, it works well:

root@mscpptb00006:~/peng# cat test1.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: OnFailure
  schedulerName: volcano
  containers:
  - image: nvidia/cuda:10.1-base-ubuntu18.04
    name: pod1-ctr
    command: ["sleep"]
    args: ["100000"]
    resources:
      limits:
        volcano.sh/gpu-memory: 1024

root@mscpptb00006:~/peng# kubectl get po
NAME       READY   STATUS    RESTARTS   AGE
gpu-test   1/1     Running   0          2s

root@mscpptb00006:~/peng# kubectl exec -it gpu-test nvidia-smi
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
Wed Dec  2 09:34:46 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:21:01.0 Off |                    0 |
| N/A   27C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

jxfruit · 2020-12-04T04:04:50Z

@wpeng102 It still could not work by your steps, my volcano version is 1.0.1 and volcano-device-plugin version is 1.0.1.

wpeng102 · 2020-12-04T06:16:34Z

@jxfruit could you help paste your volcano-scheduler-configmap and the scheduler log?

jxfruit · 2020-12-04T07:03:41Z

@wpeng102 volcano-scheduler-configmap, the scheduler log and test yaml are here
gpu-test.yaml.zip

jxfruit · 2020-12-04T08:19:15Z

I found that the filed "volcano.sh/gpu-memory" cannot be like "1024Mi". After I changed it from "1024Mi" to 1024, it runs. But when I deployed 2 jobs on the same node, there is only process on gpu node, and on the other job logs throw the OOM error

It looks like the gpu memory have not been shared at all ??? Or the allocated gpu resources is not effective？
Is there something I missed?
gpu-test.zip

wpeng102 · 2020-12-04T11:20:17Z

Talked with jxfrui, this version can not support GPU memory hart isolation, refer https://github.com/volcano-sh/devices#docs

jxfruit · 2020-12-05T01:02:48Z

Thank dalaos.
Now I understand what hard isolation is. Actually, the gpu memory has not been controlled in this version. The field "volcano.sh/gpu-memory" just sets the lower memory limitation of gpu device which let volcano schduler choose. What's more, one gpu device can be re-schdulered, the volcano device plugin cannot allocate memory for our jobs. However, we can make a limitation in our TF/MS/PT... jobs.
For example, TF1.x
tf_config = tensorflow.ConfigProto()
tf_config.gpu_options.allow_growth = True
session = tensorflow.Session(config=tf_config)
This may be helpful for us. I suggest close this issue.

wpeng102 · 2020-12-11T06:35:22Z

/close

william-wang · 2020-12-11T07:19:00Z

/close

zishen mentioned this issue Nov 19, 2020

when use volcano.sh/gpu-number: 1，why the pod has all GPU？ #11

Closed

jxfruit mentioned this issue Nov 26, 2020

distribution training jobs with mindspore/tensorflow/pytorch cannot stop volcano-sh/volcano#1163

Closed

wpeng102 mentioned this issue Dec 2, 2020

Should add volcano.sh/gpu-index annotation in pod example #9

Closed

Thor-wl closed this as completed Nov 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc error: code = Unknown desc = failed to find gpu id #10

rpc error: code = Unknown desc = failed to find gpu id #10

zishen commented Nov 10, 2020 •

edited by k82cn

Loading

zishen commented Nov 10, 2020

hzxuzhonghu commented Nov 11, 2020

Thor-wl commented Nov 11, 2020

mmhhss commented Nov 16, 2020 •

edited by k82cn

Loading

william-wang commented Nov 28, 2020 •

edited

Loading

wpeng102 commented Dec 2, 2020 •

edited

Loading

jxfruit commented Dec 4, 2020

wpeng102 commented Dec 4, 2020

jxfruit commented Dec 4, 2020

jxfruit commented Dec 4, 2020

wpeng102 commented Dec 4, 2020

jxfruit commented Dec 5, 2020

wpeng102 commented Dec 11, 2020

william-wang commented Dec 11, 2020

rpc error: code = Unknown desc = failed to find gpu id #10

rpc error: code = Unknown desc = failed to find gpu id #10

Comments

zishen commented Nov 10, 2020 • edited by k82cn Loading

zishen commented Nov 10, 2020

hzxuzhonghu commented Nov 11, 2020

Thor-wl commented Nov 11, 2020

mmhhss commented Nov 16, 2020 • edited by k82cn Loading

william-wang commented Nov 28, 2020 • edited Loading

wpeng102 commented Dec 2, 2020 • edited Loading

jxfruit commented Dec 4, 2020

wpeng102 commented Dec 4, 2020

jxfruit commented Dec 4, 2020

jxfruit commented Dec 4, 2020

wpeng102 commented Dec 4, 2020

jxfruit commented Dec 5, 2020

wpeng102 commented Dec 11, 2020

william-wang commented Dec 11, 2020

zishen commented Nov 10, 2020 •

edited by k82cn

Loading

mmhhss commented Nov 16, 2020 •

edited by k82cn

Loading

william-wang commented Nov 28, 2020 •

edited

Loading

wpeng102 commented Dec 2, 2020 •

edited

Loading