Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc error: code = Unknown desc = failed to find gpu id #10

Closed
zishen opened this issue Nov 10, 2020 · 14 comments
Closed

rpc error: code = Unknown desc = failed to find gpu id #10

zishen opened this issue Nov 10, 2020 · 14 comments

Comments

@zishen
Copy link

zishen commented Nov 10, 2020

Maybe is a bug

The yaml is

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: mindx-dls-gpu
  namespace: vcjob
spec:
  minAvailable: 1
  schedulerName: volcano
  policies:
    - event: PodEvicted
      action: RestartJob
  maxRetry: 3
  queue: default
  tasks:
  - name: "default-1p"
    replicas: 1
    template:
      metadata:
        labels:
          app: tf
      spec:
        containers:
        - image: nvidia-train:v1
          imagePullPolicy: IfNotPresent
          name: cuda-container
          command:
          - "/bin/bash"
          - "-c"
          #- "chmod 777 -R /job;cd /job/code/ModelZoo_Resnet50_HC; bash train_start.sh"
          args: [ "while true; do sleep 3000000; done;"  ]
          resources:
            requests:
              volcano.sh/gpu-number: 1
            limits:
              volcano.sh/gpu-number: 1
          volumeMounts:
          - name: timezone
            mountPath: /etc/timezone
          - name: localtime
            mountPath: /etc/localtime
        nodeSelector:
          accelerator: nvidia-tesla-v100
        volumes:
        - name: timezone
          hostPath:
            path: /etc/timezone
        - name: localtime
          hostPath:
            path: /etc/timezone
  name:
        restartPolicy: OnFailure`

And I use "olcano.sh/gpu-memory" resource is error:

Nov 10 20:11:23 ubuntu560 kubelet[26515]: E1110 20:11:23.149895   26515 manager.go:374] Failed to allocate device plugin resource for pod 28bc8549-e3b9-40f6-8adb-7830f967d97b: rpc error: code = Unknown desc = failed to find gpu id
Nov 10 20:11:23 ubuntu560 kubelet[26515]: W1110 20:11:23.149941   26515 predicate.go:74] Failed to admit pod mindx-dls-gpu-default-1p-0_vcjob(28bc8549-e3b9-40f6-8adb-7830f967d97b) - Update plugin resources failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected.

env:

volcano:v1.0.1
volcano-deviceplugin:v1.0.0
os:ubuntu 18.04  amd64

so is volcano.sh/gpu-memory support?

@zishen
Copy link
Author

zishen commented Nov 10, 2020

volcano-deviceplugin
The error picture is above.

@hzxuzhonghu
Copy link
Collaborator

cc @william-wang @Thor-wl

@Thor-wl
Copy link
Contributor

Thor-wl commented Nov 11, 2020

OK, let me have a try.

@mmhhss
Copy link

mmhhss commented Nov 16, 2020

I met the same problem too

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod1
spec:
  schedulerName: volcano
  containers:
    - name: cuda-container
      image: horovod_gpu:v2.0
      command: ["sleep", "100"]
      resources:
        limits:
          volcano.sh/gpu-memory: "1024"
kubectl describe pod gpu-pod1
Name:         gpu-pod1
Namespace:    default
Priority:     0
Node:         cmp-node6/
Start Time:   Mon, 16 Nov 2020 18:14:08 +0800
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"gpu-pod1","namespace":"default"},"spec":{"containers":[{"command":["s...
              scheduling.k8s.io/group-name: podgroup-e9a8ebc2-da88-402a-a8ea-c719e52fede3
Status:       Failed
Reason:       UnexpectedAdmissionError
Message:      Pod Update plugin resources failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected.
IP:           
Containers:
  cuda-container:
    Image:      horovod_gpu:v2.0
    Port:       <none>
    Host Port:  <none>
    Command:
      sleep
      100
    Limits:
      volcano.sh/gpu-memory:  1024
    Requests:
      volcano.sh/gpu-memory:  1024
    Environment:              <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-kr9kv (ro)
Volumes:
  default-token-kr9kv:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-kr9kv
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                    Age   From                Message
  ----     ------                    ----  ----                -------
  Normal   Scheduled                 30s   volcano             Successfully assigned default/gpu-pod1 to cmp-node6
  Warning  UnexpectedAdmissionError  30s   kubelet, cmp-node6  Update plugin resources failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected.

@william-wang
Copy link
Member

william-wang commented Nov 28, 2020

@Thor-wl is there any progress for this issue?

@wpeng102
Copy link
Member

wpeng102 commented Dec 2, 2020

Please check:
1)If the volcano scheduler contains GPU share feature
2)If the volcano-scheduler-configmap enable predicate.GPUSharingEnable refer (https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_gpu_sharing.md)
3) If the volcano-scheduler cluster role has patch role for "pods", "pods/status"

If not free install latest volcano just replace volcano scheduler image maybe miss step 2) and 3).

In my test env, it works well:

root@mscpptb00006:~/peng# cat test1.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: OnFailure
  schedulerName: volcano
  containers:
  - image: nvidia/cuda:10.1-base-ubuntu18.04
    name: pod1-ctr
    command: ["sleep"]
    args: ["100000"]
    resources:
      limits:
        volcano.sh/gpu-memory: 1024
root@mscpptb00006:~/peng# kubectl get po
NAME       READY   STATUS    RESTARTS   AGE
gpu-test   1/1     Running   0          2s

root@mscpptb00006:~/peng# kubectl exec -it gpu-test nvidia-smi
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
Wed Dec  2 09:34:46 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:21:01.0 Off |                    0 |
| N/A   27C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@jxfruit
Copy link

jxfruit commented Dec 4, 2020

@wpeng102 It still could not work by your steps, my volcano version is 1.0.1 and volcano-device-plugin version is 1.0.1.
image
image

@wpeng102
Copy link
Member

wpeng102 commented Dec 4, 2020

@jxfruit could you help paste your volcano-scheduler-configmap and the scheduler log?

@jxfruit
Copy link

jxfruit commented Dec 4, 2020

@wpeng102 volcano-scheduler-configmap, the scheduler log and test yaml are here
gpu-test.yaml.zip

@jxfruit
Copy link

jxfruit commented Dec 4, 2020

I found that the filed "volcano.sh/gpu-memory" cannot be like "1024Mi". After I changed it from "1024Mi" to 1024, it runs. But when I deployed 2 jobs on the same node, there is only process on gpu node, and on the other job logs throw the OOM error
image
image
It looks like the gpu memory have not been shared at all ??? Or the allocated gpu resources is not effective?
Is there something I missed?
gpu-test.zip

@wpeng102
Copy link
Member

wpeng102 commented Dec 4, 2020

Talked with jxfrui, this version can not support GPU memory hart isolation, refer https://github.com/volcano-sh/devices#docs

@jxfruit
Copy link

jxfruit commented Dec 5, 2020

Thank dalaos.
Now I understand what hard isolation is. Actually, the gpu memory has not been controlled in this version. The field "volcano.sh/gpu-memory" just sets the lower memory limitation of gpu device which let volcano schduler choose. What's more, one gpu device can be re-schdulered, the volcano device plugin cannot allocate memory for our jobs. However, we can make a limitation in our TF/MS/PT... jobs.
For example, TF1.x
tf_config = tensorflow.ConfigProto()
tf_config.gpu_options.allow_growth = True
session = tensorflow.Session(config=tf_config)
This may be helpful for us. I suggest close this issue.

@wpeng102
Copy link
Member

/close

1 similar comment
@william-wang
Copy link
Member

/close

@Thor-wl Thor-wl closed this as completed Nov 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants