UnexpectedAdmissionError Allocate failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected #2701

631068264 · 2023-02-23T09:54:13Z

What happened:

try to deploy example pod to test fail and follow this https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_gpu_sharing.md to do

Containers:
  cuda-container:
    Image:      xxxx/library/nvidia/cuda:11.6.2-base-ubuntu20.04
    Port:       <none>
    Host Port:  <none>
    Command:
      sleep
    Args:
      100000
    Limits:
      volcano.sh/gpu-memory:  1024
    Requests:
      volcano.sh/gpu-memory:  1024
    Environment:              <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p5qs6 (ro)
Volumes:
  kube-api-access-p5qs6:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                    Age   From     Message
  ----     ------                    ----  ----     -------
  Normal   Scheduled                 23s   volcano  Successfully assigned default/gpu-pod1 to d-ecs-38357230
  Warning  UnexpectedAdmissionError  23s   kubelet  Allocate failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected

nvidia-smi 
Thu Feb 23 17:50:28 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.89       Driver Version: 450.89       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:08.0 Off |                    0 |
| N/A   70C    P0    32W /  70W |   1144MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3028      C   python                           1141MiB |
+-----------------------------------------------------------------------------+

nvidia-smi -L
GPU 0: Tesla T4 (UUID: GPU-6b1d3e62-f94c-9236-1398-813bb48aab5a)



#  volcano deploy success

kubectl describe node

Capacity:
...
  volcano.sh/gpu-memory:  15109
  volcano.sh/gpu-number:  1
Allocatable:
...
  volcano.sh/gpu-memory:  15109
  volcano.sh/gpu-number:  1

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Volcano Version: 1.7
Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.16", GitCommit:"60e5135f758b6e43d0523b3277e8d34b4ab3801f", GitTreeState:"clean", BuildDate:"2023-01-18T16:01:10Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.9", GitCommit:"6df4433e288edc9c40c2e344eb336f63fad45cd2", GitTreeState:"clean", BuildDate:"2022-04-13T19:52:02Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release): Ubuntu 18.04.2 LTS
Kernel (e.g. uname -a): Linux d-ecs-38357230 4.15.0-128-generic
Install tools: helm
Others:

The text was updated successfully, but these errors were encountered:

archlitchi · 2023-02-24T03:10:09Z

currently, we only support using volcano.sh/gpu-memory or volcano.sh/gpu-number, setting them both is not supported right now, it will be implemented in the next version

631068264 · 2023-02-24T03:16:53Z

@archlitchi no I just use volcano.sh/gpu-memory , I don't know where you get this info. I just run the example on your install md

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod1
spec:
  schedulerName: volcano
  containers:
    - name: cuda-container
      image: xxx:8080/library/nvidia/cuda:11.6.2-base-ubuntu20.04
      command: ["sleep"]
      args: ["100000"]
      resources:
        limits:
          volcano.sh/gpu-memory: 1024 # requesting 1024MB GPU memory

---

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod2
spec:
  schedulerName: volcano
  containers:
    - name: cuda-container
      image: xxx:8080/library/nvidia/cuda:11.6.2-base-ubuntu20.04
      command: ["sleep"]
      args: ["100000"]
      resources:
        limits:
          volcano.sh/gpu-memory: 1024 # requesting 1024MB GPU memory

631068264 · 2023-02-24T06:19:27Z

volcano-controllers error log

I0224 06:16:06.953772       1 queue_controller.go:224] Finished syncing queue default (13.940321ms).
I0224 06:16:07.968531       1 job_controller.go:320] Try to handle request <Queue: , Job: default/podgroup-e65591b5-b390-4893-9e88-4258d5a27b0c, Task:, Event:, ExitCode:0, Action:, JobVersion: 0>
E0224 06:16:07.968578       1 job_controller.go:325] Failed to get job by <Queue: , Job: default/podgroup-e65591b5-b390-4893-9e88-4258d5a27b0c, Task:, Event:, ExitCode:0, Action:, JobVersion: 0> from cache: failed to find job <default/podgroup-e65591b5-b390-4893-9e88-4258d5a27b0c>
I0224 06:16:07.969759       1 queue_controller.go:242] Begin execute SyncQueue action for queue default, current status Open
I0224 06:16:07.969783       1 queue_controller_action.go:35] Begin to sync queue default.
I0224 06:16:07.970417       1 job_controller.go:320] Try to handle request <Queue: , Job: default/podgroup-07a820b5-a0a1-4097-8cf4-05fbba2124da, Task:, Event:, ExitCode:0, Action:, JobVersion: 0>
E0224 06:16:07.970445       1 job_controller.go:325] Failed to get job by <Queue: , Job: default/podgroup-07a820b5-a0a1-4097-8cf4-05fbba2124da, Task:, Event:, ExitCode:0, Action:, JobVersion: 0> from cache: failed to find job <default/podgroup-07a820b5-a0a1-4097-8cf4-05fbba2124da>
I0224 06:16:07.977594       1 queue_controller_action.go:83] End sync queue default.
I0224 06:16:07.977614       1 queue_controller.go:224] Finished syncing queue default (7.864486ms).
I0224 06:16:07.977630       1 queue_controller.go:242] Begin execute SyncQueue action for queue default, current status Open

kubectl get podgroup -A -owide 


NAMESPACE   NAME                                            STATUS    MINMEMBER   RUNNINGS   AGE     QUEUE
default     podgroup-07a820b5-a0a1-4097-8cf4-05fbba2124da   Inqueue   1                      4m18s   default
default     podgroup-e65591b5-b390-4893-9e88-4258d5a27b0c   Inqueue   1                      4m18s   default

archlitchi · 2023-02-24T07:10:18Z

could you show the log of volcano-device-plugin plz?

631068264 · 2023-02-24T08:20:49Z

volcano-device-plugin

2023/02/23 07:49:21 Starting OS signal watcher.
time="2023-02-23T07:49:21Z" level=info msg="set gpu memory: 15109" source="utils.go:67"
2023/02/23 07:49:21 Starting GRPC server for 'volcano.sh/gpu-memory'
2023/02/23 07:49:21 Starting to serve 'volcano.sh/gpu-memory' on /var/lib/kubelet/device-plugins/volcano.sock
2023/02/23 07:49:21 Registered device plugin for 'volcano.sh/gpu-memory' with Kubelet
I0223 09:27:06.139662       1 server.go:300] Got candidate Pod 53ce111b-ae45-4d23-89e6-cc4f0683243c(cuda-container), the device count is: 1024
W0223 09:27:06.139708       1 server.go:314] Failed to get the gpu id for pod default/gpu-pod2
I0223 09:27:06.176309       1 server.go:300] Got candidate Pod 1c8b0ca2-afd6-4c03-996a-de1ab6a6c1d7(cuda-container), the device count is: 1024
W0223 09:27:06.176344       1 server.go:314] Failed to get the gpu id for pod default/gpu-pod1
I0223 09:39:40.409725       1 server.go:300] Got candidate Pod ec217e6a-254e-4937-87e3-d31b1d2b4052(cuda-container), the device count is: 1024
W0223 09:39:40.409760       1 server.go:314] Failed to get the gpu id for pod default/gpu-pod2
I0223 09:39:40.463164       1 server.go:300] Got candidate Pod 760f9edd-89e5-4f21-9b28-7260273c890b(cuda-container), the device count is: 1024
W0223 09:39:40.463208       1 server.go:314] Failed to get the gpu id for pod default/gpu-pod1
I0223 09:49:01.711100       1 server.go:300] Got candidate Pod 44318361-2419-490d-be4b-3346425ba040(cuda-container), the device count is: 1024
W0223 09:49:01.711144       1 server.go:314] Failed to get the gpu id for pod default/gpu-pod2
I0223 09:49:01.751763       1 server.go:300] Got candidate Pod 44318361-2419-490d-be4b-3346425ba040(cuda-container), the device count is: 1024
W0223 09:49:01.751795       1 server.go:314] Failed to get the gpu id for pod default/gpu-pod2
I0224 06:16:06.986111       1 server.go:300] Got candidate Pod e65591b5-b390-4893-9e88-4258d5a27b0c(cuda-container), the device count is: 1024
W0224 06:16:06.986142       1 server.go:314] Failed to get the gpu id for pod default/gpu-pod2
I0224 06:16:07.024074       1 server.go:300] Got candidate Pod e65591b5-b390-4893-9e88-4258d5a27b0c(cuda-container), the device count is: 1024
W0224 06:16:07.024107       1 server.go:314] Failed to get the gpu id for pod default/gpu-pod2

631068264 · 2023-02-24T08:56:09Z

Oh, I found that why get this prolem

volcano-scheduler-configmap format wrong

      - name: predicates
            arguments:
                predicate.GPUSharingEnable: true # 打开GPU share 开关

But I get another error I use this https://github.com/volcano-sh/devices/blob/release-1.0/volcano-device-plugin.yml

  Warning  UnexpectedAdmissionError  7m3s  kubelet  Allocate failed due to rpc error: code = Unknown desc = failed to update pod annotation pods "gpu-pod1" is forbidden: User "system:serviceaccount:kube-system:volcano-device-plugin" cannot update resource "pods" in API group "" in the namespace "default", which is unexpected

archlitchi · 2023-02-24T09:04:52Z

@631068264 please use the master version yaml instead

631068264 · 2023-02-24T09:22:10Z

Pod are running but I can't find any processes from nvidia-smi

631068264 · 2023-02-24T09:28:10Z

And I want to know that schedulerName is required ? How about another crd for example kubeflow inferenceservices

Or only volcano.sh/gpu-memory is required ?

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod1
spec:
  schedulerName: volcano 
  containers:
    - name: cuda-container
      image: xxx:8080/library/nvidia/cuda:11.6.2-base-ubuntu20.04
      command: ["sleep"]
      args: ["100000"]
      resources:
        limits:
          volcano.sh/gpu-memory: 1024 # requesting 1024MB GPU memory

631068264 · 2023-02-24T09:38:42Z

I use kubeflow 1.6.1 to deploy not work

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "volcano"
spec:
  predictor:
    schedulerName: volcano
    containers:
      - name: kserve-container
        image: xxx:8080/library/model/firesmoke:v1
        env:
          - name: MODEL_NAME
            value: volcano
        command:
          - python
          - -m
          - fire_smoke
        resources:
          limits:
            volcano.sh/gpu-memory: 2048

But get error

Events:
  Type     Reason         Age                    From                Message
  ----     ------         ----                   ----                -------
  Warning  InternalError  3m4s (x17 over 8m32s)  v1beta1Controllers  fails to reconcile predictor: admission webhook "validation.webhook.serving.knative.dev" denied the request: validation failed: must not set the field(s): spec.template.spec.schedulerName

archlitchi · 2023-02-24T10:18:43Z

yeah, you can't see any process in nvidia-smi inside container, it's natural because the pid namespace is different. Currently you have to specify schedulerName to volcano in order for the gpu-share to work.

631068264 · 2023-02-24T10:30:46Z

yeah, you can't see any process in nvidia-smi inside container, it's natural because the pid namespace is different. Currently you have to specify schedulerName to volcano in order for the gpu-share to work.

No , this three process are running in docker container which I use NVIDIA Operator as device plugin

archlitchi · 2023-02-24T10:39:03Z

yeah, you can't see any process in nvidia-smi inside container, it's natural because the pid namespace is different. Currently you have to specify schedulerName to volcano in order for the gpu-share to work.

No , this three process are running in docker container which I use NVIDIA Operator as device plugin

nvidia operator does apply a dozen of improvements for the original nvidia-device-plugin, including fix this problem. however, if you choose to deploy nvidia-device-plugin directly, you will encounter the same problem

archlitchi · 2023-02-24T10:43:49Z

You can do a simple test, download and install nvidia-docker2, mount GPUs into container by using command "docker run -it --rm -e=NVIDIA_VISIBLE_DEVICES=all --runtime=nvidia {your image} {command}"
and you will see no processes are displayed in command nvidia-smi

631068264 added the kind/bug Categorizes issue or PR as related to a bug. label Feb 23, 2023

631068264 mentioned this issue Feb 24, 2023

Warning UnexpectedAdmissionErro volcano-sh/devices#17

Closed

631068264 closed this as completed Feb 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnexpectedAdmissionError Allocate failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected #2701

UnexpectedAdmissionError Allocate failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected #2701

631068264 commented Feb 23, 2023

archlitchi commented Feb 24, 2023

631068264 commented Feb 24, 2023

631068264 commented Feb 24, 2023 •

edited

archlitchi commented Feb 24, 2023

631068264 commented Feb 24, 2023

631068264 commented Feb 24, 2023 •

edited

archlitchi commented Feb 24, 2023

631068264 commented Feb 24, 2023 •

edited

631068264 commented Feb 24, 2023 •

edited

631068264 commented Feb 24, 2023 •

edited

archlitchi commented Feb 24, 2023

631068264 commented Feb 24, 2023

archlitchi commented Feb 24, 2023

archlitchi commented Feb 24, 2023

UnexpectedAdmissionError Allocate failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected #2701

UnexpectedAdmissionError Allocate failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected #2701

Comments

631068264 commented Feb 23, 2023

archlitchi commented Feb 24, 2023

631068264 commented Feb 24, 2023

631068264 commented Feb 24, 2023 • edited

archlitchi commented Feb 24, 2023

631068264 commented Feb 24, 2023

631068264 commented Feb 24, 2023 • edited

archlitchi commented Feb 24, 2023

631068264 commented Feb 24, 2023 • edited

631068264 commented Feb 24, 2023 • edited

631068264 commented Feb 24, 2023 • edited

archlitchi commented Feb 24, 2023

631068264 commented Feb 24, 2023

archlitchi commented Feb 24, 2023

archlitchi commented Feb 24, 2023

631068264 commented Feb 24, 2023 •

edited

631068264 commented Feb 24, 2023 •

edited

631068264 commented Feb 24, 2023 •

edited

631068264 commented Feb 24, 2023 •

edited

631068264 commented Feb 24, 2023 •

edited