Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnexpectedAdmissionError Allocate failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected #2701

Closed
631068264 opened this issue Feb 23, 2023 · 14 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@631068264
Copy link

What happened:

try to deploy example pod to test fail and follow this https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_gpu_sharing.md to do

Containers:
  cuda-container:
    Image:      xxxx/library/nvidia/cuda:11.6.2-base-ubuntu20.04
    Port:       <none>
    Host Port:  <none>
    Command:
      sleep
    Args:
      100000
    Limits:
      volcano.sh/gpu-memory:  1024
    Requests:
      volcano.sh/gpu-memory:  1024
    Environment:              <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p5qs6 (ro)
Volumes:
  kube-api-access-p5qs6:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                    Age   From     Message
  ----     ------                    ----  ----     -------
  Normal   Scheduled                 23s   volcano  Successfully assigned default/gpu-pod1 to d-ecs-38357230
  Warning  UnexpectedAdmissionError  23s   kubelet  Allocate failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected
nvidia-smi 
Thu Feb 23 17:50:28 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.89       Driver Version: 450.89       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:08.0 Off |                    0 |
| N/A   70C    P0    32W /  70W |   1144MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3028      C   python                           1141MiB |
+-----------------------------------------------------------------------------+

nvidia-smi -L
GPU 0: Tesla T4 (UUID: GPU-6b1d3e62-f94c-9236-1398-813bb48aab5a)



#  volcano deploy success

kubectl describe node

Capacity:
...
  volcano.sh/gpu-memory:  15109
  volcano.sh/gpu-number:  1
Allocatable:
...
  volcano.sh/gpu-memory:  15109
  volcano.sh/gpu-number:  1

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Volcano Version: 1.7
  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.16", GitCommit:"60e5135f758b6e43d0523b3277e8d34b4ab3801f", GitTreeState:"clean", BuildDate:"2023-01-18T16:01:10Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.9", GitCommit:"6df4433e288edc9c40c2e344eb336f63fad45cd2", GitTreeState:"clean", BuildDate:"2022-04-13T19:52:02Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release): Ubuntu 18.04.2 LTS
  • Kernel (e.g. uname -a): Linux d-ecs-38357230 4.15.0-128-generic
  • Install tools: helm
  • Others:
@631068264 631068264 added the kind/bug Categorizes issue or PR as related to a bug. label Feb 23, 2023
@archlitchi
Copy link
Contributor

currently, we only support using volcano.sh/gpu-memory or volcano.sh/gpu-number, setting them both is not supported right now, it will be implemented in the next version

@631068264
Copy link
Author

@archlitchi no I just use volcano.sh/gpu-memory , I don't know where you get this info. I just run the example on your install md

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod1
spec:
  schedulerName: volcano
  containers:
    - name: cuda-container
      image: xxx:8080/library/nvidia/cuda:11.6.2-base-ubuntu20.04
      command: ["sleep"]
      args: ["100000"]
      resources:
        limits:
          volcano.sh/gpu-memory: 1024 # requesting 1024MB GPU memory

---

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod2
spec:
  schedulerName: volcano
  containers:
    - name: cuda-container
      image: xxx:8080/library/nvidia/cuda:11.6.2-base-ubuntu20.04
      command: ["sleep"]
      args: ["100000"]
      resources:
        limits:
          volcano.sh/gpu-memory: 1024 # requesting 1024MB GPU memory

@631068264
Copy link
Author

631068264 commented Feb 24, 2023

volcano-controllers error log

I0224 06:16:06.953772       1 queue_controller.go:224] Finished syncing queue default (13.940321ms).
I0224 06:16:07.968531       1 job_controller.go:320] Try to handle request <Queue: , Job: default/podgroup-e65591b5-b390-4893-9e88-4258d5a27b0c, Task:, Event:, ExitCode:0, Action:, JobVersion: 0>
E0224 06:16:07.968578       1 job_controller.go:325] Failed to get job by <Queue: , Job: default/podgroup-e65591b5-b390-4893-9e88-4258d5a27b0c, Task:, Event:, ExitCode:0, Action:, JobVersion: 0> from cache: failed to find job <default/podgroup-e65591b5-b390-4893-9e88-4258d5a27b0c>
I0224 06:16:07.969759       1 queue_controller.go:242] Begin execute SyncQueue action for queue default, current status Open
I0224 06:16:07.969783       1 queue_controller_action.go:35] Begin to sync queue default.
I0224 06:16:07.970417       1 job_controller.go:320] Try to handle request <Queue: , Job: default/podgroup-07a820b5-a0a1-4097-8cf4-05fbba2124da, Task:, Event:, ExitCode:0, Action:, JobVersion: 0>
E0224 06:16:07.970445       1 job_controller.go:325] Failed to get job by <Queue: , Job: default/podgroup-07a820b5-a0a1-4097-8cf4-05fbba2124da, Task:, Event:, ExitCode:0, Action:, JobVersion: 0> from cache: failed to find job <default/podgroup-07a820b5-a0a1-4097-8cf4-05fbba2124da>
I0224 06:16:07.977594       1 queue_controller_action.go:83] End sync queue default.
I0224 06:16:07.977614       1 queue_controller.go:224] Finished syncing queue default (7.864486ms).
I0224 06:16:07.977630       1 queue_controller.go:242] Begin execute SyncQueue action for queue default, current status Open
kubectl get podgroup -A -owide 


NAMESPACE   NAME                                            STATUS    MINMEMBER   RUNNINGS   AGE     QUEUE
default     podgroup-07a820b5-a0a1-4097-8cf4-05fbba2124da   Inqueue   1                      4m18s   default
default     podgroup-e65591b5-b390-4893-9e88-4258d5a27b0c   Inqueue   1                      4m18s   default

@archlitchi
Copy link
Contributor

could you show the log of volcano-device-plugin plz?

@631068264
Copy link
Author

volcano-device-plugin

2023/02/23 07:49:21 Starting OS signal watcher.
time="2023-02-23T07:49:21Z" level=info msg="set gpu memory: 15109" source="utils.go:67"
2023/02/23 07:49:21 Starting GRPC server for 'volcano.sh/gpu-memory'
2023/02/23 07:49:21 Starting to serve 'volcano.sh/gpu-memory' on /var/lib/kubelet/device-plugins/volcano.sock
2023/02/23 07:49:21 Registered device plugin for 'volcano.sh/gpu-memory' with Kubelet
I0223 09:27:06.139662       1 server.go:300] Got candidate Pod 53ce111b-ae45-4d23-89e6-cc4f0683243c(cuda-container), the device count is: 1024
W0223 09:27:06.139708       1 server.go:314] Failed to get the gpu id for pod default/gpu-pod2
I0223 09:27:06.176309       1 server.go:300] Got candidate Pod 1c8b0ca2-afd6-4c03-996a-de1ab6a6c1d7(cuda-container), the device count is: 1024
W0223 09:27:06.176344       1 server.go:314] Failed to get the gpu id for pod default/gpu-pod1
I0223 09:39:40.409725       1 server.go:300] Got candidate Pod ec217e6a-254e-4937-87e3-d31b1d2b4052(cuda-container), the device count is: 1024
W0223 09:39:40.409760       1 server.go:314] Failed to get the gpu id for pod default/gpu-pod2
I0223 09:39:40.463164       1 server.go:300] Got candidate Pod 760f9edd-89e5-4f21-9b28-7260273c890b(cuda-container), the device count is: 1024
W0223 09:39:40.463208       1 server.go:314] Failed to get the gpu id for pod default/gpu-pod1
I0223 09:49:01.711100       1 server.go:300] Got candidate Pod 44318361-2419-490d-be4b-3346425ba040(cuda-container), the device count is: 1024
W0223 09:49:01.711144       1 server.go:314] Failed to get the gpu id for pod default/gpu-pod2
I0223 09:49:01.751763       1 server.go:300] Got candidate Pod 44318361-2419-490d-be4b-3346425ba040(cuda-container), the device count is: 1024
W0223 09:49:01.751795       1 server.go:314] Failed to get the gpu id for pod default/gpu-pod2
I0224 06:16:06.986111       1 server.go:300] Got candidate Pod e65591b5-b390-4893-9e88-4258d5a27b0c(cuda-container), the device count is: 1024
W0224 06:16:06.986142       1 server.go:314] Failed to get the gpu id for pod default/gpu-pod2
I0224 06:16:07.024074       1 server.go:300] Got candidate Pod e65591b5-b390-4893-9e88-4258d5a27b0c(cuda-container), the device count is: 1024
W0224 06:16:07.024107       1 server.go:314] Failed to get the gpu id for pod default/gpu-pod2

@631068264
Copy link
Author

631068264 commented Feb 24, 2023

Oh, I found that why get this prolem

volcano-scheduler-configmap format wrong

      - name: predicates
            arguments:
                predicate.GPUSharingEnable: true # 打开GPU share 开关

But I get another error I use this https://github.com/volcano-sh/devices/blob/release-1.0/volcano-device-plugin.yml

  Warning  UnexpectedAdmissionError  7m3s  kubelet  Allocate failed due to rpc error: code = Unknown desc = failed to update pod annotation pods "gpu-pod1" is forbidden: User "system:serviceaccount:kube-system:volcano-device-plugin" cannot update resource "pods" in API group "" in the namespace "default", which is unexpected

@archlitchi
Copy link
Contributor

@631068264 please use the master version yaml instead

@631068264
Copy link
Author

631068264 commented Feb 24, 2023

Pod are running but I can't find any processes from nvidia-smi
image

image

@631068264
Copy link
Author

631068264 commented Feb 24, 2023

And I want to know that schedulerName is required ? How about another crd for example kubeflow inferenceservices

Or only volcano.sh/gpu-memory is required ?

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod1
spec:
  schedulerName: volcano 
  containers:
    - name: cuda-container
      image: xxx:8080/library/nvidia/cuda:11.6.2-base-ubuntu20.04
      command: ["sleep"]
      args: ["100000"]
      resources:
        limits:
          volcano.sh/gpu-memory: 1024 # requesting 1024MB GPU memory


@631068264
Copy link
Author

631068264 commented Feb 24, 2023

I use kubeflow 1.6.1 to deploy not work

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "volcano"
spec:
  predictor:
    schedulerName: volcano
    containers:
      - name: kserve-container
        image: xxx:8080/library/model/firesmoke:v1
        env:
          - name: MODEL_NAME
            value: volcano
        command:
          - python
          - -m
          - fire_smoke
        resources:
          limits:
            volcano.sh/gpu-memory: 2048

But get error

Events:
  Type     Reason         Age                    From                Message
  ----     ------         ----                   ----                -------
  Warning  InternalError  3m4s (x17 over 8m32s)  v1beta1Controllers  fails to reconcile predictor: admission webhook "validation.webhook.serving.knative.dev" denied the request: validation failed: must not set the field(s): spec.template.spec.schedulerName

@archlitchi
Copy link
Contributor

yeah, you can't see any process in nvidia-smi inside container, it's natural because the pid namespace is different. Currently you have to specify schedulerName to volcano in order for the gpu-share to work.

@631068264
Copy link
Author

yeah, you can't see any process in nvidia-smi inside container, it's natural because the pid namespace is different. Currently you have to specify schedulerName to volcano in order for the gpu-share to work.

No , this three process are running in docker container which I use NVIDIA Operator as device plugin
image

@archlitchi
Copy link
Contributor

yeah, you can't see any process in nvidia-smi inside container, it's natural because the pid namespace is different. Currently you have to specify schedulerName to volcano in order for the gpu-share to work.

No , this three process are running in docker container which I use NVIDIA Operator as device plugin image

nvidia operator does apply a dozen of improvements for the original nvidia-device-plugin, including fix this problem. however, if you choose to deploy nvidia-device-plugin directly, you will encounter the same problem

@archlitchi
Copy link
Contributor

You can do a simple test, download and install nvidia-docker2, mount GPUs into container by using command "docker run -it --rm -e=NVIDIA_VISIBLE_DEVICES=all --runtime=nvidia {your image} {command}"
and you will see no processes are displayed in command nvidia-smi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants