gpu number无法使用 #31

Trainbow · 2023-01-18T01:54:21Z

No description provided.

Trainbow · 2023-01-18T01:57:26Z

你好，我在尝试volcano gpu number的服务调度，在根据volcano的教程步骤安装之后，每一个带gpu的node都能够正确的显示有多少块gpu，但是在创建pod的时候，container的容器中没有volcano-gpu-number这一个环境变量，在里面输入nvidia-smi能够看到该节点所有的gpu，想问一下是否需要更改yaml文件？

Thor-wl · 2023-01-19T01:25:46Z

你好，我在尝试volcano gpu number的服务调度，在根据volcano的教程步骤安装之后，每一个带gpu的node都能够正确的显示有多少块gpu，但是在创建pod的时候，container的容器中没有volcano-gpu-number这一个环境变量，在里面输入nvidia-smi能够看到该节点所有的gpu，想问一下是否需要更改yaml文件？

Hey, which version do you make use of?

Trainbow · 2023-01-28T06:37:44Z

你好，我在尝试volcano gpu number的服务调度，在根据volcano的教程步骤安装之后，每一个带gpu的node都能够正确的显示有多少块gpu，但是在创建pod的时候，container的容器中没有volcano-gpu-number这一个环境变量，在里面输入nvidia-smi能够看到该节点所有的gpu，想问一下是否需要更改yaml文件？

Hey, which version do you make use of?

volcano-1.6.0

Thor-wl · 2023-01-29T01:28:28Z

/cc @wangyang0616 Can you help take a look?

wangyang0616 · 2023-01-29T01:45:21Z

/cc @wangyang0616 Can you help take a look?

ok, let me take a look

wangyang0616 · 2023-01-29T01:47:10Z

@Trainbow Is it convenient to post the yaml file for creating the test task?
By the way, can it be successfully scheduled using the default scheduler of k8s?

Trainbow · 2023-01-29T02:22:26Z

@Trainbow Is it convenient to post the yaml file for creating the test task? By the way, can it be successfully scheduled using the default scheduler of k8s?

I used the sample yaml in vaolcano-gpu-number readme.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod1
  namespace: model
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:9.0-devel
      command: ["sleep"]
      args: ["100000"]
      resources:
        limits:
          volcano.sh/gpu-number: 1 # requesting 1 gpu cards
          # nvidia.com/gpu: 1

I also installed nvidia's k8s-device-plugin for testing. For example, when the limits field used nvidia.com/gpu, the pod's container works well, and it has one gpu devices. When i used volcano.sh/gpu-number, the container's env doesn't have the variable VOLCANO_GPU_ALLOCATED, the NVIDIA_VISIBLE_DEVICES is all.
I tried the gpu-sharing with volcano, according to the official tutorial to test, I can find the corresponding environment variables in the pod.

wangyang0616 · 2023-03-09T08:39:33Z

Volcano Device Plugin GPUSTRATEGY default is the Share mode, that is, you can use the Volcano.sh/GPU-MEMOMORY.
If you use the volcano.sh/gpu-number, you need number`, see for details: config-the-volcano-device-plugin-binary

Hope the above information is helpful to you.

Hugh-yw · 2024-11-06T03:03:26Z

@wangyang0616 你好，我用的volcano版本:v1.8.1，k8s版本：v1.23.17，首次安装了volcano-device-plugin，测试过后我将volcano-device-plugin组件进行卸载，并卸载volcano组件，发现集群中的节点还是存在volcano.sh资源标签，并且通过k8s原生scheduler可以申请volcano.sh/gpu-number 进行调度的，除了volcano-device-plugin.yml，还有其他特殊化的资源没有清理干净吗？还是什么原因？

Capacity:
  cpu:                    128
  ephemeral-storage:      824646552Ki
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  memory:                 1056477056Ki
  nvidia.com/gpu:         8
  pods:                   520
  volcano.sh/gpu-memory:  0   #为啥还存在volcano.sh资源标签
  volcano.sh/gpu-number:  8    #为啥还存在volcano.sh资源标签
Allocatable:
  cpu:                    127600m
  ephemeral-storage:      824646552Ki
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  memory:                 1027216591271
  nvidia.com/gpu:         8
  pods:                   520
  volcano.sh/gpu-memory:  0      #为啥还存在volcano.sh资源标签
  volcano.sh/gpu-number:  8       #为啥还存在volcano.sh资源标签

Trainbow · 2024-11-06T03:03:57Z

已收到，谢谢！

Trainbow changed the title ~~gpu~~ gpu number无法使用 Jan 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu number无法使用 #31

gpu number无法使用 #31

Trainbow commented Jan 18, 2023

Trainbow commented Jan 18, 2023

Thor-wl commented Jan 19, 2023

Trainbow commented Jan 28, 2023

Thor-wl commented Jan 29, 2023

wangyang0616 commented Jan 29, 2023

wangyang0616 commented Jan 29, 2023

Trainbow commented Jan 29, 2023

wangyang0616 commented Mar 9, 2023

Hugh-yw commented Nov 6, 2024

Trainbow commented Nov 6, 2024 via email

gpu number无法使用 #31

gpu number无法使用 #31

Comments

Trainbow commented Jan 18, 2023

Trainbow commented Jan 18, 2023

Thor-wl commented Jan 19, 2023

Trainbow commented Jan 28, 2023

Thor-wl commented Jan 29, 2023

wangyang0616 commented Jan 29, 2023

wangyang0616 commented Jan 29, 2023

Trainbow commented Jan 29, 2023

wangyang0616 commented Mar 9, 2023

Hugh-yw commented Nov 6, 2024

Trainbow commented Nov 6, 2024 via email