Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SKS-2160: Fix 'where' condition when getting GPU device allocation details #157

Merged
merged 1 commit into from
Dec 1, 2023

Conversation

haijianyang
Copy link
Contributor

@haijianyang haijianyang commented Nov 30, 2023

Issue

SKS 创建 GPU 直通集群时,虚拟机配置了 GPU,但是实际挂载的是 vGPU。

image

如图所示,where 实际没有设置值,所以没有生效,导致返回了 GPU + vGPU 设备。gpuDeviceUsage 没有指定,而 GPU 和 vGPU 的 model(Tesla T4) 是一样的,导致 vGPU 设备被判断成 GPU。

复现步骤

  1. 集群有三张 Tesla T4 卡,把其中一张卡切成 vGPU。
    image

  2. 创建 1CP + 3Worker 集群,每个 Worker 配置一张 Tesla T4 直通。
    image

  3. 集群创建完成,观察到其中一个 Worker 挂载了 vGPU

image

Change

查询 GPU 信息的时候指定正确的 where (gpuDeviceUsage 和主机 ID)查询条件。

Test

  1. 集群有三张 Tesla T4 卡,把其中一张卡切成 vGPU。
    image

  2. 创建 1CP + 3Worker 集群,每个 Worker 配置一张 Tesla T4 直通。
    image

  3. 集群创建有两个 Worker 挂载了 Tesla T4,剩下的一个 Worker haijian-gpu-workergroup1-ltp5m 没有挂载 vGPU,在等待足够可用的 GPU 直通设备。

No host with the required GPU devices for the virtual machine, so wait for enough available hosts" namespace="default" elfCluster="haijian-gpu" elfMachine="haijian-gpu-workergroup1-ltp5m" machine="haijian-gpu-workergroup1-nwfxx-nd6q5"

被切成 vGPU 的 Tesla T4 没有被使用。
image

  1. 把 vGPU 改成直通,上述的 Worker haijian-gpu-workergroup1-ltp5m 挂载了 Tesla T4。
image

Copy link

codecov bot commented Nov 30, 2023

Codecov Report

Attention: 8 lines in your changes are missing coverage. Please review.

Comparison is base (846e355) 59.28% compared to head (9dd0e5a) 59.31%.

Files Patch % Lines
pkg/service/vm.go 0.00% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #157      +/-   ##
==========================================
+ Coverage   59.28%   59.31%   +0.03%     
==========================================
  Files          19       19              
  Lines        3522     3520       -2     
==========================================
  Hits         2088     2088              
+ Misses       1285     1283       -2     
  Partials      149      149              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

getDetailVMInfoByGpuDevicesParams := clientgpu.NewGetDetailVMInfoByGpuDevicesParams()
getDetailVMInfoByGpuDevicesParams.RequestBody = &models.GetGpuDevicesRequestBody{
Where: &models.GpuDeviceWhereInput{},
getDetailVMInfoByGpuDevicesParams.RequestBody.Where.AvailableVgpusNumGt = TowerInt32(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是什么原因没有生效?

@haijianyang haijianyang changed the title SKS-1775: Get GPU device allocation details using where SKS-2160 SKS-1775: Get GPU device allocation details using where Nov 30, 2023
@jessehu jessehu changed the title SKS-2160 SKS-1775: Get GPU device allocation details using where SKS-2160: Fix 'where' condition when getting GPU device allocation details Nov 30, 2023
@haijianyang haijianyang merged commit 2fe52eb into smartxworks:master Dec 1, 2023
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants