SKS-2160: Fix 'where' condition when getting GPU device allocation details #157
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue
SKS 创建 GPU 直通集群时,虚拟机配置了 GPU,但是实际挂载的是 vGPU。
如图所示,where 实际没有设置值,所以没有生效,导致返回了 GPU + vGPU 设备。gpuDeviceUsage 没有指定,而 GPU 和 vGPU 的 model(Tesla T4) 是一样的,导致 vGPU 设备被判断成 GPU。
复现步骤
集群有三张 Tesla T4 卡,把其中一张卡切成 vGPU。
创建 1CP + 3Worker 集群,每个 Worker 配置一张 Tesla T4 直通。
集群创建完成,观察到其中一个 Worker 挂载了 vGPU
Change
查询 GPU 信息的时候指定正确的 where (gpuDeviceUsage 和主机 ID)查询条件。
Test
集群有三张 Tesla T4 卡,把其中一张卡切成 vGPU。
创建 1CP + 3Worker 集群,每个 Worker 配置一张 Tesla T4 直通。
集群创建有两个 Worker 挂载了 Tesla T4,剩下的一个 Worker haijian-gpu-workergroup1-ltp5m 没有挂载 vGPU,在等待足够可用的 GPU 直通设备。
被切成 vGPU 的 Tesla T4 没有被使用。