volcano vgpu metrics not update properly #3605

archlitchi · 2024-07-17T06:30:42Z

If you submit a vgpu job, you can see the corresponding metrics by using scheduler metrics, as follows:

task yaml:

apiVersion: v1
kind: Pod
metadata:
  name: pod1
spec:
  restartPolicy: OnFailure
  schedulerName: volcano
  containers:
  - image: ubuntu:20.04
    name: pod1-ctr
    command: ["sleep"]
    args: ["100000"]
    resources:
      limits:
        volcano.sh/vgpu-memory: 1024
        volcano.sh/vgpu-number: 1

then by visiting scheduler metrics, you can get the vgpu overview of vc-scheduler

curl {vc-scheduler}:8080/metrics

# HELP volcano_vgpu_device_allocated_cores The percentage of gpu compute cores allocated in this card
# TYPE volcano_vgpu_device_allocated_cores gauge
volcano_vgpu_device_allocated_cores{devID="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec"} 0
volcano_vgpu_device_allocated_cores{devID="GPU-0fc3eda5-e98b-a25b-5b0d-cf5c855d1448"} 0
# HELP volcano_vgpu_device_allocated_memory The number of vgpu memory allocated in this card
# TYPE volcano_vgpu_device_allocated_memory gauge
volcano_vgpu_device_allocated_memory{devID="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec"} 0
volcano_vgpu_device_allocated_memory{devID="GPU-0fc3eda5-e98b-a25b-5b0d-cf5c855d1448"} 1024
# HELP volcano_vgpu_device_memory_limit The number of total device memory allocated in this card
# TYPE volcano_vgpu_device_memory_limit gauge
volcano_vgpu_device_memory_limit{devID="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec"} 32768
volcano_vgpu_device_memory_limit{devID="GPU-0fc3eda5-e98b-a25b-5b0d-cf5c855d1448"} 32768
# HELP volcano_vgpu_device_shared_number The number of vgpu tasks sharing this card
# TYPE volcano_vgpu_device_shared_number gauge
volcano_vgpu_device_shared_number{devID="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec"} 0
volcano_vgpu_device_shared_number{devID="GPU-0fc3eda5-e98b-a25b-5b0d-cf5c855d1448"} 1

But these metrics are not cleaned up after the pod ends, these metrics are still there, even if we delete this pod.

The text was updated successfully, but these errors were encountered:

Monokaix · 2024-07-19T06:50:32Z

/good-first-issue

volcano-sh-bot · 2024-07-19T06:50:35Z

@Monokaix:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

googs1025 · 2024-07-19T07:04:06Z

/assign

googs1025 · 2024-07-19T07:05:00Z

Let me understand the problem first. When our pod is updated, the indicator is not updated accordingly, right? @archlitchi

Monokaix · 2024-07-19T09:42:23Z

he percentage of gpu compute c

I think the problem is that metric is not updated when pod deleted: ) An core codes in file pkg/scheduler/api/devices/nvidia/vgpu/metrics.go & pkg/scheduler/api/devices/nvidia/vgpu/device_info.go

archlitchi · 2024-07-22T10:06:23Z

Let me understand the problem first. When our pod is updated, the indicator is not updated accordingly, right? @archlitchi

yes, i will submit a patch to fix that

archlitchi added the kind/bug Categorizes issue or PR as related to a bug. label Jul 17, 2024

volcano-sh-bot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Jul 19, 2024

volcano-sh-bot assigned googs1025 Jul 19, 2024

googs1025 linked a pull request Jul 19, 2024 that will close this issue

fix: vgpu metrics not update when pod deleted #3614

Open

archlitchi mentioned this issue Jul 24, 2024

Update volcano-vgpu monitoring system #3620

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

volcano vgpu metrics not update properly #3605

volcano vgpu metrics not update properly #3605

archlitchi commented Jul 17, 2024

Monokaix commented Jul 19, 2024

volcano-sh-bot commented Jul 19, 2024

googs1025 commented Jul 19, 2024

googs1025 commented Jul 19, 2024

Monokaix commented Jul 19, 2024

archlitchi commented Jul 22, 2024

volcano vgpu metrics not update properly #3605

volcano vgpu metrics not update properly #3605

Comments

archlitchi commented Jul 17, 2024

Monokaix commented Jul 19, 2024

volcano-sh-bot commented Jul 19, 2024

googs1025 commented Jul 19, 2024

googs1025 commented Jul 19, 2024

Monokaix commented Jul 19, 2024

archlitchi commented Jul 22, 2024