Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

volcano vgpu metrics not update properly #3605

Open
archlitchi opened this issue Jul 17, 2024 · 6 comments · May be fixed by #3614
Open

volcano vgpu metrics not update properly #3605

archlitchi opened this issue Jul 17, 2024 · 6 comments · May be fixed by #3614
Assignees
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug.

Comments

@archlitchi
Copy link
Contributor

If you submit a vgpu job, you can see the corresponding metrics by using scheduler metrics, as follows:

task yaml:

apiVersion: v1
kind: Pod
metadata:
  name: pod1
spec:
  restartPolicy: OnFailure
  schedulerName: volcano
  containers:
  - image: ubuntu:20.04
    name: pod1-ctr
    command: ["sleep"]
    args: ["100000"]
    resources:
      limits:
        volcano.sh/vgpu-memory: 1024
        volcano.sh/vgpu-number: 1

then by visiting scheduler metrics, you can get the vgpu overview of vc-scheduler

curl {vc-scheduler}:8080/metrics
# HELP volcano_vgpu_device_allocated_cores The percentage of gpu compute cores allocated in this card
# TYPE volcano_vgpu_device_allocated_cores gauge
volcano_vgpu_device_allocated_cores{devID="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec"} 0
volcano_vgpu_device_allocated_cores{devID="GPU-0fc3eda5-e98b-a25b-5b0d-cf5c855d1448"} 0
# HELP volcano_vgpu_device_allocated_memory The number of vgpu memory allocated in this card
# TYPE volcano_vgpu_device_allocated_memory gauge
volcano_vgpu_device_allocated_memory{devID="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec"} 0
volcano_vgpu_device_allocated_memory{devID="GPU-0fc3eda5-e98b-a25b-5b0d-cf5c855d1448"} 1024
# HELP volcano_vgpu_device_memory_limit The number of total device memory allocated in this card
# TYPE volcano_vgpu_device_memory_limit gauge
volcano_vgpu_device_memory_limit{devID="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec"} 32768
volcano_vgpu_device_memory_limit{devID="GPU-0fc3eda5-e98b-a25b-5b0d-cf5c855d1448"} 32768
# HELP volcano_vgpu_device_shared_number The number of vgpu tasks sharing this card
# TYPE volcano_vgpu_device_shared_number gauge
volcano_vgpu_device_shared_number{devID="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec"} 0
volcano_vgpu_device_shared_number{devID="GPU-0fc3eda5-e98b-a25b-5b0d-cf5c855d1448"} 1

But these metrics are not cleaned up after the pod ends, these metrics are still there, even if we delete this pod.

@archlitchi archlitchi added the kind/bug Categorizes issue or PR as related to a bug. label Jul 17, 2024
@Monokaix
Copy link
Member

/good-first-issue

@volcano-sh-bot
Copy link
Contributor

@Monokaix:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@volcano-sh-bot volcano-sh-bot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Jul 19, 2024
@googs1025
Copy link
Member

/assign

@googs1025
Copy link
Member

Let me understand the problem first. When our pod is updated, the indicator is not updated accordingly, right? @archlitchi

@Monokaix
Copy link
Member

he percentage of gpu compute c

I think the problem is that metric is not updated when pod deleted: ) An core codes in file pkg/scheduler/api/devices/nvidia/vgpu/metrics.go & pkg/scheduler/api/devices/nvidia/vgpu/device_info.go

@googs1025 googs1025 linked a pull request Jul 19, 2024 that will close this issue
@archlitchi
Copy link
Contributor Author

Let me understand the problem first. When our pod is updated, the indicator is not updated accordingly, right? @archlitchi

yes, i will submit a patch to fix that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants