This repository was archived by the owner on Aug 7, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 888
This repository was archived by the owner on Aug 7, 2025. It is now read-only.
Metric collector should collect more GPU metrics #1937
Copy link
Copy link
Closed
Labels
Description
🚀 The feature
I'm seeing customers run commands like the below
nvidia-smi --format=csv,noheader,nounits --query-gpu=utilization.gpu,utilization.memory,memory.total,memory.used,temperature.gpu,power.draw,clocks.current.sm,clocks.current.memory -l 10 > logs/gpu-stats.log &
However in our existing metric_collector.py we have only have metrics for utilization and memory used but customers may also be interested in clock speed and power draw. We should make those available as well
serve/ts/metrics/system_metrics.py
Lines 72 to 88 in e181fee
for value in info: | |
dimension_gpu = [ | |
Dimension("Level", "Host"), | |
Dimension("device_id", value["index"]), | |
] | |
system_metrics.append( | |
Metric( | |
"GPUMemoryUtilization", | |
value["mem_used_percent"], | |
"percent", | |
dimension_gpu, | |
) | |
) | |
system_metrics.append( | |
Metric("GPUMemoryUsed", value["mem_used"], "MB", dimension_gpu) | |
) | |
Motivation, pitch
This is easy enough to do by using nvml which already instruments this data, we just need to create new metrics objects
Alternatives
No response
Additional context
No response