Skip to content
This repository was archived by the owner on Aug 7, 2025. It is now read-only.
This repository was archived by the owner on Aug 7, 2025. It is now read-only.

Metric collector should collect more GPU metrics #1937

@msaroufim

Description

@msaroufim

🚀 The feature

I'm seeing customers run commands like the below

nvidia-smi --format=csv,noheader,nounits --query-gpu=utilization.gpu,utilization.memory,memory.total,memory.used,temperature.gpu,power.draw,clocks.current.sm,clocks.current.memory -l 10 > logs/gpu-stats.log &

However in our existing metric_collector.py we have only have metrics for utilization and memory used but customers may also be interested in clock speed and power draw. We should make those available as well

for value in info:
dimension_gpu = [
Dimension("Level", "Host"),
Dimension("device_id", value["index"]),
]
system_metrics.append(
Metric(
"GPUMemoryUtilization",
value["mem_used_percent"],
"percent",
dimension_gpu,
)
)
system_metrics.append(
Metric("GPUMemoryUsed", value["mem_used"], "MB", dimension_gpu)
)

Motivation, pitch

This is easy enough to do by using nvml which already instruments this data, we just need to create new metrics objects

Alternatives

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions