New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(sdk): add support for monitoring AMD GPU system metrics #5449
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nits, otherwise lgtm.
Let's go Team Red! 😎
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #5449 +/- ##
===========================================
+ Coverage 63.73% 82.40% +18.66%
===========================================
Files 352 327 -25
Lines 40239 39378 -861
===========================================
+ Hits 25645 32448 +6803
+ Misses 14594 6930 -7664
Flags with carried forward coverage won't be shown. Click here to find out more.
|
I'm running this. I see this in
But I don't get the new graphs. |
Hi! Would it be possible to get this submitted soon (it looks like it's essentially ready)? We'd love to use it for a few large runs we're launching soon. |
@dirkgr thank you for testing this! I'm guessing that you didn't see the graphs because some of the expected keys were missing/invalid in the output of the rocm-smi call and I wasn't too careful about that. Added some robustness checks, so now as long any data is there/parsable, a graph for that will appear in the app. Could you give it another try? One reason for the missing values could be that the version of rocm-smi does not (fully?) support the GPU. I've seen that when I tried testing this PR on AWS' g4ad vms that come with Radeon Pro V520 GPUs: I could get the static info about the card, but not the metrics. I've noticed that you tested this on an MI200-series card, which one specifically? Is it on the list of supported GPUs for the latest rocm-smi release https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html#gpu-support-table? |
@jacob-morrison I'm merging this PR into main -- it'll be in the 0.15.4 release scheduled for Tuesday, 6/6/23. |
@dirkgr @jacob-morrison would you mind sharing what a |
They are MI250X GPUs. I can send you the output of |
Turns out we're in a maintenance window and can't access the cluster, but I have this from the logs:
|
@dirkgr thanks! once you can access the cluster again, could you please run |
Sorry for the delay, here's the output of that command from one of the nodes we're running a job on:
|
Fixes the part of #4767 that is about AMD GPU stats tracking.
Description
🤖 Generated by Copilot at aff452d
This pull request adds support for detecting and reporting AMD GPU devices in the wandb system metrics data. It defines a new field
env.amd_gpu
in thewandb_telemetry.proto
file and its corresponding Python files. It also adds a new classGPUAMD
in theassets
package, which uses therocm_smi
tool to collect and send AMD GPU stats to the wandb interface. It also includes a unit test for theGPUAMD
class.https://wandb.ai/dimaduev/amd/runs/k9wp95u5?workspace=user-dimaduev
Testing
Tested on AMD Cloud (ubuntu 22.04 box with 2 MI210's and pre-installed rocm-smi).
Checklist
🤖 Generated by Copilot at aff452d