Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sdk): add support for monitoring AMD GPU system metrics #5449

Merged
merged 26 commits into from Jun 2, 2023
Merged

Conversation

dmitryduev
Copy link
Member

@dmitryduev dmitryduev commented Apr 28, 2023

Fixes the part of #4767 that is about AMD GPU stats tracking.

Description

🤖 Generated by Copilot at aff452d

This pull request adds support for detecting and reporting AMD GPU devices in the wandb system metrics data. It defines a new field env.amd_gpu in the wandb_telemetry.proto file and its corresponding Python files. It also adds a new class GPUAMD in the assets package, which uses the rocm_smi tool to collect and send AMD GPU stats to the wandb interface. It also includes a unit test for the GPUAMD class.

https://wandb.ai/dimaduev/amd/runs/k9wp95u5?workspace=user-dimaduev

image image

Testing

Tested on AMD Cloud (ubuntu 22.04 box with 2 MI210's and pre-installed rocm-smi).

Checklist

  • Include reference to internal ticket "Fixes WB-NNNN" and/or GitHub issue "Fixes #NNNN" (if applicable)
  • Ensure PR title compliance with the conventional commits standards

🤖 Generated by Copilot at aff452d

We're the wandb crew and we monitor the stats
Of the GPUs that make our models run fast
We've added a new field for the AMD ones
So env.amd_gpu will tell us if they're on

@github-actions github-actions bot added cc-feat and removed cc-feat labels Apr 28, 2023
@dmitryduev dmitryduev requested a review from a team May 9, 2023 08:27
@github-actions github-actions bot added cc-feat and removed cc-feat labels May 9, 2023
@github-actions github-actions bot added cc-feat and removed cc-feat labels May 9, 2023
@github-actions github-actions bot added cc-feat and removed cc-feat labels May 9, 2023
@dmitryduev dmitryduev marked this pull request as ready for review May 9, 2023 09:02
@github-actions github-actions bot added cc-feat and removed cc-feat labels May 9, 2023
@github-actions github-actions bot added cc-feat and removed cc-feat labels May 9, 2023
Copy link
Contributor

@andrewtruong andrewtruong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nits, otherwise lgtm.

Let's go Team Red! 😎

wandb/sdk/internal/system/assets/gpu_amd.py Outdated Show resolved Hide resolved
wandb/sdk/internal/system/assets/gpu_amd.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented May 11, 2023

Codecov Report

Merging #5449 (fbcf6a8) into main (427b736) will increase coverage by 18.66%.
The diff coverage is 92.55%.

❗ Current head fbcf6a8 differs from pull request most recent head 8871904. Consider uploading reports for the commit 8871904 to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #5449       +/-   ##
===========================================
+ Coverage   63.73%   82.40%   +18.66%     
===========================================
  Files         352      327       -25     
  Lines       40239    39378      -861     
===========================================
+ Hits        25645    32448     +6803     
+ Misses      14594     6930     -7664     
Flag Coverage Δ
unittest 82.40% <92.55%> (+18.66%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
wandb/sdk/internal/system/assets/gpu_amd.py 92.47% <92.47%> (ø)
wandb/sdk/internal/system/assets/__init__.py 100.00% <100.00%> (ø)

... and 215 files with indirect coverage changes

@github-actions github-actions bot added cc-feat and removed cc-feat labels May 11, 2023
@dirkgr
Copy link

dirkgr commented May 14, 2023

I'm running this. I see this in wand-metadata.json:

"gpu_devices": [
        {
            "id": "0x7408",
            "unique_id": "0xa140459056f1eb61",
            "vbios_version": "113-D65201-046",
            "performance_level": "auto",
            "gpu_overdrive": "0",
            "gpu_memory_overdrive": "0",
            "max_power": "560.0",
            "series": "AMD INSTINCT MI200 (MCM) OAM LC MBA HPE C2",
            "model": "FirePro W4300",
            "vendor": "Advanced Micro Devices, Inc. [AMD/ATI]",
            "sku": "D65201",
            "sclk_range": "500Mhz - 1700Mhz",
            "mclk_range": "400Mhz - 1600Mhz"
        },
        [...]

But I don't get the new graphs.

@jacob-morrison
Copy link

Hi! Would it be possible to get this submitted soon (it looks like it's essentially ready)? We'd love to use it for a few large runs we're launching soon.

@dmitryduev
Copy link
Member Author

dmitryduev commented Jun 2, 2023

@dirkgr thank you for testing this! I'm guessing that you didn't see the graphs because some of the expected keys were missing/invalid in the output of the rocm-smi call and I wasn't too careful about that. Added some robustness checks, so now as long any data is there/parsable, a graph for that will appear in the app. Could you give it another try?

One reason for the missing values could be that the version of rocm-smi does not (fully?) support the GPU. I've seen that when I tried testing this PR on AWS' g4ad vms that come with Radeon Pro V520 GPUs: I could get the static info about the card, but not the metrics.

I've noticed that you tested this on an MI200-series card, which one specifically? Is it on the list of supported GPUs for the latest rocm-smi release https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html#gpu-support-table?

@dmitryduev dmitryduev enabled auto-merge (squash) June 2, 2023 07:46
@dmitryduev
Copy link
Member Author

@jacob-morrison I'm merging this PR into main -- it'll be in the 0.15.4 release scheduled for Tuesday, 6/6/23.
In the meantime, could you give it a try and see if it works for you? I'd be happy to get in a follow-up PR before the release in case anything needs to be adjusted/fixed.

@dmitryduev dmitryduev merged commit 1bb62e9 into main Jun 2, 2023
50 of 52 checks passed
@dmitryduev dmitryduev deleted the amd-gpu branch June 2, 2023 07:56
@dmitryduev
Copy link
Member Author

dmitryduev commented Jun 2, 2023

@dirkgr @jacob-morrison would you mind sharing what a rocm-smi call prints for you?

@dirkgr
Copy link

dirkgr commented Jul 10, 2023

They are MI250X GPUs. I can send you the output of rocm-smi tomorrow.

@dirkgr
Copy link

dirkgr commented Jul 10, 2023

Turns out we're in a maintenance window and can't access the cluster, but I have this from the logs:

nid005113 out: ======================= ROCm System Management Interface =======================
nid005113 out: ================================= Concise Info =================================
nid005113 out: GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%
nid005113 out: 0    47.0c  91.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%
nid005113 out: 1    46.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%
nid005113 out: 2    46.0c  97.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%
nid005113 out: 3    46.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%
nid005113 out: 4    47.0c  89.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%
nid005113 out: 5    46.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%
nid005113 out: 6    45.0c  95.0W   800Mhz  1600Mhz  0%   auto  560.0W    0%   0%
nid005113 out: 7    39.0c  N/A     800Mhz  1600Mhz  0%   auto  0.0W      0%   0%
nid005113 out: ================================================================================
nid005113 out: ============================= End of ROCm SMI Log ==============================

@dmitryduev
Copy link
Member Author

@dirkgr thanks! once you can access the cluster again, could you please run rocm-smi -a --json as that is what we use internally.

@jacob-morrison
Copy link

Sorry for the delay, here's the output of that command from one of the nodes we're running a job on:

{"card0": {"GPU ID": "0x7408", "Unique ID": "0xa32343daf897162a", "VBIOS version": "113-D65201-046", "Temperature (Sensor edge) (C)": "56.0", "Temperature (Sensor junction) (C)": "61.0", "Temperature (Sensor memory) (C)": "65.0", "Temperature (Sensor HBM 0) (C)": "65.0", "Temperature (Sensor HBM 1) (C)": "63.0", "Temperature (Sensor HBM 2) (C)": "62.0", "Temperature (Sensor HBM 3) (C)": "63.0", "fclk clock speed:": "(0Mhz)", "fclk clock level:": "0", "mclk clock speed:": "(1600Mhz)", "mclk clock level:": "3", "sclk clock speed:": "(1700Mhz)", "sclk clock level:": "1", "socclk clock speed:": "(1090Mhz)", "socclk clock level:": "3", "Performance Level": "auto", "GPU OverDrive value (%)": "0", "GPU Memory OverDrive value (%)": "0", "Max Graphics Package Power (W)": "560.0", "Average Graphics Package Power (W)": "144.0", "GPU use (%)": "100", "GFX Activity": "324450778", "GPU memory use (%)": "0", "Memory Activity": "87110708", "GPU memory vendor": "hynix", "PCIe Replay Count": "0", "Serial Number": "692140000412", "Voltage (mV)": "912", "PCI Bus": "0000:C1:00.0", "ASD firmware version": "0x00000000", "CE firmware version": "0", "DMCU firmware version": "0", "MC firmware version": "0", "ME firmware version": "0", "MEC firmware version": "63", "MEC2 firmware version": "63", "PFP firmware version": "0", "RLC firmware version": "17", "RLC SRLC firmware version": "0", "RLC SRLG firmware version": "0", "RLC SRLS firmware version": "0", "SDMA firmware version": "8", "SDMA2 firmware version": "8", "SMC firmware version": "00.68.54.00", "SOS firmware version": "0x00270080", "TA RAS firmware version": "27.00.01.60", "TA XGMI firmware version": "32.00.00.13", "UVD firmware version": "0x00000000", "VCE firmware version": "0x00000000", "VCN firmware version": "0x0110101b", "Card series": "AMD INSTINCT MI200 (MCM) OAM LC MBA HPE C2", "Card model": "FirePro W4300", "Card vendor": "Advanced Micro Devices, Inc. [AMD/ATI]", "Card SKU": "D65201", "Valid sclk range": "500Mhz - 1700Mhz", "Valid mclk range": "400Mhz - 1600Mhz", "Voltage point 0": "0Mhz 0mV", "Voltage point 1": "0Mhz 0mV", "Voltage point 2": "0Mhz 0mV", "Energy counter": "15448344049909", "Accumulated Energy (uJ)": "236359666910145.5"}, "card1": {"GPU ID": "0x7408", "Unique ID": "0xf5dce4b4c0ea6eb1", "VBIOS version": "113-D65201-046", "Temperature (Sensor edge) (C)": "56.0", "Temperature (Sensor junction) (C)": "60.0", "Temperature (Sensor memory) (C)": "64.0", "Temperature (Sensor HBM 0) (C)": "63.0", "Temperature (Sensor HBM 1) (C)": "64.0", "Temperature (Sensor HBM 2) (C)": "63.0", "Temperature (Sensor HBM 3) (C)": "63.0", "fclk clock speed:": "(0Mhz)", "fclk clock level:": "0", "mclk clock speed:": "(1600Mhz)", "mclk clock level:": "3", "sclk clock speed:": "(1700Mhz)", "sclk clock level:": "1", "socclk clock speed:": "(1090Mhz)", "socclk clock level:": "3", "Performance Level": "auto", "GPU OverDrive value (%)": "0", "GPU Memory OverDrive value (%)": "0", "Max Graphics Package Power (W)": "0.0", "Average Graphics Package Power (W)": "N/A (Secondary die)", "GPU use (%)": "100", "GFX Activity": "465773592", "GPU memory use (%)": "0", "Memory Activity": "256584672", "GPU memory vendor": "hynix", "PCIe Replay Count": "0", "Serial Number": "692140000412", "Voltage (mV)": "887", "PCI Bus": "0000:C6:00.0", "ASD firmware version": "0x00000000", "CE firmware version": "0", "DMCU firmware version": "0", "MC firmware version": "0", "ME firmware version": "0", "MEC firmware version": "63", "MEC2 firmware version": "63", "PFP firmware version": "0", "RLC firmware version": "17", "RLC SRLC firmware version": "0", "RLC SRLG firmware version": "0", "RLC SRLS firmware version": "0", "SDMA firmware version": "8", "SDMA2 firmware version": "8", "SMC firmware version": "00.68.54.00", "SOS firmware version": "0x00270080", "TA RAS firmware version": "27.00.01.60", "TA XGMI firmware version": "32.00.00.13", "UVD firmware version": "0x00000000", "VCE firmware version": "0x00000000", "VCN firmware version": "0x0110101b", "Card series": "AMD INSTINCT MI200 (MCM) OAM LC MBA HPE C2", "Card model": "FirePro W4300", "Card vendor": "Advanced Micro Devices, Inc. [AMD/ATI]", "Card SKU": "D65201", "Valid sclk range": "500Mhz - 1700Mhz", "Valid mclk range": "400Mhz - 1600Mhz", "Voltage point 0": "0Mhz 0mV", "Voltage point 1": "0Mhz 0mV", "Voltage point 2": "0Mhz 0mV", "Energy counter": "0", "Accumulated Energy (uJ)": "0.0"}, "card2": {"GPU ID": "0x7408", "Unique ID": "0xd43c25c9441d1ba1", "VBIOS version": "113-D65201-046", "Temperature (Sensor edge) (C)": "51.0", "Temperature (Sensor junction) (C)": "57.0", "Temperature (Sensor memory) (C)": "63.0", "Temperature (Sensor HBM 0) (C)": "63.0", "Temperature (Sensor HBM 1) (C)": "60.0", "Temperature (Sensor HBM 2) (C)": "59.0", "Temperature (Sensor HBM 3) (C)": "63.0", "fclk clock speed:": "(0Mhz)", "fclk clock level:": "0", "mclk clock speed:": "(1600Mhz)", "mclk clock level:": "3", "sclk clock speed:": "(1700Mhz)", "sclk clock level:": "1", "socclk clock speed:": "(1090Mhz)", "socclk clock level:": "3", "Performance Level": "auto", "GPU OverDrive value (%)": "0", "GPU Memory OverDrive value (%)": "0", "Max Graphics Package Power (W)": "560.0", "Average Graphics Package Power (W)": "128.0", "GPU use (%)": "100", "GFX Activity": "1141796410", "GPU memory use (%)": "0", "Memory Activity": "237430125", "GPU memory vendor": "hynix", "PCIe Replay Count": "0", "Serial Number": "692139000119", "Voltage (mV)": "906", "PCI Bus": "0000:C9:00.0", "ASD firmware version": "0x00000000", "CE firmware version": "0", "DMCU firmware version": "0", "MC firmware version": "0", "ME firmware version": "0", "MEC firmware version": "63", "MEC2 firmware version": "63", "PFP firmware version": "0", "RLC firmware version": "17", "RLC SRLC firmware version": "0", "RLC SRLG firmware version": "0", "RLC SRLS firmware version": "0", "SDMA firmware version": "8", "SDMA2 firmware version": "8", "SMC firmware version": "00.68.54.00", "SOS firmware version": "0x00270080", "TA RAS firmware version": "27.00.01.60", "TA XGMI firmware version": "32.00.00.13", "UVD firmware version": "0x00000000", "VCE firmware version": "0x00000000", "VCN firmware version": "0x0110101b", "Card series": "AMD INSTINCT MI200 (MCM) OAM LC MBA HPE C2", "Card model": "FirePro W4300", "Card vendor": "Advanced Micro Devices, Inc. [AMD/ATI]", "Card SKU": "D65201", "Valid sclk range": "500Mhz - 1700Mhz", "Valid mclk range": "400Mhz - 1600Mhz", "Voltage point 0": "0Mhz 0mV", "Voltage point 1": "0Mhz 0mV", "Voltage point 2": "0Mhz 0mV", "Energy counter": "14006838908878", "Accumulated Energy (uJ)": "214304637977425.9"}, "card3": {"GPU ID": "0x7408", "Unique ID": "0x134e5b5cbe55ecb4", "VBIOS version": "113-D65201-046", "Temperature (Sensor edge) (C)": "59.0", "Temperature (Sensor junction) (C)": "59.0", "Temperature (Sensor memory) (C)": "61.0", "Temperature (Sensor HBM 0) (C)": "60.0", "Temperature (Sensor HBM 1) (C)": "56.0", "Temperature (Sensor HBM 2) (C)": "57.0", "Temperature (Sensor HBM 3) (C)": "61.0", "fclk clock speed:": "(0Mhz)", "fclk clock level:": "0", "mclk clock speed:": "(1600Mhz)", "mclk clock level:": "3", "sclk clock speed:": "(1700Mhz)", "sclk clock level:": "1", "socclk clock speed:": "(1090Mhz)", "socclk clock level:": "3", "Performance Level": "auto", "GPU OverDrive value (%)": "0", "GPU Memory OverDrive value (%)": "0", "Max Graphics Package Power (W)": "0.0", "Average Graphics Package Power (W)": "N/A (Secondary die)", "GPU use (%)": "100", "GFX Activity": "888974754", "GPU memory use (%)": "0", "Memory Activity": "253318460", "GPU memory vendor": "hynix", "PCIe Replay Count": "0", "Serial Number": "692139000119", "Voltage (mV)": "912", "PCI Bus": "0000:CE:00.0", "ASD firmware version": "0x00000000", "CE firmware version": "0", "DMCU firmware version": "0", "MC firmware version": "0", "ME firmware version": "0", "MEC firmware version": "63", "MEC2 firmware version": "63", "PFP firmware version": "0", "RLC firmware version": "17", "RLC SRLC firmware version": "0", "RLC SRLG firmware version": "0", "RLC SRLS firmware version": "0", "SDMA firmware version": "8", "SDMA2 firmware version": "8", "SMC firmware version": "00.68.54.00", "SOS firmware version": "0x00270080", "TA RAS firmware version": "27.00.01.60", "TA XGMI firmware version": "32.00.00.13", "UVD firmware version": "0x00000000", "VCE firmware version": "0x00000000", "VCN firmware version": "0x0110101b", "Card series": "AMD INSTINCT MI200 (MCM) OAM LC MBA HPE C2", "Card model": "FirePro W4300", "Card vendor": "Advanced Micro Devices, Inc. [AMD/ATI]", "Card SKU": "D65201", "Valid sclk range": "500Mhz - 1700Mhz", "Valid mclk range": "400Mhz - 1600Mhz", "Voltage point 0": "0Mhz 0mV", "Voltage point 1": "0Mhz 0mV", "Voltage point 2": "0Mhz 0mV", "Energy counter": "0", "Accumulated Energy (uJ)": "0.0"}, "card4": {"GPU ID": "0x7408", "Unique ID": "0x4ab1f739f8361acd", "VBIOS version": "113-D65201-046", "Temperature (Sensor edge) (C)": "64.0", "Temperature (Sensor junction) (C)": "65.0", "Temperature (Sensor memory) (C)": "69.0", "Temperature (Sensor HBM 0) (C)": "69.0", "Temperature (Sensor HBM 1) (C)": "67.0", "Temperature (Sensor HBM 2) (C)": "69.0", "Temperature (Sensor HBM 3) (C)": "69.0", "fclk clock speed:": "(0Mhz)", "fclk clock level:": "0", "mclk clock speed:": "(1600Mhz)", "mclk clock level:": "3", "sclk clock speed:": "(1700Mhz)", "sclk clock level:": "1", "socclk clock speed:": "(1090Mhz)", "socclk clock level:": "3", "Performance Level": "auto", "GPU OverDrive value (%)": "0", "GPU Memory OverDrive value (%)": "0", "Max Graphics Package Power (W)": "560.0", "Average Graphics Package Power (W)": "130.0", "GPU use (%)": "100", "GFX Activity": "491488437", "GPU memory use (%)": "0", "Memory Activity": "223075763", "GPU memory vendor": "hynix", "PCIe Replay Count": "0", "Serial Number": "692139000260", "Voltage (mV)": "893", "PCI Bus": "0000:D1:00.0", "ASD firmware version": "0x00000000", "CE firmware version": "0", "DMCU firmware version": "0", "MC firmware version": "0", "ME firmware version": "0", "MEC firmware version": "63", "MEC2 firmware version": "63", "PFP firmware version": "0", "RLC firmware version": "17", "RLC SRLC firmware version": "0", "RLC SRLG firmware version": "0", "RLC SRLS firmware version": "0", "SDMA firmware version": "8", "SDMA2 firmware version": "8", "SMC firmware version": "00.68.54.00", "SOS firmware version": "0x00270080", "TA RAS firmware version": "27.00.01.60", "TA XGMI firmware version": "32.00.00.13", "UVD firmware version": "0x00000000", "VCE firmware version": "0x00000000", "VCN firmware version": "0x0110101b", "Card series": "AMD INSTINCT MI200 (MCM) OAM LC MBA HPE C2", "Card model": "FirePro W4300", "Card vendor": "Advanced Micro Devices, Inc. [AMD/ATI]", "Card SKU": "D65201", "Valid sclk range": "500Mhz - 1700Mhz", "Valid mclk range": "400Mhz - 1600Mhz", "Voltage point 0": "0Mhz 0mV", "Voltage point 1": "0Mhz 0mV", "Voltage point 2": "0Mhz 0mV", "Energy counter": "13913808507170", "Accumulated Energy (uJ)": "212881272813549.38"}, "card5": {"GPU ID": "0x7408", "Unique ID": "0x87fc3ea34b2236f3", "VBIOS version": "113-D65201-046", "Temperature (Sensor edge) (C)": "64.0", "Temperature (Sensor junction) (C)": "64.0", "Temperature (Sensor memory) (C)": "69.0", "Temperature (Sensor HBM 0) (C)": "68.0", "Temperature (Sensor HBM 1) (C)": "67.0", "Temperature (Sensor HBM 2) (C)": "64.0", "Temperature (Sensor HBM 3) (C)": "69.0", "fclk clock speed:": "(0Mhz)", "fclk clock level:": "0", "mclk clock speed:": "(1600Mhz)", "mclk clock level:": "3", "sclk clock speed:": "(1700Mhz)", "sclk clock level:": "1", "socclk clock speed:": "(1090Mhz)", "socclk clock level:": "3", "Performance Level": "auto", "GPU OverDrive value (%)": "0", "GPU Memory OverDrive value (%)": "0", "Max Graphics Package Power (W)": "0.0", "Average Graphics Package Power (W)": "N/A (Secondary die)", "GPU use (%)": "100", "GFX Activity": "283946554", "GPU memory use (%)": "0", "Memory Activity": "245061500", "GPU memory vendor": "hynix", "PCIe Replay Count": "0", "Serial Number": "692139000260", "Voltage (mV)": "906", "PCI Bus": "0000:D6:00.0", "ASD firmware version": "0x00000000", "CE firmware version": "0", "DMCU firmware version": "0", "MC firmware version": "0", "ME firmware version": "0", "MEC firmware version": "63", "MEC2 firmware version": "63", "PFP firmware version": "0", "RLC firmware version": "17", "RLC SRLC firmware version": "0", "RLC SRLG firmware version": "0", "RLC SRLS firmware version": "0", "SDMA firmware version": "8", "SDMA2 firmware version": "8", "SMC firmware version": "00.68.54.00", "SOS firmware version": "0x00270080", "TA RAS firmware version": "27.00.01.60", "TA XGMI firmware version": "32.00.00.13", "UVD firmware version": "0x00000000", "VCE firmware version": "0x00000000", "VCN firmware version": "0x0110101b", "Card series": "AMD INSTINCT MI200 (MCM) OAM LC MBA HPE C2", "Card model": "FirePro W4300", "Card vendor": "Advanced Micro Devices, Inc. [AMD/ATI]", "Card SKU": "D65201", "Valid sclk range": "500Mhz - 1700Mhz", "Valid mclk range": "400Mhz - 1600Mhz", "Voltage point 0": "0Mhz 0mV", "Voltage point 1": "0Mhz 0mV", "Voltage point 2": "0Mhz 0mV", "Energy counter": "0", "Accumulated Energy (uJ)": "0.0"}, "card6": {"GPU ID": "0x7408", "Unique ID": "0x7be4ee6dcc2097c9", "VBIOS version": "113-D65201-046", "Temperature (Sensor edge) (C)": "52.0", "Temperature (Sensor junction) (C)": "54.0", "Temperature (Sensor memory) (C)": "64.0", "Temperature (Sensor HBM 0) (C)": "60.0", "Temperature (Sensor HBM 1) (C)": "60.0", "Temperature (Sensor HBM 2) (C)": "59.0", "Temperature (Sensor HBM 3) (C)": "64.0", "fclk clock speed:": "(0Mhz)", "fclk clock level:": "0", "mclk clock speed:": "(1600Mhz)", "mclk clock level:": "3", "sclk clock speed:": "(1700Mhz)", "sclk clock level:": "1", "socclk clock speed:": "(1090Mhz)", "socclk clock level:": "3", "Performance Level": "auto", "GPU OverDrive value (%)": "0", "GPU Memory OverDrive value (%)": "0", "Max Graphics Package Power (W)": "560.0", "Average Graphics Package Power (W)": "136.0", "GPU use (%)": "100", "GFX Activity": "394498625", "GPU memory use (%)": "0", "Memory Activity": "248671447", "GPU memory vendor": "hynix", "PCIe Replay Count": "0", "Serial Number": "692137001312", "Voltage (mV)": "893", "PCI Bus": "0000:D9:00.0", "ASD firmware version": "0x00000000", "CE firmware version": "0", "DMCU firmware version": "0", "MC firmware version": "0", "ME firmware version": "0", "MEC firmware version": "63", "MEC2 firmware version": "63", "PFP firmware version": "0", "RLC firmware version": "17", "RLC SRLC firmware version": "0", "RLC SRLG firmware version": "0", "RLC SRLS firmware version": "0", "SDMA firmware version": "8", "SDMA2 firmware version": "8", "SMC firmware version": "00.68.54.00", "SOS firmware version": "0x00270080", "TA RAS firmware version": "27.00.01.60", "TA XGMI firmware version": "32.00.00.13", "UVD firmware version": "0x00000000", "VCE firmware version": "0x00000000", "VCN firmware version": "0x0110101b", "Card series": "AMD INSTINCT MI200 (MCM) OAM LC MBA HPE C2", "Card model": "FirePro W4300", "Card vendor": "Advanced Micro Devices, Inc. [AMD/ATI]", "Card SKU": "D65201", "Valid sclk range": "500Mhz - 1700Mhz", "Valid mclk range": "400Mhz - 1600Mhz", "Voltage point 0": "0Mhz 0mV", "Voltage point 1": "0Mhz 0mV", "Voltage point 2": "0Mhz 0mV", "Energy counter": "14610425995231", "Accumulated Energy (uJ)": "223539520513751.9"}, "card7": {"GPU ID": "0x7408", "Unique ID": "0xff4251c4d1cac80c", "VBIOS version": "113-D65201-046", "Temperature (Sensor edge) (C)": "51.0", "Temperature (Sensor junction) (C)": "53.0", "Temperature (Sensor memory) (C)": "64.0", "Temperature (Sensor HBM 0) (C)": "63.0", "Temperature (Sensor HBM 1) (C)": "59.0", "Temperature (Sensor HBM 2) (C)": "61.0", "Temperature (Sensor HBM 3) (C)": "64.0", "fclk clock speed:": "(0Mhz)", "fclk clock level:": "0", "mclk clock speed:": "(1600Mhz)", "mclk clock level:": "3", "sclk clock speed:": "(1700Mhz)", "sclk clock level:": "1", "socclk clock speed:": "(1090Mhz)", "socclk clock level:": "3", "Performance Level": "auto", "GPU OverDrive value (%)": "0", "GPU Memory OverDrive value (%)": "0", "Max Graphics Package Power (W)": "0.0", "Average Graphics Package Power (W)": "N/A (Secondary die)", "GPU use (%)": "100", "GFX Activity": "989214951", "GPU memory use (%)": "0", "Memory Activity": "282396622", "GPU memory vendor": "hynix", "PCIe Replay Count": "0", "Serial Number": "692137001312", "Voltage (mV)": "918", "PCI Bus": "0000:DE:00.0", "ASD firmware version": "0x00000000", "CE firmware version": "0", "DMCU firmware version": "0", "MC firmware version": "0", "ME firmware version": "0", "MEC firmware version": "63", "MEC2 firmware version": "63", "PFP firmware version": "0", "RLC firmware version": "17", "RLC SRLC firmware version": "0", "RLC SRLG firmware version": "0", "RLC SRLS firmware version": "0", "SDMA firmware version": "8", "SDMA2 firmware version": "8", "SMC firmware version": "00.68.54.00", "SOS firmware version": "0x00270080", "TA RAS firmware version": "27.00.01.60", "TA XGMI firmware version": "32.00.00.13", "UVD firmware version": "0x00000000", "VCE firmware version": "0x00000000", "VCN firmware version": "0x0110101b", "Card series": "AMD INSTINCT MI200 (MCM) OAM LC MBA HPE C2", "Card model": "FirePro W4300", "Card vendor": "Advanced Micro Devices, Inc. [AMD/ATI]", "Card SKU": "D65201", "Valid sclk range": "500Mhz - 1700Mhz", "Valid mclk range": "400Mhz - 1600Mhz", "Voltage point 0": "0Mhz 0mV", "Voltage point 1": "0Mhz 0mV", "Voltage point 2": "0Mhz 0mV", "Energy counter": "0", "Accumulated Energy (uJ)": "0.0"}, "system": {"Driver version": "5.16.9.22.20", "PID25695": "python, 1, unknown, unknown, unknown", "PID25685": "python, 1, unknown, unknown, unknown", "PID25552": "python, 1, unknown, unknown, unknown", "PID25688": "python, 1, unknown, unknown, unknown", "PID25686": "python, 1, unknown, unknown, unknown", "PID25684": "python, 1, unknown, unknown, unknown", "PID25690": "python, 1, unknown, unknown, unknown", "PID25687": "python, 1, unknown, unknown, unknown"}}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants