[Dashboard] Add GPU component usage #46188
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Close #45755.
This PR addresses the need for enhanced GPU usage metrics at the task/actor level in the Ray dashboard. Currently, the Ray dashboard provides detailed CPU and memory usage metrics for individual tasks and actors, but lacks similar granularity for GPU metrics. This enhancement aims to fill that gap by introducing per-task/actor GPU utilization and memory usage metrics.
Summary of Changes
GPU Metrics Collection:
reporter_agent.py
to retrieve GPU utilization and memory usage for each process using thepynvml
library.METRICS_GAUGES
dictionary forcomponent_gpu_utilization
andcomponent_gpu_memory_usage
.Worker Process Updates:
_get_workers
function to include GPU usage information (both memory and utilization) for each process.Metric Reporting:
_generate_system_stats_record
function to include the new GPU metrics, ensuring they are reported alongside existing CPU and memory metrics.Dashboard Panels:
default_dashboard_panels.py
to display the GPU utilization and memory usage metrics at the task/actor level in the Ray dashboard.@nemo9cby @liuxsh9 I think this can be also apply to our cluster with different accelerator, @liuxsh9 you can work based on this PR and get a component usage for different accelerators with the decoupled device.
Related issue number
#45755
![image](https://private-user-images.githubusercontent.com/121425509/342480359-e9603526-ead9-4349-985e-cc3d3b0957c7.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE0MDg3ODAsIm5iZiI6MTcyMTQwODQ4MCwicGF0aCI6Ii8xMjE0MjU1MDkvMzQyNDgwMzU5LWU5NjAzNTI2LWVhZDktNDM0OS05ODVlLWNjM2QzYjA5NTdjNy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzE5JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcxOVQxNzAxMjBaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT03ZjQwNDQ5OWZhYWYyYzZiMjg5NmY0NzZmZjU4ODgzZTE0YWUxYmU1ZDkxZGU2NWEzOTdhOWFhZWExOGMwZDk4JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.U9VDWCYZ-VnSFc2gSqBXjS7smY6oH2cAqsCk3tt5_u8)
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.