Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dashboard] Add GPU component usage #46188

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

Bye-legumes
Copy link
Contributor

@Bye-legumes Bye-legumes commented Jun 21, 2024

Why are these changes needed?

Close #45755.
This PR addresses the need for enhanced GPU usage metrics at the task/actor level in the Ray dashboard. Currently, the Ray dashboard provides detailed CPU and memory usage metrics for individual tasks and actors, but lacks similar granularity for GPU metrics. This enhancement aims to fill that gap by introducing per-task/actor GPU utilization and memory usage metrics.

Summary of Changes

  1. GPU Metrics Collection:

    • Extended the existing reporter_agent.py to retrieve GPU utilization and memory usage for each process using the pynvml library.
    • Added new metrics to the METRICS_GAUGES dictionary for component_gpu_utilization and component_gpu_memory_usage.
  2. Worker Process Updates:

    • Modified the _get_workers function to include GPU usage information (both memory and utilization) for each process.
  3. Metric Reporting:

    • Updated the _generate_system_stats_record function to include the new GPU metrics, ensuring they are reported alongside existing CPU and memory metrics.
  4. Dashboard Panels:

    • Added new panels in default_dashboard_panels.py to display the GPU utilization and memory usage metrics at the task/actor level in the Ray dashboard.

@nemo9cby @liuxsh9 I think this can be also apply to our cluster with different accelerator, @liuxsh9 you can work based on this PR and get a component usage for different accelerators with the decoupled device.

Related issue number

#45755
image

Checks

  • [√] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [√] I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [√] Unit tests
    • Release tests
    • This PR is not tested :(

Bye-legumes and others added 8 commits June 21, 2024 15:54
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
@Bye-legumes
Copy link
Contributor Author

@jjyao Can you check if this PR can solve your issue, plz? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] Show per task/actor GPU usage metric
1 participant