-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] fix gpu check #13825
[tune] fix gpu check #13825
Conversation
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
python/ray/tune/utils/util.py
Outdated
@@ -471,18 +471,16 @@ def wait_for_gpu(gpu_id=None, gpu_memory_limit=0.1, retry=20): | |||
gpu_id (Optional[str]): GPU id to check. Must be found | |||
within GPUtil.getGPUs(). If none, resorts to | |||
the first item returned from `ray.get_gpu_ids()`. | |||
gpu_memory_limit (float): If memory usage is below | |||
gpu_memory_limit (float): If fractional memory usage is below |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the original docstring was correct here. We're comparing absolute memory usage (memoryUsed
) vs. the absolute amount we need to have available (gpu_memory_limit
). The fact that the default value here is 0.1
might be confusing. Shouldn't we just require passing a GPU limit here? And maybe state that this is memory usage in bytes (if this is in bytes, which I think it is?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you keep it as a fraction and compare against memoryUtil
instead of memoryUsed
?
https://github.com/anderskm/gputil/blob/master/GPUtil/GPUtil.py#L50
This is more intuitive for us, particularly since in most cases you just want to say "wait until all the gpu memory is free" or wait_for_gpu(gpu, 1.0)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so now, the usage is:
wait_for_gpu(target_util=0.1)
or wait_for_gpu(target_util=0)
(for full blocking)
python/ray/tune/utils/util.py
Outdated
for i in range(int(retry)): | ||
gpu_object = GPUtil.getGPUs()[gpu_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpu_id
is a string, but getGPUs()
returns a list. You probably need to compare against the GPU id (which is also an int) returned in the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just one tiny nit
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
This reverts commit de2a997.
This reverts commit de2a997.
Why are these changes needed?
closes #13486
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.