[tune] fix gpu check #13825

richardliaw · 2021-01-31T08:15:23Z

Why are these changes needed?

closes #13486

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

krfricke · 2021-02-01T09:33:36Z

python/ray/tune/utils/util.py

@@ -471,18 +471,16 @@ def wait_for_gpu(gpu_id=None, gpu_memory_limit=0.1, retry=20):
        gpu_id (Optional[str]): GPU id to check. Must be found
            within GPUtil.getGPUs(). If none, resorts to
            the first item returned from `ray.get_gpu_ids()`.
-        gpu_memory_limit (float): If memory usage is below
+        gpu_memory_limit (float): If fractional memory usage is below


I think the original docstring was correct here. We're comparing absolute memory usage (memoryUsed) vs. the absolute amount we need to have available (gpu_memory_limit). The fact that the default value here is 0.1 might be confusing. Shouldn't we just require passing a GPU limit here? And maybe state that this is memory usage in bytes (if this is in bytes, which I think it is?)

Can you keep it as a fraction and compare against memoryUtil instead of memoryUsed?

https://github.com/anderskm/gputil/blob/master/GPUtil/GPUtil.py#L50

This is more intuitive for us, particularly since in most cases you just want to say "wait until all the gpu memory is free" or wait_for_gpu(gpu, 1.0).

so now, the usage is:

wait_for_gpu(target_util=0.1) or wait_for_gpu(target_util=0) (for full blocking)

tgaddair · 2021-02-04T00:56:52Z

python/ray/tune/utils/util.py

    for i in range(int(retry)):
+        gpu_object = GPUtil.getGPUs()[gpu_id]


gpu_id is a string, but getGPUs() returns a list. You probably need to compare against the GPU id (which is also an int) returned in the list.

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

krfricke

Looks good! Just one tiny nit

python/ray/tune/utils/util.py

Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>

This reverts commit de2a997.

amogkam and others added 8 commits January 29, 2021 17:23

update test

277c738

add smoke test

6c0449c

formatting

3325ccf

add

385d274

typo

5e298a7

add-symlink

36df586

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

link

87a9718

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

ok

298cf1b

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw requested a review from krfricke January 31, 2021 08:15

krfricke reviewed Feb 1, 2021

View reviewed changes

tgaddair reviewed Feb 4, 2021

View reviewed changes

richardliaw added 8 commits February 3, 2021 17:48

gpu-wait

b5a5e5e

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

uuid

bee4d8a

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

ok

9e3a6bc

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

fix

5705fce

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

none

e13650e

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Merge branch 'master' into gpu

a874cfe

util

cf6d61c

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

util

46090e9

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

krfricke approved these changes Feb 4, 2021

View reviewed changes

python/ray/tune/utils/util.py Show resolved Hide resolved

richardliaw commented Feb 4, 2021

View reviewed changes

python/ray/tune/utils/util.py Outdated Show resolved Hide resolved

Apply suggestions from code review

b58dd62

richardliaw merged commit 0fc81e2 into ray-project:master Feb 4, 2021

richardliaw deleted the gpu branch February 4, 2021 09:14

fishbone pushed a commit to fishbone/ray that referenced this pull request Feb 16, 2021

[tune] fix gpu check (ray-project#13825)

de2a997

Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>

fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021

Revert "[tune] fix gpu check (ray-project#13825)"

819b7fd

This reverts commit de2a997.

fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021

Revert "[tune] fix gpu check (ray-project#13825)"

1fab2c1

This reverts commit de2a997.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] fix gpu check #13825

[tune] fix gpu check #13825

richardliaw commented Jan 31, 2021

krfricke Feb 1, 2021

tgaddair Feb 4, 2021

richardliaw Feb 4, 2021

richardliaw Feb 4, 2021

tgaddair Feb 4, 2021

krfricke left a comment

		for i in range(int(retry)):
		gpu_object = GPUtil.getGPUs()[gpu_id]

[tune] fix gpu check #13825

[tune] fix gpu check #13825

Conversation

richardliaw commented Jan 31, 2021

Why are these changes needed?

Related issue number

Checks

krfricke Feb 1, 2021

Choose a reason for hiding this comment

tgaddair Feb 4, 2021

Choose a reason for hiding this comment

richardliaw Feb 4, 2021

Choose a reason for hiding this comment

richardliaw Feb 4, 2021

Choose a reason for hiding this comment

tgaddair Feb 4, 2021

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment