-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
🐛 Bug
In common_distributed.py
we have some widely-used utilities that skip tests if there are not a certain number of GPUs. The problem, however, is that many of them rely on the following code: https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_distributed.py#L30 which hardcodes the error message as "Need at least 2 CUDA devices", but functions such a skip_if_lt_x_gpu(x)
can be invoked with an arbitrary # of GPUs. We should enhance this test-skipping logic so that the correct no. of GPUs needed is reported when a test is skipped. Currently, it is quite confusing to get an error message such as "Need at least 2 CUDA devices" when you're running on (for example) a 4-GPU machine.
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski