Skip to content

GPU Skip decorators in common_distributed can misreport device information #41378

@rohan-varma

Description

@rohan-varma

🐛 Bug

In common_distributed.py we have some widely-used utilities that skip tests if there are not a certain number of GPUs. The problem, however, is that many of them rely on the following code: https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_distributed.py#L30 which hardcodes the error message as "Need at least 2 CUDA devices", but functions such a skip_if_lt_x_gpu(x) can be invoked with an arbitrary # of GPUs. We should enhance this test-skipping logic so that the correct no. of GPUs needed is reported when a test is skipped. Currently, it is quite confusing to get an error message such as "Need at least 2 CUDA devices" when you're running on (for example) a 4-GPU machine.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski

Metadata

Metadata

Assignees

Labels

better-engineeringRelatively self-contained tasks for better engineering contributorsmodule: bootcampWe plan to do a full writeup on the issue, and then get someone to do it for onboardingoncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions