-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix decorators skipping NCCL tests #122397
base: main
Are you sure you want to change the base?
Conversation
The decorator(s) is written to `sys.exit` when the test function is called which is AFTER the `setup` call which forks the processes and uses (potentially) a GPU/NCCL based barrier which requires "n GPUs" to be present befor checking if "n GPUs" are available. Rewrite those decorators to use `unittest.skipIf` which will not even enter the `setup` function. This also exposed that `require_n_gpus_for_nccl_backend` is the same as `nccl_skip_if_lt_x_gpu` but the former has a better name so I removed the latter. Fixes pytorch#89686
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122397
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 1 Unrelated FailureAs of commit c865f77 with merge base ac51920 (): NEW FAILURE - The following job has failed:
UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the same issue apply to other decorators that use the same sys.exit
pattern e.g., skip_if_no_gpu
?
Factor out `exit_if_lt_x_gpu` Replace checks by `unittest.skip*` where possible
Possibly yes, I'd strongly suggest to avoid this pattern. E.g.
However many of the other usages of I did what I could now to reduce the issues and do some cleanup. |
world_size = int(os.environ["WORLD_SIZE"]) | ||
if torch.cuda.device_count() < world_size: | ||
sys.exit(TEST_SKIPS[f"multi-gpu-{world_size}"].exit_code) | ||
exit_if_lt_x_gpu(int(os.environ["WORLD_SIZE"])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you comment on the state of this code? is this still 'ok' to use exit or is this just an area that needs further work and not handled in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to explain this in #122397 (comment)
In short: It needs further work and might not even be possible.
Question here is when exactly $WORLD_SIZE
is set. I.e. is this available when the decorator is executed or only when the function is executed, i.e. after the test setup (likely involving forking). In the latter case this cannot be helped.
Similar below: Dependency on self.world_size
and we don't have self
available in the decorator. And especially as this property can be set/changed by subclasses I figure it is very hard to fix this.
However what could be improved is the duplication:
with_comms
is defined at least twice and very similarly- Unclear differences between usage of
self.world_size
and$WORLD_SIZE
, maybe use either one? - After the previous point maybe a (new) separate decorator
require_gpus_foreach_rank
could be used to move this out ofwith_comms
- In all those cases we can skip early (
unittest.skip
) if there are no GPUs available at all. Might already workaround a for common case (CPU only machine)
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
No objections from me but I assume it needs a stamp with someone with merge rights |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm.
could you also comment on the future work that you would propose?
I think these changes look correct, but I noticed that only a small set of tests (4) used the new 'skip' functionality
I'd propose to double-check the uses of the
That only looks like this, see the first commit which has the most important changes I.e. instead of starting the function and conditionally exit the process when not enough GPUs are available skip the test without starting it. This decorator(s) are used at many tests already |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm6.1-py3.8 / test (distributed, 1, 1, linux.rocm.gpu) Details for Dev Infra teamRaised by workflow job |
@wconstab The failure seems unrelated to these changes and only happened in the merge attempt but not before. Might be a regression? |
Avoid failures caused by tests exiting via sys.exit instead of
unittest.skip
In particular it will not try to start the test (causing forks into subprocess) just to stop them (killing the subprocess) which is done in the test setup
Using
unittest.skip
decorators avoids the starting of the test in the first place.cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @rohan-varma