Fix decorators skipping NCCL tests #122397

Flamefire · 2024-03-21T11:55:51Z

Avoid failures caused by tests exiting via sys.exit instead of unittest.skip

In particular it will not try to start the test (causing forks into subprocess) just to stop them (killing the subprocess) which is done in the test setup

Using unittest.skip decorators avoids the starting of the test in the first place.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @rohan-varma

The decorator(s) is written to `sys.exit` when the test function is called which is AFTER the `setup` call which forks the processes and uses (potentially) a GPU/NCCL based barrier which requires "n GPUs" to be present befor checking if "n GPUs" are available. Rewrite those decorators to use `unittest.skipIf` which will not even enter the `setup` function. This also exposed that `require_n_gpus_for_nccl_backend` is the same as `nccl_skip_if_lt_x_gpu` but the former has a better name so I removed the latter. Fixes pytorch#89686

pytorch-bot · 2024-03-21T11:55:54Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122397

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit c865f77 with merge base ac51920 ():

NEW FAILURE - The following job has failed:

trunk / linux-focal-rocm6.1-py3.8 / test (distributed, 1, 1, linux.rocm.gpu) (gh)
distributed/test_c10d_nccl.py::ProcessGroupNCCLGroupTest::test_nan_assert_float16

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

linux-binary-manywheel / manywheel-py3_8-cuda12_1-test / test (gh) (#127288)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory

This comment was automatically generated by Dr. CI and updates every 15 minutes.

eqy

Does the same issue apply to other decorators that use the same sys.exit pattern e.g., skip_if_no_gpu?

Factor out `exit_if_lt_x_gpu` Replace checks by `unittest.skip*` where possible

Flamefire · 2024-03-22T12:06:57Z

Does the same issue apply to other decorators that use the same sys.exit pattern e.g., skip_if_no_gpu?

Possibly yes, I'd strongly suggest to avoid this pattern. E.g. exit_if_lt_x_gpu could be replaced by skipUnless

with_comms in test_c10d_logger looks strange as it checks BACKEND which seems to be a constant.

exit_if_lt_x_gpu & with_comms in test_functional_api seems to duplication (both check GPUs against world size)

However many of the other usages of sys.exit(TEST_SKIPS... involve the "world size", e.g. usages of exit_if_lt_x_gpu which I think is not possible to do in a decorator as either self is not available until the method is called or $WORLD_SIZE might not be set until the test is started spawing processes.

I did what I could now to reduce the issues and do some cleanup.

wconstab · 2024-03-22T13:55:53Z

torch/testing/_internal/common_distributed.py

-        world_size = int(os.environ["WORLD_SIZE"])
-        if torch.cuda.device_count() < world_size:
-            sys.exit(TEST_SKIPS[f"multi-gpu-{world_size}"].exit_code)
+        exit_if_lt_x_gpu(int(os.environ["WORLD_SIZE"]))


can you comment on the state of this code? is this still 'ok' to use exit or is this just an area that needs further work and not handled in this PR?

I tried to explain this in #122397 (comment)

In short: It needs further work and might not even be possible.

Question here is when exactly $WORLD_SIZE is set. I.e. is this available when the decorator is executed or only when the function is executed, i.e. after the test setup (likely involving forking). In the latter case this cannot be helped.

Similar below: Dependency on self.world_size and we don't have self available in the decorator. And especially as this property can be set/changed by subclasses I figure it is very hard to fix this.

However what could be improved is the duplication:

with_comms is defined at least twice and very similarly

Unclear differences between usage of self.world_size and $WORLD_SIZE, maybe use either one?

After the previous point maybe a (new) separate decorator require_gpus_foreach_rank could be used to move this out of with_comms

In all those cases we can skip early (unittest.skip) if there are no GPUs available at all. Might already workaround a for common case (CPU only machine)

test/distributed/test_c10d_logger.py

github-actions · 2024-05-21T15:33:53Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

Flamefire · 2024-05-27T06:53:45Z

@eqy @wconstab can this be merged please?

eqy · 2024-05-28T17:10:43Z

No objections from me but I assume it needs a stamp with someone with merge rights

wconstab

lgtm.

could you also comment on the future work that you would propose?

I think these changes look correct, but I noticed that only a small set of tests (4) used the new 'skip' functionality

Flamefire · 2024-05-28T20:37:45Z

could you also comment on the future work that you would propose?

I'd propose to double-check the uses of the exit_if_* markers/functions whether the condition can be determined before the test starts. As mentioned above this is hard if the condition is self.worldsize which is determined by env variables or only set after the test was initialized. Similar other cases exist (e.g. os.environ["WORLD_SIZE"] which IIRC is set when the program is forked for starting the test)

I think these changes look correct, but I noticed that only a small set of tests (4) used the new 'skip' functionality

That only looks like this, see the first commit which has the most important changes

I.e. instead of starting the function and conditionally exit the process when not enough GPUs are available skip the test without starting it. This decorator(s) are used at many tests already

Flamefire · 2024-05-28T20:38:25Z

@pytorchbot merge

pytorchmergebot · 2024-05-28T20:40:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-05-28T22:14:43Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm6.1-py3.8 / test (distributed, 1, 1, linux.rocm.gpu)

Details for Dev Infra team

Raised by workflow job

Flamefire · 2024-05-29T10:55:24Z

@wconstab The failure seems unrelated to these changes and only happened in the merge attempt but not before. Might be a regression? main looks pretty red right now

Flamefire and others added 3 commits September 18, 2023 12:48

Merge branch 'pytorch:main' into fix-decorators

50ba994

Merge branch 'pytorch:main' into fix-decorators

6d94ea1

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 21, 2024

pytorchbot added the open source label Mar 21, 2024

eqy approved these changes Mar 21, 2024

View reviewed changes

Flamefire added 2 commits March 22, 2024 11:02

Merge branch 'main' into fix-decorators

bb98a50

Reduce usage of sys.exit to skip tests

d0bf74d

Factor out `exit_if_lt_x_gpu` Replace checks by `unittest.skip*` where possible

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Mar 22, 2024

wconstab reviewed Mar 22, 2024

View reviewed changes

test/distributed/test_c10d_logger.py Show resolved Hide resolved

github-actions bot added the Stale label May 21, 2024

Merge branch 'main' into fix-decorators

c865f77

Flamefire force-pushed the fix-decorators branch from c338d9e to c865f77 Compare May 22, 2024 15:29

wconstab approved these changes May 28, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 28, 2024

pytorchmergebot added the merging label May 28, 2024

pytorchmergebot removed the merging label May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix decorators skipping NCCL tests #122397

Fix decorators skipping NCCL tests #122397

Flamefire commented Mar 21, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented Mar 21, 2024 •

edited

eqy left a comment

Flamefire commented Mar 22, 2024

wconstab Mar 22, 2024

Flamefire Mar 22, 2024

github-actions bot commented May 21, 2024

Flamefire commented May 27, 2024

eqy commented May 28, 2024 •

edited

wconstab left a comment

Flamefire commented May 28, 2024 •

edited

Flamefire commented May 28, 2024

pytorchmergebot commented May 28, 2024

pytorchmergebot commented May 28, 2024

Flamefire commented May 29, 2024

Fix decorators skipping NCCL tests #122397

Are you sure you want to change the base?

Fix decorators skipping NCCL tests #122397

Conversation

Flamefire commented Mar 21, 2024 • edited by pytorch-bot bot

pytorch-bot bot commented Mar 21, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122397

❌ 1 New Failure, 1 Unrelated Failure

eqy left a comment

Choose a reason for hiding this comment

Flamefire commented Mar 22, 2024

wconstab Mar 22, 2024

Choose a reason for hiding this comment

Flamefire Mar 22, 2024

Choose a reason for hiding this comment

github-actions bot commented May 21, 2024

Flamefire commented May 27, 2024

eqy commented May 28, 2024 • edited

wconstab left a comment

Choose a reason for hiding this comment

Flamefire commented May 28, 2024 • edited

Flamefire commented May 28, 2024

pytorchmergebot commented May 28, 2024

Merge started

pytorchmergebot commented May 28, 2024

Merge failed

Flamefire commented May 29, 2024

Flamefire commented Mar 21, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented Mar 21, 2024 •

edited

eqy commented May 28, 2024 •

edited

Flamefire commented May 28, 2024 •

edited