Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix decorators skipping NCCL tests #122397

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Flamefire
Copy link
Collaborator

@Flamefire Flamefire commented Mar 21, 2024

Avoid failures caused by tests exiting via sys.exit instead of unittest.skip

In particular it will not try to start the test (causing forks into subprocess) just to stop them (killing the subprocess) which is done in the test setup

Using unittest.skip decorators avoids the starting of the test in the first place.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @rohan-varma

Flamefire and others added 3 commits September 18, 2023 12:48
The decorator(s) is written to `sys.exit` when the test function is
called which is AFTER the `setup` call which forks the processes and
uses (potentially) a GPU/NCCL based barrier which requires "n GPUs" to
be present befor checking if "n GPUs" are available.

Rewrite those decorators to use `unittest.skipIf` which will not even
enter the `setup` function.
This also exposed that `require_n_gpus_for_nccl_backend` is the same as
`nccl_skip_if_lt_x_gpu` but the former has a better name so I removed
the latter.

Fixes pytorch#89686
Copy link

pytorch-bot bot commented Mar 21, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122397

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit c865f77 with merge base ac51920 (image):

NEW FAILURE - The following job has failed:

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 21, 2024
Copy link
Collaborator

@eqy eqy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the same issue apply to other decorators that use the same sys.exit pattern e.g., skip_if_no_gpu?

Factor out `exit_if_lt_x_gpu`
Replace checks by `unittest.skip*` where possible
@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Mar 22, 2024
@Flamefire
Copy link
Collaborator Author

Does the same issue apply to other decorators that use the same sys.exit pattern e.g., skip_if_no_gpu?

Possibly yes, I'd strongly suggest to avoid this pattern. E.g. exit_if_lt_x_gpu could be replaced by skipUnless

with_comms in test_c10d_logger looks strange as it checks BACKEND which seems to be a constant.

exit_if_lt_x_gpu & with_comms in test_functional_api seems to duplication (both check GPUs against world size)

However many of the other usages of sys.exit(TEST_SKIPS... involve the "world size", e.g. usages of exit_if_lt_x_gpu which I think is not possible to do in a decorator as either self is not available until the method is called or $WORLD_SIZE might not be set until the test is started spawing processes.

I did what I could now to reduce the issues and do some cleanup.

world_size = int(os.environ["WORLD_SIZE"])
if torch.cuda.device_count() < world_size:
sys.exit(TEST_SKIPS[f"multi-gpu-{world_size}"].exit_code)
exit_if_lt_x_gpu(int(os.environ["WORLD_SIZE"]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you comment on the state of this code? is this still 'ok' to use exit or is this just an area that needs further work and not handled in this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to explain this in #122397 (comment)

In short: It needs further work and might not even be possible.

Question here is when exactly $WORLD_SIZE is set. I.e. is this available when the decorator is executed or only when the function is executed, i.e. after the test setup (likely involving forking). In the latter case this cannot be helped.

Similar below: Dependency on self.world_size and we don't have self available in the decorator. And especially as this property can be set/changed by subclasses I figure it is very hard to fix this.

However what could be improved is the duplication:

  • with_comms is defined at least twice and very similarly
  • Unclear differences between usage of self.world_size and $WORLD_SIZE, maybe use either one?
  • After the previous point maybe a (new) separate decorator require_gpus_foreach_rank could be used to move this out of with_comms
  • In all those cases we can skip early (unittest.skip) if there are no GPUs available at all. Might already workaround a for common case (CPU only machine)

Copy link

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label May 21, 2024
@Flamefire
Copy link
Collaborator Author

@eqy @wconstab can this be merged please?

@eqy
Copy link
Collaborator

eqy commented May 28, 2024

No objections from me but I assume it needs a stamp with someone with merge rights

Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

could you also comment on the future work that you would propose?

I think these changes look correct, but I noticed that only a small set of tests (4) used the new 'skip' functionality

@Flamefire
Copy link
Collaborator Author

Flamefire commented May 28, 2024

could you also comment on the future work that you would propose?

I'd propose to double-check the uses of the exit_if_* markers/functions whether the condition can be determined before the test starts. As mentioned above this is hard if the condition is self.worldsize which is determined by env variables or only set after the test was initialized. Similar other cases exist (e.g. os.environ["WORLD_SIZE"] which IIRC is set when the program is forked for starting the test)

I think these changes look correct, but I noticed that only a small set of tests (4) used the new 'skip' functionality

That only looks like this, see the first commit which has the most important changes

I.e. instead of starting the function and conditionally exit the process when not enough GPUs are available skip the test without starting it. This decorator(s) are used at many tests already

@Flamefire
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 28, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm6.1-py3.8 / test (distributed, 1, 1, linux.rocm.gpu)

Details for Dev Infra team Raised by workflow job

@Flamefire
Copy link
Collaborator Author

@wconstab The failure seems unrelated to these changes and only happened in the merge attempt but not before. Might be a regression? main looks pretty red right now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants