[c10d][BE] fix test_init_pg_and_rpc_with_same_socket #127654

XilunWu · 2024-05-31T23:07:01Z

Stack from ghstack (oldest at bottom):

-> [c10d][BE] fix test_init_pg_and_rpc_with_same_socket #127654

Summary
fix test_init_pg_and_rpc_with_same_socket in test/distributed/test_store.py which missed a call to destroy the created ProcessGroup before exiting test function. It lead to "init PG twice" error in the test.

Test Plan
pytest test/distributed/test_store.py -s -k test_init_pg_and_rpc_with_same_socket
ciflow/periodic since this test is included in .ci/pytorch/multigpu-test.sh

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

[ghstack-poisoned]

pytorch-bot · 2024-05-31T23:07:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127654

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 6fde438 with merge base 0e7bd7f ():

NEW FAILURES - The following jobs have failed:

periodic / ios-build-test / build (default, 1, 1, macos-14-xlarge, OS, arm64, 1, 1, 1, mobilenetv2.yaml) (gh)
Process completed with exit code 1.
periodic / ios-build-test / build (default, 1, 1, macos-14-xlarge, SIMULATOR, arm64, 1, 0, 1) (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 718abf9 Pull Request resolved: #127654

Skylion007 · 2024-06-01T14:42:23Z

test/distributed/test_store.py

        )

        rpc.shutdown()
+        dist.destroy_process_group()


Nit: these dist.destroy_process_group() should probably be in a try: finally construct or something.

will address this in next PR to avoid wasting of CI usage. ;-)

Hey @Skylion007 , right before I start adding dist.destroy_process_group() in a try-catch block, I want to raise some questions that bothers me:

Should destroy_process_group be put inside a try-catch block? I thought it would be better to have it error out if the calling to destroy_process_group() failed. IMO, that being said, should we make destroy_process_group idempotent?

In the case of testing, we need to make sure that the ProcessGroup created in one test is actually destroyed before entering into the next test. Won't a try-catch block silently hide the error if one occurred in destroy_process_group?

@Skylion007 I see what you mean. Yes you're right, destroy_process_group() should be in a finally block.

XilunWu · 2024-06-03T08:42:20Z

@pytorchbot merge

pytorchmergebot · 2024-06-03T08:45:00Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-06-03T14:43:34Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

XilunWu · 2024-06-03T17:57:30Z

@pytorchbot merge

pytorchmergebot · 2024-06-03T17:59:18Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-06-03T21:30:03Z

Merge failed

Reason: 2 jobs have failed, first few of them are: trunk / macos-13-py3-arm64 / build, trunk / macos-py3-arm64-mps

Details for Dev Infra team

Raised by workflow job

XilunWu · 2024-06-03T22:41:35Z

@pytorchbot rebase -b main

pytorchmergebot · 2024-06-03T22:43:02Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

[ghstack-poisoned]

pytorchmergebot · 2024-06-03T22:43:14Z

Successfully rebased gh/XilunWu/81/orig onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/127654)

ghstack-source-id: e661550 Pull Request resolved: #127654

XilunWu · 2024-06-04T03:54:40Z

@pytorchbot merge

pytorchmergebot · 2024-06-04T03:56:24Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-06-04T03:56:41Z

Merge failed

Reason: 2 jobs have failed, first few of them are: periodic / ios-build-test / build (default, 1, 1, macos-14-xlarge, SIMULATOR, arm64, 1, 0, 1), periodic / ios-build-test / build (default, 1, 1, macos-14-xlarge, OS, arm64, 1, 1, 1, mobilenetv2.yaml)

Details for Dev Infra team

Raised by workflow job

XilunWu · 2024-06-04T03:58:36Z

@pytorchbot merge -f "ignore ios-build-test job since it's pending forever"

pytorchmergebot · 2024-06-04T04:00:15Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

**Summary** fix `test_init_pg_and_rpc_with_same_socket` in `test/distributed/test_store.py` which missed a call to destroy the created ProcessGroup before exiting test function. It lead to "init PG twice" error in the test. **Test Plan** `pytest test/distributed/test_store.py -s -k test_init_pg_and_rpc_with_same_socket` `ciflow/periodic` since this test is included in `.ci/pytorch/multigpu-test.sh` Pull Request resolved: #127654 Approved by: https://github.com/Skylion007, https://github.com/malfet (cherry picked from commit 6580a18)

**Summary** fix `test_init_pg_and_rpc_with_same_socket` in `test/distributed/test_store.py` which missed a call to destroy the created ProcessGroup before exiting test function. It lead to "init PG twice" error in the test. **Test Plan** `pytest test/distributed/test_store.py -s -k test_init_pg_and_rpc_with_same_socket` `ciflow/periodic` since this test is included in `.ci/pytorch/multigpu-test.sh` Pull Request resolved: pytorch#127654 Approved by: https://github.com/Skylion007, https://github.com/malfet

[c10d][BE] fix test_init_pg_and_rpc_with_same_socket

812ef21

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels May 31, 2024

XilunWu added a commit that referenced this pull request May 31, 2024

[c10d][BE] fix test_init_pg_and_rpc_with_same_socket

2aaf137

ghstack-source-id: 718abf9 Pull Request resolved: #127654

XilunWu requested review from H-Huang, c-p-i-o, kurman, kwen2501 and wconstab May 31, 2024 23:07

XilunWu added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label May 31, 2024

Skylion007 reviewed Jun 1, 2024

View reviewed changes

Skylion007 approved these changes Jun 1, 2024

View reviewed changes

XilunWu added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 3, 2024

pytorchmergebot added the merging label Jun 3, 2024

pytorchmergebot removed the merging label Jun 3, 2024

Update

6fde438

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request Jun 3, 2024

[c10d][BE] fix test_init_pg_and_rpc_with_same_socket

eb0212a

ghstack-source-id: e661550 Pull Request resolved: #127654

malfet approved these changes Jun 3, 2024

View reviewed changes

pytorchmergebot added the merging label Jun 4, 2024

pytorchmergebot removed the merging label Jun 4, 2024

pytorchmergebot added the merging label Jun 4, 2024

pytorchmergebot closed this in 6580a18 Jun 4, 2024

pytorchmergebot added Merged and removed merging labels Jun 4, 2024

github-actions bot deleted the gh/XilunWu/81/head branch July 7, 2024 01:59

[c10d][BE] fix test_init_pg_and_rpc_with_same_socket #127654

[c10d][BE] fix test_init_pg_and_rpc_with_same_socket #127654

Uh oh!

Conversation

XilunWu commented May 31, 2024 • edited by pytorchmergebot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127654

❌ 2 New Failures

Uh oh!

Skylion007 Jun 1, 2024

Choose a reason for hiding this comment

Uh oh!

XilunWu Jun 1, 2024

Choose a reason for hiding this comment

Uh oh!

XilunWu Jun 3, 2024

Choose a reason for hiding this comment

Uh oh!

XilunWu Jun 6, 2024

Choose a reason for hiding this comment

Uh oh!

XilunWu commented Jun 3, 2024

Uh oh!

pytorchmergebot commented Jun 3, 2024

Merge started

Uh oh!

pytorchmergebot commented Jun 3, 2024

Uh oh!

XilunWu commented Jun 3, 2024

Uh oh!

pytorchmergebot commented Jun 3, 2024

Merge started

Uh oh!

pytorchmergebot commented Jun 3, 2024

Merge failed

Uh oh!

XilunWu commented Jun 3, 2024

Uh oh!

pytorchmergebot commented Jun 3, 2024

Uh oh!

pytorchmergebot commented Jun 3, 2024

Uh oh!

XilunWu commented Jun 4, 2024

Uh oh!

pytorchmergebot commented Jun 4, 2024

Merge started

Uh oh!

pytorchmergebot commented Jun 4, 2024

Merge failed

Uh oh!

XilunWu commented Jun 4, 2024

Uh oh!

pytorchmergebot commented Jun 4, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

XilunWu commented May 31, 2024 •

edited by pytorchmergebot

Loading

pytorch-bot bot commented May 31, 2024 •

edited

Loading