-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[c10d][BE] fix test_init_pg_and_rpc_with_same_socket #127654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127654
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New FailuresAs of commit 6fde438 with merge base 0e7bd7f ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
) | ||
|
||
rpc.shutdown() | ||
dist.destroy_process_group() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: these dist.destroy_process_group()
should probably be in a try: finally construct or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will address this in next PR to avoid wasting of CI usage. ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Skylion007 , right before I start adding dist.destroy_process_group()
in a try-catch block, I want to raise some questions that bothers me:
-
Should
destroy_process_group
be put inside a try-catch block? I thought it would be better to have it error out if the calling todestroy_process_group()
failed. IMO, that being said, should we makedestroy_process_group
idempotent? -
In the case of testing, we need to make sure that the ProcessGroup created in one test is actually destroyed before entering into the next test. Won't a try-catch block silently hide the error if one occurred in
destroy_process_group
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Skylion007 I see what you mean. Yes you're right, destroy_process_group()
should be in a finally
block.
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 jobs have failed, first few of them are: trunk / macos-13-py3-arm64 / build, trunk / macos-py3-arm64-mps Details for Dev Infra teamRaised by workflow job |
@pytorchbot rebase -b main |
@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here |
Successfully rebased |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 jobs have failed, first few of them are: periodic / ios-build-test / build (default, 1, 1, macos-14-xlarge, SIMULATOR, arm64, 1, 0, 1), periodic / ios-build-test / build (default, 1, 1, macos-14-xlarge, OS, arm64, 1, 1, 1, mobilenetv2.yaml) Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge -f "ignore ios-build-test job since it's pending forever" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
**Summary** fix `test_init_pg_and_rpc_with_same_socket` in `test/distributed/test_store.py` which missed a call to destroy the created ProcessGroup before exiting test function. It lead to "init PG twice" error in the test. **Test Plan** `pytest test/distributed/test_store.py -s -k test_init_pg_and_rpc_with_same_socket` `ciflow/periodic` since this test is included in `.ci/pytorch/multigpu-test.sh` Pull Request resolved: #127654 Approved by: https://github.com/Skylion007, https://github.com/malfet (cherry picked from commit 6580a18)
**Summary** fix `test_init_pg_and_rpc_with_same_socket` in `test/distributed/test_store.py` which missed a call to destroy the created ProcessGroup before exiting test function. It lead to "init PG twice" error in the test. **Test Plan** `pytest test/distributed/test_store.py -s -k test_init_pg_and_rpc_with_same_socket` `ciflow/periodic` since this test is included in `.ci/pytorch/multigpu-test.sh` Pull Request resolved: pytorch#127654 Approved by: https://github.com/Skylion007, https://github.com/malfet
Stack from ghstack (oldest at bottom):
Summary
fix
test_init_pg_and_rpc_with_same_socket
intest/distributed/test_store.py
which missed a call to destroy the created ProcessGroup before exiting test function. It lead to "init PG twice" error in the test.Test Plan
pytest test/distributed/test_store.py -s -k test_init_pg_and_rpc_with_same_socket
ciflow/periodic
since this test is included in.ci/pytorch/multigpu-test.sh
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k