Skip to content

Conversation

XilunWu
Copy link
Contributor

@XilunWu XilunWu commented May 31, 2024

Stack from ghstack (oldest at bottom):

Summary
fix test_init_pg_and_rpc_with_same_socket in test/distributed/test_store.py which missed a call to destroy the created ProcessGroup before exiting test function. It lead to "init PG twice" error in the test.

Test Plan
pytest test/distributed/test_store.py -s -k test_init_pg_and_rpc_with_same_socket
ciflow/periodic since this test is included in .ci/pytorch/multigpu-test.sh

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

Copy link

pytorch-bot bot commented May 31, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127654

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 6fde438 with merge base 0e7bd7f (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels May 31, 2024
XilunWu added a commit that referenced this pull request May 31, 2024
ghstack-source-id: 718abf9
Pull Request resolved: #127654
@XilunWu XilunWu added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label May 31, 2024
)

rpc.shutdown()
dist.destroy_process_group()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: these dist.destroy_process_group() should probably be in a try: finally construct or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will address this in next PR to avoid wasting of CI usage. ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Skylion007 , right before I start adding dist.destroy_process_group() in a try-catch block, I want to raise some questions that bothers me:

  1. Should destroy_process_group be put inside a try-catch block? I thought it would be better to have it error out if the calling to destroy_process_group() failed. IMO, that being said, should we make destroy_process_group idempotent?

  2. In the case of testing, we need to make sure that the ProcessGroup created in one test is actually destroyed before entering into the next test. Won't a try-catch block silently hide the error if one occurred in destroy_process_group?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Skylion007 I see what you mean. Yes you're right, destroy_process_group() should be in a finally block.

@XilunWu XilunWu added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 3, 2024
@XilunWu
Copy link
Contributor Author

XilunWu commented Jun 3, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@XilunWu
Copy link
Contributor Author

XilunWu commented Jun 3, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 jobs have failed, first few of them are: trunk / macos-13-py3-arm64 / build, trunk / macos-py3-arm64-mps

Details for Dev Infra team Raised by workflow job

@XilunWu
Copy link
Contributor Author

XilunWu commented Jun 3, 2024

@pytorchbot rebase -b main

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/XilunWu/81/orig onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/127654)

pytorchmergebot pushed a commit that referenced this pull request Jun 3, 2024
ghstack-source-id: e661550
Pull Request resolved: #127654
@XilunWu
Copy link
Contributor Author

XilunWu commented Jun 4, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

@XilunWu
Copy link
Contributor Author

XilunWu commented Jun 4, 2024

@pytorchbot merge -f "ignore ios-build-test job since it's pending forever"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

bigfootjon pushed a commit that referenced this pull request Jun 5, 2024
**Summary**
fix `test_init_pg_and_rpc_with_same_socket` in `test/distributed/test_store.py` which missed a call to destroy the created ProcessGroup before exiting test function. It lead to "init PG twice" error in the test.

**Test Plan**
`pytest test/distributed/test_store.py -s -k test_init_pg_and_rpc_with_same_socket`
`ciflow/periodic` since this test is included in `.ci/pytorch/multigpu-test.sh`

Pull Request resolved: #127654
Approved by: https://github.com/Skylion007, https://github.com/malfet

(cherry picked from commit 6580a18)
petrex pushed a commit to petrex/pytorch that referenced this pull request Jun 5, 2024
**Summary**
fix `test_init_pg_and_rpc_with_same_socket` in `test/distributed/test_store.py` which missed a call to destroy the created ProcessGroup before exiting test function. It lead to "init PG twice" error in the test.

**Test Plan**
`pytest test/distributed/test_store.py -s -k test_init_pg_and_rpc_with_same_socket`
`ciflow/periodic` since this test is included in `.ci/pytorch/multigpu-test.sh`

Pull Request resolved: pytorch#127654
Approved by: https://github.com/Skylion007, https://github.com/malfet
@github-actions github-actions bot deleted the gh/XilunWu/81/head branch July 7, 2024 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants