Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase timeout for ProcessGroupGlooTest #85474

Closed
wants to merge 1 commit into from

Conversation

Flamefire
Copy link
Collaborator

We see spurious failures due to timeouts intest_allreduce_coalesced_basics but only when running the whole test suite with
python run_test.py --verbose -i distributed/test_c10d_gloo. Increasing the timeout to 50s should provide enough leeway to avoid this. Note that the default for the _timeout is 30 minutes.

Originally reported in EasyBuild at easybuilders/easybuild-easyconfigs#15137 (comment) and patch proposed by @casparvl

We see spurious failures due to timeouts in`test_allreduce_coalesced_basics`
but only when running the whole test suite with
`python run_test.py --verbose -i distributed/test_c10d_gloo`.
Increasing the timeout to 50s should provide enough leeway to avoid this.
Note that the default for the `_timeout` is 30 minutes.
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 22, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85474

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures, 3 Pending

As of commit 0b92aca:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Sep 22, 2022
@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Sep 22, 2022
@samdow samdow added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 22, 2022
Copy link
Member

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@rohan-varma
Copy link
Member

@pytorchbot merge -g

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered with the green (-g) flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

@Flamefire Flamefire deleted the test_c10d_gloo_timeout branch September 23, 2022 06:59
mehtanirav pushed a commit that referenced this pull request Oct 4, 2022
We see spurious failures due to timeouts in`test_allreduce_coalesced_basics` but only when running the whole test suite with
`python run_test.py --verbose -i distributed/test_c10d_gloo`. Increasing the timeout to 50s should provide enough leeway to avoid this. Note that the default for the `_timeout` is 30 minutes.

Originally reported in EasyBuild at easybuilders/easybuild-easyconfigs#15137 (comment) and patch proposed by @casparvl
Pull Request resolved: #85474
Approved by: https://github.com/rohan-varma
alvgaona pushed a commit to alvgaona/pytorch that referenced this pull request Oct 11, 2022
We see spurious failures due to timeouts in`test_allreduce_coalesced_basics` but only when running the whole test suite with
`python run_test.py --verbose -i distributed/test_c10d_gloo`. Increasing the timeout to 50s should provide enough leeway to avoid this. Note that the default for the `_timeout` is 30 minutes.

Originally reported in EasyBuild at easybuilders/easybuild-easyconfigs#15137 (comment) and patch proposed by @casparvl
Pull Request resolved: pytorch#85474
Approved by: https://github.com/rohan-varma
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants