-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Fix test_init_pg_and_rpc_with_same_socket by retrying on addr in use error #67638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…error [ghstack-poisoned]
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slowFor more information, please take a look at the CI Flow Wiki. |
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 2cfac11 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
|
@H-Huang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need retry_on_connect_failures for other tests as well?
|
Looking into other tests that use |
…ddr in use error" Fixes #66983 There is a possibility that the port retrieved from `find_free_port()` becomes unavailable in the between the time that it is retrieved and used by `init_process_group` and `init_rpc`. Example flaky test failure, https://github.com/pytorch/pytorch/runs/3954935266. Added the `retry_on_connect_failures` (https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_utils.py#L2241-L2266) decorator to retry the function on the `Address already in use` error. cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang Differential Revision: [D32074698](https://our.internmc.facebook.com/intern/diff/D32074698) [ghstack-poisoned]
|
@H-Huang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Stack from ghstack:
Fixes #66983
There is a possibility that the port retrieved from
find_free_port()becomes unavailable in the between the time that it is retrieved and used byinit_process_groupandinit_rpc. Example flaky test failure, https://github.com/pytorch/pytorch/runs/3954935266. Added theretry_on_connect_failures(https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_utils.py#L2241-L2266) decorator to retry the function on theAddress already in useerror.cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang
Differential Revision: D32074698