-
Notifications
You must be signed in to change notification settings - Fork 21.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[BE] Fix flaky ProcessGroupGloo tests (#61396)
Summary: A hypothesis as to why tests such as #57469 may be flaky is due to `c10d = ProcessGroupGloo(...)` is not actually guaranteed to be a synchronization point, so some ranks may create the PG, run all the error checking (which does not actually call into gloo APIs so doesn't require synchronization), and then exit, all before other ranks have created the gloo pg. This can result in the following error: ``` File "distributed/test_c10d_gloo.py", line 1037, in test_reduce_checks May 03 06:42:34 pg = c10d.ProcessGroupGloo(store, self.rank, self.world_size, self.opts()) May 03 06:42:34 RuntimeError: [/var/lib/jenkins/workspace/third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [127.0.0.1]:35521 ``` which indicates that the remote end has hung up. Furthermore all the flaky tests in this file only do error checking and don't call into the gloo APIs, further indicating that this issue may be the root cause. Not 100% sure this PR will fix it because I haven't been able to actually repro the issue even after 10000+ runs, but it happens regularly in CI. To fix this, we add a `dist.barrier(group=pg)` call after creating the pg to enforce a synchronization. Would be good to land this and observe whether it helps with the flakiness. Pull Request resolved: #61396 Reviewed By: mrshenli Differential Revision: D29664189 Pulled By: rohan-varma fbshipit-source-id: bc046d5d816fe6cb426522b85312383bfa3f90b7
- Loading branch information
1 parent
3e5d2b5
commit d520406
Showing
1 changed file
with
36 additions
and
30 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters