Skip to content

Commit 8deb4fe

Browse files
pritamdamaniafacebook-github-bot
authored andcommitted
Fix flaky NCCL error handling tests. (#42149)
Summary: Pull Request resolved: #42149 Some of these tests were flaky since we could kill the process in some way without cleaning up the ProcessGroup. This resulted in issues where the FileStore didn't clean up appropriately resulting in other processes in the group to crash. Fixed this by explicitly deleting the process_group before we bring a process down forcibly. ghstack-source-id: 108629057 Test Plan: waitforbuildbot Reviewed By: mrshenli Differential Revision: D22785042 fbshipit-source-id: c31d0f723badbc23b7258e322f75b57e0a1a42cf
1 parent b6a9f42 commit 8deb4fe

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

test/distributed/test_c10d.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3413,6 +3413,8 @@ def _test_nccl_errors_blocking(self, func):
34133413
# aborting nccl communicators before throwing Operation timed out
34143414
a = torch.rand(10).cuda(self.rank)
34153415
elif self.rank == 1:
3416+
# Clean up structures (ex: files for FileStore before going down)
3417+
del process_group
34163418
func()
34173419
else:
34183420
# Wait for timeout
@@ -3494,7 +3496,7 @@ def _wait_for_comm_abort(self, process_group):
34943496
return
34953497
else:
34963498
raise e
3497-
time.sleep(1)
3499+
time.sleep(0.1)
34983500

34993501
@requires_nccl()
35003502
@skip_if_lt_x_gpu(3)

0 commit comments

Comments
 (0)