Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISABLED test_destroy_full_group (__main__.TestMPIWithFork) #53899

Closed
mrshenli opened this issue Mar 12, 2021 · 1 comment
Closed

DISABLED test_destroy_full_group (__main__.TestMPIWithFork) #53899

mrshenli opened this issue Mar 12, 2021 · 1 comment
Assignees
Labels
high priority module: flaky-tests Problem is a flaky test in CI oncall: distributed Add this issue/PR to distributed oncall triage queue triage review

Comments

@mrshenli
Copy link
Contributor

mrshenli commented Mar 12, 2021

mark this as high priority as this recently becomes very flaky on pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test

https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129
https://app.circleci.com/pipelines/github/pytorch/pytorch/284384/workflows/32712925-932c-4f3e-abf1-d52d78c89bce/jobs/11491532
https://app.circleci.com/pipelines/github/pytorch/pytorch/284005/workflows/521fd4ba-fc51-4c7a-af44-87b29d06b72b/jobs/11470503
https://app.circleci.com/pipelines/github/pytorch/pytorch/284133/workflows/35a4b67a-96c5-4cc7-90a7-0702e20ff326/jobs/11481258
https://app.circleci.com/pipelines/github/pytorch/pytorch/284161/workflows/2db80538-14d9-49b2-bbe5-1036e0aa97f5/jobs/11481323

Mar 11 01:59:38   test_destroy_full_group (__main__.TestMPIWithFork) ... [f577aff0b45e:610] *** An error occurred in MPI_Comm_create
Mar 11 01:59:38 [f577aff0b45e:610] *** reported by process [4275896321,1]
Mar 11 01:59:38 [f577aff0b45e:610] *** on communicator MPI_COMM_WORLD
Mar 11 01:59:38 [f577aff0b45e:610] *** MPI_ERR_TRUNCATE: message truncated
Mar 11 01:59:38 [f577aff0b45e:610] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
Mar 11 01:59:38 [f577aff0b45e:610] ***    and potentially your MPI job)
Mar 11 01:59:38 Traceback (most recent call last):
Mar 11 01:59:38   File "test/run_test.py", line 1074, in <module>
Mar 11 01:59:38     main()
Mar 11 01:59:38   File "test/run_test.py", line 1053, in main
Mar 11 01:59:38     raise RuntimeError(err_message)
Mar 11 01:59:38 RuntimeError: distributed/test_distributed_fork failed!

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @anjali411 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu

@mrshenli mrshenli added high priority oncall: distributed Add this issue/PR to distributed oncall triage queue module: flaky-tests Problem is a flaky test in CI labels Mar 12, 2021
@mrshenli
Copy link
Contributor Author

This was first reported in #14554

@wayi1 wayi1 self-assigned this Mar 22, 2021
wayi1 pushed a commit that referenced this issue Apr 13, 2021
…gpu_test by retrying 3 times

Couldn't find the root cause why `createProcessGroupMPI` can be flaky when just creating a subgroup communicator. This function does not involve any p2p or collective communication at all.

Also checked the commit history, and no commit on `ProcessGroupMPI.cpp` can be found within a few days before Mar 10th.

First failure (on Mar 10th):
https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129

Note that the test failure cannot be reproduced locally.

#Closes: #53899

Differential Revision: [D27245421](https://our.internmc.facebook.com/intern/diff/D27245421/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this issue Apr 13, 2021
…gpu_test by retrying 3 times

Couldn't find the root cause why `createProcessGroupMPI` can be flaky when just creating a subgroup communicator. This function does not involve any p2p or collective communication at all.

Also checked the commit history, and no commit on `ProcessGroupMPI.cpp` can be found within a few days before Mar 10th.

First failure (on Mar 10th):
https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129

Note that the test failure cannot be reproduced locally.

#Closes: #53899

Differential Revision: [D27245421](https://our.internmc.facebook.com/intern/diff/D27245421/)

ghstack-source-id: 126383493
Pull Request resolved: #55921
wayi1 pushed a commit that referenced this issue Apr 13, 2021
…torch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test environment by adding a barrier and retrying MPI_Comm_create 3 times"


Fix this flaky test by adding a barrier and retrying the flaky function call `MPI_Comm_create` 3 times.

Couldn't figure out the root cause why `createProcessGroupMPI` can be flaky when just creating a subgroup communicator by mainly invoking `MPI_Comm_create`. Here `createProcessGroupMPI` does not involve any p2p or collective communication at all. Cannot further dig into `MPI_Comm_create`, which is in MPI codebase.

Also checked the commit history, and no commit on `ProcessGroupMPI.cpp` can be found within a few days before Mar 10th.

First failure (on Mar 10th):
https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129

Note that the test failure cannot be reproduced locally.

Verified the fix on CI using a new branch:
https://app.circleci.com/pipelines/github/pytorch/pytorch/300586/workflows/a5c16db4-3ae2-44c7-a9c8-b0885dad2a64/jobs/12356852
test_destroy_full_group has rerun 100 times and pass.

#Closes: #53899

Differential Revision: [D27245421](https://our.internmc.facebook.com/intern/diff/D27245421/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this issue Apr 13, 2021
…_xenial_cuda10_2_cudnn7_py3_multigpu_test environment by adding a barrier and retrying MPI_Comm_create 3 times

Pull Request resolved: #55921

Fix this flaky test by adding a barrier and retrying the flaky function call `MPI_Comm_create` 3 times.

Couldn't figure out the root cause why `createProcessGroupMPI` can be flaky when just creating a subgroup communicator by mainly invoking `MPI_Comm_create`. Here `createProcessGroupMPI` does not involve any p2p or collective communication at all. Cannot further dig into `MPI_Comm_create`, which is in MPI codebase.

Also checked the commit history, and no commit on `ProcessGroupMPI.cpp` can be found within a few days before Mar 10th.

First failure (on Mar 10th):
https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129

Note that the test failure cannot be reproduced locally.

Verified the fix on CI:
https://app.circleci.com/pipelines/github/pytorch/pytorch/300586/workflows/a5c16db4-3ae2-44c7-a9c8-b0885dad2a64/jobs/12356852
test_destroy_full_group has rerun 100 times and pass.


#Closes: #53899
ghstack-source-id: 126414937

Differential Revision: [D27245421](https://our.internmc.facebook.com/intern/diff/D27245421/)
krshrimali pushed a commit to krshrimali/pytorch that referenced this issue May 19, 2021
…_xenial_cuda10_2_cudnn7_py3_multigpu_test environment by adding a barrier and retrying MPI_Comm_create 3 times (pytorch#55921)

Summary:
Pull Request resolved: pytorch#55921

Fix this flaky test by adding a barrier and retrying the flaky function call `MPI_Comm_create` 3 times.

Couldn't figure out the root cause why `createProcessGroupMPI` can be flaky when just creating a subgroup communicator by mainly invoking `MPI_Comm_create`. Here `createProcessGroupMPI` does not involve any p2p or collective communication at all. Cannot further dig into `MPI_Comm_create`, which is in MPI codebase.

Also checked the commit history, and no commit on `ProcessGroupMPI.cpp` can be found within a few days before Mar 10th.

First failure (on Mar 10th):
https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129

Note that the test failure cannot be reproduced locally.

Verified the fix on CI:
https://app.circleci.com/pipelines/github/pytorch/pytorch/300586/workflows/a5c16db4-3ae2-44c7-a9c8-b0885dad2a64/jobs/12356852
test_destroy_full_group has rerun 100 times and pass.

#Closes: pytorch#53899
ghstack-source-id: 126414937

Test Plan:
```
export BACKEND=mpi
export WORLD_SIZE=2
pytest -k test_destroy_full_group test/distributed/test_distributed_fork.py -vs
```

```
#!/bin/bash
for i in {1..100}
do
pytest -k test_destroy_full_group test/distributed/test_distributed_fork.py
done
```

The CI tests triggered by a new branch:
https://app.circleci.com/pipelines/github/pytorch/pytorch?branch=ci-all%2Fwayi_mpi

Reviewed By: mrshenli

Differential Revision: D27245421

fbshipit-source-id: 86e7fe208e34eda8a33885e385d56ec6b60eca27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: flaky-tests Problem is a flaky test in CI oncall: distributed Add this issue/PR to distributed oncall triage queue triage review
Projects
None yet
Development

No branches or pull requests

2 participants