-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DISABLED test_destroy_full_group (__main__.TestMPIWithFork) #53899
Labels
high priority
module: flaky-tests
Problem is a flaky test in CI
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triage review
Comments
mrshenli
added
high priority
oncall: distributed
Add this issue/PR to distributed oncall triage queue
module: flaky-tests
Problem is a flaky test in CI
labels
Mar 12, 2021
This was first reported in #14554 |
wayi1
pushed a commit
that referenced
this issue
Apr 13, 2021
…gpu_test by retrying 3 times Couldn't find the root cause why `createProcessGroupMPI` can be flaky when just creating a subgroup communicator. This function does not involve any p2p or collective communication at all. Also checked the commit history, and no commit on `ProcessGroupMPI.cpp` can be found within a few days before Mar 10th. First failure (on Mar 10th): https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129 Note that the test failure cannot be reproduced locally. #Closes: #53899 Differential Revision: [D27245421](https://our.internmc.facebook.com/intern/diff/D27245421/) [ghstack-poisoned]
wayi1
pushed a commit
that referenced
this issue
Apr 13, 2021
…gpu_test by retrying 3 times Couldn't find the root cause why `createProcessGroupMPI` can be flaky when just creating a subgroup communicator. This function does not involve any p2p or collective communication at all. Also checked the commit history, and no commit on `ProcessGroupMPI.cpp` can be found within a few days before Mar 10th. First failure (on Mar 10th): https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129 Note that the test failure cannot be reproduced locally. #Closes: #53899 Differential Revision: [D27245421](https://our.internmc.facebook.com/intern/diff/D27245421/) ghstack-source-id: 126383493 Pull Request resolved: #55921
wayi1
pushed a commit
that referenced
this issue
Apr 13, 2021
…torch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test environment by adding a barrier and retrying MPI_Comm_create 3 times" Fix this flaky test by adding a barrier and retrying the flaky function call `MPI_Comm_create` 3 times. Couldn't figure out the root cause why `createProcessGroupMPI` can be flaky when just creating a subgroup communicator by mainly invoking `MPI_Comm_create`. Here `createProcessGroupMPI` does not involve any p2p or collective communication at all. Cannot further dig into `MPI_Comm_create`, which is in MPI codebase. Also checked the commit history, and no commit on `ProcessGroupMPI.cpp` can be found within a few days before Mar 10th. First failure (on Mar 10th): https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129 Note that the test failure cannot be reproduced locally. Verified the fix on CI using a new branch: https://app.circleci.com/pipelines/github/pytorch/pytorch/300586/workflows/a5c16db4-3ae2-44c7-a9c8-b0885dad2a64/jobs/12356852 test_destroy_full_group has rerun 100 times and pass. #Closes: #53899 Differential Revision: [D27245421](https://our.internmc.facebook.com/intern/diff/D27245421/) [ghstack-poisoned]
wayi1
pushed a commit
that referenced
this issue
Apr 13, 2021
…_xenial_cuda10_2_cudnn7_py3_multigpu_test environment by adding a barrier and retrying MPI_Comm_create 3 times Pull Request resolved: #55921 Fix this flaky test by adding a barrier and retrying the flaky function call `MPI_Comm_create` 3 times. Couldn't figure out the root cause why `createProcessGroupMPI` can be flaky when just creating a subgroup communicator by mainly invoking `MPI_Comm_create`. Here `createProcessGroupMPI` does not involve any p2p or collective communication at all. Cannot further dig into `MPI_Comm_create`, which is in MPI codebase. Also checked the commit history, and no commit on `ProcessGroupMPI.cpp` can be found within a few days before Mar 10th. First failure (on Mar 10th): https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129 Note that the test failure cannot be reproduced locally. Verified the fix on CI: https://app.circleci.com/pipelines/github/pytorch/pytorch/300586/workflows/a5c16db4-3ae2-44c7-a9c8-b0885dad2a64/jobs/12356852 test_destroy_full_group has rerun 100 times and pass. #Closes: #53899 ghstack-source-id: 126414937 Differential Revision: [D27245421](https://our.internmc.facebook.com/intern/diff/D27245421/)
krshrimali
pushed a commit
to krshrimali/pytorch
that referenced
this issue
May 19, 2021
…_xenial_cuda10_2_cudnn7_py3_multigpu_test environment by adding a barrier and retrying MPI_Comm_create 3 times (pytorch#55921) Summary: Pull Request resolved: pytorch#55921 Fix this flaky test by adding a barrier and retrying the flaky function call `MPI_Comm_create` 3 times. Couldn't figure out the root cause why `createProcessGroupMPI` can be flaky when just creating a subgroup communicator by mainly invoking `MPI_Comm_create`. Here `createProcessGroupMPI` does not involve any p2p or collective communication at all. Cannot further dig into `MPI_Comm_create`, which is in MPI codebase. Also checked the commit history, and no commit on `ProcessGroupMPI.cpp` can be found within a few days before Mar 10th. First failure (on Mar 10th): https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129 Note that the test failure cannot be reproduced locally. Verified the fix on CI: https://app.circleci.com/pipelines/github/pytorch/pytorch/300586/workflows/a5c16db4-3ae2-44c7-a9c8-b0885dad2a64/jobs/12356852 test_destroy_full_group has rerun 100 times and pass. #Closes: pytorch#53899 ghstack-source-id: 126414937 Test Plan: ``` export BACKEND=mpi export WORLD_SIZE=2 pytest -k test_destroy_full_group test/distributed/test_distributed_fork.py -vs ``` ``` #!/bin/bash for i in {1..100} do pytest -k test_destroy_full_group test/distributed/test_distributed_fork.py done ``` The CI tests triggered by a new branch: https://app.circleci.com/pipelines/github/pytorch/pytorch?branch=ci-all%2Fwayi_mpi Reviewed By: mrshenli Differential Revision: D27245421 fbshipit-source-id: 86e7fe208e34eda8a33885e385d56ec6b60eca27
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
high priority
module: flaky-tests
Problem is a flaky test in CI
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triage review
mark this as high priority as this recently becomes very flaky on
pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test
https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129
https://app.circleci.com/pipelines/github/pytorch/pytorch/284384/workflows/32712925-932c-4f3e-abf1-d52d78c89bce/jobs/11491532
https://app.circleci.com/pipelines/github/pytorch/pytorch/284005/workflows/521fd4ba-fc51-4c7a-af44-87b29d06b72b/jobs/11470503
https://app.circleci.com/pipelines/github/pytorch/pytorch/284133/workflows/35a4b67a-96c5-4cc7-90a7-0702e20ff326/jobs/11481258
https://app.circleci.com/pipelines/github/pytorch/pytorch/284161/workflows/2db80538-14d9-49b2-bbe5-1036e0aa97f5/jobs/11481323
cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @anjali411 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu
The text was updated successfully, but these errors were encountered: