DISABLED test_destroy_full_group (main.TestMPIWithFork) #53899

mrshenli · 2021-03-12T16:10:22Z

mark this as high priority as this recently becomes very flaky on pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test

https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129
https://app.circleci.com/pipelines/github/pytorch/pytorch/284384/workflows/32712925-932c-4f3e-abf1-d52d78c89bce/jobs/11491532
https://app.circleci.com/pipelines/github/pytorch/pytorch/284005/workflows/521fd4ba-fc51-4c7a-af44-87b29d06b72b/jobs/11470503
https://app.circleci.com/pipelines/github/pytorch/pytorch/284133/workflows/35a4b67a-96c5-4cc7-90a7-0702e20ff326/jobs/11481258
https://app.circleci.com/pipelines/github/pytorch/pytorch/284161/workflows/2db80538-14d9-49b2-bbe5-1036e0aa97f5/jobs/11481323

Mar 11 01:59:38   test_destroy_full_group (__main__.TestMPIWithFork) ... [f577aff0b45e:610] *** An error occurred in MPI_Comm_create
Mar 11 01:59:38 [f577aff0b45e:610] *** reported by process [4275896321,1]
Mar 11 01:59:38 [f577aff0b45e:610] *** on communicator MPI_COMM_WORLD
Mar 11 01:59:38 [f577aff0b45e:610] *** MPI_ERR_TRUNCATE: message truncated
Mar 11 01:59:38 [f577aff0b45e:610] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
Mar 11 01:59:38 [f577aff0b45e:610] ***    and potentially your MPI job)
Mar 11 01:59:38 Traceback (most recent call last):
Mar 11 01:59:38   File "test/run_test.py", line 1074, in <module>
Mar 11 01:59:38     main()
Mar 11 01:59:38   File "test/run_test.py", line 1053, in main
Mar 11 01:59:38     raise RuntimeError(err_message)
Mar 11 01:59:38 RuntimeError: distributed/test_distributed_fork failed!

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @anjali411 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu

The text was updated successfully, but these errors were encountered:

mrshenli · 2021-03-12T16:10:33Z

This was first reported in #14554

…gpu_test by retrying 3 times Couldn't find the root cause why `createProcessGroupMPI` can be flaky when just creating a subgroup communicator. This function does not involve any p2p or collective communication at all. Also checked the commit history, and no commit on `ProcessGroupMPI.cpp` can be found within a few days before Mar 10th. First failure (on Mar 10th): https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129 Note that the test failure cannot be reproduced locally. #Closes: #53899 Differential Revision: [D27245421](https://our.internmc.facebook.com/intern/diff/D27245421/) [ghstack-poisoned]

…gpu_test by retrying 3 times Couldn't find the root cause why `createProcessGroupMPI` can be flaky when just creating a subgroup communicator. This function does not involve any p2p or collective communication at all. Also checked the commit history, and no commit on `ProcessGroupMPI.cpp` can be found within a few days before Mar 10th. First failure (on Mar 10th): https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129 Note that the test failure cannot be reproduced locally. #Closes: #53899 Differential Revision: [D27245421](https://our.internmc.facebook.com/intern/diff/D27245421/) ghstack-source-id: 126383493 Pull Request resolved: #55921

…torch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test environment by adding a barrier and retrying MPI_Comm_create 3 times" Fix this flaky test by adding a barrier and retrying the flaky function call `MPI_Comm_create` 3 times. Couldn't figure out the root cause why `createProcessGroupMPI` can be flaky when just creating a subgroup communicator by mainly invoking `MPI_Comm_create`. Here `createProcessGroupMPI` does not involve any p2p or collective communication at all. Cannot further dig into `MPI_Comm_create`, which is in MPI codebase. Also checked the commit history, and no commit on `ProcessGroupMPI.cpp` can be found within a few days before Mar 10th. First failure (on Mar 10th): https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129 Note that the test failure cannot be reproduced locally. Verified the fix on CI using a new branch: https://app.circleci.com/pipelines/github/pytorch/pytorch/300586/workflows/a5c16db4-3ae2-44c7-a9c8-b0885dad2a64/jobs/12356852 test_destroy_full_group has rerun 100 times and pass. #Closes: #53899 Differential Revision: [D27245421](https://our.internmc.facebook.com/intern/diff/D27245421/) [ghstack-poisoned]

…_xenial_cuda10_2_cudnn7_py3_multigpu_test environment by adding a barrier and retrying MPI_Comm_create 3 times Pull Request resolved: #55921 Fix this flaky test by adding a barrier and retrying the flaky function call `MPI_Comm_create` 3 times. Couldn't figure out the root cause why `createProcessGroupMPI` can be flaky when just creating a subgroup communicator by mainly invoking `MPI_Comm_create`. Here `createProcessGroupMPI` does not involve any p2p or collective communication at all. Cannot further dig into `MPI_Comm_create`, which is in MPI codebase. Also checked the commit history, and no commit on `ProcessGroupMPI.cpp` can be found within a few days before Mar 10th. First failure (on Mar 10th): https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129 Note that the test failure cannot be reproduced locally. Verified the fix on CI: https://app.circleci.com/pipelines/github/pytorch/pytorch/300586/workflows/a5c16db4-3ae2-44c7-a9c8-b0885dad2a64/jobs/12356852 test_destroy_full_group has rerun 100 times and pass. #Closes: #53899 ghstack-source-id: 126414937 Differential Revision: [D27245421](https://our.internmc.facebook.com/intern/diff/D27245421/)

…_xenial_cuda10_2_cudnn7_py3_multigpu_test environment by adding a barrier and retrying MPI_Comm_create 3 times (pytorch#55921) Summary: Pull Request resolved: pytorch#55921 Fix this flaky test by adding a barrier and retrying the flaky function call `MPI_Comm_create` 3 times. Couldn't figure out the root cause why `createProcessGroupMPI` can be flaky when just creating a subgroup communicator by mainly invoking `MPI_Comm_create`. Here `createProcessGroupMPI` does not involve any p2p or collective communication at all. Cannot further dig into `MPI_Comm_create`, which is in MPI codebase. Also checked the commit history, and no commit on `ProcessGroupMPI.cpp` can be found within a few days before Mar 10th. First failure (on Mar 10th): https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129 Note that the test failure cannot be reproduced locally. Verified the fix on CI: https://app.circleci.com/pipelines/github/pytorch/pytorch/300586/workflows/a5c16db4-3ae2-44c7-a9c8-b0885dad2a64/jobs/12356852 test_destroy_full_group has rerun 100 times and pass. #Closes: pytorch#53899 ghstack-source-id: 126414937 Test Plan: ``` export BACKEND=mpi export WORLD_SIZE=2 pytest -k test_destroy_full_group test/distributed/test_distributed_fork.py -vs ``` ``` #!/bin/bash for i in {1..100} do pytest -k test_destroy_full_group test/distributed/test_distributed_fork.py done ``` The CI tests triggered by a new branch: https://app.circleci.com/pipelines/github/pytorch/pytorch?branch=ci-all%2Fwayi_mpi Reviewed By: mrshenli Differential Revision: D27245421 fbshipit-source-id: 86e7fe208e34eda8a33885e385d56ec6b60eca27

mrshenli added high priority oncall: distributed Add this issue/PR to distributed oncall triage queue module: flaky-tests Problem is a flaky test in CI labels Mar 12, 2021

pytorch-probot bot added the triage review label Mar 12, 2021

wayi1 self-assigned this Mar 22, 2021

mrshenli mentioned this issue Apr 8, 2021

CI test timing out: pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test #55389

Closed

wayi1 mentioned this issue Apr 13, 2021

Fix OSS flaky test_destroy_full_group on MPI backend in pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test environment by adding a barrier and retrying MPI_Comm_create 3 times #55921

Closed

facebook-github-bot closed this as completed in de5e3b5 Apr 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DISABLED test_destroy_full_group (main.TestMPIWithFork) #53899

DISABLED test_destroy_full_group (main.TestMPIWithFork) #53899

mrshenli commented Mar 12, 2021 •

edited by pytorch-probot bot

mrshenli commented Mar 12, 2021

DISABLED test_destroy_full_group (__main__.TestMPIWithFork) #53899

DISABLED test_destroy_full_group (__main__.TestMPIWithFork) #53899

Comments

mrshenli commented Mar 12, 2021 • edited by pytorch-probot bot

mrshenli commented Mar 12, 2021

DISABLED test_destroy_full_group (main.TestMPIWithFork) #53899

DISABLED test_destroy_full_group (main.TestMPIWithFork) #53899

mrshenli commented Mar 12, 2021 •

edited by pytorch-probot bot