Pytorch deadlock from distributed multiprocessing #22788
Labels
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
馃悰 Bug
Pytorch hangs indefinitely when using distributed multiprocessing with Pytorch 1.1.0 after 77k iterations (77939 and 77940 iterations). In Pytorch 1.0.1.post2, there is no such bug.
It appears to be deadlock.
The following distributed modules combine together in a very nasty way for some reason:
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Pytorch should not hang.
Environment
PyTorch version: 1.1.0
Is debug build: No
CUDA used to build PyTorch: 10.0.130
OS: Ubuntu 18.04.2 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2
Python version: 3.6
Is CUDA available: No
CUDA runtime version: 10.0.130
GPU models and configuration: many
Nvidia driver version:
cuDNN version:
Versions of relevant libraries:
[pip] numpy==1.16.3
[pip] torch==1.1.0
[pip] torchvision==0.3.0
[conda] blas 1.0 mkl
[conda] mkl 2019.3 199
[conda] mkl_fft 1.0.12 py36ha843d7b_0
[conda] mkl_random 1.0.2 py36hd81dba3_0
[conda] pytorch 1.1.0 py3.6_cuda10.0.130_cudnn7.5.1_0 pytorch
[conda] torchvision 0.3.0 py36_cu10.0.130_1 pytorch
Additional context
Reproduced by multiple users, on multiple GPUs.
The text was updated successfully, but these errors were encountered: