Pytorch deadlock from distributed multiprocessing #22788

johnwlambert · 2019-07-12T05:54:23Z

🐛 Bug

Pytorch hangs indefinitely when using distributed multiprocessing with Pytorch 1.1.0 after 77k iterations (77939 and 77940 iterations). In Pytorch 1.0.1.post2, there is no such bug.

It appears to be deadlock.

The following distributed modules combine together in a very nasty way for some reason:

import torch.distributed as dist
train_sampler = torch.utils.data.distributed.DistributedSampler(train_data)
dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url, world_size=args.world_size, rank=args.rank)
apex.parallel.DistributedDataParallel(model)
model, optimizer = apex.amp.initialize(model.cuda(), optimizer, opt_level=args.opt_level, keep_batchnorm_fp32=args.keep_batchnorm_fp32, loss_scale=args.loss_scale)
 model = apex.parallel.DistributedDataParallel(model)

To Reproduce

Steps to reproduce the behavior:

Clone the following repo: semseg
Use the following config:
Run the training script train.py
Observe that Pytorch will hang indefinitely after the 77,939th iteration.

Expected behavior

Pytorch should not hang.

Environment

PyTorch version: 1.1.0
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 18.04.2 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2

Python version: 3.6
Is CUDA available: No
CUDA runtime version: 10.0.130
GPU models and configuration: many
Nvidia driver version:
cuDNN version:

Versions of relevant libraries:
[pip] numpy==1.16.3
[pip] torch==1.1.0
[pip] torchvision==0.3.0
[conda] blas 1.0 mkl
[conda] mkl 2019.3 199
[conda] mkl_fft 1.0.12 py36ha843d7b_0
[conda] mkl_random 1.0.2 py36hd81dba3_0
[conda] pytorch 1.1.0 py3.6_cuda10.0.130_cudnn7.5.1_0 pytorch
[conda] torchvision 0.3.0 py36_cu10.0.130_1 pytorch

Additional context

Reproduced by multiple users, on multiple GPUs.

The text was updated successfully, but these errors were encountered:

soumith · 2019-07-14T16:38:02Z

@mcarilli @pietern any hints?

pietern · 2019-07-15T07:34:57Z

Can you grab a stack trace of the processes when they are stuck?

Is there any control flow in your trainer that may desynchronize the workers? For example, you have an if rank == 0 path in there that calls some collective?

shoaibahmed · 2019-07-15T21:15:14Z

@pietern I have the same issue. I also attached a stack trace in my case along with the NCCL log.
#22834

pytorchbot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jul 12, 2019

jerryzh168 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 13, 2019

pietern mentioned this issue Jul 17, 2019

ProcessGroup wrapper that aids collectives desynchronization #22071

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pytorch deadlock from distributed multiprocessing #22788

Pytorch deadlock from distributed multiprocessing #22788

johnwlambert commented Jul 12, 2019

soumith commented Jul 14, 2019

pietern commented Jul 15, 2019

shoaibahmed commented Jul 15, 2019

Pytorch deadlock from distributed multiprocessing #22788

Pytorch deadlock from distributed multiprocessing #22788

Comments

johnwlambert commented Jul 12, 2019

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

soumith commented Jul 14, 2019

pietern commented Jul 15, 2019

shoaibahmed commented Jul 15, 2019