Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pytorch deadlock from distributed multiprocessing #22788

Open
johnwlambert opened this issue Jul 12, 2019 · 3 comments
Open

Pytorch deadlock from distributed multiprocessing #22788

johnwlambert opened this issue Jul 12, 2019 · 3 comments
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@johnwlambert
Copy link

馃悰 Bug

Pytorch hangs indefinitely when using distributed multiprocessing with Pytorch 1.1.0 after 77k iterations (77939 and 77940 iterations). In Pytorch 1.0.1.post2, there is no such bug.

It appears to be deadlock.

The following distributed modules combine together in a very nasty way for some reason:

import torch.distributed as dist
train_sampler = torch.utils.data.distributed.DistributedSampler(train_data)
dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url, world_size=args.world_size, rank=args.rank)
apex.parallel.DistributedDataParallel(model)
model, optimizer = apex.amp.initialize(model.cuda(), optimizer, opt_level=args.opt_level, keep_batchnorm_fp32=args.keep_batchnorm_fp32, loss_scale=args.loss_scale)
 model = apex.parallel.DistributedDataParallel(model)

To Reproduce

Steps to reproduce the behavior:

  1. Clone the following repo: semseg
  2. Use the following config:
  3. Run the training script train.py
  4. Observe that Pytorch will hang indefinitely after the 77,939th iteration.

Expected behavior

Pytorch should not hang.

Environment

PyTorch version: 1.1.0
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 18.04.2 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2

Python version: 3.6
Is CUDA available: No
CUDA runtime version: 10.0.130
GPU models and configuration: many
Nvidia driver version:
cuDNN version:

Versions of relevant libraries:
[pip] numpy==1.16.3
[pip] torch==1.1.0
[pip] torchvision==0.3.0
[conda] blas 1.0 mkl
[conda] mkl 2019.3 199
[conda] mkl_fft 1.0.12 py36ha843d7b_0
[conda] mkl_random 1.0.2 py36hd81dba3_0
[conda] pytorch 1.1.0 py3.6_cuda10.0.130_cudnn7.5.1_0 pytorch
[conda] torchvision 0.3.0 py36_cu10.0.130_1 pytorch

Additional context

Reproduced by multiple users, on multiple GPUs.

@pytorchbot pytorchbot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jul 12, 2019
@jerryzh168 jerryzh168 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 13, 2019
@soumith
Copy link
Member

soumith commented Jul 14, 2019

@mcarilli @pietern any hints?

@pietern
Copy link
Contributor

pietern commented Jul 15, 2019

Can you grab a stack trace of the processes when they are stuck?

Is there any control flow in your trainer that may desynchronize the workers? For example, you have an if rank == 0 path in there that calls some collective?

@shoaibahmed
Copy link
Contributor

@pietern I have the same issue. I also attached a stack trace in my case along with the NCCL log.
#22834

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

6 participants