Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal assert failed using multiple GPUs with DataParallel #32036

tomhosking opened this issue Jan 10, 2020 · 3 comments

Internal assert failed using multiple GPUs with DataParallel #32036

tomhosking opened this issue Jan 10, 2020 · 3 comments


Copy link

@tomhosking tomhosking commented Jan 10, 2020

馃悰 Bug

I'm getting a runtime error with a "please report a bug" message, as below:

Traceback (most recent call last):
  File "./src/", line 83, in <module>
  File "/home/s1717552/torchaq/aqenv/lib64/python3.6/site-packages/absl/", line 299, in run
    _run_main(main, args)
  File "/home/s1717552/torchaq/aqenv/lib64/python3.6/site-packages/absl/", line 250, in _run_main
  File "./src/", line 64, in main
  File "/home/s1717552/torchaq/src/agents/", line 173, in train
  File "/home/s1717552/torchaq/src/agents/", line 209, in train_one_epoch
    loss = self.step_train(batch, self.tgt_field)
  File "/home/s1717552/torchaq/src/agents/", line 95, in step_train
    output, logits = self.decode_teacher_force(self.model, batch, 'q')
  File "/home/s1717552/torchaq/aqenv/lib64/python3.6/site-packages/torch/nn/modules/", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/s1717552/torchaq/src/models/samplers/", line 25, in forward
    pred_logits, _ = model(batch, output)
  File "/home/s1717552/torchaq/aqenv/lib64/python3.6/site-packages/torch/nn/modules/", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/s1717552/torchaq/aqenv/lib64/python3.6/site-packages/torch/nn/parallel/", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/s1717552/torchaq/aqenv/lib64/python3.6/site-packages/torch/nn/parallel/", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/s1717552/torchaq/aqenv/lib64/python3.6/site-packages/torch/nn/parallel/", line 85, in parallel_apply
  File "/home/s1717552/torchaq/aqenv/lib64/python3.6/site-packages/torch/", line 385, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/s1717552/torchaq/aqenv/lib64/python3.6/site-packages/torch/nn/parallel/", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/s1717552/torchaq/aqenv/lib64/python3.6/site-packages/torch/nn/modules/", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/s1717552/torchaq/src/models/", line 208, in forward
    memory = self.encoder_projection(memory_full)
  File "/home/s1717552/torchaq/aqenv/lib64/python3.6/site-packages/torch/nn/modules/", line 533, in __call__
    result = hook(self, input)
  File "/home/s1717552/torchaq/aqenv/lib64/python3.6/site-packages/torch/nn/utils/", line 55, in __call__
    setattr(module,, self.compute_weight(module))
  File "/home/s1717552/torchaq/aqenv/lib64/python3.6/site-packages/torch/nn/utils/", line 18, in compute_weight
    return _weight_norm(v, g, self.dim)
RuntimeError: diff_view_meta->output_nr_ == 0 INTERNAL ASSERT FAILED at /pytorch/torch/csrc/autograd/variable.cpp:134, please report a bug to PyTorch.

The model trains without a problem on a single GPU, but fails when using DataParallel on multiple GPUs. Wrapping the model with DataParallel when only one GPU is available still works OK.

To Reproduce

This happens when using DataParallel with a pre-built complex model. I'm not sure how to create a minimal code example to recreate the problem?

Expected behavior



PyTorch version: 1.3.1
Is debug build: No
CUDA used to build PyTorch: 10.1.243

OS: Scientific Linux release 7.6 (Nitrogen)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
CMake version: version

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti
GPU 3: GeForce RTX 2080 Ti

Nvidia driver version: 440.44
cuDNN version: /usr/lib64/

Versions of relevant libraries:
[pip3] numpy==1.17.4
[pip3] torch==1.3.1
[conda] Could not collect

Additional context


cc @ngimel


This comment has been minimized.

Copy link

@izdeby izdeby commented Jan 15, 2020

@tomhosking, can you, please, provide code that reproduces the issue?


This comment has been minimized.

Copy link

@orz-orz-orz-orz orz-orz-orz-orz commented Jan 16, 2020

I encounter the same error with the nn.Embedding wrapped in nn.DataParallel when I set max_norm=1.0. Maybe the weight norm is broken.


This comment has been minimized.

Copy link

@leonardoaraujosantos leonardoaraujosantos commented Jan 20, 2020

I'm getting NCCL errors when using nn.DataParallel, on pytorch 1.4.0

/mnt/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/ in replicate(network, devices, detach)
     86     params = list(network.parameters())
     87     param_indices = {param: idx for idx, param in enumerate(params)}
---> 88     param_copies = _broadcast_coalesced_reshape(params, devices, detach)
     90     buffers = list(network.buffers())

/mnt/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/ in _broadcast_coalesced_reshape(tensors, devices, detach)
     69         # Use the autograd function to broadcast if not detach
     70         if len(tensors) > 0:
---> 71             tensor_copies = Broadcast.apply(devices, *tensors)
     72             return [tensor_copies[i:i + len(tensors)]
     73                     for i in range(0, len(tensor_copies), len(tensors))]

/mnt/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/ in forward(ctx, target_gpus, *inputs)
     19         ctx.num_inputs = len(inputs)
     20         ctx.input_device = inputs[0].get_device()
---> 21         outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
     22         non_differentiables = []
     23         for idx, input_requires_grad in enumerate(ctx.needs_input_grad[1:]):

/mnt/anaconda3/lib/python3.7/site-packages/torch/cuda/ in broadcast_coalesced(tensors, devices, buffer_size)
     37         corresponding to indices from ``devices``.
     38     """
---> 39     return torch._C._broadcast_coalesced(tensors, devices, buffer_size)

RuntimeError: NCCL Error 2: unhandled system error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
4 participants
You can鈥檛 perform that action at this time.