DataParallel with Torch 1.5 #40457

mnxu7979 · 2020-06-23T19:39:01Z

🐛 Bug

I tried to leverage multi-gpu using nn.DataParallel. I got an error with torch 1.5, but the same code work will work with torch 1.4.

To Reproduce

I tested it with the code in this tutorial from PyTorch.org

Following code can be used to reproduce the error:

import torch 
import torch.nn as nn 

from torch.utils.data import Dataset, DataLoader 


# params 
input_size = 5
output_size = 2 

batch_size = 32
data_size = 32

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# dataloader 
class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length 
        self.data = torch.randn(length, size)
    
    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len 

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size), batch_size=batch_size, shuffle=True)


# simple model 
class Model(nn.Module):
    def __init__(self, input_size, output_size):
        super(Model, self).__init__()
        self.fc = nn.Linear(input_size, output_size)

    def forward(self, input):
        output = self.fc(input)
        print('\t\tparameters is located at', next(self.parameters()).device)
        return output 
        
model = Model(input_size, output_size)

model = nn.DataParallel(model)
model.to(device)
for batch in iter(rand_loader):
    batch = batch.to(device)
    model(batch)

And i got the following error message:

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-1-c88fccd98bfb> in <module>
     46 for batch in iter(rand_loader):
     47     batch = batch.to(device)
---> 48     model(batch)
     49

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
    153             return self.module(*inputs[0], **kwargs[0])
    154         replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
--> 155         outputs = self.parallel_apply(replicas, inputs, kwargs)
    156         return self.gather(outputs, self.output_device)
    157

/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py in parallel_apply(self, replicas, inputs, kwargs)
    163
    164     def parallel_apply(self, replicas, inputs, kwargs):
--> 165         return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
    166
    167     def gather(self, outputs, output_device):

/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py in parallel_apply(modules, inputs, kwargs_tup, devices)
     83         output = results[i]
     84         if isinstance(output, ExceptionWrapper):
---> 85             output.reraise()
     86         outputs.append(output)
     87     return outputs

/usr/local/lib/python3.6/dist-packages/torch/_utils.py in reraise(self)
    393             # (https://bugs.python.org/issue2651), so we work around it.
    394             msg = KeyErrorMessage(msg)
--> 395         raise self.exc_type(msg)

StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "<ipython-input-1-c88fccd98bfb>", line 39, in forward
    print('\t\tparameters is located at', next(self.parameters()).device)
StopIteration

Expected behavior

With torch 1.4, i got the following output without any error.

                parameters is located at cuda:0
                parameters is located at cuda:1
                parameters is located at cuda:2
                parameters is located at cuda:3

Environment

Collecting environment information...
PyTorch version: 1.5.1
Is debug build: No
CUDA used to build PyTorch: 10.2

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: Could not collect

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.2.89
GPU models and configuration:
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB

Nvidia driver version: 418.87.01
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5

Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] torch==1.5.1
[conda] Could not collect

cc @ezyang @gchanan @zou3519

The text was updated successfully, but these errors were encountered:

zou3519 · 2020-06-23T20:06:48Z

This appears to be a regression so I am tentatively labeling it as high-pri.

mrshenli · 2020-06-23T20:20:53Z

This is known regression. After #33907, parameters() on the replicated models are no longer populated. So accessing them in the forward pass might lead to errors. The reason for adding this change is because parameters in those replicated models are not leaves, and hence should be added as parameters.

mrshenli · 2020-06-23T20:23:55Z

If you really need to access those parameters, one hacky solution is to read from the _former_parameters field (available in v1.5.1 but not v1.5.0). But this is not recommended and we cannot guarantee that this attribute will always be there in future releases.

pytorch/torch/nn/parallel/distributed.py

Lines 344 to 354 in 8066fba

    
           def parameters(m, recurse=True): 
        
               def model_parameters(m): 
        
                   ps = m._former_parameters.values() \ 
        
                       if hasattr(m, "_former_parameters") \ 
        
                       else m.parameters(recurse=False) 
        
                   for p in ps: 
        
                       yield p 
        
               for m in m.modules() if recurse else [m]: 
        
                   for p in model_parameters(m): 
        
                       yield p

mrshenli · 2020-06-23T20:24:06Z

cc @ngimel

mnxu7979 · 2020-06-24T00:08:31Z

Thank you for answering. I don't need to access those parameters directly, but this issue/bug caused a crash when I was using Huggingface Transformers package. I downgraded to 1.4. it worked out just fine.

gchanan · 2020-06-29T17:06:36Z

is there an issue for the intersection of this and HuggingFace Transformers?

ezyang · 2020-06-29T17:12:39Z

According to @ngimel, HuggingFace already has an update to deal with this BC breakage.

ngimel · 2020-06-29T17:44:26Z

huggingface/transformers#4300

wmmxk · 2020-09-21T21:06:19Z

I run into a similar issue. It turns out self.parameters() is called when figuring out which gpu is used.

In my case, I make the following change to the implementation in hugging face. It works.


          device=input_ids.device, # after change
               # device=next(self.parameters()).device,

daniel347x · 2020-09-30T08:20:36Z

@wmmxk Thanks for this heads up!

I can confirm this is a bug and your fix (as well as an additional fix, noted below) resolves the issue.

Two changes I made:

In transformers/generation_utils.py, change device=next(self.parameters()).device, to device=input_ids.device,
In transformers/modeling_gpt2.py, change attention_mask = attention_mask.to(dtype=next(self.parameters()).dtype) to attention_mask = attention_mask.to(dtype=input_ids.dtype)

zou3519 added high priority module: data parallel module: regression It used to work, and now it doesn't triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jun 23, 2020

pytorch-probot bot added the triage review label Jun 23, 2020

dlazesz mentioned this issue Jun 25, 2020

Multi-GPU training is not working DavidNemeskey/emBERT#9

Closed

izdeby removed the triage review label Jun 29, 2020

unilight mentioned this issue Jul 12, 2020

Multi-GPU training failure with PyTorch 1.5.1 espnet/espnet#2169

Closed

guhur mentioned this issue Oct 9, 2020

fix nn.DataParallel compatibility with PyTorch 1.5 huggingface/transformers#7671

Merged

5 tasks

TevenLeScao mentioned this issue Nov 3, 2020

TransformerXL: StopIteration: Caught StopIteration in replica 0 on device 0 huggingface/transformers#8145

Closed

adamsvystun mentioned this issue Dec 10, 2020

Caught StopIteration in replica 0 on device 0. facebookresearch/vilbert-multi-task#57

Open

bwang482 mentioned this issue Mar 11, 2021

Issue with Multi-GPU charles9n/bert-sklearn#17

Closed

YuanEZhou mentioned this issue Aug 9, 2021

the code does not convert IntTensor to LongTensor YuanEZhou/satic#2

Open

grazder mentioned this issue Jun 23, 2022

StopIteration Exception in Conformer Encoder because of next(self.parameters()) NVIDIA/NeMo#4430

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataParallel with Torch 1.5 #40457

DataParallel with Torch 1.5 #40457

mnxu7979 commented Jun 23, 2020 •

edited by pytorch-probot bot

zou3519 commented Jun 23, 2020

mrshenli commented Jun 23, 2020

mrshenli commented Jun 23, 2020

mrshenli commented Jun 23, 2020

mnxu7979 commented Jun 24, 2020

gchanan commented Jun 29, 2020

ezyang commented Jun 29, 2020

ngimel commented Jun 29, 2020

wmmxk commented Sep 21, 2020 •

edited

daniel347x commented Sep 30, 2020

DataParallel with Torch 1.5 #40457

DataParallel with Torch 1.5 #40457

Comments

mnxu7979 commented Jun 23, 2020 • edited by pytorch-probot bot

🐛 Bug

To Reproduce

Expected behavior

Environment

zou3519 commented Jun 23, 2020

mrshenli commented Jun 23, 2020

mrshenli commented Jun 23, 2020

mrshenli commented Jun 23, 2020

mnxu7979 commented Jun 24, 2020

gchanan commented Jun 29, 2020

ezyang commented Jun 29, 2020

ngimel commented Jun 29, 2020

wmmxk commented Sep 21, 2020 • edited

daniel347x commented Sep 30, 2020

mnxu7979 commented Jun 23, 2020 •

edited by pytorch-probot bot

wmmxk commented Sep 21, 2020 •

edited