Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data parallel failure when using add_module #3174

Closed
zxytim opened this issue Oct 19, 2017 · 3 comments
Closed

Data parallel failure when using add_module #3174

zxytim opened this issue Oct 19, 2017 · 3 comments

Comments

@zxytim
Copy link

zxytim commented Oct 19, 2017

code:

#!/usr/bin/env python3

import torch
from torch import nn
from torch.autograd import Variable



class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.subs = [
            nn.Sequential(
                nn.Linear(2, 3), nn.ReLU(),
                nn.Linear(3, 1), nn.ReLU(),
            )
        ]

        for i, s in enumerate(self.subs):
            self.add_module('sub_{}'.format(i), s)

    def forward(self, x):
        return [s(x) for s in self.subs]


m = Model()
m = nn.DataParallel(m, device_ids=[0, 1])
m.cuda()

inpvar = Variable(torch.zeros(8, 2)).cuda()
out = m(inpvar)

output:

$ ./test_data_parallel.py                                                                                                                                                  1
Traceback (most recent call last):
  File "./test_data_parallel.py", line 31, in <module>
    out = m(inpvar)
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 60, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 70, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 75, in parallel_apply
    raise output
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 50, in _worker
    output = module(*input, **kwargs)
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "./test_data_parallel.py", line 23, in forward
    return [s(x) for s in self.subs]
  File "./test_data_parallel.py", line 23, in <listcomp>
    return [s(x) for s in self.subs]
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/nn/modules/container.py", line 67, in forward
    input = module(input)
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/nn/modules/linear.py", line 53, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/nn/functional.py", line 554, in linear
    return torch.addmm(bias, input, weight.t())
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/autograd/variable.py", line 924, in addmm
    return cls._blas(Addmm, args, False)
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/autograd/variable.py", line 920, in _blas
    return cls.apply(*(tensors + (alpha, beta, inplace)))
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/autograd/_functions/blas.py", line 26, in forward
    matrix1, matrix2, out=output)
RuntimeError: arguments are located on different GPUs at /pytorch/torch/lib/THC/generic/THCTensorMathBlas.cu:232
@zxytim
Copy link
Author

zxytim commented Oct 19, 2017

torch 0.2.0_3 with CUDA 8 and cudnn 6.0.20.
NVIDIA driver version is 375.51

@zxytim
Copy link
Author

zxytim commented Oct 19, 2017

However, it works when I use ModuleList.
This could solve my problem, but I am wondering why using add_module while keeping modules in a list won't work?

@apaszke
Copy link
Contributor

apaszke commented Oct 19, 2017

That is expected. Use ModuleList. The problem is that DataParallel has to replicate your modules such that different replicas have parameters on different GPUs. However, their __dict__s are just shallow copies. Because of this, their sub_i modules will be different on GPUs different than 0, but self.subs will be the same, and will always point to a list with a module on GPU0.

@apaszke apaszke closed this as completed Oct 19, 2017
timgianitsos added a commit to timgianitsos/heart_cell_ml that referenced this issue Aug 7, 2023
The `self.add_module` is a technique to register parameters, but it does
not work with DataParallel because it does not put the parameters into
the __dict__ in the normal way.
https://discuss.pytorch.org/t/discrepancy-between-manual-parameter-registration-vs-using-nn-modulelist-when-parallelizing/181055
pytorch/pytorch#3174
timgianitsos added a commit to timgianitsos/heart_cell_ml that referenced this issue Aug 7, 2023
The `self.add_module` is a technique to register parameters, but it does
not work with DataParallel because it does not put the parameters into
the __dict__ in the normal way.
https://discuss.pytorch.org/t/discrepancy-between-manual-parameter-registration-vs-using-nn-modulelist-when-parallelizing/181055
pytorch/pytorch#3174
timgianitsos added a commit to timgianitsos/heart_cell_ml that referenced this issue Aug 14, 2023
The `self.add_module` is a technique to register parameters, but it does
not work with DataParallel because it does not put the parameters into
the __dict__ in the normal way.
https://discuss.pytorch.org/t/discrepancy-between-manual-parameter-registration-vs-using-nn-modulelist-when-parallelizing/181055
pytorch/pytorch#3174
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants