Data parallel failure when using add_module #3174

zxytim · 2017-10-19T08:57:26Z

code:

#!/usr/bin/env python3

import torch
from torch import nn
from torch.autograd import Variable



class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.subs = [
            nn.Sequential(
                nn.Linear(2, 3), nn.ReLU(),
                nn.Linear(3, 1), nn.ReLU(),
            )
        ]

        for i, s in enumerate(self.subs):
            self.add_module('sub_{}'.format(i), s)

    def forward(self, x):
        return [s(x) for s in self.subs]


m = Model()
m = nn.DataParallel(m, device_ids=[0, 1])
m.cuda()

inpvar = Variable(torch.zeros(8, 2)).cuda()
out = m(inpvar)

output:

$ ./test_data_parallel.py                                                                                                                                                  1
Traceback (most recent call last):
  File "./test_data_parallel.py", line 31, in <module>
    out = m(inpvar)
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 60, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 70, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 75, in parallel_apply
    raise output
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 50, in _worker
    output = module(*input, **kwargs)
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "./test_data_parallel.py", line 23, in forward
    return [s(x) for s in self.subs]
  File "./test_data_parallel.py", line 23, in <listcomp>
    return [s(x) for s in self.subs]
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/nn/modules/container.py", line 67, in forward
    input = module(input)
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/nn/modules/linear.py", line 53, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/nn/functional.py", line 554, in linear
    return torch.addmm(bias, input, weight.t())
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/autograd/variable.py", line 924, in addmm
    return cls._blas(Addmm, args, False)
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/autograd/variable.py", line 920, in _blas
    return cls.apply(*(tensors + (alpha, beta, inplace)))
  File "/home/zxy/.local/lib/python3.5/site-packages/torch/autograd/_functions/blas.py", line 26, in forward
    matrix1, matrix2, out=output)
RuntimeError: arguments are located on different GPUs at /pytorch/torch/lib/THC/generic/THCTensorMathBlas.cu:232

zxytim · 2017-10-19T09:01:59Z

torch 0.2.0_3 with CUDA 8 and cudnn 6.0.20.
NVIDIA driver version is 375.51

zxytim · 2017-10-19T09:15:06Z

However, it works when I use ModuleList.
This could solve my problem, but I am wondering why using add_module while keeping modules in a list won't work?

apaszke · 2017-10-19T09:45:39Z

That is expected. Use ModuleList. The problem is that DataParallel has to replicate your modules such that different replicas have parameters on different GPUs. However, their __dict__s are just shallow copies. Because of this, their sub_i modules will be different on GPUs different than 0, but self.subs will be the same, and will always point to a list with a module on GPU0.

The `self.add_module` is a technique to register parameters, but it does not work with DataParallel because it does not put the parameters into the __dict__ in the normal way. https://discuss.pytorch.org/t/discrepancy-between-manual-parameter-registration-vs-using-nn-modulelist-when-parallelizing/181055 pytorch/pytorch#3174

apaszke closed this as completed Oct 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data parallel failure when using add_module #3174

Data parallel failure when using add_module #3174

zxytim commented Oct 19, 2017

zxytim commented Oct 19, 2017

zxytim commented Oct 19, 2017

apaszke commented Oct 19, 2017

Data parallel failure when using add_module #3174

Data parallel failure when using add_module #3174

Comments

zxytim commented Oct 19, 2017

zxytim commented Oct 19, 2017

zxytim commented Oct 19, 2017

apaszke commented Oct 19, 2017