-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue for DataParallel #8637
Comments
A demo code is provided to reproduce the errors. testModude2 works fine while testModude is not good. import torch
from torch import nn
class testModule(nn.Module):
def __init__(self):
super(testModule, self).__init__()
self.g = nn.Conv2d(in_channels=1, out_channels=1,
kernel_size=1, stride=1, padding=0)
self.operation_function = self._realOperation
def forward(self, x):
output = self.operation_function(x)
return output
def _realOperation(self, x):
x = self.g(x)
return x
class testModule2(nn.Module):
def __init__(self):
super(testModule2, self).__init__()
self.g = nn.Conv2d(in_channels=1, out_channels=1,
kernel_size=1, stride=1, padding=0)
def forward(self, x):
x = self.g(x)
return x
if __name__ == '__main__':
input = torch.rand(4, 1, 1, 1).cuda()
net = testModule()
net2 = testModule2()
gpu_num = torch.cuda.device_count()
print('GPU NUM: {:2d}'.format(gpu_num))
if gpu_num > 1:
net = torch.nn.DataParallel(net, list(range(gpu_num))).cuda()
net2 = torch.nn.DataParallel(net2, list(range(gpu_num))).cuda()
out2 = net2(input)
print(out2.size())
out = net(input)
print(out.size()) |
Yeah, In If you just do not save the method as an attribute, the code should work fine. E.g., directly calling In theory we can fix this by checking the broadcast module's |
i use DataParallel to computer https://github.com/JohnVinyard/experiments/blob/master/audio-loss/audio_loss.py
the error like this: |
@ssnl, is it possible to revisit the decision of supporting this in pytorch? pretrained-models.pytorch uses this pattern to build atop the |
@willprice You can see this. Cadene/pretrained-models.pytorch#145 |
reopen for discussion! |
cc @apaszke @colesbury @soumith @fmassa for opinions |
The issue with this pull request, while fine for new users of pretrained-models, is that it changes the classes of the models returned which breaks downstream code relying on instance checks. |
I still don't think we should be handling this. Our cloning logic is already quite expensive, and we already take a fair amount of shortcuts to make it faster. Doing something like this would require us to investigate every single attribute of each module, and there's just too many of those. Remember that this is not a one-time cost, it's happening at every iteration! |
So I should put all operations of a model in one method, which is the forward method, to make it work on multiple gpus using Dataparallel. Is it right? |
@wmmxk you should see SsnL's post above. Not all operations need to be in the same method, but you dynamically binding functions to attributes won't work. |
When I tried to use multi-gpu trainning.
Then
I get one error during the multi-gpu trainning
@soumith Do you have any comments about this?
The text was updated successfully, but these errors were encountered: