New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inception_v3 of vision 0.3.0 does not fit in DataParallel of torch 1.1.0 #1048
Comments
Yes, that's probably the reason. I believe we have three options:
I'd vote for option number 2. ccing @TheCodez and @Separius , who have commented / sent the aforementioned PR initially. What are your thoughts here? |
@fmassa I agree option 2 would be the best to avoid problems in the future |
@fmassa yeah second option makes the most sense |
The problem seems still there. It's a kind of mixed solution of @fmassa 's option 1 and 2. Change def gather(outputs, target_device, dim=0):
r"""
Gathers tensors from different GPUs on a specified device
(-1 means the CPU).
"""
def gather_map(outputs):
def isnamedtupleinstance(x):
t = type(x)
b = t.__bases__
if len(b) != 1 or b[0] != tuple: return False
f = getattr(t, '_fields', None)
if not isinstance(f, tuple): return False
return all(type(n)==str for n in f)
out = outputs[0]
if isinstance(out, torch.Tensor):
return Gather.apply(target_device, dim, *outputs)
if out is None:
return None
if isnamedtupleinstance(out):
outputs = [dict(out._asdict()) for out in outputs]
out = outputs[0]
if isinstance(out, dict):
if not all((len(out) == len(d) for d in outputs)):
raise ValueError('All dicts must have the same number of keys')
return type(out)(((k, gather_map([d[k] for d in outputs]))
for k in out))
return type(out)(map(gather_map, zip(*outputs)))
# Recursive function calls like this create reference cycles.
# Setting the function to None clears the refcycle.
try:
res = gather_map(outputs)
finally:
gather_map = None
return res And you can get the result of outputs, aux_outputs = self.model(imgs).values() Don't forget to add I know this is not the best solution. |
I tried out your solution @YongWookHa, however, now I am getting an error to calculate loss function
EDIT: I figured out the problem. Was an issue with dict. |
I think you forgot to add |
Yes, I did add values, but I was copying Thanks, but your method solved me hours of training time. Earlier, I had to train inception only one a single GPU, not modifying pytorch file using your code, I am able to train on more than 1 GPU. |
* Merge pytorch 1.3 commits This PR is a fix for issue #422. 1. ImageNet models usually use input size [batch, 3, 224, 224], but all Inception models require an input image size of [batch, 3, 299, 299]. 2. Inception models have auxiliary branches which contribute to the loss only during training. The reported classification loss only considers the main classification loss. 3. Inception_V3 normalizes the input inside the network itself. More details can be found in @soumendukrg's PR #425 [comments](#425 (comment)). NOTE: Training using Inception_V3 is only possible on a single GPU as of now. This issue talks about this problem. I have checked and this problem persists in torch 1.3.0: [inception_v3 of vision 0.3.0 does not fit in DataParallel of torch 1.1.0 #1048](pytorch/vision#1048) Co-authored-by: Neta Zmora <neta.zmora@intel.com>
I tried out your solution @YongWookHa, but got an error as shown below: `train Loss: 0.9664 Acc: 0.5738 Traceback (most recent call last): Could you please give me some suggestions? Edit: fixed. As there is no need to use the aux classifiers for inference, i change the code to: if phase == 'train':
else:
Thanks! |
* Merge pytorch 1.3 commits This PR is a fix for issue #422. 1. ImageNet models usually use input size [batch, 3, 224, 224], but all Inception models require an input image size of [batch, 3, 299, 299]. 2. Inception models have auxiliary branches which contribute to the loss only during training. The reported classification loss only considers the main classification loss. 3. Inception_V3 normalizes the input inside the network itself. More details can be found in @soumendukrg's PR #425 [comments](IntelLabs/distiller#425 (comment)). NOTE: Training using Inception_V3 is only possible on a single GPU as of now. This issue talks about this problem. I have checked and this problem persists in torch 1.3.0: [inception_v3 of vision 0.3.0 does not fit in DataParallel of torch 1.1.0 #1048](pytorch/vision#1048) Co-authored-by: Neta Zmora <neta.zmora@intel.com>
I used APEX.amp with inceptionv3, got the same problem:
To solve this problem, I replaced namedtuple to function returning tuple, and it works: torchvision.models.inception.InceptionOutputs = lambda a,b:(a,b) |
* Merge pytorch 1.3 commits This PR is a fix for issue #422. 1. ImageNet models usually use input size [batch, 3, 224, 224], but all Inception models require an input image size of [batch, 3, 299, 299]. 2. Inception models have auxiliary branches which contribute to the loss only during training. The reported classification loss only considers the main classification loss. 3. Inception_V3 normalizes the input inside the network itself. More details can be found in @soumendukrg's PR #425 [comments](#425 (comment)). NOTE: Training using Inception_V3 is only possible on a single GPU as of now. This issue talks about this problem. I have checked and this problem persists in torch 1.3.0: [inception_v3 of vision 0.3.0 does not fit in DataParallel of torch 1.1.0 #1048](pytorch/vision#1048) Co-authored-by: Neta Zmora <neta.zmora@intel.com>
* Merge pytorch 1.3 commits This PR is a fix for issue IntelLabs#422. 1. ImageNet models usually use input size [batch, 3, 224, 224], but all Inception models require an input image size of [batch, 3, 299, 299]. 2. Inception models have auxiliary branches which contribute to the loss only during training. The reported classification loss only considers the main classification loss. 3. Inception_V3 normalizes the input inside the network itself. More details can be found in @soumendukrg's PR IntelLabs#425 [comments](IntelLabs#425 (comment)). NOTE: Training using Inception_V3 is only possible on a single GPU as of now. This issue talks about this problem. I have checked and this problem persists in torch 1.3.0: [inception_v3 of vision 0.3.0 does not fit in DataParallel of torch 1.1.0 #1048](pytorch/vision#1048) Co-authored-by: Neta Zmora <neta.zmora@intel.com>
Environment:
Python 3.5
torch 1.1.0
torchvision 0.3.0
Reproducible example:
import torch
import torchvision
model = torchvision.models.inception_v3().cuda()
model = torch.nn.DataParallel(model, [0, 1])
x = torch.rand((8, 3, 299, 299)).cuda()
model.forward(x)
Error:
I guess the error occurs because the output of inception_v3 was changed from tuple to namedtuple.
The text was updated successfully, but these errors were encountered: