Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed in fine-tuning inception_v3 #302

Closed
JingyunLiang opened this issue Oct 18, 2017 · 35 comments
Closed

Failed in fine-tuning inception_v3 #302

JingyunLiang opened this issue Oct 18, 2017 · 35 comments

Comments

@JingyunLiang
Copy link

I failed in using inception_v3 on my own dataset. (Ubuntu14.04, cuda8.0, python3.6.2)

It outputs warning when loaded:

/home/ljy/anaconda3/lib/python3.6/site-packages/torchvision-0.1.9-py3.6.egg/torchvision/models/inception.py:65: UserWarning: src is not broadcastable to dst, but they have the same number of elements.  Falling back to deprecated pointwise behavior.

It failed which training:

Traceback (most recent call last):
  File "/home/ljy/pytorch-examples-master/cub_pytorch/main.py", line 382, in <module>
    main()
  File "/home/ljy/pytorch-examples-master/cub_pytorch/main.py", line 213, in main
    train(train_loader, model, criterion, optimizer, epoch)
  File "/home/ljy/pytorch-examples-master/cub_pytorch/main.py", line 251, in train
    loss = criterion(output, target_var)
  File "/home/ljy/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ljy/anaconda3/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 482, in forward
    self.ignore_index)
  File "/home/ljy/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 746, in cross_entropy
    return nll_loss(log_softmax(input), target, weight, size_average, ignore_index)
  File "/home/ljy/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 537, in log_softmax
    return _functions.thnn.LogSoftmax.apply(input)
  File "/home/ljy/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/thnn/auto.py", line 126, in forward
    ctx._backend = type2backend[type(input)]
  File "/home/ljy/anaconda3/lib/python3.6/site-packages/torch/_thnn/__init__.py", line 15, in __getitem__
    return self.backends[name].load()
KeyError: <class 'tuple'>
@alykhantejani
Copy link
Contributor

Hi @MichaelLiang12,

What PyTorch version are you using (found by torch.__version__), also can you provide us with a minimum working example to reproduce this?

Thanks

@alykhantejani
Copy link
Contributor

Also the user warning you are getting when loading the model is fixed in master (via #231)

@jamiechoi1995
Copy link

jamiechoi1995 commented Oct 27, 2017

Same issue:

(tensorflow) wcai@tdtd-desktop ~/tensorflow/AI_competition/pytorch $ python main.py -a inception_v3 . --pretrained
=> using pre-trained model 'inception_v3'
/home/wcai/tensorflow/lib/python3.5/site-packages/torchvision/models/inception.py:65: UserWarning: src is not broadcastable to dst, but they have the same number of elements. Falling back to deprecated pointwise behavior.
m.weight.data.copy_(values)
Traceback (most recent call last):
File "main.py", line 353, in
main()
File "main.py", line 176, in main
train(train_loader, model, criterion, optimizer, epoch)
File "main.py", line 214, in train
loss = criterion(output, target_var)
File "/home/wcai/tensorflow/lib/python3.5/site-packages/torch/nn/modules/module.py", line 224, in call
result = self.forward(*input, **kwargs)
File "/home/wcai/tensorflow/lib/python3.5/site-packages/torch/nn/modules/loss.py", line 482, in forward
self.ignore_index)
File "/home/wcai/tensorflow/lib/python3.5/site-packages/torch/nn/functional.py", line 746, in cross_entropy
return nll_loss(log_softmax(input), target, weight, size_average, ignore_index)
File "/home/wcai/tensorflow/lib/python3.5/site-packages/torch/nn/functional.py", line 537, in log_softmax
return _functions.thnn.LogSoftmax.apply(input)
File "/home/wcai/tensorflow/lib/python3.5/site-packages/torch/nn/_functions/thnn/auto.py", line 126, in forward
ctx._backend = type2backend[type(input)]
File "/home/wcai/tensorflow/lib/python3.5/site-packages/torch/_thnn/init.py", line 15, in getitem
return self.backends[name].load()
KeyError: <class 'tuple'>

Python: Python 3.5.2

print (torch.version)
0.2.0_3

@alykhantejani
Copy link
Contributor

Hi @jamiechoi1995,

Can you provide a minimum working example of this failing (i.e. an input that causes this when you pass it the model).

From the stack trace it seems like the input to the loss is a tuple, instead of a Variable.

@jamiechoi1995
Copy link

Hi @alykhantejani

You can reproduce this problom by using the code in https://github.com/pytorch/examples/tree/master/imagenet

I modify the size of rescale and crop to 299 for inception v3,
and my train&validate data are jpg files and the corresponding json files.

Using the same code with size of 224 in resnet model is OK,
but when I swith it to inception v3, I got this problem.

Thanks.

@TiRune
Copy link

TiRune commented Nov 1, 2017

Isn't this problem because the Aux error branch in the network? If you remove it it should work :)

@alykhantejani
Copy link
Contributor

alykhantejani commented Nov 1, 2017

@jamiechoi1995 @MichaelLiang12, @TiRune is correct, inception_v3 has an aux branch, and if this is not disabled the forward function will return a tuple (see here), which when passed to the criterion will throw this error.

So you have two choices:

  1. disable aux_logits when the model is created here by also passing aux_logits=False to the inception_v3 function.

  2. edit your train function to accept and unpack the returned tuple here to be something like:

output, aux = model(input_var)

@rajasekharponakala
Copy link

rajasekharponakala commented Mar 23, 2019

@alykhantejani:Hi, why we have to disable the aux_logits?, what are these aux_logits? does they effect the training/validation?

I'm trying to reproduce the accuracy from a model trained using with the bvlc_googlenet (without pretrained weights). So when I do aux branch off with pytorch(googlenet) it works and reports val_acc with 50% which is very low when compared to the caffe. any other methods to reproduce the same accurcy using pytorch?
Thanks.

@jamiechoi1995 @MichaelLiang12, @TiRune is correct, inception_v3 has an aux branch, and if this is not disabled the forward function will return a tuple (see here), which when passed to the criterion will throw this error.

So you have two choices:

1. disable `aux_logits` when the model is created [here](https://github.com/pytorch/examples/blob/master/imagenet/main.py#L75) by also passing `aux_logits=False` to the `inception_v3` function.

2. edit your `train` function to accept and unpack the returned tuple [here](https://github.com/pytorch/examples/blob/master/imagenet/main.py#L194) to be something like:
output, aux = model(input_var)

@fmassa
Copy link
Member

fmassa commented Mar 24, 2019

@rajasekharponakala the aux_logits is a separate classifier that is added to help during training, but it is not used during inference.

I'm trying to reproduce the accuracy from a model trained using with the bvlc_googlenet (without pretrained weights). So when I do aux branch off with pytorch(googlenet) it works and reports val_acc with 50% which is very low when compared to the caffe. any other methods to reproduce the same accurcy using pytorch?

Both googlenet and inception_v3 use pre-trained weights from TensorFlow, and as far as I know we didn't manage to reproduce accuracies from the paper when training from scratch.

@rajasekharponakala
Copy link

rajasekharponakala commented Mar 25, 2019

Hi @fmassa, thanks. I followed (pytorch discourse) to add below lines in train() imagenet example.

output, aux = model(input_var)
loss1 = criterion(output, target)
loss2 = criterion(aux, target)
loss = loss1 + 0.4*loss2

but ended with error:

Traceback (most recent call last):
  File "imagenet.py", line 407, in <module>
    main()
  File "imagenet.py", line 114, in main
    main_worker(args.gpu, ngpus_per_node, args)
  File "imagenet.py", line 240, in main_worker
    train(train_loader, model, criterion, optimizer, epoch, args)
  File "imagenet.py", line 281, in train
    output, aux = model(input)
ValueError: too many values to unpack (expected 2)

any idea?

@fmassa
Copy link
Member

fmassa commented Mar 25, 2019

you need to set your model to train() mode, it's probably in eval mode

@rajasekharponakala
Copy link

rajasekharponakala commented Mar 25, 2019

Thanks. Yes, I'm following the example/imagenet/main.py script:

def main()
      ...
def main_worker()
      ...
def train()
      ....
      model.train()
      ....
      outputs, aux_outputs = model(inputs)
      loss1 = criterion(outputs, target)
      loss2 = criterion(aux_outputs, target)
      loss = loss1 + 0.4*loss2
def validate()
      ...
      model.eval()
      ...
     outputs = model(inputs)
     loss = criterion(outputs, target)
     ....
def adjust_learning_rate()
     ...
def accuracy()
     ...

I found some other method in dicourse

        output = model(input) 
        loss = None
        # for nets that have multiple outputs such as inception
        if isinstance(output, tuple):
            loss = sum((criterion(o,target) for o in output))
        else:
            loss = criterion(output, target)

This times it throws different error:

Traceback (most recent call last):
  File "imagenet.py", line 417, in <module>
    main()
  File "imagenet.py", line 114, in main
    main_worker(args.gpu, ngpus_per_node, args)
  File "imagenet.py", line 240, in main_worker
    train(train_loader, model, criterion, optimizer, epoch, args)
  File "imagenet.py", line 298, in train
    acc1, acc5 = accuracy(output, target, topk=(1, 5))
  File "imagenet.py", line 405, in accuracy
    _, pred = output.topk(maxk, 1, True, True)
AttributeError: 'tuple' object has no attribute 'topk'

@fmassa
Copy link
Member

fmassa commented Mar 25, 2019

The issue is that both googlenet and inception can return auxiliary classifiers in training mode.
Your code is not taking that into account, or you didn't set aux classifiers. Double-check that and you'll be able to find the issue.

@rajasekharponakala
Copy link

Yeah. def main_worker() set to

if args.pretrained:
        print("=> using pre-trained model '{}'".format(args.arch))
        model = models.__dict__[args.arch](pretrained=True)
    else:
        print("=> creating model '{}'".format(args.arch))
        model = models.__dict__[args.arch](aux_logits=True)

and also vision/models/googlenet.py has

class GoogLeNet(nn.Module):

    def __init__(self, num_classes=1000, aux_logits=True, transform_input=False, init_weights=True):
        super(GoogLeNet, self).__init__()
        self.aux_logits = aux_logits
        self.transform_input = transform_input
        .....
        def forward() #has self.aux_logits

@TheCodez
Copy link
Contributor

@rajasekharponakala one thing to note here is that GoogLeNet has two aux branches where as inception v3 only has one.

So for GoogLeNet you have to use:
aux1, aux2, output = model(inputs)

@rajasekharponakala
Copy link

@TheCodez: Thanks, its working now!
format:

aux1, aux2, output = model(inputs)     
loss1 = criterion(outputs, target)
loss2 = criterion(aux1, target)
loss3 = criterion(aux2, target)
loss = loss1 + 0.4*(loss2+loss3)

@TheCodez
Copy link
Contributor

@rajasekharponakala the correct weighting scheme for GoogLeNet is using 0.3:

aux1, aux2, output = model(inputs)     
loss1 = criterion(outputs, target)
loss2 = criterion(aux1, target)
loss3 = criterion(aux2, target)
loss = loss1 + 0.3 * (loss2 + loss3)

@rajasekharponakala
Copy link

Yeah, thanks.

@tejasri19
Copy link

tejasri19 commented Jul 10, 2019

@TheCodez @fmassa @alykhantejani @rajasekharponakala Do we have to set auxiliary classifiers in test mode? I get very poor test accuracy when I retrieve trained model ( auxiliary classifiers are set here). I'm using inception v3 model for my task!

@fmassa
Copy link
Member

fmassa commented Jul 10, 2019

@tejasri19 for inference, don't forget to set your model to eval() mode.

You don't need to use the aux classifiers for inference, only for training

@Holmeyoung
Copy link

Holmeyoung commented Jul 16, 2019

Hi, i have a question. In the https://github.com/pytorch/vision/blob/master/torchvision/models/googlenet.py
it's

        if self.training and self.aux_logits:
            return _GoogLeNetOutputs(x, aux2, aux1)
        return x
_GoogLeNetOutputs = namedtuple('GoogLeNetOutputs', ['logits', 'aux_logits2', 'aux_logits1'])

so, should it be
output, aux2, aux1 = model(inputs)
but not
aux1, aux2, output = model(inputs)

Is it right? Thanks.

@fmassa
Copy link
Member

fmassa commented Jul 16, 2019

It should be output, aux2, aux1.

@rasha-salim
Copy link

rasha-salim commented May 16, 2020

Thanks for this thread it really helped me but now I'm getting this error when unpacking the model output:
output, aux1= model(data)
ValueError: too many values to unpack (expected 2)

and even when I added an extra output to unpack:
output, aux2, aux1 = model(data)
I still have the following error:
not enough values to unpack (expected 3, got 2)

@rasha-salim
Copy link

rasha-salim commented May 16, 2020

I solved it by unpacking the output in seperatelly:
output = model(data).logits
aux1 = model(data).aux_logits
It seems that there are extra outputs such as counts that I don't believe we need for training

@TheCodez
Copy link
Contributor

@gamesMum I would advise not to do that, as you are essentially running your model twice.
Instead just use this once:
output = model(data)

and then access using:

output.logits
output.aux_logits

@rasha-salim
Copy link

@TheCodez oh dear how did I kiss that!
Thanks for pointing this out

@wlj567
Copy link

wlj567 commented Oct 20, 2021

Traceback (most recent call last): File "/home/pxg/DAN/DANet/experiments/segmentation/train.py", line 282, in <module> trainer.training(epoch) File "/home/pxg/DAN/DANet/experiments/segmentation/train.py", line 214, in training loss = self.criterion(outputs, target) File "/home/pxg/DAN/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/pxg/DAN/DANet/encoding/parallel.py", line 130, in forward return self.module(inputs, *targets[0], **kwargs[0]) File "/home/pxg/DAN/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/pxg/DAN/DANet/encoding/nn/loss.py", line 68, in forward return super(SegmentationLosses, self).forward(*outputs) File "/home/pxg/DAN/venv/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 904, in forward ignore_index=self.ignore_index, reduction=self.reduction) File "/home/pxg/DAN/venv/lib/python3.6/site-packages/torch/nn/functional.py", line 1970, in cross_entropy return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction) File "/home/pxg/DAN/venv/lib/python3.6/site-packages/torch/nn/functional.py", line 1295, in log_softmax ret = input.log_softmax(dim) AttributeError: 'tuple' object has no attribute 'log_softmax'
Hello, can you help me with my question? Thank you very much.

@wlj567
Copy link

wlj567 commented Oct 20, 2021

Traceback (most recent call last):
File "/home/pxg/DAN/DANet/experiments/segmentation/train.py", line 283, in
trainer.training(epoch)
File "/home/pxg/DAN/DANet/experiments/segmentation/train.py", line 215, in training
loss = self.criterion(outputs, target)
File "/home/pxg/DAN/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/pxg/DAN/DANet/encoding/parallel.py", line 130, in forward
return self.module(inputs, *targets[0], **kwargs[0])
File "/home/pxg/DAN/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/pxg/DAN/DANet/encoding/nn/loss.py", line 68, in forward
return super(SegmentationLosses, self).forward(*inputs)
File "/home/pxg/DAN/venv/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 904, in forward
ignore_index=self.ignore_index, reduction=self.reduction)
File "/home/pxg/DAN/venv/lib/python3.6/site-packages/torch/nn/functional.py", line 1970, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File "/home/pxg/DAN/venv/lib/python3.6/site-packages/torch/nn/functional.py", line 1295, in log_softmax
ret = input.log_softmax(dim)
AttributeError: 'tuple' object has no attribute 'log_softmax'

@wlj567
Copy link

wlj567 commented Oct 27, 2021

Traceback (most recent call last): File "/home/pxg/DAN/DANet/experiments/segmentation/train.py", line 292, in <module> trainer.training(epoch) File "/home/pxg/DAN/DANet/experiments/segmentation/train.py", line 214, in training aux1, aux2, outputs = self.model(image) ValueError: not enough values to unpack (expected 3, got 1)
Hello, I modified it according to the above method and still reported an error. Can you help me see it? Thank you very much.

@TheCodez
Copy link
Contributor

@gamesMum I would advise not to do that, as you are essentially running your model twice. Instead just use this once: output = model(data)

and then access using:

output.logits
output.aux_logits

@wlj567 see if that works.

@wlj567
Copy link

wlj567 commented Oct 28, 2021

@TheCodez Hello, I don't quite understand how to modify it. Can you have a look?

def training(self, epoch):
train_loss = 0.0
self.model.train()
tbar = tqdm(self.trainloader)
for i, (image, target) in enumerate(tbar):
self.scheduler(self.optimizer, i, epoch, self.best_pred)
self.optimizer.zero_grad()
outputs = self.model(image)
loss = self.criterion(outputs, target)
loss.backward()
self.optimizer.step()
train_loss += loss.item()
tbar.set_description('Train loss: %.3f' % (train_loss / (i + 1)))

@TheCodez
Copy link
Contributor

@wlj567 see if that works.

loss = self.criterion(outputs.logits, target)

@wlj567
Copy link

wlj567 commented Oct 30, 2021

File "/home/pxg/DAN/DANet/experiments/segmentation/train.py", line 223, in training
loss = self.criterion(outputs.logits, target)
AttributeError: 'tuple' object has no attribute 'logits'

Do I modify or report an error according to this method

@wlj567
Copy link

wlj567 commented Oct 30, 2021

def forward(self, *inputs):
    preds, target = tuple(inputs)
    inputs = tuple(list(preds) + [target])
    if not self.se_loss and not self.aux:
        return super(SegmentationLosses, self).forward(*inputs)
    elif not self.se_loss:
        pred1, pred2, target = tuple(inputs)
        loss1 = super(SegmentationLosses, self).forward(pred1, target)
        loss2 = super(SegmentationLosses, self).forward(pred2, target)
        return loss1 + self.aux_weight * loss2

I modified this part in loss.py and added

preds, target = tuple(inputs)
inputs = tuple(list(preds) + [target])

Train loss: 4.745: 0%| | 6/5052 [00:12<2:36:10, 1.86s/it]
This part can be loaded, but the loading speed is very slow, and an error will be reported in the end.

@wlj567
Copy link

wlj567 commented Oct 30, 2021

Traceback (most recent call last):
File "/home/pxg/DAN/DANet/experiments/segmentation/train.py", line 293, in
trainer.training(epoch)
File "/home/pxg/DAN/DANet/experiments/segmentation/train.py", line 224, in training
loss = self.criterion(outputs, target)
File "/home/pxg/DAN/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/pxg/DAN/DANet/encoding/parallel.py", line 130, in forward
return self.module(inputs, *targets[0], **kwargs[0])
File "/home/pxg/DAN/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/pxg/DAN/DANet/encoding/nn/loss.py", line 70, in forward
return super(SegmentationLosses, self).forward(*inputs)
File "/home/pxg/DAN/venv/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 942, in forward
ignore_index=self.ignore_index, reduction=self.reduction)
File "/home/pxg/DAN/venv/lib/python3.6/site-packages/torch/nn/functional.py", line 2056, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File "/home/pxg/DAN/venv/lib/python3.6/site-packages/torch/nn/functional.py", line 1350, in log_softmax
ret = input.log_softmax(dim)
AttributeError: 'tuple' object has no attribute 'log_softmax'

This is the original code error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests