[Image Captioning] out of memory immediately after training starts? #35

jtoy · 2017-05-23T02:53:19Z

What size cards are these networks tested and trained on? I just tried running "09 - Image Captioning" and I immediately get errors. I am testing this on a Titan X with 12 GB of memory:

Namespace(batch_size=128, caption_path='./data/annotations/captions_train2014.json', crop_size=224, embed_size=256, hidden_size=512, image_dir='./data/resized2014', learning_rate=0.001, log_step=10, model_path='./models/', num_epochs=5, num_layers=1, num_workers=2, save_step=1000, vocab_path='./data/vocab.pkl')
loading annotations into memory...
Done (t=0.89s)
creating index...
index created!
THCudaCheck FAIL file=/b/wheel/pytorch-src/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
Traceback (most recent call last):
  File "train.py", line 119, in <module>
    main(args)
  File "train.py", line 66, in main
    features = encoder(images)
  File "/root/.venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/pytorch/pytorch-tutorial/tutorials/09 - Image Captioning/model.py", line 26, in forward
    features = self.resnet(images)
  File "/root/.venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/.venv/local/lib/python2.7/site-packages/torchvision/models/resnet.py", line 146, in forward
    x = self.layer3(x)
  File "/root/.venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/.venv/local/lib/python2.7/site-packages/torch/nn/modules/container.py", line 64, in forward
    input = module(input)
  File "/root/.venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/.venv/local/lib/python2.7/site-packages/torchvision/models/resnet.py", line 85, in forward
    out = self.bn3(out)
  File "/root/.venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/.venv/local/lib/python2.7/site-packages/torch/nn/modules/batchnorm.py", line 43, in forward
    self.training, self.momentum, self.eps)
  File "/root/.venv/local/lib/python2.7/site-packages/torch/nn/functional.py", line 463, in batch_norm
    return f(input, weight, bias)
RuntimeError: cuda runtime error (2) : out of memory at /b/wheel/pytorch-src/torch/lib/THC/generic/THCStorage.cu:66
Command exited with non-zero status 1

The text was updated successfully, but these errors were encountered:

yunjey · 2017-05-23T03:42:36Z

I remember that the required gpu memory for batch_size=128 is less than 5GB (it may be much smaller).

What's your Python and PyTorch version? I guess you are using Python 2.7. Am i right?
This memory issue occurred when requires_grad=False does not work.

There are two options for solving this problem.

Upgrade PyTorch version
Use Python 3.5 instead of 2.7

karandwivedi42 · 2017-05-23T11:08:40Z

This might be related to #26

This is a known issue which will be resolved in the next release.

Till then as a workaround, just change L56 to

               images = Variable(images, volatile=True)

and L66 to

                features = encoder(images)
                features = Variable(features.data)

jtoy · 2017-05-23T15:59:45Z

@yunjey I am on python 2.7 and pytorch 0.12. I will try your changes.
@karandwivedi42 I will also test your fix.
I will let you both know.

yunjey · 2017-05-23T16:35:31Z

@jtoy I recommend you to install PyTorch using source. This will give you the latest version of PyTorch.

jtoy · 2017-05-24T15:32:40Z

I tried with pytorch python 2.7 source and using pytorch for python 3.5, both died with the same issue.

jtoy · 2017-05-24T15:35:37Z

@karandwivedi42 your changes work! @yunjey will the code need to be updated? It seems like source doesnt seem to fix the issue. I can do more testing if needed.

yunjey · 2017-05-24T16:01:17Z

@jtoy Ok. Thanks.

yunjey · 2017-05-24T16:17:16Z

@karandwivedi42 That does not work.

images = Variable(images, volatile=True)

The code above makes requires_grad=False in resnet.fc. See here for the details of volatile.

karandwivedi42 · 2017-05-24T16:57:59Z

@yunjey You are right. I don't know how important it is though because this linear layer is followed by another linear layer in the decoder with no non-linearity in between.

jtoy · 2017-05-24T22:34:08Z

so what is the right code to use? I was able to train a model with @karandwivedi42 's change and the model completed training for me in 155 minutes. does that time seem right? I trained the original show and tell model and I remember it taking at least a day.

karandwivedi42 · 2017-05-24T22:39:29Z

Put the fc and bn as a separate module between encoder and decoder so that they can be a part of gradient computation. Does it make sense?

…

On May 25, 2017 4:04 AM, "jtoy" ***@***.***> wrote: so what is the right code to use? I was able to train a model with @karandwivedi42 <https://github.com/karandwivedi42> change and the model completed training for me in 155 minutes. does that time seem right? I trained the original show and tell model and I remember it taking at least a day. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#35 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJLb6rEdGcH57EGQQA_MduX-g65DnZ_Eks5r9LBkgaJpZM4NjHj8> .

jtoy · 2017-05-24T22:58:04Z

@karandwivedi42 I dont fully understand, Im just starting to play with pytorch, any way to see it as a diff ?

karandwivedi42 · 2017-05-25T09:14:58Z

@jtoy This fork is a very hacky way to do exactly what the original code does.
https://github.com/karandwivedi42/pytorch-tutorial/tree/master/tutorials/09%20-%20Image%20Captioning

@yunjey Can you please check this one? (Thanks for the amazing tutorials btw :) )

yunjey · 2017-05-26T05:05:59Z

@jtoy @karandwivedi42 I will fix the code by this weekend.

jtoy · 2017-05-26T05:17:42Z

I have a model training on it now. I'll also test out your version of the code.

…

On May 25, 2017, at 10:06 PM, yunjey ***@***.***> wrote: @jtoy @karandwivedi42 I will fix the code by this weekend. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

yunjey · 2017-05-28T11:16:01Z

@jtoy @karandwivedi42 I modified the code. Try it. Thanks :)

yunjey changed the title ~~out of memory immediately after training starts?~~ [Image Captioning] out of memory immediately after training starts? May 23, 2017

yunjey closed this as completed May 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Image Captioning] out of memory immediately after training starts? #35

[Image Captioning] out of memory immediately after training starts? #35

jtoy commented May 23, 2017 •

edited

yunjey commented May 23, 2017

karandwivedi42 commented May 23, 2017 •

edited

jtoy commented May 23, 2017

yunjey commented May 23, 2017

jtoy commented May 24, 2017

jtoy commented May 24, 2017

yunjey commented May 24, 2017 •

edited

yunjey commented May 24, 2017

karandwivedi42 commented May 24, 2017

jtoy commented May 24, 2017 •

edited

karandwivedi42 commented May 24, 2017 via email

jtoy commented May 24, 2017

karandwivedi42 commented May 25, 2017 •

edited

yunjey commented May 26, 2017

jtoy commented May 26, 2017 via email

yunjey commented May 28, 2017

[Image Captioning] out of memory immediately after training starts? #35

[Image Captioning] out of memory immediately after training starts? #35

Comments

jtoy commented May 23, 2017 • edited

yunjey commented May 23, 2017

karandwivedi42 commented May 23, 2017 • edited

jtoy commented May 23, 2017

yunjey commented May 23, 2017

jtoy commented May 24, 2017

jtoy commented May 24, 2017

yunjey commented May 24, 2017 • edited

yunjey commented May 24, 2017

karandwivedi42 commented May 24, 2017

jtoy commented May 24, 2017 • edited

karandwivedi42 commented May 24, 2017 via email

jtoy commented May 24, 2017

karandwivedi42 commented May 25, 2017 • edited

yunjey commented May 26, 2017

jtoy commented May 26, 2017 via email

yunjey commented May 28, 2017

jtoy commented May 23, 2017 •

edited

karandwivedi42 commented May 23, 2017 •

edited

yunjey commented May 24, 2017 •

edited

jtoy commented May 24, 2017 •

edited

karandwivedi42 commented May 25, 2017 •

edited