Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Image Captioning] out of memory immediately after training starts? #35

Closed
jtoy opened this issue May 23, 2017 · 16 comments
Closed

[Image Captioning] out of memory immediately after training starts? #35

jtoy opened this issue May 23, 2017 · 16 comments

Comments

@jtoy
Copy link
Contributor

jtoy commented May 23, 2017

What size cards are these networks tested and trained on? I just tried running "09 - Image Captioning" and I immediately get errors. I am testing this on a Titan X with 12 GB of memory:

Namespace(batch_size=128, caption_path='./data/annotations/captions_train2014.json', crop_size=224, embed_size=256, hidden_size=512, image_dir='./data/resized2014', learning_rate=0.001, log_step=10, model_path='./models/', num_epochs=5, num_layers=1, num_workers=2, save_step=1000, vocab_path='./data/vocab.pkl')
loading annotations into memory...
Done (t=0.89s)
creating index...
index created!
THCudaCheck FAIL file=/b/wheel/pytorch-src/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
Traceback (most recent call last):
  File "train.py", line 119, in <module>
    main(args)
  File "train.py", line 66, in main
    features = encoder(images)
  File "/root/.venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/pytorch/pytorch-tutorial/tutorials/09 - Image Captioning/model.py", line 26, in forward
    features = self.resnet(images)
  File "/root/.venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/.venv/local/lib/python2.7/site-packages/torchvision/models/resnet.py", line 146, in forward
    x = self.layer3(x)
  File "/root/.venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/.venv/local/lib/python2.7/site-packages/torch/nn/modules/container.py", line 64, in forward
    input = module(input)
  File "/root/.venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/.venv/local/lib/python2.7/site-packages/torchvision/models/resnet.py", line 85, in forward
    out = self.bn3(out)
  File "/root/.venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/.venv/local/lib/python2.7/site-packages/torch/nn/modules/batchnorm.py", line 43, in forward
    self.training, self.momentum, self.eps)
  File "/root/.venv/local/lib/python2.7/site-packages/torch/nn/functional.py", line 463, in batch_norm
    return f(input, weight, bias)
RuntimeError: cuda runtime error (2) : out of memory at /b/wheel/pytorch-src/torch/lib/THC/generic/THCStorage.cu:66
Command exited with non-zero status 1
@yunjey
Copy link
Owner

yunjey commented May 23, 2017

I remember that the required gpu memory for batch_size=128 is less than 5GB (it may be much smaller).

What's your Python and PyTorch version? I guess you are using Python 2.7. Am i right?
This memory issue occurred when requires_grad=False does not work.

There are two options for solving this problem.

  1. Upgrade PyTorch version
  2. Use Python 3.5 instead of 2.7

@yunjey yunjey changed the title out of memory immediately after training starts? [Image Captioning] out of memory immediately after training starts? May 23, 2017
@karandwivedi42
Copy link

karandwivedi42 commented May 23, 2017

This might be related to #26

This is a known issue which will be resolved in the next release.

Till then as a workaround, just change L56 to

               images = Variable(images, volatile=True)

and L66 to

                features = encoder(images)
                features = Variable(features.data)

@jtoy
Copy link
Contributor Author

jtoy commented May 23, 2017

@yunjey I am on python 2.7 and pytorch 0.12. I will try your changes.
@karandwivedi42 I will also test your fix.
I will let you both know.

@yunjey
Copy link
Owner

yunjey commented May 23, 2017

@jtoy I recommend you to install PyTorch using source. This will give you the latest version of PyTorch.

@jtoy
Copy link
Contributor Author

jtoy commented May 24, 2017

I tried with pytorch python 2.7 source and using pytorch for python 3.5, both died with the same issue.

@jtoy
Copy link
Contributor Author

jtoy commented May 24, 2017

@karandwivedi42 your changes work! @yunjey will the code need to be updated? It seems like source doesnt seem to fix the issue. I can do more testing if needed.

@yunjey
Copy link
Owner

yunjey commented May 24, 2017

@jtoy Ok. Thanks.

@yunjey
Copy link
Owner

yunjey commented May 24, 2017

@karandwivedi42 That does not work.

images = Variable(images, volatile=True)

The code above makes requires_grad=False in resnet.fc. See here for the details of volatile.

@karandwivedi42
Copy link

@yunjey You are right. I don't know how important it is though because this linear layer is followed by another linear layer in the decoder with no non-linearity in between.

@jtoy
Copy link
Contributor Author

jtoy commented May 24, 2017

so what is the right code to use? I was able to train a model with @karandwivedi42 's change and the model completed training for me in 155 minutes. does that time seem right? I trained the original show and tell model and I remember it taking at least a day.

@karandwivedi42
Copy link

karandwivedi42 commented May 24, 2017 via email

@jtoy
Copy link
Contributor Author

jtoy commented May 24, 2017

@karandwivedi42 I dont fully understand, Im just starting to play with pytorch, any way to see it as a diff ?

@karandwivedi42
Copy link

karandwivedi42 commented May 25, 2017

@jtoy This fork is a very hacky way to do exactly what the original code does.
https://github.com/karandwivedi42/pytorch-tutorial/tree/master/tutorials/09%20-%20Image%20Captioning

@yunjey Can you please check this one? (Thanks for the amazing tutorials btw :) )

@yunjey
Copy link
Owner

yunjey commented May 26, 2017

@jtoy @karandwivedi42 I will fix the code by this weekend.

@jtoy
Copy link
Contributor Author

jtoy commented May 26, 2017 via email

@yunjey
Copy link
Owner

yunjey commented May 28, 2017

@jtoy @karandwivedi42 I modified the code. Try it. Thanks :)

@yunjey yunjey closed this as completed May 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants