Out of memory error when training best model on imagenet #14

tremblerz · 2018-07-11T18:38:42Z

I am using V100 gpu which has 16G memory. Here is the error log-

07/10 07:05:24 PM valid 000 2.609589e+00 47.656250 76.562500
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "train_imagenet.py", line 230, in <module>
    main() 
  File "train_imagenet.py", line 152, in main
    valid_acc_top1, valid_acc_top5, valid_obj = infer(valid_queue, model, criterion)
  File "train_imagenet.py", line 214, in infer
    logits, _ = model(input)
  File "/home/ubuntu/workspace/.torch-env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/workspace/darts/cnn/model.py", line 207, in forward
    s0, s1 = s1, cell(s0, s1, self.drop_path_prob)
  File "/home/ubuntu/workspace/.torch-env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/workspace/darts/cnn/model.py", line 51, in forward
    h1 = op1(h1)
  File "/home/ubuntu/workspace/.torch-env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/workspace/darts/cnn/operations.py", line 66, in forward
    return self.op(x)
  File "/home/ubuntu/workspace/.torch-env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/workspace/.torch-env/lib/python3.5/site-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/home/ubuntu/workspace/.torch-env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/workspace/.torch-env/lib/python3.5/site-packages/torch/nn/modules/conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58

The text was updated successfully, but these errors were encountered:

quark0 · 2018-07-11T19:45:56Z

Hard to tell without more details such as the pytorch version. If you use pytorch 0.4, be sure to wrap the validation scripts into torch.no_grad() as otherwise you would get OOM. I would also try smaller batch sizes and check the memory consumption during training & validation.

tremblerz · 2018-07-12T05:14:28Z

Thank you, that solves the issue.

dragen1860 · 2019-02-01T01:40:06Z

Thanks.
https://github.com/dragen1860/DARTS-PyTorch
Here is the darts version supporting pytorch 1.0.

tremblerz closed this as completed Jul 12, 2018

quark0 mentioned this issue Jul 12, 2018

Out of memory trying to run CIFAR example #16

Closed

quark0 mentioned this issue Aug 6, 2018

after 2hours training stopped by CUDA error: output of memory #27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory error when training best model on imagenet #14

Out of memory error when training best model on imagenet #14

tremblerz commented Jul 11, 2018

quark0 commented Jul 11, 2018 •

edited

Loading

tremblerz commented Jul 12, 2018

dragen1860 commented Feb 1, 2019

Out of memory error when training best model on imagenet #14

Out of memory error when training best model on imagenet #14

Comments

tremblerz commented Jul 11, 2018

quark0 commented Jul 11, 2018 • edited Loading

tremblerz commented Jul 12, 2018

dragen1860 commented Feb 1, 2019

quark0 commented Jul 11, 2018 •

edited

Loading