New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out of memory cuda runtime error (2) on p2.xlarge AWS instance with batch_size = 1 #4420

Closed
hassanshallal opened this Issue Dec 31, 2017 · 2 comments

Comments

Projects
None yet
2 participants
@hassanshallal

hassanshallal commented Dec 31, 2017

Hi folks,

I know this issue has been reported earlier and the common wisdom has converged to decreasing the batch_size. In my current situation, decreasing the batch_size all the way to 1 didn't solve the out of memory runtime error indicating to me that there must be something else that's causing this issue. Here are some relevant information:

  1. I am trying to fine tune pretrained resnet-34 on images with standard 3224224 size

  2. I was able to accomplish the fine-tuning with a batch-size of 32 on my mac-book air with only 8GB memory.

  3. Right now, I am trying to perform the same routine in a p2.xlarge instance on AWS with a 61 GB memory as indicated by AWS https://aws.amazon.com/ec2/instance-types/p2/

  4. I followed this tutorial to set up the AWS EC2 instance: https://kevinzakka.github.io/2017/08/13/aws-pytorch/

  5. I am reproducing the pipeline in this amazing tutorial: http://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html

  6. On the AWS instance: everything seems to be running fine. I was able to run a pre-training inference on the dataset but only after setting volatile =True. If not volatile, the out of memory runtime issue obtains!

  7. Here is the error message I obtain irrelevant to the batch_size, I get this error upon trying to train the resnet-34 even with a batch_size of 1:

/home/ubuntu/envs/deepL/lib/python3.5/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/ubuntu/envs/deepL/lib/python3.5/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
finetunacuda.py:147: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  train_props = m(Variable(train_props))
finetunacuda.py:183: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  test_props = m(Variable(test_props))
THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "finetunacuda.py", line 311, in <module>
    main()
  File "finetunacuda.py", line 309, in main
    fine_tuna_protocol()
  File "finetunacuda.py", line 287, in fine_tuna_protocol
    model_ft = train_model(dataloaders, dataset_sizes, model_ft, criterion, optimizer_ft, num_epochs = nep, temp_save_name = name_of_results_output_txt_file)
  File "finetunacuda.py", line 228, in train_model
    outputs = model(inputs)
  File "/home/ubuntu/envs/deepL/lib/python3.5/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/envs/deepL/lib/python3.5/site-packages/torchvision/models/resnet.py", line 142, in forward
    x = self.maxpool(x)
  File "/home/ubuntu/envs/deepL/lib/python3.5/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/envs/deepL/lib/python3.5/site-packages/torch/nn/modules/pooling.py", line 143, in forward
    self.return_indices)
  File "/home/ubuntu/envs/deepL/lib/python3.5/site-packages/torch/nn/functional.py", line 334, in max_pool2d
    ret = torch._C._nn.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58
  1. Here is the gpu in the instance:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   42C    P0    73W / 149W |      0MiB / 11439MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
  1. I don't know whether the above information is enough to troubleshoot the issue. Please feel free to ask for any further information. I hope you guys can help with this troubleshooting, I am a newbie in GPU computing, this is my first time trying to train a model using GPU. Any hint/feedback is very much appreciated.

Happy new year
Ciao

@apaszke

This comment has been minimized.

Member

apaszke commented Jan 2, 2018

You're probably accumulating the loss without taking the .data or doing something else that prevents the previous iteration graphs from being freed. Try using the official imagenet example, and reopen the issue if you hit the same issues with that script. Otherwise, please ask on our forums - we're using GitHub issues for bug reports only.

@apaszke apaszke closed this Jan 2, 2018

@hassanshallal

This comment has been minimized.

hassanshallal commented Jan 2, 2018

Thanks a lot for the response and the redirection, will pay attention to different platforms!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment