-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda out of memory error when GPU0 memory is fully utilized #3477
Comments
That's because the default device is 0, so pytorch is trying to create context on it. You can control the devices you are using either by CUDA_VISIBLE_DEVICES environment variable, or guarding you computations like this
|
Even when I specify the cuda device for all my transfers, there is still memory being used on GPU0, e.g. model = nn.Module()
model.rnn = nn.RNN(input_size=features, hidden_size=hidden_size, num_layers=2)
model.cuda(5)
X_train = torch.randn(seq_len, batch_size, features)
y_train = torch.randn(batch_size)
X_train, y_train = Variable(X_train).cuda(5), Variable(y_train).cuda(5) Is this the right behavior? More specifically, |
@lucylw Yes this is the intended behavior. Since PyTorch still sees your GPU 0 as first in CUDA_VISIBLE_DEVICES, it will create some context on it. If you want your script to completely ignore GPU 0, you need to set that environment variable. e.g., for it to only use GPU 5, do |
Still, I thought we're only initializing contexts on devices that are actually getting used (in the last snippet). I don't think we should create one on GPU0 in this case. It's worth looking into that. |
@apaszke From my experience with 0.2.0, it always creates ~250MB context on first visible GPU, no matter if that GPU is used or not. |
That's weird, I don't think it should be like this |
It looks like in the context-manager in |
Shouldn't enter/exit only get called with a 'with' statement? |
There are a bunch of things in PyTorch that can currently lead to initialization of a context on the first visible GPU; things like CPU-GPU copies and |
CUDA_VISIBLE_DEVICES may not be the best thing to always rely on "robust applications should use the CUDA API to enumerate and select devices with appropriate capabilities at run time. To learn how, read the section on Device Enumeration in the CUDA Programming Guide. But the CUDA_VISIBLE_DEVICES environment variable is handy for restricting execution to a specific device or set of devices for debugging and testing." https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/ It has known conflicts with multi-process NCCL (as in it doesn't work) |
I met the same problem using Pytorch 0.4.1. I have two GTX 1080Ti GPUs (11GB RAM for each one). When I run a training code on GPU0, it's OK. When I run the training code on GPU1 by setting CUDA_VISIBIE_DEVICES, the program reported CUDA OUT OF MEMORY error. It's weird since GPU0 actually has less free memory since it's connected to the monitor. Free GPU memory before running the training code:
After running the training code on GPU0:
Error message when running on GPU1:
|
@TomHeaven did you set CUDA_VISIBLE_DEVICES outside the python process? if that's the case pytorch should not even have driver-level access to your GPU0. Ideally:
|
Yes, I did exactly the same way. And if I restart the computer, GPU 1 sometimes can run the code without problem. |
@gchanan Please ignore my comment earlier. I have been informed just now by the owner of the system, that due to some glitch the gpus id have been reversed. So, RTX is actually gpu id 1, and not 0. and nvidia-smi gives wrong info. I am really sorry for wasting your time. Note: I have just copied over information from supervisor to you directly. Just to not look into my issue and waste time, but I am yet to confirm the supervisor comments myself yet. I will confirm this tonight as soon as I have system access. UPDATE
Turns out the issue is discrepancy between how nvidia-smi and rest of nvidia driver works. P.S. Setting |
I think we've tackled a number of these issues. I'm going to close this now since we haven't seen a reproduction in quite awhile. Please reopen if you see this again. |
I have been experiencing an error on systems with multiple GPUs. When GPU0 is fully utilized by another process, I get
RuntimeError: cuda runtime error (2) : out of memory
.It seems that torch.nn.Module.cuda() transfers data not only to my specified GPU, but also GPU0, whose memory is already being used.
I can reproduce this error using the below code (memory on GPU0 should be fully utilized by another process):
After
model.cuda(5)
, my nvidia-smi output shows:+----------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|===========================================|
| 0 32317 C python 7773MiB |
| 0 34873 C python 217MiB |
| 1 41080 C python 7775MiB |
| 5 34873 C python 289MiB |
+----------------------------------------------------------+
I am using GPU5, but the same process 34873 is also using memory on GPU0.
It looks like in the device class of torch/cuda/init.py, the
prev_idx
is being reset to 0 and then torch._C._cuda_setDevice is setting the device number to 0 upon exit.torch/cuda/init.py:110
cc @ngimel
The text was updated successfully, but these errors were encountered: