Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPINEPS : RuntimeError: CUDA error: out of memory #24

Closed
Kaonashi22 opened this issue Apr 23, 2024 · 7 comments
Closed

SPINEPS : RuntimeError: CUDA error: out of memory #24

Kaonashi22 opened this issue Apr 23, 2024 · 7 comments
Assignees

Comments

@Kaonashi22
Copy link

Kaonashi22 commented Apr 23, 2024

Following our discussion in issue #21:

After setting the TMPDIR shell variable to the directory where the data is stored, the "batch_processing.sh" script finally starts running and creates a tmp folder. Then, it exits with this error:

torch.cuda.OutOfMemoryError: """CUDA out of memory. Tried to allocate 128.00 MiB. GPU 0 has a total capacity 
of 1.95 GiB of which 1.19 MiB is free. Process 7900 has 27.12 MiB memory in use. Including non-PyTorch 
memory, this process has 1.28 GiB memory in use. Of the allocated memory 1.10 GiB is allocated by PyTorch, 
and 139.26 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try 
setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for 
Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)"""

After exporting these variables export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True PYTORCH_NO_CUDA_MEMORY_CACHING=1, I got this new error:

RuntimeError: """CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions."""

I exported again this variable export=CUDA_LAUNCH_BLOCKING=1, and I end up with this error:

RuntimeError: """CUDA error: out of memory
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions."""

Any insights? Thanks for your help!

Note from @joshuacwnewton: I added ``` formatting just for readability!

@joshuacwnewton
Copy link
Member

The first thing I would suggest is use nvidia-smi to check information about the GPU prior to the call to spineps. Here is some sample output from my own GPU:

PS:C:\Users\Joshua $ nvidia-smi
Tue Apr 23 13:12:41 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 546.01                 Driver Version: 546.01       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1050      WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   42C    P8              N/A / ERR! |      0MiB /  4096MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I would be curious to see what the output of this is immediately prior to the spineps call. Would it be possible to add this to batch_processing.sh, then share the error log file for whichever subject fails? :)

At the very least, this will hopefully help us to isolate the problem, since we should be able to tell if this is due to SPINEPS' memory requirements, or if this is due to other processes using the GPU memory.


You may also want to set the following options in the spineps command: -sd -v. I'm not sure what sort of debug information SPINEPS provides, but it's possibly worth a try.

@Kaonashi22
Copy link
Author

Thanks @joshuacwnewton ; this is the information about the GPU
Screenshot
image

And the log fileof the test subject where you can see the output before calling SPINEPS
err.batch_processing_sub-BB277.log

I'm trying again to install the SCT on Compute Canada; I'll let you know how it works

@jcohenadad
Copy link
Member

@NathanMolinier can SPINEPS run on CPU only?

@NathanMolinier
Copy link
Contributor

I don't think it is implemented yet in the command line, but I talked about that to Hendrik and he mentioned that the code is checking if a GPU is available before running so it should not be "too difficult" to change that. I will ask if it's possible to allow the code to run on cpu only.

@joshuacwnewton
Copy link
Member

joshuacwnewton commented Apr 23, 2024

Thanks @joshuacwnewton ; this is the information about the GPU

Ah!! That at least partially explains things. You have quite a few memory-intensive processes running on your GPU. Normally, I would expect these processes to run on your CPU's integrated GPU instead, thus leaving the deep learning tasks solely for your GTX 750 GPU?

I am wondering -- are you SSH'ing into a remote machine and forwarding the display, by chance? (This seems to be a common theme when people talk about Xorg taking up GPU resources in discussions online.)

Either way, you may want to look into help posts such as: https://askubuntu.com/questions/1279809/prevent-usr-lib-xorg-xorg-from-using-gpu-memory-in-ubuntu-20-04-server. The end goal is probably something like what is shown in this answer: https://askubuntu.com/a/1313440. Given that your GPU is relatively old (GTX 750 is from 2014) and that your GPU has somewhat limited memory (2GB), you will want to take steps to conserve as much as the memory as you can, so that you are left with as much as possible when running inference on GPU.

@Kaonashi22
Copy link
Author

Yes, I'm running the analysis on the server of the institute

I wasn't on any other process while running the script... Not sure I can spare GPU memory...

@jcohenadad
Copy link
Member

Won't fix (see #26)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants