SPINEPS : RuntimeError: CUDA error: out of memory #24

Kaonashi22 · 2024-04-23T14:22:07Z

Following our discussion in issue #21:

After setting the TMPDIR shell variable to the directory where the data is stored, the "batch_processing.sh" script finally starts running and creates a tmp folder. Then, it exits with this error:

torch.cuda.OutOfMemoryError: """CUDA out of memory. Tried to allocate 128.00 MiB. GPU 0 has a total capacity 
of 1.95 GiB of which 1.19 MiB is free. Process 7900 has 27.12 MiB memory in use. Including non-PyTorch 
memory, this process has 1.28 GiB memory in use. Of the allocated memory 1.10 GiB is allocated by PyTorch, 
and 139.26 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try 
setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for 
Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)"""

After exporting these variables export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True PYTORCH_NO_CUDA_MEMORY_CACHING=1, I got this new error:

RuntimeError: """CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions."""

I exported again this variable export=CUDA_LAUNCH_BLOCKING=1, and I end up with this error:

RuntimeError: """CUDA error: out of memory
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions."""

Any insights? Thanks for your help!

Note from @joshuacwnewton: I added ``` formatting just for readability!

The text was updated successfully, but these errors were encountered:

joshuacwnewton · 2024-04-23T17:25:51Z

The first thing I would suggest is use nvidia-smi to check information about the GPU prior to the call to spineps. Here is some sample output from my own GPU:

PS:C:\Users\Joshua $ nvidia-smi
Tue Apr 23 13:12:41 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 546.01                 Driver Version: 546.01       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1050      WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   42C    P8              N/A / ERR! |      0MiB /  4096MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I would be curious to see what the output of this is immediately prior to the spineps call. Would it be possible to add this to batch_processing.sh, then share the error log file for whichever subject fails? :)

At the very least, this will hopefully help us to isolate the problem, since we should be able to tell if this is due to SPINEPS' memory requirements, or if this is due to other processes using the GPU memory.

You may also want to set the following options in the spineps command: -sd -v. I'm not sure what sort of debug information SPINEPS provides, but it's possibly worth a try.

Kaonashi22 · 2024-04-23T18:27:37Z

Thanks @joshuacwnewton ; this is the information about the GPU
Screenshot

And the log fileof the test subject where you can see the output before calling SPINEPS
err.batch_processing_sub-BB277.log

I'm trying again to install the SCT on Compute Canada; I'll let you know how it works

jcohenadad · 2024-04-23T18:46:16Z

@NathanMolinier can SPINEPS run on CPU only?

NathanMolinier · 2024-04-23T18:56:30Z

I don't think it is implemented yet in the command line, but I talked about that to Hendrik and he mentioned that the code is checking if a GPU is available before running so it should not be "too difficult" to change that. I will ask if it's possible to allow the code to run on cpu only.

joshuacwnewton · 2024-04-23T19:22:47Z

Thanks @joshuacwnewton ; this is the information about the GPU

Ah!! That at least partially explains things. You have quite a few memory-intensive processes running on your GPU. Normally, I would expect these processes to run on your CPU's integrated GPU instead, thus leaving the deep learning tasks solely for your GTX 750 GPU?

I am wondering -- are you SSH'ing into a remote machine and forwarding the display, by chance? (This seems to be a common theme when people talk about Xorg taking up GPU resources in discussions online.)

Either way, you may want to look into help posts such as: https://askubuntu.com/questions/1279809/prevent-usr-lib-xorg-xorg-from-using-gpu-memory-in-ubuntu-20-04-server. The end goal is probably something like what is shown in this answer: https://askubuntu.com/a/1313440. Given that your GPU is relatively old (GTX 750 is from 2014) and that your GPU has somewhat limited memory (2GB), you will want to take steps to conserve as much as the memory as you can, so that you are left with as much as possible when running inference on GPU.

Kaonashi22 · 2024-04-23T21:45:49Z

Yes, I'm running the analysis on the server of the institute

I wasn't on any other process while running the script... Not sure I can spare GPU memory...

jcohenadad · 2024-04-28T15:08:22Z

Won't fix (see #26)

joshuacwnewton self-assigned this Apr 23, 2024

joshuacwnewton mentioned this issue Apr 23, 2024

Add call to SPINEPS in batch_processing.sh #21

Closed

jcohenadad closed this as completed Apr 28, 2024

This was referenced Apr 28, 2024

Vertebral labeling is unreliable on this dataset using sct_label_vertebrae #9

Closed

Memory issue on McGill servers to run processing pipeline #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPINEPS : RuntimeError: CUDA error: out of memory #24

SPINEPS : RuntimeError: CUDA error: out of memory #24

Kaonashi22 commented Apr 23, 2024 •

edited by joshuacwnewton

Loading

joshuacwnewton commented Apr 23, 2024

Kaonashi22 commented Apr 23, 2024

jcohenadad commented Apr 23, 2024

NathanMolinier commented Apr 23, 2024

joshuacwnewton commented Apr 23, 2024 •

edited

Loading

Kaonashi22 commented Apr 23, 2024

jcohenadad commented Apr 28, 2024

SPINEPS : RuntimeError: CUDA error: out of memory #24

SPINEPS : RuntimeError: CUDA error: out of memory #24

Comments

Kaonashi22 commented Apr 23, 2024 • edited by joshuacwnewton Loading

joshuacwnewton commented Apr 23, 2024

Kaonashi22 commented Apr 23, 2024

jcohenadad commented Apr 23, 2024

NathanMolinier commented Apr 23, 2024

joshuacwnewton commented Apr 23, 2024 • edited Loading

Kaonashi22 commented Apr 23, 2024

jcohenadad commented Apr 28, 2024

Kaonashi22 commented Apr 23, 2024 •

edited by joshuacwnewton

Loading

joshuacwnewton commented Apr 23, 2024 •

edited

Loading