New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistency between GPU memory usage in torch.cuda.memory_summary and nvidia-smi #37250
Comments
The total allocated memory is shown in PyTorch also uses a caching mechanism to avoid reallocating memory on the device as explained in the memory management docs. The current usage might thus be lower than the allocated memory. Re: OOM issue on validation: |
@ptrblck Thanks for your help! The total allocated memory seems to have different value in different runs. This is another case where the nvidia-smi reports usage of 10125MiB in total:
And in all the cases, if I just increase the batchsize (right now from 4 to 5), it will cause OOM error. For the issue of evaluation, I will read the OOM thread and try to find where I can improve |
The total allocated memory shown here is not the memory PyTorch currently allocated. |
I also encountered a similar problem where PyTorch reports inconsistent memory usage between vGPU memory and actual GPU memory env
Python 3.9.16 (main, Dec 7 2022, 01:11:58)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
>>> gpu_id = torch.cuda.current_device()
>>> torch.cuda.set_device(0)
>>> batch_size = 1024
>>> data_shape = (3, 224, 224)
>>> tensor = torch.zeros([batch_size] + list(data_shape)).cuda(device=0)
>>> batch_size = 10240
>>> tensor = torch.zeros([batch_size] + list(data_shape)).cuda(device=0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: CUDA out of memory. Tried to allocate 5.74 GiB (GPU 0; 5.50 GiB total capacity; 588.00 MiB already allocated; 4.13 GiB free; 588.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
>>>
>>> torch.set_default_tensor_type(torch.FloatTensor)
>>> total_memory = torch.cuda.max_memory_allocated()
>>> free_memory = torch.cuda.max_memory_cached()
/home/admin/langchain-ChatGLM/venv/lib/python3.9/site-packages/torch/cuda/memory.py:392: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved
warnings.warn(
>>>
>>> print(f"Total memory: {total_memory / (1024 ** 3):.2f} GB")
Total memory: 0.57 GB
>>> print(f"Free memory: {free_memory / (1024 ** 3):.2f} GB")
Free memory: 0.57 GB
>>> print(f"status:{torch.cuda.memory_summary()}")
初始状态:|===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 1 | cudaMalloc retries: 1 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 602124 KB | 602124 KB | 602124 KB | 0 B |
| from large pool | 602112 KB | 602112 KB | 602112 KB | 0 B |
| from small pool | 12 KB | 12 KB | 12 KB | 0 B |
|---------------------------------------------------------------------------|
| Active memory | 602124 KB | 602124 KB | 602124 KB | 0 B |
| from large pool | 602112 KB | 602112 KB | 602112 KB | 0 B |
| from small pool | 12 KB | 12 KB | 12 KB | 0 B |
|---------------------------------------------------------------------------|
| GPU reserved memory | 604160 KB | 604160 KB | 604160 KB | 0 B |
| from large pool | 602112 KB | 602112 KB | 602112 KB | 0 B |
| from small pool | 2048 KB | 2048 KB | 2048 KB | 0 B |
|---------------------------------------------------------------------------|
| Non-releasable memory | 2035 KB | 2040 KB | 2040 KB | 4608 B |
| from large pool | 0 KB | 0 KB | 0 KB | 0 B |
| from small pool | 2035 KB | 2040 KB | 2040 KB | 4608 B |
|---------------------------------------------------------------------------|
| Allocations | 5 | 5 | 5 | 0 |
| from large pool | 1 | 1 | 1 | 0 |
| from small pool | 4 | 4 | 4 | 0 |
|---------------------------------------------------------------------------|
| Active allocs | 5 | 5 | 5 | 0 |
| from large pool | 1 | 1 | 1 | 0 |
| from small pool | 4 | 4 | 4 | 0 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 2 | 2 | 2 | 0 |
| from large pool | 1 | 1 | 1 | 0 |
| from small pool | 1 | 1 | 1 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 1 | 1 | 1 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 1 | 1 | 1 | 0 |
|---------------------------------------------------------------------------|
| Oversize allocations | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Oversize GPU segments | 0 | 0 | 0 | 0 |
|===========================================================================| # nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
Thu May 11 02:39:19 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.172.01 Driver Version: 450.172.01 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:88:00.0 Off | 0 |
| N/A 43C P0 50W / 250W | 2924MiB / 22919MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+ |
I have a torch model in cuda, after I
it shows However, nvidia-smi shows:
why are they different? Update: That is because the deleted model is still in the cache, to really clear it from GPU, need to
then nvidia-smi shows:
though there are still 495MiB in GPU, but that's not a big deal to me. Update2: If using multiple GPUs, simply
|
🐛 Bug
I want to increase the batch size of my model but find the memory easily filled. However when I look at the numbers of the memory, it's not consistent between memory_summary and nvidia-smi.
The run-out-of-memory error says
When I reduced the batch size a little bit, the memory_summary says
Full output from another run
while at the same time, nvidia-smi reports
So it seems to me that the whole process takes ~10000MB memory, but only ~5000MB are actually used by pytorch, the rest of the memory is just "allocated" but not used? I have struggling throught this for weeks and I haven't found anyway to solve this.
I found a similar issue #35901, but in my case I don't have any really large tensors. I can post the memory profiling result during several iterations using this tool:
Tensors list
I also want to mention another phenomenon I found. During my training process, I have a validation step, if I add validation after each epoch of training, the memory cost will almost be doubled! Even if I use the same network object, I just call
net.eval()
and the forward function. So I have to reduce batch_size if I enable validation. Is this problem caused by my incorrect usage or something related to pytorch memory management? I also tried pytorch_lightning to avoid hand written validation loop, I still have the same issue!Thanks for your help in advance!
To Reproduce
Unfortunately I cannot provide the code I use...
Environment
Also note that I use this sparse convolution library: spconv
cc @ngimel
The text was updated successfully, but these errors were encountered: