CUDA out of memory issue #1654

agupta582 · 2023-03-27T15:34:39Z

Search before asking

I have searched the YOLOv8 issues and found no similar bug report.

YOLOv8 Component

Training

Bug

I landed up into the following Runtime Error several times whenever I pick any model from Ultralytics package. The error snippet, given below, I got for yolov5n.pt model. Interestingly, when I train the same model directly from Yolov5 github repository [https://github.com/ultralytics/yolov5], it works perfectly without any error. Even I was able to train the large model Yolov5l.pt also with no errors. So, essentially, it seems an issue with the memory management in Ultralytics package. Please look into it.

RuntimeError: CUDA out of memory. Tried to allocate 572.00 MiB (GPU 0; 23.99 GiB total capacity; 21.62 GiB already allocated; 0 bytes free; 22.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Environment

-- Ultralytics YOLOv8.0.57 Python-3.9.12 torch-1.12.0 CUDA:0 (NVIDIA RTX A5000, 24564MiB)
-- OS Ubuntu 22.04

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

github-actions · 2023-03-27T15:35:19Z

👋 Hello @ashishgupta582, thank you for your interest in YOLOv8 🚀! We recommend a visit to the YOLOv8 Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Install

Pip install the ultralytics package including all requirements in a Python>=3.7 environment with PyTorch>=1.7.

pip install ultralytics

Environments

YOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLOv8 Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher · 2023-03-27T15:39:20Z

@ashishgupta582
👋 Hello! Thanks for asking about CUDA memory issues. YOLOv5 🚀 can be trained on CPU, single-GPU, or multi-GPU. When training on GPU it is important to keep your batch-size small enough that you do not use all of your GPU memory, otherwise you will see a CUDA Out Of Memory (OOM) Error and your training will crash. You can observe your CUDA memory utilization using either the nvidia-smi command or by viewing your console output:

CUDA Out of Memory Solutions

If you encounter a CUDA OOM error, the steps you can take to reduce your memory usage are:

Reduce --batch-size
Reduce --img-size
Reduce model size, i.e. from YOLOv5x -> YOLOv5l -> YOLOv5m -> YOLOv5s > YOLOv5n
Train with multi-GPU at the same --batch-size
Upgrade your hardware to a larger GPU
Train on free GPU backends with up to 16GB of CUDA memory:

AutoBatch

You can use YOLOv5 AutoBatch (NEW) to find the best batch size for your training by passing --batch-size -1. AutoBatch will solve for a 90% CUDA memory-utilization batch-size given your training settings. AutoBatch is experimental, and only works for Single-GPU training. It may not work on all systems, and is not recommended for production use.

Good luck 🍀 and let us know if you have any other questions!

ExtReMLapin · 2023-05-11T09:17:06Z

While yolov5 seems to had a very stable memory usage, it seems that yolov8 doesn't have this benefit.

Even when runing ray optimization, I decided to go low enough on the batch size to only use 20Gb of vram out of my 40 available, yet it still managed to crash (OOM) when I let it ran for few hours.

glenn-jocher · 2023-05-11T22:45:47Z

@ExtReMLapin hello! We're sorry to hear that you're encountering memory-related issues while running YOLOv8 👎. If you encounter a CUDA Out Of Memory (OOM) Error, it is typically the result of a lack of available memory to perform a specific task. To help avoid this, we recommend lowering your batch size and/or reducing your training image size. If this does not work and you have a larger GPU, you could try increasing your GPU's memory to run your models.

Another option that you might want to consider is tuning the max_split_size_mb parameter to adjust the amount of reserved memory relative to allocated memory to avoid memory fragmentation. You can find information on how to do this on the official PyTorch website.

We hope that helps! Let us know if you have any additional questions or run into any other issues.

ExtReMLapin · 2023-07-01T07:44:33Z

Sorry for the delayed answer, I guess the issue is the memory usage is unstable, hear me out, with yolov5 the memory usage was very stage, while internally, there are allocations that are invisible in nvidia-smi because pytorch has reserved memory, with yolov8 it seems that for some reasons, deep in the training it needs to reserve more memory.

Starting a training on 35gb used out of 40gb sounds safe, but crashing two hours later is frustrating.

please take a look at the following error :

      Epoch    GPU_mem  giou_loss   cls_loss    l1_loss  Instances       Size
   128/3500      34.4G       1.64     0.3032     0.5341        219       2048: 100%|██████████| 82/82 [00:53<00:00,  1.53it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 36/36 [00:09<00:00,  3.66it/s]
                   all        491      14399     0.0357       0.26     0.0251    0.00567

      Epoch    GPU_mem  giou_loss   cls_loss    l1_loss  Instances       Size
   129/3500      34.2G      1.697     0.2921     0.5755        134       2048:  80%|████████  | 66/82 [00:42<00:10,  1.55it/s]Traceback (most recent call last):
  File "/home/CENSORED/.config/Ultralytics/DDP/_temp_ajvj9jgl139928182858336.py", line 9, in <module>
    trainer.train()
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/ultralytics/yolo/engine/trainer.py", line 192, in train
    self._do_train(world_size)
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/ultralytics/yolo/engine/trainer.py", line 332, in _do_train
    self.loss, self.loss_items = self.model(batch)
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/ultralytics/nn/tasks.py", line 44, in forward
    return self.loss(x, *args, **kwargs)
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/ultralytics/nn/tasks.py", line 434, in loss
    preds = self.predict(img, batch=targets) if preds is None else preds
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/ultralytics/nn/tasks.py", line 478, in predict
    x = head([y[j] for j in head.f], batch)  # head inference
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/ultralytics/nn/modules/head.py", line 242, in forward
    dec_bboxes, dec_scores = self.decoder(embed,
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/ultralytics/nn/modules/transformer.py", line 359, in forward
    output = layer(output, refer_bbox, feats, shapes, padding_mask, attn_mask, pos_mlp(refer_bbox))
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/ultralytics/nn/modules/transformer.py", line 319, in forward
    tgt = self.cross_attn(self.with_pos_embed(embed, query_pos), refer_bbox.unsqueeze(2), feats, shapes,
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/ultralytics/nn/modules/transformer.py", line 268, in forward
    output = multi_scale_deformable_attn_pytorch(value, value_shapes, sampling_locations, attention_weights)
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/ultralytics/nn/modules/utils.py", line 59, in multi_scale_deformable_attn_pytorch
    value_l_ = (value_list[level].flatten(2).transpose(1, 2).reshape(bs * num_heads, embed_dims, H_, W_))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB (GPU 2; 40.00 GiB total capacity; 35.54 GiB already allocated; 211.38 MiB free; 35.62 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Each epoch + evaluation takes about one minute, and at epoch 120, the training ran for two hours straight.
The issue I have is the usage was at 35gb out of 40, which should be enough for small unpredicted allocations.
(3 A40 @ 40gb running the training)

There was nothing else running on the server.

Here are some metrics from wandb

At the very end there is peak, but I'm unsure what it is, or if it's caused by the crash itself.

glenn-jocher · 2023-07-01T23:49:16Z

@ExtReMLapin hi there,

Apologies for the delayed response. It appears that the memory usage of YOLOv8 is more unstable compared to YOLOv5. While YOLOv5 may have hidden memory allocations that are not visible in nvidia-smi due to PyTorch reserving memory, YOLOv8 seems to require additional memory reservations during deep training.

Starting the training with 35GB of VRAM out of 40GB may seem safe, but it eventually crashes after running for a couple of hours. The error message you shared indicates that there was a CUDA Out Of Memory (OOM) error at epoch 129 when allocating additional memory.

One potential solution is to reduce the batch size and/or image size during training to lower the memory requirements. Another option is to increase the available GPU memory if you have a larger GPU or consider adjusting the max_split_size_mb parameter to optimize memory allocation and avoid fragmentation.

From the Wandb metrics you shared, there is a peak at the very end, but it's unclear whether it caused the crash or is a result of the crash itself.

Please let us know if you have any further questions or if there's anything else we can assist you with.

github-actions · 2023-08-01T00:19:07Z

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

thurnbauermatthi · 2023-09-14T09:22:01Z

I have a similar issue. I have access to two NVIDIA A5000 GPUs with 24GB each. I use the pretrained yolov8m model. I get an OOM for an image size of 640 and a batch size of 4(!!). I am using cached images on disk for the training, but the error message says:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 18.46 GiB (GPU 0; 23.69 GiB total capacity; 16.35 GiB already allocated; 4.53 GiB free; 18.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Why does PyTorch reserve 18GB? That does sound like it is too much.

glenn-jocher · 2023-09-14T18:29:49Z

@thurnbauermatthi hi there,

The PyTorch CUDA out of memory error you're encountering generally indicates that your GPU does not have enough memory to store the computations needed for training your model. In your case, with the pretrained YOLOv8m model, image size of 640 and a batch size of 4 on NVIDIA A5000 GPUs, it seems like PyTorch is indeed trying to reserve more memory than what's available on your GPUs.

The memory reserved by PyTorch isn't just for the model parameters. It's also used to store intermediate variables for backpropagation, optimizer states, and more. Moreover, PyTorch tends to cache memory to avoid the cost of allocating and deallocating memory every time it’s needed.

The memory requirement can increase with the complexity of the model, the size of the input data, and the batch size. While batch size and input size are more direct factors because they affect how much data is loaded into memory at once, the model complexity indirectly affects the memory requirement through the computational graph used for backpropagation.

Here are a few solutions you could try:

Reduce the batch size: You've already mentioned your batch size is 4. You could try reducing it further, even though that could potentially affect your training dynamics.
Gradual Freezing: This involves freezing some layers of the model while training others, then alternating.
Gradient Checkpointing: This is a trade-off strategy where you decrease memory usage by saving some intermediate variables and re-computing them during backward passes.
Adjusting max_split_size_mb: PyTorch divides tensors into chunks to optimize memory usage, and this parameter tells PyTorch the maximum chunk size. You could try modifying it based on your requirements.
Using PyTorch's memory utilities: torch.cuda.memory_summary() could be particularly useful to understand how CUDA memory is being allocated and, possibly, is running out.

I hope this information will help you! Let us know if there's anything else we could do to assist you.

dt140120 · 2023-09-15T00:37:04Z

@glenn-jocher
same problem, with imgsz=640, batch=4, satellite images and trained by rtx 2080ti (12gb)

glenn-jocher · 2023-09-15T08:48:13Z

@dt140120 hello,

Thanks for reaching out. It appears you're experiencing an Out of Memory (OOM) error when attempting to train with a batch size of 4 and image size of 640 using RTX 2080 Ti (12GB).

The GPU memory requirement for training a model is primarily affected by input data size and model complexity. In this case, your batch size, model, and input image size are likely demanding more GPU memory than the available 12GB during training.

To resolve this issue, you could try several approaches:

Reduce the batch size: While your current batch size is relatively small, reducing it further might help lessen the memory demand, though this could affect training dynamics.
Decrease the image size: Reducing the image size will lower the memory footprint per image.
Adjust max_split_size_mb: This can help optimize memory allocation.
Use gradient accumulation: This allows you to virtually increase your batch size without increasing memory usage by accumulating gradients over multiple mini-batches and performing a model update afterwards.

Please note that these solutions come with trade-offs and can potentially impact the accuracy of the model.

I hope this helps, and do let us know if you have further questions!

agupta582 added the bug Something isn't working label Mar 27, 2023

glenn-jocher added the non-reproducible Bug is not reproducible label Mar 27, 2023

github-actions bot added the Stale label Aug 1, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory issue #1654

CUDA out of memory issue #1654

agupta582 commented Mar 27, 2023

github-actions bot commented Mar 27, 2023 •

edited by glenn-jocher

glenn-jocher commented Mar 27, 2023 •

edited

ExtReMLapin commented May 11, 2023

glenn-jocher commented May 11, 2023

ExtReMLapin commented Jul 1, 2023

glenn-jocher commented Jul 1, 2023

github-actions bot commented Aug 1, 2023

thurnbauermatthi commented Sep 14, 2023

glenn-jocher commented Sep 14, 2023

dt140120 commented Sep 15, 2023

glenn-jocher commented Sep 15, 2023

CUDA out of memory issue #1654

CUDA out of memory issue #1654

Comments

agupta582 commented Mar 27, 2023

Search before asking

YOLOv8 Component

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

github-actions bot commented Mar 27, 2023 • edited by glenn-jocher

Install

Environments

Status

glenn-jocher commented Mar 27, 2023 • edited

CUDA Out of Memory Solutions

AutoBatch

ExtReMLapin commented May 11, 2023

glenn-jocher commented May 11, 2023

ExtReMLapin commented Jul 1, 2023

glenn-jocher commented Jul 1, 2023

github-actions bot commented Aug 1, 2023

thurnbauermatthi commented Sep 14, 2023

glenn-jocher commented Sep 14, 2023

dt140120 commented Sep 15, 2023

glenn-jocher commented Sep 15, 2023

github-actions bot commented Mar 27, 2023 •

edited by glenn-jocher

glenn-jocher commented Mar 27, 2023 •

edited