Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory issue #1654

Closed
1 of 2 tasks
agupta582 opened this issue Mar 27, 2023 · 11 comments
Closed
1 of 2 tasks

CUDA out of memory issue #1654

agupta582 opened this issue Mar 27, 2023 · 11 comments
Labels
bug Something isn't working non-reproducible Bug is not reproducible Stale

Comments

@agupta582
Copy link

Search before asking

  • I have searched the YOLOv8 issues and found no similar bug report.

YOLOv8 Component

Training

Bug

I landed up into the following Runtime Error several times whenever I pick any model from Ultralytics package. The error snippet, given below, I got for yolov5n.pt model. Interestingly, when I train the same model directly from Yolov5 github repository [https://github.com/ultralytics/yolov5], it works perfectly without any error. Even I was able to train the large model Yolov5l.pt also with no errors. So, essentially, it seems an issue with the memory management in Ultralytics package. Please look into it.

RuntimeError: CUDA out of memory. Tried to allocate 572.00 MiB (GPU 0; 23.99 GiB total capacity; 21.62 GiB already allocated; 0 bytes free; 22.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Environment

-- Ultralytics YOLOv8.0.57 Python-3.9.12 torch-1.12.0 CUDA:0 (NVIDIA RTX A5000, 24564MiB)
-- OS Ubuntu 22.04

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@agupta582 agupta582 added the bug Something isn't working label Mar 27, 2023
@github-actions
Copy link

github-actions bot commented Mar 27, 2023

👋 Hello @ashishgupta582, thank you for your interest in YOLOv8 🚀! We recommend a visit to the YOLOv8 Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Install

Pip install the ultralytics package including all requirements in a Python>=3.7 environment with PyTorch>=1.7.

pip install ultralytics

Environments

YOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

Ultralytics CI

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLOv8 Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

@glenn-jocher glenn-jocher added the non-reproducible Bug is not reproducible label Mar 27, 2023
@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 27, 2023

@ashishgupta582
👋 Hello! Thanks for asking about CUDA memory issues. YOLOv5 🚀 can be trained on CPU, single-GPU, or multi-GPU. When training on GPU it is important to keep your batch-size small enough that you do not use all of your GPU memory, otherwise you will see a CUDA Out Of Memory (OOM) Error and your training will crash. You can observe your CUDA memory utilization using either the nvidia-smi command or by viewing your console output:

Screenshot 2021-05-28 at 12 19 51

CUDA Out of Memory Solutions

If you encounter a CUDA OOM error, the steps you can take to reduce your memory usage are:

  • Reduce --batch-size
  • Reduce --img-size
  • Reduce model size, i.e. from YOLOv5x -> YOLOv5l -> YOLOv5m -> YOLOv5s > YOLOv5n
  • Train with multi-GPU at the same --batch-size
  • Upgrade your hardware to a larger GPU
  • Train on free GPU backends with up to 16GB of CUDA memory: Open In Colab Open In Kaggle

AutoBatch

You can use YOLOv5 AutoBatch (NEW) to find the best batch size for your training by passing --batch-size -1. AutoBatch will solve for a 90% CUDA memory-utilization batch-size given your training settings. AutoBatch is experimental, and only works for Single-GPU training. It may not work on all systems, and is not recommended for production use.

Screenshot 2021-11-06 at 12 31 10

Good luck 🍀 and let us know if you have any other questions!

@ExtReMLapin
Copy link
Contributor

While yolov5 seems to had a very stable memory usage, it seems that yolov8 doesn't have this benefit.

Even when runing ray optimization, I decided to go low enough on the batch size to only use 20Gb of vram out of my 40 available, yet it still managed to crash (OOM) when I let it ran for few hours.

@glenn-jocher
Copy link
Member

@ExtReMLapin hello! We're sorry to hear that you're encountering memory-related issues while running YOLOv8 👎. If you encounter a CUDA Out Of Memory (OOM) Error, it is typically the result of a lack of available memory to perform a specific task. To help avoid this, we recommend lowering your batch size and/or reducing your training image size. If this does not work and you have a larger GPU, you could try increasing your GPU's memory to run your models.

Another option that you might want to consider is tuning the max_split_size_mb parameter to adjust the amount of reserved memory relative to allocated memory to avoid memory fragmentation. You can find information on how to do this on the official PyTorch website.

We hope that helps! Let us know if you have any additional questions or run into any other issues.

@ExtReMLapin
Copy link
Contributor

Sorry for the delayed answer, I guess the issue is the memory usage is unstable, hear me out, with yolov5 the memory usage was very stage, while internally, there are allocations that are invisible in nvidia-smi because pytorch has reserved memory, with yolov8 it seems that for some reasons, deep in the training it needs to reserve more memory.

Starting a training on 35gb used out of 40gb sounds safe, but crashing two hours later is frustrating.

please take a look at the following error :

      Epoch    GPU_mem  giou_loss   cls_loss    l1_loss  Instances       Size
   128/3500      34.4G       1.64     0.3032     0.5341        219       2048: 100%|██████████| 82/82 [00:53<00:00,  1.53it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 36/36 [00:09<00:00,  3.66it/s]
                   all        491      14399     0.0357       0.26     0.0251    0.00567

      Epoch    GPU_mem  giou_loss   cls_loss    l1_loss  Instances       Size
   129/3500      34.2G      1.697     0.2921     0.5755        134       2048:  80%|████████  | 66/82 [00:42<00:10,  1.55it/s]Traceback (most recent call last):
  File "/home/CENSORED/.config/Ultralytics/DDP/_temp_ajvj9jgl139928182858336.py", line 9, in <module>
    trainer.train()
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/ultralytics/yolo/engine/trainer.py", line 192, in train
    self._do_train(world_size)
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/ultralytics/yolo/engine/trainer.py", line 332, in _do_train
    self.loss, self.loss_items = self.model(batch)
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/ultralytics/nn/tasks.py", line 44, in forward
    return self.loss(x, *args, **kwargs)
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/ultralytics/nn/tasks.py", line 434, in loss
    preds = self.predict(img, batch=targets) if preds is None else preds
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/ultralytics/nn/tasks.py", line 478, in predict
    x = head([y[j] for j in head.f], batch)  # head inference
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/ultralytics/nn/modules/head.py", line 242, in forward
    dec_bboxes, dec_scores = self.decoder(embed,
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/ultralytics/nn/modules/transformer.py", line 359, in forward
    output = layer(output, refer_bbox, feats, shapes, padding_mask, attn_mask, pos_mlp(refer_bbox))
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/ultralytics/nn/modules/transformer.py", line 319, in forward
    tgt = self.cross_attn(self.with_pos_embed(embed, query_pos), refer_bbox.unsqueeze(2), feats, shapes,
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/ultralytics/nn/modules/transformer.py", line 268, in forward
    output = multi_scale_deformable_attn_pytorch(value, value_shapes, sampling_locations, attention_weights)
  File "/opt/ultralytics/venv/lib/python3.10/site-packages/ultralytics/nn/modules/utils.py", line 59, in multi_scale_deformable_attn_pytorch
    value_l_ = (value_list[level].flatten(2).transpose(1, 2).reshape(bs * num_heads, embed_dims, H_, W_))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB (GPU 2; 40.00 GiB total capacity; 35.54 GiB already allocated; 211.38 MiB free; 35.62 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Each epoch + evaluation takes about one minute, and at epoch 120, the training ran for two hours straight.
The issue I have is the usage was at 35gb out of 40, which should be enough for small unpredicted allocations.
(3 A40 @ 40gb running the training)

There was nothing else running on the server.

Here are some metrics from wandb

image

At the very end there is peak, but I'm unsure what it is, or if it's caused by the crash itself.

@glenn-jocher
Copy link
Member

@ExtReMLapin hi there,

Apologies for the delayed response. It appears that the memory usage of YOLOv8 is more unstable compared to YOLOv5. While YOLOv5 may have hidden memory allocations that are not visible in nvidia-smi due to PyTorch reserving memory, YOLOv8 seems to require additional memory reservations during deep training.

Starting the training with 35GB of VRAM out of 40GB may seem safe, but it eventually crashes after running for a couple of hours. The error message you shared indicates that there was a CUDA Out Of Memory (OOM) error at epoch 129 when allocating additional memory.

One potential solution is to reduce the batch size and/or image size during training to lower the memory requirements. Another option is to increase the available GPU memory if you have a larger GPU or consider adjusting the max_split_size_mb parameter to optimize memory allocation and avoid fragmentation.

From the Wandb metrics you shared, there is a peak at the very end, but it's unclear whether it caused the crash or is a result of the crash itself.

Please let us know if you have any further questions or if there's anything else we can assist you with.

@github-actions
Copy link

github-actions bot commented Aug 1, 2023

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

@github-actions github-actions bot added the Stale label Aug 1, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 12, 2023
@thurnbauermatthi
Copy link

I have a similar issue. I have access to two NVIDIA A5000 GPUs with 24GB each. I use the pretrained yolov8m model. I get an OOM for an image size of 640 and a batch size of 4(!!). I am using cached images on disk for the training, but the error message says:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 18.46 GiB (GPU 0; 23.69 GiB total capacity; 16.35 GiB already allocated; 4.53 GiB free; 18.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Why does PyTorch reserve 18GB? That does sound like it is too much.

@glenn-jocher
Copy link
Member

@thurnbauermatthi hi there,

The PyTorch CUDA out of memory error you're encountering generally indicates that your GPU does not have enough memory to store the computations needed for training your model. In your case, with the pretrained YOLOv8m model, image size of 640 and a batch size of 4 on NVIDIA A5000 GPUs, it seems like PyTorch is indeed trying to reserve more memory than what's available on your GPUs.

The memory reserved by PyTorch isn't just for the model parameters. It's also used to store intermediate variables for backpropagation, optimizer states, and more. Moreover, PyTorch tends to cache memory to avoid the cost of allocating and deallocating memory every time it’s needed.

The memory requirement can increase with the complexity of the model, the size of the input data, and the batch size. While batch size and input size are more direct factors because they affect how much data is loaded into memory at once, the model complexity indirectly affects the memory requirement through the computational graph used for backpropagation.

Here are a few solutions you could try:

  1. Reduce the batch size: You've already mentioned your batch size is 4. You could try reducing it further, even though that could potentially affect your training dynamics.

  2. Gradual Freezing: This involves freezing some layers of the model while training others, then alternating.

  3. Gradient Checkpointing: This is a trade-off strategy where you decrease memory usage by saving some intermediate variables and re-computing them during backward passes.

  4. Adjusting max_split_size_mb: PyTorch divides tensors into chunks to optimize memory usage, and this parameter tells PyTorch the maximum chunk size. You could try modifying it based on your requirements.

  5. Using PyTorch's memory utilities: torch.cuda.memory_summary() could be particularly useful to understand how CUDA memory is being allocated and, possibly, is running out.

I hope this information will help you! Let us know if there's anything else we could do to assist you.

@dt140120
Copy link

@glenn-jocher
same problem, with imgsz=640, batch=4, satellite images and trained by rtx 2080ti (12gb)

@glenn-jocher
Copy link
Member

@dt140120 hello,

Thanks for reaching out. It appears you're experiencing an Out of Memory (OOM) error when attempting to train with a batch size of 4 and image size of 640 using RTX 2080 Ti (12GB).

The GPU memory requirement for training a model is primarily affected by input data size and model complexity. In this case, your batch size, model, and input image size are likely demanding more GPU memory than the available 12GB during training.

To resolve this issue, you could try several approaches:

  1. Reduce the batch size: While your current batch size is relatively small, reducing it further might help lessen the memory demand, though this could affect training dynamics.

  2. Decrease the image size: Reducing the image size will lower the memory footprint per image.

  3. Adjust max_split_size_mb: This can help optimize memory allocation.

  4. Use gradient accumulation: This allows you to virtually increase your batch size without increasing memory usage by accumulating gradients over multiple mini-batches and performing a model update afterwards.

Please note that these solutions come with trade-offs and can potentially impact the accuracy of the model.

I hope this helps, and do let us know if you have further questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-reproducible Bug is not reproducible Stale
Projects
None yet
Development

No branches or pull requests

5 participants