nan report in box_class cls_class and dfl_loss when train custom dataset #280

duynguyen1907 · 2023-01-12T06:34:47Z

Search before asking

I have searched the YOLOv8 issues and found no similar bug report.

YOLOv8 Component

Training

Bug

Hello, I am newbie in computer vision and I just started to try the new version yolov8 and I get some error when take the result.
I seem like something wrong but I don't know how to fix it. Can you give me some suggest?

Environment

-YOLOv8n
-CUDA: 11.6
-Ultralytics YOLOv8.0.4
-OS: Windows 10

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

Laughing-q · 2023-01-12T06:51:25Z

@classico09 can you share your training command?

duynguyen1907 · 2023-01-12T06:53:40Z

@classico09 can you share your training command?

Here is my command:
yolo task=detect mode=train model=yolov8n.yaml data="./data/dataset.yaml" epochs=100 batch=20 device='0' workers=4

AyushExel · 2023-01-12T07:58:14Z

@classico09 hi. Is your performance fine with yolov5? Can you run the same command with v5loader=True flag to check.
If that doesn't work then probably your dataset might have some problem

duynguyen1907 · 2023-01-12T08:37:51Z

@classico09 hi. Is your performance fine with yolov5? Can you run the same command with v5loader=True flag to check. If that doesn't work then probably your dataset might have some problem

Thank you. I tried it but it still doesn't work. I tried with the yolov5 model in the yolov5 repositories and it work so I think it is not because of the dataset.

jiyuwangbupt · 2023-01-12T09:29:00Z

@classico09 hi. Is your performance fine with yolov5? Can you run the same command with v5loader=True flag to check. If that doesn't work then probably your dataset might have some problem

Hello, I met same question. I successfully completed the training using the environment of yolov5, (MAP is 0.907). Based on the yolov5 environment, I quickly completed the installation using pip install ultralytics according to the documentation (I want to train yolov8 and compare yolov5 to see the effect). #283

yolo task=init --config-name helmethyp.yaml --config-path /nfs/volume-622-1/lanzhixiong/project/smoking/code/yolov8/
yolo task=detect mode=train model=yolov8n.yaml data=/nfs/volume-622-1/lanzhixiong/project/smoking/code/yolov8/helmet640.yaml device=0 batch=20 workers=0 --config-name=helmethyp.yaml --config-path=/nfs/volume-622-1/lanzhixiong/project/smoking/code/yolov8

M15-3080 · 2023-01-12T09:41:47Z

I also encountered the same problem, but I found that the problem could be solved by turning down the batch, but I don't know why this is so, and the training is very slow, and the GPU utilization rate is very low

pepijnob · 2023-01-12T13:23:14Z

@classico09 hi. Is your performance fine with yolov5? Can you run the same command with v5loader=True flag to check. If that doesn't work then probably your dataset might have some problem

I have the same/similar problem. When I run the same command with v5loader=True I get: KeyError: 'masks'. However, I can run the same dataset with v5loader=False but get very bad results (high loss/no predictions). I run the same dataset with Yolov5 repository and I get good results.

AyushExel · 2023-01-12T13:57:09Z

Hi all.
Your issue might've been solved in a PR by @Laughing-q and will be available in the package in few hours.

Laughing-q · 2023-01-13T06:26:04Z

@pepijnob @duynguyen1907 @jiyuwangbupt hey guys, can you try to replace the following line to self.optimizer.step()? and restart training to check if the losses are good. Thanks!

ultralytics/ultralytics/yolo/engine/trainer.py

Line 410 in 2bc36d9

self.scaler.step(self.optimizer)

pepijnob · 2023-01-13T10:40:05Z

@Laughing-q I tried your suggestion in version 8.0.4 and in 8.0.5 and both times my loss went to nan in the first epoch. When I just update to 8.0.5 without your suggestion I get the same as before with the loss not going down (on the same dataset where yolov5 did work).

hdnh2006 · 2023-01-13T12:17:59Z

I am experimenting same logs with the defualt command:
yolo task=detect mode=train model=yolov8s.pt batch=4

I mean, with coco128.yaml, just to do some testings and same results are gotten:

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      1/100      1.51G        nan        nan        nan         71        640: 100%|██████████| 32/32 [00:36<00:00,  1.13s/it]
/home/henry/.local/bin/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 16/16 [00:08<00:00,  1.94it/s]
                   all        128        929          0          0          0          0

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      2/100      2.13G        nan        nan        nan         51        640: 100%|██████████| 32/32 [00:34<00:00,  1.07s/it]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 16/16 [00:08<00:00,  1.92it/s]
                   all        128        929          0          0          0          0

nikbobo · 2023-01-15T07:25:38Z

I also meet same problem, seems cls_loss suddenly appear NaN, and also all loss is NaN.

Laughing-q · 2023-01-19T06:39:10Z

@nikbobo @hdnh2006 hi guys, may I ask if there're corrupt labels in your datasets?

actually I just found that we got a mismatch issue if there's corrupt label in datasets, so if you guys got corrupt labels then your issues could be solved by this PR. #460

Laughing-q · 2023-01-19T07:35:32Z

the nan loss issue has been solved in this PR #490, which we'll merge it later today. :)

hdnh2006 · 2023-01-19T09:28:39Z

@Laughing-q I am still getting nan in my training. It seems for validation is solved:

After running pip install --upgrade ultralytics I get the following:

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: ultralytics in ********/.virtualenvs/ultralytics/lib/python3.8/site-packages (8.0.10)

hdnh2006 · 2023-01-19T09:29:36Z

@nikbobo @hdnh2006 hi guys, may I ask if there're corrupt labels in your datasets? actually I just found that we got a mismatch issue if there's corrupt label in datasets, so if you guys got corrupt labels then your issues could be solved by this PR. #460

My dataset is coco128.yaml, so I am just using the default parameters for testing.

AyushExel · 2023-01-19T09:30:03Z

@hdnh2006 it's not merged yet. The update will be available later today

hdnh2006 · 2023-01-19T09:30:44Z

@AyushExel Thanks Ayush, you are awesome as always!!

kosmicznemuchomory123pl · 2023-01-20T21:50:54Z

I have this issue even on newest update

$ yolo detect train data=coco128.yaml model=yolov8n.pt epochs=100 imgsz=640 batch=4
Ultralytics YOLOv8.0.11 🚀 Python-3.10.6 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1650 Ti, 3912MiB)
yolo/engine/trainer: task=detect, mode=train, model=yolov8n.pt, data=coco128.yaml, epochs=100, patience=50, batch=4, imgsz=640, save=True, cache=False, device=, workers=8, project=None, name=None, exist_ok=False, pretrained=False, optimizer=SGD, verbose=False, seed=0, deterministic=True, single_cls=False, image_weights=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop=False, hide_labels=False, hide_conf=False, vid_stride=1, line_thickness=3, visualize=False, augment=False, agnostic_nms=False, retina_masks=False, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=17, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, fl_gamma=0.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0, cfg=None, hydra={'output_subdir': None, 'run': {'dir': '.'}}, v5loader=False, save_dir=runs/detect/train47

                   from  n    params  module                                       arguments                     
  0                  -1  1       464  ultralytics.nn.modules.Conv                  [3, 16, 3, 2]                 
  1                  -1  1      4672  ultralytics.nn.modules.Conv                  [16, 32, 3, 2]                
  2                  -1  1      7360  ultralytics.nn.modules.C2f                   [32, 32, 1, True]             
  3                  -1  1     18560  ultralytics.nn.modules.Conv                  [32, 64, 3, 2]                
  4                  -1  2     49664  ultralytics.nn.modules.C2f                   [64, 64, 2, True]             
  5                  -1  1     73984  ultralytics.nn.modules.Conv                  [64, 128, 3, 2]               
  6                  -1  2    197632  ultralytics.nn.modules.C2f                   [128, 128, 2, True]           
  7                  -1  1    295424  ultralytics.nn.modules.Conv                  [128, 256, 3, 2]              
  8                  -1  1    460288  ultralytics.nn.modules.C2f                   [256, 256, 1, True]           
  9                  -1  1    164608  ultralytics.nn.modules.SPPF                  [256, 256, 5]                 
 10                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 11             [-1, 6]  1         0  ultralytics.nn.modules.Concat                [1]                           
 12                  -1  1    148224  ultralytics.nn.modules.C2f                   [384, 128, 1]                 
 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14             [-1, 4]  1         0  ultralytics.nn.modules.Concat                [1]                           
 15                  -1  1     37248  ultralytics.nn.modules.C2f                   [192, 64, 1]                  
 16                  -1  1     36992  ultralytics.nn.modules.Conv                  [64, 64, 3, 2]                
 17            [-1, 12]  1         0  ultralytics.nn.modules.Concat                [1]                           
 18                  -1  1    123648  ultralytics.nn.modules.C2f                   [192, 128, 1]                 
 19                  -1  1    147712  ultralytics.nn.modules.Conv                  [128, 128, 3, 2]              
 20             [-1, 9]  1         0  ultralytics.nn.modules.Concat                [1]                           
 21                  -1  1    493056  ultralytics.nn.modules.C2f                   [384, 256, 1]                 
 22        [15, 18, 21]  1    897664  ultralytics.nn.modules.Detect                [80, [64, 128, 256]]          
Model summary: 225 layers, 3157200 parameters, 3157184 gradients, 8.9 GFLOPs

Transferred 355/355 items from pretrained weights
optimizer: SGD(lr=0.01) with parameter groups 57 weight(decay=0.0), 64 weight(decay=0.0005), 63 bias
train: Scanning /home/karol/Projekty/yolov8/datasets/coco128/labels/train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]
val: Scanning /home/karol/Projekty/yolov8/datasets/coco128/labels/train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]
Image sizes 640 train, 640 val
Using 4 dataloader workers
Logging results to runs/detect/train47
Starting training for 100 epochs...

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      1/100      1.07G        nan        nan        nan         71        640: 100%|██████████| 32/32 [00:10<00:00,  3.04it/s]
/home/karol/Projekty/yolov8/yolov8-venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 16/16 [00:02<00:00,  5.95it/s]
                   all        128        929      0.697     0.0679     0.0821     0.0503

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      2/100      1.69G        nan        nan        nan         51        640: 100%|██████████| 32/32 [00:09<00:00,  3.47it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 16/16 [00:02<00:00,  6.10it/s]
                   all        128        929      0.672     0.0729      0.082      0.051

I try on nvidia drivers 525 (CUDA 12), 470 (CUDA 11.4), with ultralytics docker, etc
My card is GTX 1650 Ti Mobile
Ubuntu 22
Try on generic command from CLI and python (only down batch size, because have only 4GB card memory)
Sems better when I down batch size to 1-3 or switch to CPU.

xyrod6 · 2023-01-23T08:48:27Z

Same problem with the latest version (8.0.17)

When lowering the batch size the losses seems to be working but the model is not learning properly.

Batch 16

Batch 8

nikbobo · 2023-01-23T09:20:54Z

Same problem with the latest version (8.0.17)

When lowering the batch size the losses seems to be working but the model is not learning properly.

Batch 16

Batch 8

In my experiment, I sometimes will be fixed by replacing the dataset, updating the PyTorch to latest version, or changing the gpu types. It seems not a model problem but is a AMP problem. If you don't want to try the above operation, you can try close AMP and only using FP32 by force all in FP32 and close autocast.

xyrod6 · 2023-01-23T13:12:59Z

Same problem with the latest version (8.0.17)
When lowering the batch size the losses seems to be working but the model is not learning properly.
Batch 16
Batch 8

In my experiment, I sometimes will be fixed by replacing the dataset, updating the PyTorch to latest version, or changing the gpu types. It seems not a model problem but is a AMP problem. If you don't want to try the above operation, you can try close AMP and only using FP32 by force all in FP32 and close autocast.

You are right the problem is with the GPU.
It works on the 3090 and doesn't with the 1650.
Python-3.10.9 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce RTX 3090, 24265MiB) -- Working

Python-3.10.9 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1650, 3912MiB) -- Not working

hdnh2006 · 2023-01-23T17:56:50Z

Same problem with the latest version (8.0.17)
When lowering the batch size the losses seems to be working but the model is not learning properly.
Batch 16
Batch 8

In my experiment, I sometimes will be fixed by replacing the dataset, updating the PyTorch to latest version, or changing the gpu types. It seems not a model problem but is a AMP problem. If you don't want to try the above operation, you can try close AMP and only using FP32 by force all in FP32 and close autocast.

You are right the problem is with the GPU. It works on the 3090 and doesn't with the 1650. Python-3.10.9 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce RTX 3090, 24265MiB) -- Working

Python-3.10.9 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1650, 3912MiB) -- Not working

This is totally true, I have a RTX2060 Super and I get the following logs:

Meanwhile in my laptop with GTX 1650, the logs are like following (with an important warning):

I have the same versions of PyTorch in both computers:

RTX2060:

Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch 
>>> torch.__version__
'1.13.1+cu117'

GTX1650:

Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch 
>>> torch.__version__
'1.13.1+cu117'

New important EDIT:

If I try the trianing on my laptop with GTX1650 but using the CPU, I don't get any nan values:

So clearly, there's a compatibilty problem with this GPU.

kosmicznemuchomory123pl · 2023-01-23T20:22:22Z

I think this is the same problem in yolov5: ultralytics/yolov5#7908
Must check with cuda10, but it required older system.

Hridh0y · 2023-01-25T18:51:40Z

I had the same errors propagate to yolov8 and yolov5, but I found a similar bug report for yolov5 that suggested disabling AMP by amp=False in train.py, which fixes box_loss and obj_loss equating to nan.

The other suggested fix for validation not working was that in train.py validation uses half accuracy and is half=amp in the validator() function (val in this thread) but by force assigning it half=False, it fixed my problem for training on yolov5 and training has resumed as usual using CUDA 11.7 with a Nvidia T1200 Laptop GPU (Compute Capability 7+).

Could perhaps be a problem with amp from CUDA since I even saw users in this thread have an issue with amp in CUDA 11.x and saw it solved when they reverted to CUDA 10.x.

Perhaps mirroring the fix found in this thread might help? Can't really find the equivalent variables to change in train.py, and was wondering where they were moved to in v8.

Thread for reference:
ultralytics/yolov5#7908

mithilnettyfy · 2023-06-09T05:19:02Z

Any solution to run YOLOV8 or YOLOV5 on NVIDIA GTX 1650

because still i am facing same error

glenn-jocher · 2023-06-10T03:10:04Z

@mithilnettyfy hey there! Thank you for reaching out to us. We apologize for the inconvenience you have faced while training YOLOv8 on your NVIDIA GTX 1650 GPU.

The issue you are facing could be related to a compatibility issue with your GPU or with the use of Automatic Mixed Precision (AMP). We recommend trying the following solutions:

Disable AMP by setting amp=False in train.py while training the YOLOv8 model.
Force all variables to run in FP32 instead of using both FP16 and FP32 by disabling autocast. You can do this by setting autocast=False in train.py while training the model.

If neither of these solutions work, we recommend checking the compatibility of your GTX 1650 GPU with the CUDA version you are using. Some users have reported issues with AMP in CUDA 11.x and have solved the problem by reverting back to CUDA 10.x.

Please let us know if this helps resolve your issue or if you have any further questions.

mithilnettyfy · 2023-06-12T05:02:44Z

Hey @glenn-jocher Thank you so much for helping to resolve this issue. My program is working perfectly but your second solution is not working. Could you please describe what exactly is the second point

There is no any argument autocast=FALSE https://docs.ultralytics.com/modes/train/#arguments

Force all variables to run in FP32 instead of using both FP16 and FP32 by disabling autocast. You can do this by setting autocast=False in train.py while training the model.

Thank you in advance for your help. I appreciate it.

still getting 0 value on Box(P R mAP50 mAP50-95)

Chase-Xuu · 2023-06-29T11:55:12Z

Hello, I just faced the same problem in the platform AutoDL using YOLOv5. I solved it by cloning the latest version of YOLOv5 rather than using the YOLOv5 provided by the platform. I hope this tip can help you.

glenn-jocher · 2023-06-30T05:52:37Z

@Chi-XU-Sean hello,

Thank you for sharing your experience. This platform-specific issue seems to be related to the version of YOLOv5 provided by the AutoDL platform. To resolve this problem, you can try cloning the latest version of YOLOv5 directly from the official repository. This should help ensure that you are using the most up-to-date and bug-free version of YOLOv5.

I hope this solution works for you. Let me know if you have any further questions or concerns.

Best regards.

priyakanabar-crest · 2023-08-04T11:18:11Z

@mithilnettyfy @glenn-jocher did you get solution for this please discuss ? 0 value on Box(P R mAP50 mAP50-95)
Thanks

mithilnettyfy · 2023-08-04T11:27:52Z

Hi @priyakanabar-crest yes i solve this can you please tell me which GPU you use for training?

priyakanabar-crest · 2023-08-04T11:32:00Z

Hello @mithilnettyfy I am using NVIDIA Geforce GTX 1650 as you said the Box(P R mAP50 mAP50-95) is showing 0 to me i have set amp=False

mithilnettyfy · 2023-08-04T11:34:36Z

@priyakanabar-crest
if name == 'main':
from multiprocessing import freeze_support
freeze_support()

from ultralytics import YOLO

model = YOLO("yolov8n.pt")

model.train(data = "data_custom.yaml", batch=4,amp=False,device="0", imgsz=1480, epochs=50, profile=True)

this is my training code please follow this.

priyakanabar-crest · 2023-08-04T11:36:08Z

@mithilnettyfy I am trying this Thank you so much for your reply

mithilnettyfy · 2023-08-04T11:48:43Z

@priyakanabar-crest it's work?

priyakanabar-crest · 2023-08-04T12:01:41Z

No it does not work @mithilnettyfy

priyakanabar-crest · 2023-08-04T12:34:20Z

@mithilnettyfy this is how its showing

mithilnettyfy · 2023-08-04T12:37:21Z

@priyakanabar-crest can you please share your training code?

priyakanabar-crest · 2023-08-04T12:43:21Z

if name == 'main':
try:
from multiprocessing import freeze_support
freeze_support()

        from ultralytics import YOLO


        model = YOLO('yolov8m.pt')
        results = model.train(
            data='data.yaml',
            imgsz=640,
            epochs=40,
            batch=4,
            amp=False,
            profile = True,
            name='yolov8n_custom'
        ) 
except Exception as e:
    print(e.args)

@mithilnettyfy this is what i am using

priyakanabar-crest · 2023-08-04T13:15:30Z

@mithilnettyfy just for your information it works fine with yolov8s.pt but not with yolov8m.pt , I am not able to understand why

glenn-jocher · 2023-08-05T01:06:40Z

Hello @mithilnettyfy,

Thanks for sharing additional details regarding the issue. It's great to hear that it's working as expected with 'yolov8s.pt'.

When it comes to different models like 'yolov8s.pt' and 'yolov8m.pt', they differ in size, layers, and potentially training regimen, which could lead some models to perform better on certain datasets than others.

Issues like the one you're facing with 'yolov8m.pt' could be due to various factors such as data-related issues (e.g., small object size, low-resolution images, class imbalance, etc.) or specific model characteristics. It could also be related to the GPU memory since different models have different memory and compute requirements.

If adjusting parameters (like image size, batch size, etc.) or trying different models does not solve the problem, it might be beneficial to review your data. Verify if your annotations are correct, or if there's any class imbalance in your dataset. Also, try to ensure that your dataset has diverse and representative samples of objects that YOLOv8 should detect.

Please let us know if you have any further questions or continue to encounter problems. We appreciate your collaboration and are eager to assist you in resolving this issue.

Best,
Glenn

deKeijzer · 2024-03-31T11:00:07Z

This issue is present in the latest version when using mps on a Mac M3 through the hub for various (detect) models. Additionally when using mps the box_loss and dfl_loss is always zero. Switching to cpu training resolves these issues.

glenn-jocher · 2024-03-31T22:04:04Z

@deKeijzer hey there! 👋 Thanks for bringing this to our attention. Indeed, using MPS on a Mac M3 has shown some unique behaviors with our detect models, including the box_loss and dfl_loss being consistently zero. This seems to be an issue specific to the MPS backend.

For now, reverting to CPU training, as you discovered, bypasses these problems. We'll look into what's causing these discrepancies with MPS to find a solution. For users facing similar issues, here's a quick way to switch to CPU training:

from ultralytics import YOLO
model = YOLO('yolov8n.pt', device='cpu')  # Specify device as 'cpu'

We appreciate your patience and contributions to improving YOLOv8! Stay tuned for updates. 🚀

FedeMorenoOptima · 2024-04-03T00:33:37Z

Same problem on MAC M1. Thanks

glenn-jocher · 2024-04-03T09:16:13Z

@FedeMorenoOptima hi there! 👋 It seems like the issue you're experiencing on the Mac M1 with YOLOv8 is noted. For now, a workaround is to train on the CPU to circumvent this problem. Here's a quick way to do it:

from ultralytics import YOLO
model = YOLO('yolov8n.pt', device='cpu')  # Force training on CPU

We're on it to fix this MPS backend issue. Your patience and support are much appreciated!

thiagodsd · 2024-07-26T21:34:54Z

@mithilnettyfy hey there! Thank you for reaching out to us. We apologize for the inconvenience you have faced while training YOLOv8 on your NVIDIA GTX 1650 GPU.

The issue you are facing could be related to a compatibility issue with your GPU or with the use of Automatic Mixed Precision (AMP). We recommend trying the following solutions:

Disable AMP by setting amp=False in train.py while training the YOLOv8 model.

Force all variables to run in FP32 instead of using both FP16 and FP32 by disabling autocast. You can do this by setting autocast=False in train.py while training the model.

If neither of these solutions work, we recommend checking the compatibility of your GTX 1650 GPU with the CUDA version you are using. Some users have reported issues with AMP in CUDA 11.x and have solved the problem by reverting back to CUDA 10.x.

Please let us know if this helps resolve your issue or if you have any further questions.

Just for reference, disabling AMP worked in Ubuntu Linux 22.04, 16 GB RAM, AMD Ryzen 7 3700X, NVIDIA GeForce GTX 1660 Ti, Python 3.10.12, pytorch 2.0.0+cu117.

Thanks @mithilnettyfy !

mithilnettyfy · 2024-07-29T06:27:50Z

Yes it's working on ubuntu also @thiagodsd

janelyd · 2024-07-30T14:18:32Z

Same problem with the latest version (8.0.17)
When lowering the batch size the losses seems to be working but the model is not learning properly.
Batch 16
Batch 8

In my experiment, I sometimes will be fixed by replacing the dataset, updating the PyTorch to latest version, or changing the gpu types. It seems not a model problem but is a AMP problem. If you don't want to try the above operation, you can try close AMP and only using FP32 by force all in FP32 and close autocast.

You are right the problem is with the GPU. It works on the 3090 and doesn't with the 1650. Python-3.10.9 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce RTX 3090, 24265MiB) -- Working
Python-3.10.9 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1650, 3912MiB) -- Not working

This is totally true, I have a RTX2060 Super and I get the following logs:

Meanwhile in my laptop with GTX 1650, the logs are like following (with an important warning):

I have the same versions of PyTorch in both computers:

RTX2060:
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch 
>>> torch.__version__
'1.13.1+cu117'
GTX1650:
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch 
>>> torch.__version__
'1.13.1+cu117'
New important EDIT:

If I try the trianing on my laptop with GTX1650 but using the CPU, I don't get any nan values:

So clearly, there's a compatibilty problem with this GPU.

i don't think the issue is about to GPU. I use google colab for training and have acces to T4 GPU but still get the issue

pderrenger · 2024-07-30T19:26:31Z

@janelyd thank you for the detailed information. It appears that the issue might not be solely related to the GPU type but could also involve other factors such as AMP settings or specific configurations. To help us investigate further, could you please ensure you are using the latest versions of YOLOv8 and PyTorch? Additionally, try disabling AMP and running the training again. If the issue persists, please share any additional logs or warnings you encounter. This will help us pinpoint the problem more accurately.

janelyd · 2024-08-02T06:34:08Z

@janelyd thank you for the detailed information. It appears that the issue might not be solely related to the GPU type but could also involve other factors such as AMP settings or specific configurations. To help us investigate further, could you please ensure you are using the latest versions of YOLOv8 and PyTorch? Additionally, try disabling AMP and running the training again. If the issue persists, please share any additional logs or warnings you encounter. This will help us pinpoint the problem more accurately.

@pderrenger I'm using the latest versions of YOLOv9 and pytorch. I just had the same issue: box_loss returns nan (also obj_loss returns nan).
But then I'v tried the disabling AMP. It get worst. I got nan for other metrics too after disabling AMP(speaking for YOLOv9).
Now I tried the --freze 10, it has better results. There is no nan issue for me anymore. I also tried the change hyperparameters such as augmentation parameters

pderrenger · 2024-08-02T09:20:40Z

Thank you for the update, @janelyd. It's helpful to know that disabling AMP worsened the issue and that using --freeze 10 improved the results. This suggests that the problem might be related to the model's initial layers or specific hyperparameters. If you encounter further issues, please share any additional logs or warnings. This will assist us in diagnosing the problem more accurately.

glenn-jocher · 2024-10-14T21:47:00Z

@xyrod6 lowering the batch size can sometimes help with NaN issues, but if the model isn't learning properly, consider checking your dataset for any issues or adjusting learning rates and other hyperparameters.

duynguyen1907 added the bug Something isn't working label Jan 12, 2023

AyushExel mentioned this issue Jan 12, 2023

Release 8.0.5 PR #279

Merged

2 tasks

pepijnob mentioned this issue Jan 13, 2023

About yolov8s-seg train custom dataset #317

Closed

1 task

Laughing-q linked a pull request Jan 19, 2023 that will close this issue

fix nan/inf loss #490

Merged

kosmicznemuchomory123pl mentioned this issue Jan 23, 2023

Windows v8 dataloader lag between epochs #509

Closed

1 task

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 2, 2023

pomoron mentioned this issue Apr 30, 2024

When I was training, the loss value changed to nan WongKinYiu/yolov9#329

Open

nan report in box_class cls_class and dfl_loss when train custom dataset #280

nan report in box_class cls_class and dfl_loss when train custom dataset #280

Comments

duynguyen1907 commented Jan 12, 2023

Search before asking

YOLOv8 Component

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

Laughing-q commented Jan 12, 2023

duynguyen1907 commented Jan 12, 2023

AyushExel commented Jan 12, 2023

duynguyen1907 commented Jan 12, 2023

jiyuwangbupt commented Jan 12, 2023 • edited Loading

M15-3080 commented Jan 12, 2023

pepijnob commented Jan 12, 2023

AyushExel commented Jan 12, 2023

Laughing-q commented Jan 13, 2023

pepijnob commented Jan 13, 2023 • edited Loading

hdnh2006 commented Jan 13, 2023

nikbobo commented Jan 15, 2023

Laughing-q commented Jan 19, 2023

Laughing-q commented Jan 19, 2023

hdnh2006 commented Jan 19, 2023

hdnh2006 commented Jan 19, 2023

AyushExel commented Jan 19, 2023

hdnh2006 commented Jan 19, 2023

kosmicznemuchomory123pl commented Jan 20, 2023

xyrod6 commented Jan 23, 2023

nikbobo commented Jan 23, 2023

xyrod6 commented Jan 23, 2023

hdnh2006 commented Jan 23, 2023 • edited Loading

RTX2060:

GTX1650:

New important EDIT:

kosmicznemuchomory123pl commented Jan 23, 2023

Hridh0y commented Jan 25, 2023

mithilnettyfy commented Jun 9, 2023

glenn-jocher commented Jun 10, 2023

mithilnettyfy commented Jun 12, 2023 • edited Loading

Chase-Xuu commented Jun 29, 2023

glenn-jocher commented Jun 30, 2023

priyakanabar-crest commented Aug 4, 2023

mithilnettyfy commented Aug 4, 2023

priyakanabar-crest commented Aug 4, 2023 • edited Loading

mithilnettyfy commented Aug 4, 2023

priyakanabar-crest commented Aug 4, 2023

mithilnettyfy commented Aug 4, 2023

priyakanabar-crest commented Aug 4, 2023

priyakanabar-crest commented Aug 4, 2023

mithilnettyfy commented Aug 4, 2023 • edited Loading

priyakanabar-crest commented Aug 4, 2023

priyakanabar-crest commented Aug 4, 2023

glenn-jocher commented Aug 5, 2023

deKeijzer commented Mar 31, 2024

glenn-jocher commented Mar 31, 2024

FedeMorenoOptima commented Apr 3, 2024

glenn-jocher commented Apr 3, 2024

thiagodsd commented Jul 26, 2024

mithilnettyfy commented Jul 29, 2024

janelyd commented Jul 30, 2024

RTX2060:

GTX1650:

New important EDIT:

pderrenger commented Jul 30, 2024

janelyd commented Aug 2, 2024

pderrenger commented Aug 2, 2024

glenn-jocher commented Oct 14, 2024

jiyuwangbupt commented Jan 12, 2023 •

edited

Loading

pepijnob commented Jan 13, 2023 •

edited

Loading

hdnh2006 commented Jan 23, 2023 •

edited

Loading

mithilnettyfy commented Jun 12, 2023 •

edited

Loading

priyakanabar-crest commented Aug 4, 2023 •

edited

Loading

mithilnettyfy commented Aug 4, 2023 •

edited

Loading