Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nan report in box_class cls_class and dfl_loss when train custom dataset #280

Closed
1 of 2 tasks
duynguyen1907 opened this issue Jan 12, 2023 · 54 comments · Fixed by #490
Closed
1 of 2 tasks

nan report in box_class cls_class and dfl_loss when train custom dataset #280

duynguyen1907 opened this issue Jan 12, 2023 · 54 comments · Fixed by #490
Labels
bug Something isn't working Stale Stale and schedule for closing soon

Comments

@duynguyen1907
Copy link

Search before asking

  • I have searched the YOLOv8 issues and found no similar bug report.

YOLOv8 Component

Training

Bug

Hello, I am newbie in computer vision and I just started to try the new version yolov8 and I get some error when take the result.
I seem like something wrong but I don't know how to fix it. Can you give me some suggest?

bug

Environment

-YOLOv8n
-CUDA: 11.6
-Ultralytics YOLOv8.0.4
-OS: Windows 10

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@duynguyen1907 duynguyen1907 added the bug Something isn't working label Jan 12, 2023
@Laughing-q
Copy link
Member

@classico09 can you share your training command?

@duynguyen1907
Copy link
Author

@classico09 can you share your training command?

Here is my command:
yolo task=detect mode=train model=yolov8n.yaml data="./data/dataset.yaml" epochs=100 batch=20 device='0' workers=4

@AyushExel
Copy link
Contributor

@classico09 hi. Is your performance fine with yolov5? Can you run the same command with v5loader=True flag to check.
If that doesn't work then probably your dataset might have some problem

@duynguyen1907
Copy link
Author

@classico09 hi. Is your performance fine with yolov5? Can you run the same command with v5loader=True flag to check. If that doesn't work then probably your dataset might have some problem

Thank you. I tried it but it still doesn't work. I tried with the yolov5 model in the yolov5 repositories and it work so I think it is not because of the dataset.

@jiyuwangbupt
Copy link

jiyuwangbupt commented Jan 12, 2023

@classico09 hi. Is your performance fine with yolov5? Can you run the same command with v5loader=True flag to check. If that doesn't work then probably your dataset might have some problem

Hello, I met same question. I successfully completed the training using the environment of yolov5, (MAP is 0.907). Based on the yolov5 environment, I quickly completed the installation using pip install ultralytics according to the documentation (I want to train yolov8 and compare yolov5 to see the effect). #283

yolo task=init --config-name helmethyp.yaml --config-path /nfs/volume-622-1/lanzhixiong/project/smoking/code/yolov8/
yolo task=detect mode=train model=yolov8n.yaml data=/nfs/volume-622-1/lanzhixiong/project/smoking/code/yolov8/helmet640.yaml device=0 batch=20 workers=0 --config-name=helmethyp.yaml --config-path=/nfs/volume-622-1/lanzhixiong/project/smoking/code/yolov8

@M15-3080
Copy link

I also encountered the same problem, but I found that the problem could be solved by turning down the batch, but I don't know why this is so, and the training is very slow, and the GPU utilization rate is very low

@pepijnob
Copy link

@classico09 hi. Is your performance fine with yolov5? Can you run the same command with v5loader=True flag to check. If that doesn't work then probably your dataset might have some problem

I have the same/similar problem. When I run the same command with v5loader=True I get: KeyError: 'masks'. However, I can run the same dataset with v5loader=False but get very bad results (high loss/no predictions). I run the same dataset with Yolov5 repository and I get good results.

@AyushExel
Copy link
Contributor

Hi all.
Your issue might've been solved in a PR by @Laughing-q and will be available in the package in few hours.

@AyushExel AyushExel mentioned this issue Jan 12, 2023
2 tasks
@Laughing-q
Copy link
Member

@pepijnob @duynguyen1907 @jiyuwangbupt hey guys, can you try to replace the following line to self.optimizer.step()? and restart training to check if the losses are good. Thanks!

self.scaler.step(self.optimizer)

@pepijnob
Copy link

pepijnob commented Jan 13, 2023

@Laughing-q I tried your suggestion in version 8.0.4 and in 8.0.5 and both times my loss went to nan in the first epoch. When I just update to 8.0.5 without your suggestion I get the same as before with the loss not going down (on the same dataset where yolov5 did work).

@hdnh2006
Copy link
Contributor

I am experimenting same logs with the defualt command:
yolo task=detect mode=train model=yolov8s.pt batch=4

I mean, with coco128.yaml, just to do some testings and same results are gotten:

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      1/100      1.51G        nan        nan        nan         71        640: 100%|██████████| 32/32 [00:36<00:00,  1.13s/it]
/home/henry/.local/bin/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 16/16 [00:08<00:00,  1.94it/s]
                   all        128        929          0          0          0          0

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      2/100      2.13G        nan        nan        nan         51        640: 100%|██████████| 32/32 [00:34<00:00,  1.07s/it]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 16/16 [00:08<00:00,  1.92it/s]
                   all        128        929          0          0          0          0

@nikbobo
Copy link

nikbobo commented Jan 15, 2023

I also meet same problem, seems cls_loss suddenly appear NaN, and also all loss is NaN.

@Laughing-q
Copy link
Member

@nikbobo @hdnh2006 hi guys, may I ask if there're corrupt labels in your datasets?
oRFtMgd2vI
actually I just found that we got a mismatch issue if there's corrupt label in datasets, so if you guys got corrupt labels then your issues could be solved by this PR. #460

@Laughing-q Laughing-q linked a pull request Jan 19, 2023 that will close this issue
@Laughing-q
Copy link
Member

the nan loss issue has been solved in this PR #490, which we'll merge it later today. :)

@hdnh2006
Copy link
Contributor

@Laughing-q I am still getting nan in my training. It seems for validation is solved:
image

After running pip install --upgrade ultralytics I get the following:

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: ultralytics in ********/.virtualenvs/ultralytics/lib/python3.8/site-packages (8.0.10)

@hdnh2006
Copy link
Contributor

@nikbobo @hdnh2006 hi guys, may I ask if there're corrupt labels in your datasets? oRFtMgd2vI actually I just found that we got a mismatch issue if there's corrupt label in datasets, so if you guys got corrupt labels then your issues could be solved by this PR. #460

My dataset is coco128.yaml, so I am just using the default parameters for testing.

@AyushExel
Copy link
Contributor

@hdnh2006 it's not merged yet. The update will be available later today

@hdnh2006
Copy link
Contributor

@AyushExel Thanks Ayush, you are awesome as always!!

@kosmicznemuchomory123pl

I have this issue even on newest update

$ yolo detect train data=coco128.yaml model=yolov8n.pt epochs=100 imgsz=640 batch=4
Ultralytics YOLOv8.0.11 🚀 Python-3.10.6 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1650 Ti, 3912MiB)
yolo/engine/trainer: task=detect, mode=train, model=yolov8n.pt, data=coco128.yaml, epochs=100, patience=50, batch=4, imgsz=640, save=True, cache=False, device=, workers=8, project=None, name=None, exist_ok=False, pretrained=False, optimizer=SGD, verbose=False, seed=0, deterministic=True, single_cls=False, image_weights=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop=False, hide_labels=False, hide_conf=False, vid_stride=1, line_thickness=3, visualize=False, augment=False, agnostic_nms=False, retina_masks=False, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=17, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, fl_gamma=0.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0, cfg=None, hydra={'output_subdir': None, 'run': {'dir': '.'}}, v5loader=False, save_dir=runs/detect/train47

                   from  n    params  module                                       arguments                     
  0                  -1  1       464  ultralytics.nn.modules.Conv                  [3, 16, 3, 2]                 
  1                  -1  1      4672  ultralytics.nn.modules.Conv                  [16, 32, 3, 2]                
  2                  -1  1      7360  ultralytics.nn.modules.C2f                   [32, 32, 1, True]             
  3                  -1  1     18560  ultralytics.nn.modules.Conv                  [32, 64, 3, 2]                
  4                  -1  2     49664  ultralytics.nn.modules.C2f                   [64, 64, 2, True]             
  5                  -1  1     73984  ultralytics.nn.modules.Conv                  [64, 128, 3, 2]               
  6                  -1  2    197632  ultralytics.nn.modules.C2f                   [128, 128, 2, True]           
  7                  -1  1    295424  ultralytics.nn.modules.Conv                  [128, 256, 3, 2]              
  8                  -1  1    460288  ultralytics.nn.modules.C2f                   [256, 256, 1, True]           
  9                  -1  1    164608  ultralytics.nn.modules.SPPF                  [256, 256, 5]                 
 10                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 11             [-1, 6]  1         0  ultralytics.nn.modules.Concat                [1]                           
 12                  -1  1    148224  ultralytics.nn.modules.C2f                   [384, 128, 1]                 
 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14             [-1, 4]  1         0  ultralytics.nn.modules.Concat                [1]                           
 15                  -1  1     37248  ultralytics.nn.modules.C2f                   [192, 64, 1]                  
 16                  -1  1     36992  ultralytics.nn.modules.Conv                  [64, 64, 3, 2]                
 17            [-1, 12]  1         0  ultralytics.nn.modules.Concat                [1]                           
 18                  -1  1    123648  ultralytics.nn.modules.C2f                   [192, 128, 1]                 
 19                  -1  1    147712  ultralytics.nn.modules.Conv                  [128, 128, 3, 2]              
 20             [-1, 9]  1         0  ultralytics.nn.modules.Concat                [1]                           
 21                  -1  1    493056  ultralytics.nn.modules.C2f                   [384, 256, 1]                 
 22        [15, 18, 21]  1    897664  ultralytics.nn.modules.Detect                [80, [64, 128, 256]]          
Model summary: 225 layers, 3157200 parameters, 3157184 gradients, 8.9 GFLOPs

Transferred 355/355 items from pretrained weights
optimizer: SGD(lr=0.01) with parameter groups 57 weight(decay=0.0), 64 weight(decay=0.0005), 63 bias
train: Scanning /home/karol/Projekty/yolov8/datasets/coco128/labels/train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]
val: Scanning /home/karol/Projekty/yolov8/datasets/coco128/labels/train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]
Image sizes 640 train, 640 val
Using 4 dataloader workers
Logging results to runs/detect/train47
Starting training for 100 epochs...

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      1/100      1.07G        nan        nan        nan         71        640: 100%|██████████| 32/32 [00:10<00:00,  3.04it/s]
/home/karol/Projekty/yolov8/yolov8-venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 16/16 [00:02<00:00,  5.95it/s]
                   all        128        929      0.697     0.0679     0.0821     0.0503

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
      2/100      1.69G        nan        nan        nan         51        640: 100%|██████████| 32/32 [00:09<00:00,  3.47it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 16/16 [00:02<00:00,  6.10it/s]
                   all        128        929      0.672     0.0729      0.082      0.051

I try on nvidia drivers 525 (CUDA 12), 470 (CUDA 11.4), with ultralytics docker, etc
My card is GTX 1650 Ti Mobile
Ubuntu 22
Try on generic command from CLI and python (only down batch size, because have only 4GB card memory)
Sems better when I down batch size to 1-3 or switch to CPU.

@xyrod6
Copy link

xyrod6 commented Jan 23, 2023

Same problem with the latest version (8.0.17)

When lowering the batch size the losses seems to be working but the model is not learning properly.

Batch 16

Batch 8

@nikbobo
Copy link

nikbobo commented Jan 23, 2023

Same problem with the latest version (8.0.17)

When lowering the batch size the losses seems to be working but the model is not learning properly.

Batch 16

Batch 8

In my experiment, I sometimes will be fixed by replacing the dataset, updating the PyTorch to latest version, or changing the gpu types. It seems not a model problem but is a AMP problem. If you don't want to try the above operation, you can try close AMP and only using FP32 by force all in FP32 and close autocast.

@xyrod6
Copy link

xyrod6 commented Jan 23, 2023

Same problem with the latest version (8.0.17)
When lowering the batch size the losses seems to be working but the model is not learning properly.
Batch 16
Batch 8

In my experiment, I sometimes will be fixed by replacing the dataset, updating the PyTorch to latest version, or changing the gpu types. It seems not a model problem but is a AMP problem. If you don't want to try the above operation, you can try close AMP and only using FP32 by force all in FP32 and close autocast.

You are right the problem is with the GPU.
It works on the 3090 and doesn't with the 1650.
Python-3.10.9 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce RTX 3090, 24265MiB) -- Working

Python-3.10.9 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1650, 3912MiB) -- Not working

@hdnh2006
Copy link
Contributor

hdnh2006 commented Jan 23, 2023

Same problem with the latest version (8.0.17)
When lowering the batch size the losses seems to be working but the model is not learning properly.
Batch 16
Batch 8

In my experiment, I sometimes will be fixed by replacing the dataset, updating the PyTorch to latest version, or changing the gpu types. It seems not a model problem but is a AMP problem. If you don't want to try the above operation, you can try close AMP and only using FP32 by force all in FP32 and close autocast.

You are right the problem is with the GPU. It works on the 3090 and doesn't with the 1650. Python-3.10.9 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce RTX 3090, 24265MiB) -- Working

Python-3.10.9 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1650, 3912MiB) -- Not working

This is totally true, I have a RTX2060 Super and I get the following logs:
image

Meanwhile in my laptop with GTX 1650, the logs are like following (with an important warning):
image

I have the same versions of PyTorch in both computers:

RTX2060:

Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch 
>>> torch.__version__
'1.13.1+cu117'

GTX1650:

Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch 
>>> torch.__version__
'1.13.1+cu117'

New important EDIT:

If I try the trianing on my laptop with GTX1650 but using the CPU, I don't get any nan values:
image

So clearly, there's a compatibilty problem with this GPU.

@kosmicznemuchomory123pl

I think this is the same problem in yolov5: ultralytics/yolov5#7908
Must check with cuda10, but it required older system.

@Hridh0y
Copy link

Hridh0y commented Jan 25, 2023

I had the same errors propagate to yolov8 and yolov5, but I found a similar bug report for yolov5 that suggested disabling AMP by amp=False in train.py, which fixes box_loss and obj_loss equating to nan.

The other suggested fix for validation not working was that in train.py validation uses half accuracy and is half=amp in the validator() function (val in this thread) but by force assigning it half=False, it fixed my problem for training on yolov5 and training has resumed as usual using CUDA 11.7 with a Nvidia T1200 Laptop GPU (Compute Capability 7+).

Could perhaps be a problem with amp from CUDA since I even saw users in this thread have an issue with amp in CUDA 11.x and saw it solved when they reverted to CUDA 10.x.

Perhaps mirroring the fix found in this thread might help? Can't really find the equivalent variables to change in train.py, and was wondering where they were moved to in v8.

Thread for reference:
ultralytics/yolov5#7908

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 2, 2023
@mithilnettyfy
Copy link

Any solution to run YOLOV8 or YOLOV5 on NVIDIA GTX 1650

because still i am facing same error

Screenshot (2)

@glenn-jocher
Copy link
Member

@mithilnettyfy hey there! Thank you for reaching out to us. We apologize for the inconvenience you have faced while training YOLOv8 on your NVIDIA GTX 1650 GPU.

The issue you are facing could be related to a compatibility issue with your GPU or with the use of Automatic Mixed Precision (AMP). We recommend trying the following solutions:

  1. Disable AMP by setting amp=False in train.py while training the YOLOv8 model.

  2. Force all variables to run in FP32 instead of using both FP16 and FP32 by disabling autocast. You can do this by setting autocast=False in train.py while training the model.

If neither of these solutions work, we recommend checking the compatibility of your GTX 1650 GPU with the CUDA version you are using. Some users have reported issues with AMP in CUDA 11.x and have solved the problem by reverting back to CUDA 10.x.

Please let us know if this helps resolve your issue or if you have any further questions.

@mithilnettyfy
Copy link

mithilnettyfy commented Jun 12, 2023

Hey @glenn-jocher Thank you so much for helping to resolve this issue. My program is working perfectly but your second solution is not working. Could you please describe what exactly is the second point

There is no any argument autocast=FALSE https://docs.ultralytics.com/modes/train/#arguments

  1. Force all variables to run in FP32 instead of using both FP16 and FP32 by disabling autocast. You can do this by setting autocast=False in train.py while training the model.

Thank you in advance for your help. I appreciate it.

image

still getting 0 value on Box(P R mAP50 mAP50-95)

@Chase-Xuu
Copy link

Hello, I just faced the same problem in the platform AutoDL using YOLOv5. I solved it by cloning the latest version of YOLOv5 rather than using the YOLOv5 provided by the platform. I hope this tip can help you.

@glenn-jocher
Copy link
Member

@Chi-XU-Sean hello,

Thank you for sharing your experience. This platform-specific issue seems to be related to the version of YOLOv5 provided by the AutoDL platform. To resolve this problem, you can try cloning the latest version of YOLOv5 directly from the official repository. This should help ensure that you are using the most up-to-date and bug-free version of YOLOv5.

I hope this solution works for you. Let me know if you have any further questions or concerns.

Best regards.

@priyakanabar-crest
Copy link

@mithilnettyfy @glenn-jocher did you get solution for this please discuss ? 0 value on Box(P R mAP50 mAP50-95)
Thanks

@mithilnettyfy
Copy link

Hi @priyakanabar-crest yes i solve this can you please tell me which GPU you use for training?

@priyakanabar-crest
Copy link

priyakanabar-crest commented Aug 4, 2023

Hello @mithilnettyfy I am using NVIDIA Geforce GTX 1650 as you said the Box(P R mAP50 mAP50-95) is showing 0 to me i have set amp=False

@mithilnettyfy
Copy link

@priyakanabar-crest
if name == 'main':
from multiprocessing import freeze_support
freeze_support()

from ultralytics import YOLO

model = YOLO("yolov8n.pt")

model.train(data = "data_custom.yaml", batch=4,amp=False,device="0", imgsz=1480, epochs=50, profile=True)

this is my training code please follow this.

@priyakanabar-crest
Copy link

@mithilnettyfy I am trying this Thank you so much for your reply

@mithilnettyfy
Copy link

@priyakanabar-crest it's work?

@priyakanabar-crest
Copy link

No it does not work @mithilnettyfy

@priyakanabar-crest
Copy link

image_2023_08_04T12_31_33_121Z
@mithilnettyfy this is how its showing

@mithilnettyfy
Copy link

mithilnettyfy commented Aug 4, 2023

@priyakanabar-crest can you please share your training code?

@priyakanabar-crest
Copy link

if name == 'main':
try:
from multiprocessing import freeze_support
freeze_support()

        from ultralytics import YOLO


        model = YOLO('yolov8m.pt')
        results = model.train(
            data='data.yaml',
            imgsz=640,
            epochs=40,
            batch=4,
            amp=False,
            profile = True,
            name='yolov8n_custom'
        ) 
except Exception as e:
    print(e.args)

@mithilnettyfy this is what i am using

@priyakanabar-crest
Copy link

@mithilnettyfy just for your information it works fine with yolov8s.pt but not with yolov8m.pt , I am not able to understand why

@glenn-jocher
Copy link
Member

Hello @mithilnettyfy,

Thanks for sharing additional details regarding the issue. It's great to hear that it's working as expected with 'yolov8s.pt'.

When it comes to different models like 'yolov8s.pt' and 'yolov8m.pt', they differ in size, layers, and potentially training regimen, which could lead some models to perform better on certain datasets than others.

Issues like the one you're facing with 'yolov8m.pt' could be due to various factors such as data-related issues (e.g., small object size, low-resolution images, class imbalance, etc.) or specific model characteristics. It could also be related to the GPU memory since different models have different memory and compute requirements.

If adjusting parameters (like image size, batch size, etc.) or trying different models does not solve the problem, it might be beneficial to review your data. Verify if your annotations are correct, or if there's any class imbalance in your dataset. Also, try to ensure that your dataset has diverse and representative samples of objects that YOLOv8 should detect.

Please let us know if you have any further questions or continue to encounter problems. We appreciate your collaboration and are eager to assist you in resolving this issue.

Best,
Glenn

@deKeijzer
Copy link

This issue is present in the latest version when using mps on a Mac M3 through the hub for various (detect) models. Additionally when using mps the box_loss and dfl_loss is always zero. Switching to cpu training resolves these issues.

@glenn-jocher
Copy link
Member

@deKeijzer hey there! 👋 Thanks for bringing this to our attention. Indeed, using MPS on a Mac M3 has shown some unique behaviors with our detect models, including the box_loss and dfl_loss being consistently zero. This seems to be an issue specific to the MPS backend.

For now, reverting to CPU training, as you discovered, bypasses these problems. We'll look into what's causing these discrepancies with MPS to find a solution. For users facing similar issues, here's a quick way to switch to CPU training:

from ultralytics import YOLO
model = YOLO('yolov8n.pt', device='cpu')  # Specify device as 'cpu'

We appreciate your patience and contributions to improving YOLOv8! Stay tuned for updates. 🚀

@FedeMorenoOptima
Copy link

Same problem on MAC M1. Thanks

@glenn-jocher
Copy link
Member

@FedeMorenoOptima hi there! 👋 It seems like the issue you're experiencing on the Mac M1 with YOLOv8 is noted. For now, a workaround is to train on the CPU to circumvent this problem. Here's a quick way to do it:

from ultralytics import YOLO
model = YOLO('yolov8n.pt', device='cpu')  # Force training on CPU

We're on it to fix this MPS backend issue. Your patience and support are much appreciated!

@thiagodsd
Copy link

@mithilnettyfy hey there! Thank you for reaching out to us. We apologize for the inconvenience you have faced while training YOLOv8 on your NVIDIA GTX 1650 GPU.

The issue you are facing could be related to a compatibility issue with your GPU or with the use of Automatic Mixed Precision (AMP). We recommend trying the following solutions:

  1. Disable AMP by setting amp=False in train.py while training the YOLOv8 model.
  2. Force all variables to run in FP32 instead of using both FP16 and FP32 by disabling autocast. You can do this by setting autocast=False in train.py while training the model.

If neither of these solutions work, we recommend checking the compatibility of your GTX 1650 GPU with the CUDA version you are using. Some users have reported issues with AMP in CUDA 11.x and have solved the problem by reverting back to CUDA 10.x.

Please let us know if this helps resolve your issue or if you have any further questions.

Just for reference, disabling AMP worked in Ubuntu Linux 22.04, 16 GB RAM, AMD Ryzen 7 3700X, NVIDIA GeForce GTX 1660 Ti, Python 3.10.12, pytorch 2.0.0+cu117.

Thanks @mithilnettyfy !

@mithilnettyfy
Copy link

Yes it's working on ubuntu also @thiagodsd

@janelyd
Copy link

janelyd commented Jul 30, 2024

Same problem with the latest version (8.0.17)
When lowering the batch size the losses seems to be working but the model is not learning properly.
Batch 16
Batch 8

In my experiment, I sometimes will be fixed by replacing the dataset, updating the PyTorch to latest version, or changing the gpu types. It seems not a model problem but is a AMP problem. If you don't want to try the above operation, you can try close AMP and only using FP32 by force all in FP32 and close autocast.

You are right the problem is with the GPU. It works on the 3090 and doesn't with the 1650. Python-3.10.9 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce RTX 3090, 24265MiB) -- Working
Python-3.10.9 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1650, 3912MiB) -- Not working

This is totally true, I have a RTX2060 Super and I get the following logs: image

Meanwhile in my laptop with GTX 1650, the logs are like following (with an important warning): image

I have the same versions of PyTorch in both computers:

RTX2060:

Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch 
>>> torch.__version__
'1.13.1+cu117'

GTX1650:

Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch 
>>> torch.__version__
'1.13.1+cu117'

New important EDIT:

If I try the trianing on my laptop with GTX1650 but using the CPU, I don't get any nan values: image

So clearly, there's a compatibilty problem with this GPU.

i don't think the issue is about to GPU. I use google colab for training and have acces to T4 GPU but still get the issue

@pderrenger
Copy link
Member

@janelyd thank you for the detailed information. It appears that the issue might not be solely related to the GPU type but could also involve other factors such as AMP settings or specific configurations. To help us investigate further, could you please ensure you are using the latest versions of YOLOv8 and PyTorch? Additionally, try disabling AMP and running the training again. If the issue persists, please share any additional logs or warnings you encounter. This will help us pinpoint the problem more accurately.

@janelyd
Copy link

janelyd commented Aug 2, 2024

@janelyd thank you for the detailed information. It appears that the issue might not be solely related to the GPU type but could also involve other factors such as AMP settings or specific configurations. To help us investigate further, could you please ensure you are using the latest versions of YOLOv8 and PyTorch? Additionally, try disabling AMP and running the training again. If the issue persists, please share any additional logs or warnings you encounter. This will help us pinpoint the problem more accurately.

@pderrenger I'm using the latest versions of YOLOv9 and pytorch. I just had the same issue: box_loss returns nan (also obj_loss returns nan).
But then I'v tried the disabling AMP. It get worst. I got nan for other metrics too after disabling AMP(speaking for YOLOv9).
Now I tried the --freze 10, it has better results. There is no nan issue for me anymore. I also tried the change hyperparameters such as augmentation parameters

@pderrenger
Copy link
Member

Thank you for the update, @janelyd. It's helpful to know that disabling AMP worsened the issue and that using --freeze 10 improved the results. This suggests that the problem might be related to the model's initial layers or specific hyperparameters. If you encounter further issues, please share any additional logs or warnings. This will assist us in diagnosing the problem more accurately.

@glenn-jocher
Copy link
Member

@xyrod6 lowering the batch size can sometimes help with NaN issues, but if the model isn't learning properly, consider checking your dataset for any issues or adjusting learning rates and other hyperparameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale Stale and schedule for closing soon
Projects
None yet
Development

Successfully merging a pull request may close this issue.