-
-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nan report in box_class cls_class and dfl_loss when train custom dataset #280
Comments
@classico09 can you share your training command? |
Here is my command: |
@classico09 hi. Is your performance fine with yolov5? Can you run the same command with |
Thank you. I tried it but it still doesn't work. I tried with the yolov5 model in the yolov5 repositories and it work so I think it is not because of the dataset. |
Hello, I met same question. I successfully completed the training using the environment of yolov5, (MAP is 0.907). Based on the yolov5 environment, I quickly completed the installation using pip install ultralytics according to the documentation (I want to train yolov8 and compare yolov5 to see the effect). #283 yolo task=init --config-name helmethyp.yaml --config-path /nfs/volume-622-1/lanzhixiong/project/smoking/code/yolov8/ |
I also encountered the same problem, but I found that the problem could be solved by turning down the batch, but I don't know why this is so, and the training is very slow, and the GPU utilization rate is very low |
I have the same/similar problem. When I run the same command with |
Hi all. |
@pepijnob @duynguyen1907 @jiyuwangbupt hey guys, can you try to replace the following line to
|
@Laughing-q I tried your suggestion in version 8.0.4 and in 8.0.5 and both times my loss went to nan in the first epoch. When I just update to 8.0.5 without your suggestion I get the same as before with the loss not going down (on the same dataset where yolov5 did work). |
I am experimenting same logs with the defualt command: I mean, with coco128.yaml, just to do some testings and same results are gotten:
|
I also meet same problem, seems cls_loss suddenly appear NaN, and also all loss is NaN. |
the nan loss issue has been solved in this PR #490, which we'll merge it later today. :) |
@Laughing-q I am still getting nan in my training. It seems for validation is solved: After running
|
My dataset is |
@hdnh2006 it's not merged yet. The update will be available later today |
@AyushExel Thanks Ayush, you are awesome as always!! |
I have this issue even on newest update
I try on nvidia drivers 525 (CUDA 12), 470 (CUDA 11.4), with ultralytics docker, etc |
In my experiment, I sometimes will be fixed by replacing the dataset, updating the PyTorch to latest version, or changing the gpu types. It seems not a model problem but is a AMP problem. If you don't want to try the above operation, you can try close AMP and only using FP32 by force all in FP32 and close autocast. |
You are right the problem is with the GPU. Python-3.10.9 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1650, 3912MiB) -- Not working |
This is totally true, I have a RTX2060 Super and I get the following logs: Meanwhile in my laptop with GTX 1650, the logs are like following (with an important warning): I have the same versions of PyTorch in both computers: RTX2060:
GTX1650:
New important EDIT:If I try the trianing on my laptop with GTX1650 but using the CPU, I don't get any So clearly, there's a compatibilty problem with this GPU. |
I think this is the same problem in yolov5: ultralytics/yolov5#7908 |
I had the same errors propagate to yolov8 and yolov5, but I found a similar bug report for yolov5 that suggested disabling AMP by amp=False in train.py, which fixes box_loss and obj_loss equating to nan. The other suggested fix for validation not working was that in train.py validation uses half accuracy and is half=amp in the validator() function (val in this thread) but by force assigning it half=False, it fixed my problem for training on yolov5 and training has resumed as usual using CUDA 11.7 with a Nvidia T1200 Laptop GPU (Compute Capability 7+). Could perhaps be a problem with amp from CUDA since I even saw users in this thread have an issue with amp in CUDA 11.x and saw it solved when they reverted to CUDA 10.x. Perhaps mirroring the fix found in this thread might help? Can't really find the equivalent variables to change in train.py, and was wondering where they were moved to in v8. Thread for reference: |
@mithilnettyfy hey there! Thank you for reaching out to us. We apologize for the inconvenience you have faced while training YOLOv8 on your NVIDIA GTX 1650 GPU. The issue you are facing could be related to a compatibility issue with your GPU or with the use of Automatic Mixed Precision (AMP). We recommend trying the following solutions:
If neither of these solutions work, we recommend checking the compatibility of your GTX 1650 GPU with the CUDA version you are using. Some users have reported issues with AMP in CUDA 11.x and have solved the problem by reverting back to CUDA 10.x. Please let us know if this helps resolve your issue or if you have any further questions. |
Hey @glenn-jocher Thank you so much for helping to resolve this issue. My program is working perfectly but your second solution is not working. Could you please describe what exactly is the second point There is no any argument autocast=FALSE https://docs.ultralytics.com/modes/train/#arguments
Thank you in advance for your help. I appreciate it. still getting 0 value on Box(P R mAP50 mAP50-95) |
Hello, I just faced the same problem in the platform AutoDL using YOLOv5. I solved it by cloning the latest version of YOLOv5 rather than using the YOLOv5 provided by the platform. I hope this tip can help you. |
@Chi-XU-Sean hello, Thank you for sharing your experience. This platform-specific issue seems to be related to the version of YOLOv5 provided by the AutoDL platform. To resolve this problem, you can try cloning the latest version of YOLOv5 directly from the official repository. This should help ensure that you are using the most up-to-date and bug-free version of YOLOv5. I hope this solution works for you. Let me know if you have any further questions or concerns. Best regards. |
@mithilnettyfy @glenn-jocher did you get solution for this please discuss ? 0 value on Box(P R mAP50 mAP50-95) |
Hi @priyakanabar-crest yes i solve this can you please tell me which GPU you use for training? |
Hello @mithilnettyfy I am using NVIDIA Geforce GTX 1650 as you said the Box(P R mAP50 mAP50-95) is showing 0 to me i have set amp=False |
@priyakanabar-crest
|
@mithilnettyfy I am trying this Thank you so much for your reply |
@priyakanabar-crest it's work? |
No it does not work @mithilnettyfy |
|
@priyakanabar-crest can you please share your training code? |
if name == 'main':
@mithilnettyfy this is what i am using |
@mithilnettyfy just for your information it works fine with yolov8s.pt but not with yolov8m.pt , I am not able to understand why |
Hello @mithilnettyfy, Thanks for sharing additional details regarding the issue. It's great to hear that it's working as expected with 'yolov8s.pt'. When it comes to different models like 'yolov8s.pt' and 'yolov8m.pt', they differ in size, layers, and potentially training regimen, which could lead some models to perform better on certain datasets than others. Issues like the one you're facing with 'yolov8m.pt' could be due to various factors such as data-related issues (e.g., small object size, low-resolution images, class imbalance, etc.) or specific model characteristics. It could also be related to the GPU memory since different models have different memory and compute requirements. If adjusting parameters (like image size, batch size, etc.) or trying different models does not solve the problem, it might be beneficial to review your data. Verify if your annotations are correct, or if there's any class imbalance in your dataset. Also, try to ensure that your dataset has diverse and representative samples of objects that YOLOv8 should detect. Please let us know if you have any further questions or continue to encounter problems. We appreciate your collaboration and are eager to assist you in resolving this issue. Best, |
This issue is present in the latest version when using mps on a Mac M3 through the hub for various (detect) models. Additionally when using mps the box_loss and dfl_loss is always zero. Switching to cpu training resolves these issues. |
@deKeijzer hey there! 👋 Thanks for bringing this to our attention. Indeed, using MPS on a Mac M3 has shown some unique behaviors with our detect models, including the box_loss and dfl_loss being consistently zero. This seems to be an issue specific to the MPS backend. For now, reverting to CPU training, as you discovered, bypasses these problems. We'll look into what's causing these discrepancies with MPS to find a solution. For users facing similar issues, here's a quick way to switch to CPU training: from ultralytics import YOLO
model = YOLO('yolov8n.pt', device='cpu') # Specify device as 'cpu' We appreciate your patience and contributions to improving YOLOv8! Stay tuned for updates. 🚀 |
Same problem on MAC M1. Thanks |
@FedeMorenoOptima hi there! 👋 It seems like the issue you're experiencing on the Mac M1 with YOLOv8 is noted. For now, a workaround is to train on the CPU to circumvent this problem. Here's a quick way to do it: from ultralytics import YOLO
model = YOLO('yolov8n.pt', device='cpu') # Force training on CPU We're on it to fix this MPS backend issue. Your patience and support are much appreciated! |
Just for reference, disabling AMP worked in Ubuntu Linux 22.04, 16 GB RAM, AMD Ryzen 7 3700X, NVIDIA GeForce GTX 1660 Ti, Python 3.10.12, pytorch 2.0.0+cu117. Thanks @mithilnettyfy ! |
Yes it's working on ubuntu also @thiagodsd |
i don't think the issue is about to GPU. I use google colab for training and have acces to T4 GPU but still get the issue |
@janelyd thank you for the detailed information. It appears that the issue might not be solely related to the GPU type but could also involve other factors such as AMP settings or specific configurations. To help us investigate further, could you please ensure you are using the latest versions of YOLOv8 and PyTorch? Additionally, try disabling AMP and running the training again. If the issue persists, please share any additional logs or warnings you encounter. This will help us pinpoint the problem more accurately. |
@pderrenger I'm using the latest versions of YOLOv9 and pytorch. I just had the same issue: box_loss returns nan (also obj_loss returns nan). |
Thank you for the update, @janelyd. It's helpful to know that disabling AMP worsened the issue and that using |
@xyrod6 lowering the batch size can sometimes help with NaN issues, but if the model isn't learning properly, consider checking your dataset for any issues or adjusting learning rates and other hyperparameters. |
Search before asking
YOLOv8 Component
Training
Bug
Hello, I am newbie in computer vision and I just started to try the new version yolov8 and I get some error when take the result.
I seem like something wrong but I don't know how to fix it. Can you give me some suggest?
Environment
-YOLOv8n
-CUDA: 11.6
-Ultralytics YOLOv8.0.4
-OS: Windows 10
Minimal Reproducible Example
No response
Additional
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: