-
-
Notifications
You must be signed in to change notification settings - Fork 16.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the results obtained by training with Yolov5 are different each time #11626
Comments
@Ellohiye hello, Thank you for reaching out. The performance difference between successive training runs might be due to several factors, including variations in the number of iterations and differences in the way the network initializes its parameters. One of the reasons can be the Cuda cuDNN version. You can try setting cudnn_benchmark to true to provide the best performance. Another reason can be the shuffling order of the dataset. YOLOv5 shuffles the data before each epoch; you can try disabling the shuffle by setting ‘--noautoanchor –nosave –nosplit’. Hope this helps. Let me know if you have any further concerns. |
Thank you for your reply! |
@Ellohiye hello, Yes, I understand that you are experiencing this issue during training. The parameter --nosave does not affect the randomness, it disables saving the weights of the model. The --nosplit parameter is actually a combination of two parameters --nosplit_datasets and --nosplit_val, which disables splitting the dataset into training and validation sets. You can use this parameter to disable data shuffling and ensure that the order of the training data is consistent across runs. I hope this helps clarify your questions. Please let me know if you have any further concerns or questions. |
Thank you for your answer! I still have some doubts. |
Thank you for your follow-up questions. Yes, YOLOv5 shuffles the training set order before each epoch to avoid memorization. However, if the random seed is fixed, the training set order should be consistent across runs. Disabling the --nosplit parameter can indeed mess up the training set order, so if you want to ensure that the training set order is consistent across runs, you should set --nosplit to disable data shuffling. Regarding the --nosave parameter, you are correct that it does not directly help with randomness. However, disabling the model weights saving can help ensure that the model does not overfit the training data, which can also be a cause of performance variation across runs. I hope this clears up your doubts. Please let me know if you have any further questions or concerns. |
Thank you for your reply! |
Thank you for your response. I'm glad that my previous explanations could help clarify the situation. And yes, you are correct that the argument --nosplit is disabled by default, and even if random seeds are fixed, the order of the training sets will be scrambled in each epoch. To use the --nosplit parameter, you can set it in the command line when using YOLOv5 to run your training script. For example, you can run Alternatively, if you want to set it in your training configuration file, you can add I hope this helps. Please don't hesitate to let me know if you have any further questions or issues. |
Thank you very much! I will try it later! |
@Ellohiye you're welcome! I'm glad that I could help. Please let me know if you have any further questions or concerns. Good luck with your training! |
When I enter the training model according to your command line, the following error occurs: |
@Ellohiye hello, I apologize for the confusion and the error you've encountered. The command you specified in your issue seems to be missing the directory path to the YOLOv5 train.py file. Assuming you are using the latest version of YOLOv5, you can use this command to train your model:
Please ensure that you replace I hope this helps you. Let me know if you encounter any further issues or have any other questions. |
Thank you for your reply! |
@Ellohiye hello, Thank you for your response. I apologize for the confusion. The parameters that I mentioned in my previous reply have been deprecated in the latest version of YOLOv5, and I was not aware of this change. I am sorry for any inconvenience that may have caused you. Based on the latest YOLOv5 code, the recommended command to train your model would be something like this:
Please replace I hope this clears up any confusion. Please let me know if you have any further questions or concerns. |
That's OK! |
@Ellohiye hello! That's correct! By fixing the random seeds and using the --cache parameter, the training sets should have the same order for every round of training. Regarding your question, the --cache parameter in YOLOv5 enables caching of training dataset images to speed up training. It allows the images to be preprocessed only once during the training process, and the processed images are then cached in memory or on disk. This can help reduce the training time required for each epoch, especially for large datasets. Please let me know if you have any further questions or concerns. |
I added these parameters in two training sessions as you suggested, and then trained 30 rounds to check the results, but the results were still different each time. This frustrated me and made me lose faith in deep learning. If I couldn't guarantee repeatability, my improvements wouldn't be convincing. |
@Ellohiye hello, I apologize for the frustration you're experiencing with the lack of repeatability in your training results. I can understand how it can be discouraging, especially when trying to assess the impact of changes or improvements to your models. While there are several factors that can affect the consistency of your training results, using the --nosplit and --cache parameters as suggested above should help ensure that the order of training sets remains the same and speed up the training process, respectively. Other potential factors that may be affecting the repeatability of your training results include:
I would suggest reviewing your training setup and configurations to ensure that everything is consistent and properly tuned to obtain more reliable results. Please don't hesitate to let me know if you have any further questions or concerns. |
Every time I train, I make sure I train in one environment, on one machine, and the environment is exactly the same. |
@Ellohiye hello! I understand that you have taken measures to ensure the consistency of your training setup, however, there are still variations in your results. In addition to the parameters I suggested earlier, I would recommend double-checking your hardware and software configurations, such as ensuring that you are using the same versions of CUDA or cuDNN across all runs. Additionally, you may want to pay attention to random seeds that are involved in the initialization of the network weights. Correct initialization with a set seed can usually help with reproducibility, and incorrect initialization without a set seed will produce different weights across every training run. Lastly, you might also consider creating a controlled environment for training, such as using virtual machines to control for hardware or software changes. This can help ensure consistency in your training environment across all runs. I hope that helps! Let me know if you have any other questions or concerns. |
👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help. For additional resources and information, please see the links below:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed! Thank you for your contributions to YOLO 🚀 and Vision AI ⭐ |
Thank you for your information. I tried using the following code (I have tested more than ten model families on the hugging face classification project, only the effcientformer cannot be completely fixed, and the related reasons are still being tested). code: I found that the images of each batch are completely consistent. However, I output loss(code loss, loss_items = compute_loss(pred, targets.to(device)), and the result is as follows: result: There is a slight difference in the results, probably due to floating-point arithmetic or AMP. But I don't know how to modify it, if not completely fixed, sometimes it is found that randomness alone may make the result reach sota. (I used a hugface classification model with incomplete fixed randomness, and found that the final result was +-2% when all inputs were constant. About 20 experiments. I have found that experiments that modify the backbone or structure of the model sometimes have opposite results, which leads to problems with erosion experiments.) Looking forward to your reply. |
@Ellohiye hello! The modified code is as follows: Maybe I didn't modify it correctly. Or maybe there's some other reason. Looking forward to your reply. |
@deadly-fish i understand the frustration you may be experiencing with unrepeatable training results despite taking various measures to ensure consistency. It seems that the floating-point arithmetic or mixed-precision training (AMP) could be contributing to the slight variations in your loss results. In the modified code snippet you provided, it looks like there is an attempt to set the seed and control the environment's randomness, which is a good approach. Since the torch version used may affect the behavior of the code, I recommend consulting the official documentation or community forums for information specific to handling randomness in your torch version, as its behavior could have changed in the newer version. Another approach you may consider is conducting a thorough analysis of the floating-point arithmetic and AMP configurations to identify potential sources of variation in your results. This may involve reviewing the precision of computations, rounding methods, or AMP settings to ensure reproducibility. Additionally, you could validate the consistency of your floating-point operations across different runs and make adjustments accordingly. I hope this information is helpful, and I encourage you to seek guidance from the official documentation or community resources specific to your torch version for further insights into managing reproducibility in your training environment. If you have any other questions or concerns, please feel free to ask. |
Thank you for writing back and providing me with direction. I'm hoping to address this problem later! |
@deadly-fish, you're welcome! I'm glad to hear that you're feeling more directed in tackling this issue. Remember, achieving perfect reproducibility in deep learning can be challenging due to the inherent stochastic nature of the training process, but with careful control of the factors we've discussed, you can minimize variability. If you encounter any more issues or have additional questions in the future, don't hesitate to reach out. Good luck with your experiments! 🍀 |
Search before asking
Question
Hello, author!
Even with a fixed random seed and using the same pre-defined dataset split, the results obtained by training with Yolov5 are different each time. What is the reason for this situation?
Additional
No response
The text was updated successfully, but these errors were encountered: