Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the results obtained by training with Yolov5 are different each time #11626

Closed
1 task done
Ellohiye opened this issue May 30, 2023 · 25 comments
Closed
1 task done

the results obtained by training with Yolov5 are different each time #11626

Ellohiye opened this issue May 30, 2023 · 25 comments
Labels
question Further information is requested Stale Stale and schedule for closing soon

Comments

@Ellohiye
Copy link

Search before asking

Question

Hello, author!
Even with a fixed random seed and using the same pre-defined dataset split, the results obtained by training with Yolov5 are different each time. What is the reason for this situation?

Additional

No response

@Ellohiye Ellohiye added the question Further information is requested label May 30, 2023
@glenn-jocher
Copy link
Member

@Ellohiye hello,

Thank you for reaching out. The performance difference between successive training runs might be due to several factors, including variations in the number of iterations and differences in the way the network initializes its parameters. One of the reasons can be the Cuda cuDNN version. You can try setting cudnn_benchmark to true to provide the best performance.

Another reason can be the shuffling order of the dataset. YOLOv5 shuffles the data before each epoch; you can try disabling the shuffle by setting ‘--noautoanchor –nosave –nosplit’.

Hope this helps. Let me know if you have any further concerns.

@Ellohiye
Copy link
Author

Thank you for your reply!
Did you have this problem during training?
The environment I'm training in is exactly the same,
Does the parameter --nosave have an effect on randomness?
Where is the parameter - nosplit set?

@glenn-jocher
Copy link
Member

@Ellohiye hello,

Yes, I understand that you are experiencing this issue during training. The parameter --nosave does not affect the randomness, it disables saving the weights of the model. The --nosplit parameter is actually a combination of two parameters --nosplit_datasets and --nosplit_val, which disables splitting the dataset into training and validation sets. You can use this parameter to disable data shuffling and ensure that the order of the training data is consistent across runs.

I hope this helps clarify your questions. Please let me know if you have any further concerns or questions.

@Ellohiye
Copy link
Author

@Ellohiye你好,

是的,我了解到您在培训期间遇到了这个问题。参数 --nosave 不影响随机性,它禁用保存模型的权重。--nosplit 参数实际上是两个参数--nosplit_datasets 和--nosplit_val 的组合,它禁止将数据集拆分为训练集和验证集。您可以使用此参数来禁用数据混洗并确保训练数据的顺序在运行中保持一致。

我希望这有助于澄清您的问题。如果您还有任何疑虑或问题,请告诉我。

Thank you for your answer! I still have some doubts.
Does yolov5 scramble the training set order before each epoch? If random seeds are fixed, the training set order will not be messed up.
If it does mess up the training set order, disabling --nospilt will still mess up the training set order.
If the nosave parameter doesn't help with randomness, disabling it won't help keep the same value for multiple drills.

@glenn-jocher
Copy link
Member

@Ellohiye,

Thank you for your follow-up questions. Yes, YOLOv5 shuffles the training set order before each epoch to avoid memorization. However, if the random seed is fixed, the training set order should be consistent across runs. Disabling the --nosplit parameter can indeed mess up the training set order, so if you want to ensure that the training set order is consistent across runs, you should set --nosplit to disable data shuffling.

Regarding the --nosave parameter, you are correct that it does not directly help with randomness. However, disabling the model weights saving can help ensure that the model does not overfit the training data, which can also be a cause of performance variation across runs.

I hope this clears up your doubts. Please let me know if you have any further questions or concerns.

@Ellohiye
Copy link
Author

@Ellohiye,

Thank you for your follow-up questions. Yes, YOLOv5 shuffles the training set order before each epoch to avoid memorization. However, if the random seed is fixed, the training set order should be consistent across runs. Disabling the --nosplit parameter can indeed mess up the training set order, so if you want to ensure that the training set order is consistent across runs, you should set --nosplit to disable data shuffling.

Regarding the --nosave parameter, you are correct that it does not directly help with randomness. However, disabling the model weights saving can help ensure that the model does not overfit the training data, which can also be a cause of performance variation across runs.

I hope this clears up your doubts. Please let me know if you have any further questions or concerns.

Thank you for your reply!
I misunderstood before. After reading your explanation, my understanding is as follows:
The argument --nosplit is disabled by default, and even if random seeds are fixed, the order of the training sets will be scrambled in each epoch.
If I want to use the --nospilt parameter, in which file should I specify the value of that parameter?

@glenn-jocher
Copy link
Member

@Ellohiye,

Thank you for your response. I'm glad that my previous explanations could help clarify the situation. And yes, you are correct that the argument --nosplit is disabled by default, and even if random seeds are fixed, the order of the training sets will be scrambled in each epoch.

To use the --nosplit parameter, you can set it in the command line when using YOLOv5 to run your training script. For example, you can run python train.py --img 640 --batch 16 --epochs 50 --data coco.yaml --noautoanchor --nosave --cache --nosplit to turn off the automatic shuffling of data at the end of each training epoch.

Alternatively, if you want to set it in your training configuration file, you can add nosplit: True under the train field in your YAML file and then launch the training script with your customized configuration file.

I hope this helps. Please don't hesitate to let me know if you have any further questions or issues.

@Ellohiye
Copy link
Author

@Ellohiye,

感谢您的答复。很高兴我之前的解释可以帮助澄清情况。是的,你是对的,参数 --nosplit 默认是禁用的,即使随机种子是固定的,训练集的顺序也会在每个时期被打乱。

要使用 --nosplit 参数,您可以在使用 YOLOv5 运行训练脚本时在命令行中设置它。例如,您可以运行python train.py --img 640 --batch 16 --epochs 50 --data coco.yaml --noautoanchor --nosave --cache --nosplit以在每个训练时期结束时关闭数据的自动混洗。

或者,如果您想在训练配置文件中设置它,您可以在 YAML 文件的字段nosplit: True下添加train,然后使用您的自定义配置文件启动训练脚本。

我希望这有帮助。如果您还有任何疑问或问题,请随时告诉我。

Thank you very much! I will try it later!

@glenn-jocher
Copy link
Member

@Ellohiye you're welcome! I'm glad that I could help. Please let me know if you have any further questions or concerns. Good luck with your training!

@Ellohiye
Copy link
Author

Ellohiye commented Jun 1, 2023

@Ellohiye you're welcome! I'm glad that I could help. Please let me know if you have any further questions or concerns. Good luck with your training!

When I enter the training model according to your command line, the following error occurs:
There is no parameter in the train.py file
image

@glenn-jocher
Copy link
Member

@Ellohiye hello,

I apologize for the confusion and the error you've encountered. The command you specified in your issue seems to be missing the directory path to the YOLOv5 train.py file.

Assuming you are using the latest version of YOLOv5, you can use this command to train your model:

python yolov5/train.py --img 640 --batch 16 --epochs 50 --data coco.yaml --noautoanchor --nosave --cache --nosplit_datasets --nosplit_val

Please ensure that you replace coco.yaml with the configuration file that corresponds to your dataset. Also, make sure to include the correct path to the train.py file from the YOLOv5 directory.

I hope this helps you. Let me know if you encounter any further issues or have any other questions.

@Ellohiye
Copy link
Author

Ellohiye commented Jun 1, 2023

@Ellohiye hello,

I apologize for the confusion and the error you've encountered. The command you specified in your issue seems to be missing the directory path to the YOLOv5 train.py file.

Assuming you are using the latest version of YOLOv5, you can use this command to train your model:

python yolov5/train.py --img 640 --batch 16 --epochs 50 --data coco.yaml --noautoanchor --nosave --cache --nosplit_datasets --nosplit_val

Please ensure that you replace coco.yaml with the configuration file that corresponds to your dataset. Also, make sure to include the correct path to the train.py file from the YOLOv5 directory.

I hope this helps you. Let me know if you encounter any further issues or have any other questions.

Thank you for your reply!
I don't think the problem is the directory path, but that the parameter is not in the train.py file. I checked the latest yolov5 code and it doesn't have the parameters you mentioned.
image
image

@glenn-jocher
Copy link
Member

@Ellohiye hello,

Thank you for your response. I apologize for the confusion. The parameters that I mentioned in my previous reply have been deprecated in the latest version of YOLOv5, and I was not aware of this change. I am sorry for any inconvenience that may have caused you.

Based on the latest YOLOv5 code, the recommended command to train your model would be something like this:

python train.py --img 640 --batch 16 --epochs 50 --data coco.yaml --noautoanchor --nosave --cache

Please replace coco.yaml with the name of your dataset's configuration file, and make sure you are using the latest version of the repo.

I hope this clears up any confusion. Please let me know if you have any further questions or concerns.

@Ellohiye
Copy link
Author

Ellohiye commented Jun 1, 2023

@Ellohiye你好,

感谢您的答复。对于造成的混乱,我深表歉意。我在之前的回复中提到的参数在最新版本的 YOLOv5 中已被弃用,我并没有意识到这一变化。对于给您带来的任何不便,我们深表歉意。

根据最新的 YOLOv5 代码,训练模型的推荐命令如下所示:

python train.py --img 640 --batch 16 --epochs 50 --data coco.yaml --noautoanchor --nosave --cache

请替换coco.yaml为您的数据集配置文件的名称,并确保您使用的是最新版本的存储库。

我希望这能消除任何困惑。如果您有任何其他问题或疑虑,请告诉我。

--cache

That's OK!
That means that the training sets are now in the same order for each round after fixing the random seeds.
I want to ask what the parameter --cache stands for?

@glenn-jocher
Copy link
Member

@Ellohiye hello!

That's correct! By fixing the random seeds and using the --cache parameter, the training sets should have the same order for every round of training.

Regarding your question, the --cache parameter in YOLOv5 enables caching of training dataset images to speed up training. It allows the images to be preprocessed only once during the training process, and the processed images are then cached in memory or on disk. This can help reduce the training time required for each epoch, especially for large datasets.

Please let me know if you have any further questions or concerns.

@Ellohiye
Copy link
Author

Ellohiye commented Jun 2, 2023

@Ellohiye你好!

没错!通过固定随机种子并使用 --cache 参数,训练集在每一轮训练中应该具有相同的顺序。

关于你的问题,YOLOv5 中的 --cache 参数可以缓存训练数据集图像以加快训练速度。它允许在训练过程中只对图像进行一次预处理,然后将处理后的图像缓存在内存或磁盘上。这有助于减少每个时期所需的训练时间,尤其是对于大型数据集。

如果您有任何其他问题或疑虑,请告诉我。

I added these parameters in two training sessions as you suggested, and then trained 30 rounds to check the results, but the results were still different each time. This frustrated me and made me lose faith in deep learning. If I couldn't guarantee repeatability, my improvements wouldn't be convincing.

@glenn-jocher
Copy link
Member

@Ellohiye hello,

I apologize for the frustration you're experiencing with the lack of repeatability in your training results. I can understand how it can be discouraging, especially when trying to assess the impact of changes or improvements to your models.

While there are several factors that can affect the consistency of your training results, using the --nosplit and --cache parameters as suggested above should help ensure that the order of training sets remains the same and speed up the training process, respectively.

Other potential factors that may be affecting the repeatability of your training results include:

  • Hardware/software configurations: inconsistencies in your hardware or software configurations, such as different versions of CUDA or cuDNN, can cause variations in performance.
  • Use of augmentation: random augmentation techniques, such as flipping or rotation, can cause slight variations in the data and lead to different results during training.
  • Model architectures and hyperparameters: different model architectures or hyperparameters can yield different results, and it's important to ensure that these are properly tuned for your specific dataset and problem.

I would suggest reviewing your training setup and configurations to ensure that everything is consistent and properly tuned to obtain more reliable results.

Please don't hesitate to let me know if you have any further questions or concerns.

@Ellohiye
Copy link
Author

Ellohiye commented Jun 2, 2023

@Ellohiye你好,

对于您因训练结果缺乏可重复性而感到沮丧,我深表歉意。我能理解它是多么令人沮丧,尤其是在尝试评估模型更改或改进的影响时。

虽然有几个因素会影响训练结果的一致性,但使用上面建议的 --nosplit 和 --cache 参数应该有助于确保训练集的顺序保持不变并分别加快训练过程。

其他可能影响训练结果可重复性的潜在因素包括:

  • 硬件/软件配置:硬件或软件配置的不一致,例如不同版本的 CUDA 或 cuDNN,可能会导致性能差异。
  • 使用增强:随机增强技术,例如翻转或旋转,可能会导致数据发生轻微变化,并在训练过程中导致不同的结果。
  • 模型架构和超参数:不同的模型架构或超参数会产生不同的结果,重要的是要确保针对您的特定数据集和问题正确调整这些结果。

我建议检查您的训练设置和配置,以确保一切一致并适当调整以获得更可靠的结果。

如果您有任何其他问题或疑虑,请随时告诉我。

@Ellohiye你好,

对于您因训练导致缺乏可再生性而感到油腻,我深深地表示正确。我能理解它是多么令人油腻,尤其是在尝试评价模型更改或者改变进入的影响时。

虽然有几个因素会影响训练结果的一个致性,但使用上面建议的 --nosplit 和 --cache 参数应该有助于确保训练集的顺序保护保持不变并分别加快速训练过程。

其他可能影响训练结果可能重新性的潜在原因包括:

  • 硬件/软件配置:硬件或软件配置的不同,例如不同版本的CUDA或cuDNN,可能会导致性能差异。
  • 使用增强:随机增强技术,例如翻转或旋转,可能会导致数据发生轻微变化,而在训练过程中导致不同的结果。
  • 模型构架和超参数:不同的模型构架或超参数会产生不同的结果,重要的是要确保针头对您的特定数据集和问题正确调整这些结果。

我建议检查您的训练设置和配置,以确保一切一致并适当当调整整以获取更可靠的结果。

如果您有任何其他问题或疑问,请随时告诉我。

Every time I train, I make sure I train in one environment, on one machine, and the environment is exactly the same.
The other parameters are completely consistent and the data set is completely consistent.
The parameters of the data enhancement method adopted in the training also remain completely consistent, and the images after each enhancement are also completely consistent under the action of random seeds.
So I don't know what went wrong.

@glenn-jocher
Copy link
Member

@Ellohiye hello! I understand that you have taken measures to ensure the consistency of your training setup, however, there are still variations in your results.

In addition to the parameters I suggested earlier, I would recommend double-checking your hardware and software configurations, such as ensuring that you are using the same versions of CUDA or cuDNN across all runs. Additionally, you may want to pay attention to random seeds that are involved in the initialization of the network weights. Correct initialization with a set seed can usually help with reproducibility, and incorrect initialization without a set seed will produce different weights across every training run.

Lastly, you might also consider creating a controlled environment for training, such as using virtual machines to control for hardware or software changes. This can help ensure consistency in your training environment across all runs.

I hope that helps! Let me know if you have any other questions or concerns.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 3, 2023

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

@github-actions github-actions bot added the Stale Stale and schedule for closing soon label Jul 3, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 14, 2023
@wangat
Copy link

wangat commented Sep 11, 2023

Thank you for your information. I tried using the following code (I have tested more than ten model families on the hugging face classification project, only the effcientformer cannot be completely fixed, and the related reasons are still being tested).

code:
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
cudnn.benchmark, cudnn.deterministic = (False, True) if seed == 0 else (False, True)
torch.backends.cudnn.enabled = True
os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':16:8'

I found that the images of each batch are completely consistent. However, I output loss(code loss, loss_items = compute_loss(pred, targets.to(device)), and the result is as follows:

result:
tensor([13.44336], device='cuda:0', grad_fn=) tensor([0.11942, 0.04171, 0.04893], device='cuda:0') (Control all variables unchanged, the first output of the first epoch of the first experiment)
tensor([13.44390], device='cuda:0', grad_fn=) tensor([0.11942, 0.04172, 0.04893], device='cuda:0')(Control all variables unchanged, the first output of the first epoch of the second experiment)

There is a slight difference in the results, probably due to floating-point arithmetic or AMP. But I don't know how to modify it, if not completely fixed, sometimes it is found that randomness alone may make the result reach sota. (I used a hugface classification model with incomplete fixed randomness, and found that the final result was +-2% when all inputs were constant. About 20 experiments. I have found that experiments that modify the backbone or structure of the model sometimes have opposite results, which leads to problems with erosion experiments.)

Looking forward to your reply.

@LXDxiaoxiaoda
Copy link

@Ellohiye hello!
I have read the above question carefully as I am experiencing the same problem. I was able to repeat the experiment just on my laptop, but not on the server. I have tried all the ideas mentioned above: specify cuDNN; keep the same dataset order; try noautoanchor, nosave. all of these failed and the training results are still unrepeatable on the server. But I noticed this code, since my torch version is 1.8.1+cu101, it can't satisfy the following code, so it can't go into the judgment statement. So I modified the code, but the training results are still unrepeatable and I am confused.

The modified code is as follows:
image

Maybe I didn't modify it correctly. Or maybe there's some other reason.

Looking forward to your reply.

@glenn-jocher
Copy link
Member

@deadly-fish i understand the frustration you may be experiencing with unrepeatable training results despite taking various measures to ensure consistency. It seems that the floating-point arithmetic or mixed-precision training (AMP) could be contributing to the slight variations in your loss results.

In the modified code snippet you provided, it looks like there is an attempt to set the seed and control the environment's randomness, which is a good approach. Since the torch version used may affect the behavior of the code, I recommend consulting the official documentation or community forums for information specific to handling randomness in your torch version, as its behavior could have changed in the newer version.

Another approach you may consider is conducting a thorough analysis of the floating-point arithmetic and AMP configurations to identify potential sources of variation in your results. This may involve reviewing the precision of computations, rounding methods, or AMP settings to ensure reproducibility.

Additionally, you could validate the consistency of your floating-point operations across different runs and make adjustments accordingly.

I hope this information is helpful, and I encourage you to seek guidance from the official documentation or community resources specific to your torch version for further insights into managing reproducibility in your training environment. If you have any other questions or concerns, please feel free to ask.

@LXDxiaoxiaoda
Copy link

@deadly-fish i understand the frustration you may be experiencing with unrepeatable training results despite taking various measures to ensure consistency. It seems that the floating-point arithmetic or mixed-precision training (AMP) could be contributing to the slight variations in your loss results.

In the modified code snippet you provided, it looks like there is an attempt to set the seed and control the environment's randomness, which is a good approach. Since the torch version used may affect the behavior of the code, I recommend consulting the official documentation or community forums for information specific to handling randomness in your torch version, as its behavior could have changed in the newer version.

Another approach you may consider is conducting a thorough analysis of the floating-point arithmetic and AMP configurations to identify potential sources of variation in your results. This may involve reviewing the precision of computations, rounding methods, or AMP settings to ensure reproducibility.

Additionally, you could validate the consistency of your floating-point operations across different runs and make adjustments accordingly.

I hope this information is helpful, and I encourage you to seek guidance from the official documentation or community resources specific to your torch version for further insights into managing reproducibility in your training environment. If you have any other questions or concerns, please feel free to ask.

Thank you for writing back and providing me with direction. I'm hoping to address this problem later!

@glenn-jocher
Copy link
Member

@deadly-fish, you're welcome! I'm glad to hear that you're feeling more directed in tackling this issue. Remember, achieving perfect reproducibility in deep learning can be challenging due to the inherent stochastic nature of the training process, but with careful control of the factors we've discussed, you can minimize variability. If you encounter any more issues or have additional questions in the future, don't hesitate to reach out. Good luck with your experiments! 🍀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale Stale and schedule for closing soon
Projects
None yet
Development

No branches or pull requests

4 participants