Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Loss is nan, stopping train" appears regularly #26

Closed
YizJia opened this issue Jun 7, 2022 · 4 comments
Closed

"Loss is nan, stopping train" appears regularly #26

YizJia opened this issue Jun 7, 2022 · 4 comments

Comments

@YizJia
Copy link

YizJia commented Jun 7, 2022

I followed the steps in the READ.ME, configured the file directory structure, and trained the model. But there are always strange problems, like the log information intercepted below.

----OUTPUT----
Epoch: [5] [1660/2241] eta: 0:08:56 lr: 0.003000 loss: 2.2882 (2.4257) loss_proposal_cls: 0.0818 (0.0915) loss_proposal_reg: 1.2728 (1.4000) loss_box_cls: 0.1167 (0.1311) loss_box_reg: 0.1667 (0.1707) loss_box_reid: 0.4618 (0.5611) loss_rpn_reg: 0.0283 (0.0344) loss_rpn_cls: 0.0317 (0.0369) time: 0.9248 data: 0.0005 max mem: 24005
Loss is nan, stopping training
{'loss_proposal_cls': tensor(0.0837, device='cuda:0', grad_fn=), 'loss_proposal_reg': tensor(1.3923, device='cuda:0', grad_fn=), 'loss_box_cls': tensor(0.1187, device='cuda:0', grad_fn=), 'loss_box_reg': tensor(0.1719, device='cuda:0', grad_fn=), 'loss_box_reid': tensor(nan, device='cuda:0', grad_fn=), 'loss_rpn_reg': tensor(0.0457, device='cuda:0', grad_fn=), 'loss_rpn_cls': tensor(0.0226, device='cuda:0', grad_fn=)}

This phenomenon occurs after executing a fixed number of epochs. The error "Loss is nan, stopping training" is very regular. For example, after 5 epochs, it will appear after the 1160th batch of the 6th epoch, whether it is training from epoch=0 or using the --resume command .

Whether the model is trained on the RTX A6000,RTX A5000 or Tesla V100 32G, or whether the batch size and learning rate are adjusted in equal proportions, this error will occur, thus stopping the training.

I used the --resume command to train for 20 epochs, and observed that every time the problem appeared on the loss_box_reid.

This should be a bug in the code, but I'm not quite sure how it came about and how to fix it.

@serend1p1ty
Copy link
Owner

serend1p1ty commented Jun 8, 2022

@YizJia What is your PyTorch version? I strongly recommend you use the version same as the requirements.txt.
Different versions of PyTorch may cause incomprehensible problems.

@serend1p1ty
Copy link
Owner

serend1p1ty commented Jun 8, 2022

BTW, Are you using the default configuration file (configs/cuhk_sysu.yaml, configs/prw.yaml) without modifying any code?

@YizJia
Copy link
Author

YizJia commented Jun 10, 2022

Thank you for your reply. Now I found the problem and solved it.
I found out that when getting the loss_box_reid, the nan appears and then traces back to the code in the OIM.py file. The images in the batch are all negative persons, that is, the instances whose pid is represented as 5555, will cause this problem. This is indeed caused by different pytorch versions. Specifically, the results of torch.nn.functional.cross_entropy() are different in different pytorch versions. In the pytorch 1.11 version, cross_entropy() will return nan in the above case, while previous versions, such as 1.8.2 and 1.10 that I tested, will return 0.
I switched to pytorch1.8.2 and everything works fine so far.

@YizJia YizJia closed this as completed Jun 10, 2022
@serend1p1ty
Copy link
Owner

I am glad to hear that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants