"Loss is nan, stopping train" appears regularly #26

YizJia · 2022-06-07T02:09:40Z

I followed the steps in the READ.ME, configured the file directory structure, and trained the model. But there are always strange problems, like the log information intercepted below.

----OUTPUT----
Epoch: [5] [1660/2241] eta: 0:08:56 lr: 0.003000 loss: 2.2882 (2.4257) loss_proposal_cls: 0.0818 (0.0915) loss_proposal_reg: 1.2728 (1.4000) loss_box_cls: 0.1167 (0.1311) loss_box_reg: 0.1667 (0.1707) loss_box_reid: 0.4618 (0.5611) loss_rpn_reg: 0.0283 (0.0344) loss_rpn_cls: 0.0317 (0.0369) time: 0.9248 data: 0.0005 max mem: 24005
Loss is nan, stopping training
{'loss_proposal_cls': tensor(0.0837, device='cuda:0', grad_fn=), 'loss_proposal_reg': tensor(1.3923, device='cuda:0', grad_fn=), 'loss_box_cls': tensor(0.1187, device='cuda:0', grad_fn=), 'loss_box_reg': tensor(0.1719, device='cuda:0', grad_fn=), 'loss_box_reid': tensor(nan, device='cuda:0', grad_fn=), 'loss_rpn_reg': tensor(0.0457, device='cuda:0', grad_fn=), 'loss_rpn_cls': tensor(0.0226, device='cuda:0', grad_fn=)}

This phenomenon occurs after executing a fixed number of epochs. The error "Loss is nan, stopping training" is very regular. For example, after 5 epochs, it will appear after the 1160th batch of the 6th epoch, whether it is training from epoch=0 or using the --resume command .

Whether the model is trained on the RTX A6000，RTX A5000 or Tesla V100 32G, or whether the batch size and learning rate are adjusted in equal proportions, this error will occur, thus stopping the training.

I used the --resume command to train for 20 epochs, and observed that every time the problem appeared on the loss_box_reid.

This should be a bug in the code, but I'm not quite sure how it came about and how to fix it.

serend1p1ty · 2022-06-08T16:01:36Z

@YizJia What is your PyTorch version? I strongly recommend you use the version same as the requirements.txt.
Different versions of PyTorch may cause incomprehensible problems.

serend1p1ty · 2022-06-08T16:17:28Z

BTW, Are you using the default configuration file (configs/cuhk_sysu.yaml, configs/prw.yaml) without modifying any code?

YizJia · 2022-06-10T08:31:04Z

Thank you for your reply. Now I found the problem and solved it.
I found out that when getting the loss_box_reid, the nan appears and then traces back to the code in the OIM.py file. The images in the batch are all negative persons, that is, the instances whose pid is represented as 5555, will cause this problem. This is indeed caused by different pytorch versions. Specifically, the results of torch.nn.functional.cross_entropy() are different in different pytorch versions. In the pytorch 1.11 version, cross_entropy() will return nan in the above case, while previous versions, such as 1.8.2 and 1.10 that I tested, will return 0.
I switched to pytorch1.8.2 and everything works fine so far.

serend1p1ty · 2022-06-10T08:52:09Z

I am glad to hear that.

YizJia closed this as completed Jun 10, 2022

serend1p1ty added help wanted Extra attention is needed and removed help wanted Extra attention is needed labels Jul 2, 2022

serend1p1ty added the PyTorch version mismatch label Sep 6, 2022

ZhengPeng7 mentioned this issue Jan 23, 2023

when i train the dataset CUHK-SYSU,the result is not the same as the result you provide #5

Closed

serend1p1ty mentioned this issue Oct 29, 2023

Loss is nan when training cuhk_sysu dataset #34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Loss is nan, stopping train" appears regularly #26

"Loss is nan, stopping train" appears regularly #26

YizJia commented Jun 7, 2022

serend1p1ty commented Jun 8, 2022 •

edited

Loading

serend1p1ty commented Jun 8, 2022 •

edited

Loading

YizJia commented Jun 10, 2022

serend1p1ty commented Jun 10, 2022

"Loss is nan, stopping train" appears regularly #26

"Loss is nan, stopping train" appears regularly #26

Comments

YizJia commented Jun 7, 2022

serend1p1ty commented Jun 8, 2022 • edited Loading

serend1p1ty commented Jun 8, 2022 • edited Loading

YizJia commented Jun 10, 2022

serend1p1ty commented Jun 10, 2022

serend1p1ty commented Jun 8, 2022 •

edited

Loading

serend1p1ty commented Jun 8, 2022 •

edited

Loading