Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appearing "nan" during training #24

Closed
b03505036 opened this issue Mar 16, 2020 · 7 comments
Closed

Appearing "nan" during training #24

b03505036 opened this issue Mar 16, 2020 · 7 comments

Comments

@b03505036
Copy link

when just using atss_r50 , but the loss_centrness is keep nan

2020-03-16 11:31:03,886 atss_core.trainer INFO: eta: 7:16:51 iter: 35320 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7951) loss_reg: nan (nan) time: 0.4746 (0.4794) dat[0/1224]
7 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:31:13,489 atss_core.trainer INFO: eta: 7:16:41 iter: 35340 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7951) loss_reg: nan (nan) time: 0.4779 (0.4794) data: 0.012
6 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:31:23,103 atss_core.trainer INFO: eta: 7:16:32 iter: 35360 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7952) loss_reg: nan (nan) time: 0.4784 (0.4794) data: 0.012
9 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:31:32,660 atss_core.trainer INFO: eta: 7:16:22 iter: 35380 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7952) loss_reg: nan (nan) time: 0.4760 (0.4794) data: 0.013
3 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:31:42,264 atss_core.trainer INFO: eta: 7:16:13 iter: 35400 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7952) loss_reg: nan (nan) time: 0.4782 (0.4794) data: 0.012
5 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:31:51,898 atss_core.trainer INFO: eta: 7:16:03 iter: 35420 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7952) loss_reg: nan (nan) time: 0.4796 (0.4794) data: 0.012
6 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:32:01,503 atss_core.trainer INFO: eta: 7:15:53 iter: 35440 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7953) loss_reg: nan (nan) time: 0.4759 (0.4794) data: 0.013
2 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:32:10,856 atss_core.trainer INFO: eta: 7:15:44 iter: 35460 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7953) loss_reg: nan (nan) time: 0.4740 (0.4794) data: 0.012
9 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:32:20,435 atss_core.trainer INFO: eta: 7:15:34 iter: 35480 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7953) loss_reg: nan (nan) time: 0.4752 (0.4794) data: 0.013
8 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:32:30,141 atss_core.trainer INFO: eta: 7:15:25 iter: 35500 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7953) loss_reg: nan (nan) time: 0.4839 (0.4794) data: 0.013
4 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:32:39,787 atss_core.trainer INFO: eta: 7:15:15 iter: 35520 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7953) loss_reg: nan (nan) time: 0.4814 (0.4794) data: 0.012
4 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:32:49,445 atss_core.trainer INFO: eta: 7:15:06 iter: 35540 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7954) loss_reg: nan (nan) time: 0.4775 (0.4794) data: 0.012
8 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:32:59,029 atss_core.trainer INFO: eta: 7:14:56 iter: 35560 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7954) loss_reg: nan (nan) time: 0.4765 (0.4794) data: 0.013
3 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:33:08,577 atss_core.trainer INFO: eta: 7:14:46 iter: 35580 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7954) loss_reg: nan (nan) time: 0.4760 (0.4794) data: 0.013
0 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:33:18,283 atss_core.trainer INFO: eta: 7:14:37 iter: 35600 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7954) loss_reg: nan (nan) time: 0.4818 (0.4794) data: 0.012
3 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:33:27,834 atss_core.trainer INFO: eta: 7:14:27 iter: 35620 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7954) loss_reg: nan (nan) time: 0.4761 (0.4794) data: 0.012
7 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:33:37,441 atss_core.trainer INFO: eta: 7:14:18 iter: 35640 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7955) loss_reg: nan (nan) time: 0.4758 (0.4794) data: 0.012
5 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:33:47,026 atss_core.trainer INFO: eta: 7:14:08 iter: 35660 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7955) loss_reg: nan (nan) time: 0.4770 (0.4794) data: 0.013
8 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:33:56,665 atss_core.trainer INFO: eta: 7:13:59 iter: 35680 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7955) loss_reg: nan (nan) time: 0.4823 (0.4794) data: 0.013
1 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:34:06,223 atss_core.trainer INFO: eta: 7:13:49 iter: 35700 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7955) loss_reg: nan (nan) time: 0.4770 (0.4794) data: 0.013
6 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:34:15,821 atss_core.trainer INFO: eta: 7:13:39 iter: 35720 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7956) loss_reg: nan (nan) time: 0.4782 (0.4794) data: 0.012
8 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:34:25,217 atss_core.trainer INFO: eta: 7:13:30 iter: 35740 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7956) loss_reg: nan (nan) time: 0.4768 (0.4794) data: 0.013
6 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:34:34,755 atss_core.trainer INFO: eta: 7:13:20 iter: 35760 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7956) loss_reg: nan (nan) time: 0.4747 (0.4794) data: 0.012
6 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:34:44,318 atss_core.trainer INFO: eta: 7:13:10 iter: 35780 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7956) loss_reg: nan (nan) time: 0.4766 (0.4794) data: 0.013
5 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:34:53,960 atss_core.trainer INFO: eta: 7:13:01 iter: 35800 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7956) loss_reg: nan (nan) time: 0.4813 (0.4794) data: 0.013
9 (0.0135) lr: 0.010000 max mem: 5423
Traceback (most recent call last):
File "/home/wxp/anaconda3/envs/atss/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/wxp/anaconda3/envs/atss/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/wxp/anaconda3/envs/atss/lib/python3.7/site-packages/torch/distributed/launch.py", line 235, in
main()
File "/home/wxp/anaconda3/envs/atss/lib/python3.7/site-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/home/wxp/anaconda3/envs/atss/bin/python', '-u', 'tools/train_net.py', '--local_rank=0', '--config-file', 'configs/atss/atss_R_50_FPN_1x.yaml', 'DATALOADER.NUM_WO
RKERS', '2', 'OUTPUT_DIR', 'training_dir/atss_R_50_FPN_1x']' died with <Signals.SIGKILL: 9>.

@b03505036
Copy link
Author

hope i could get some advices, thanks!

@sfzhang15
Copy link
Owner

@b03505036
Did you modify anything?

@b03505036
Copy link
Author

nothing i have changed.
also could you provide anchor-free version (mentioned in the article)?
thanks for your reply

@sfzhang15
Copy link
Owner

@b03505036
It's weird. We have not encountered this problem, which is it always appears?
You can modify POSITIVE_TYPE and REGRESSION_TYPE to train anchor-free detectors, e.g., FCOS (POSITIVE_TYPE=SSC and REGRESSION_TYPE=POINT).

@wulongjian
Copy link

Do you solve the problem? I have met the same problem, 0.0

@IgorSechko
Copy link

did you try reducing BASE_LR?

@HansolEom
Copy link

I have figured out the root cause of this problem.
This problem occurs when too low a precision value enters the sigmoid.
I think you used cpu sigmoid_focal_loss. If so, solve this problem using max or clamp.
(in sigmoid_focal_loss_cuda, This problem doesn't happen.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants