-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Appearing "nan" during training #24
Comments
hope i could get some advices, thanks! |
@b03505036 |
nothing i have changed. |
@b03505036 |
Do you solve the problem? I have met the same problem, 0.0 |
did you try reducing BASE_LR? |
I have figured out the root cause of this problem. |
when just using atss_r50 , but the loss_centrness is keep nan
2020-03-16 11:31:03,886 atss_core.trainer INFO: eta: 7:16:51 iter: 35320 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7951) loss_reg: nan (nan) time: 0.4746 (0.4794) dat[0/1224]
7 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:31:13,489 atss_core.trainer INFO: eta: 7:16:41 iter: 35340 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7951) loss_reg: nan (nan) time: 0.4779 (0.4794) data: 0.012
6 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:31:23,103 atss_core.trainer INFO: eta: 7:16:32 iter: 35360 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7952) loss_reg: nan (nan) time: 0.4784 (0.4794) data: 0.012
9 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:31:32,660 atss_core.trainer INFO: eta: 7:16:22 iter: 35380 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7952) loss_reg: nan (nan) time: 0.4760 (0.4794) data: 0.013
3 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:31:42,264 atss_core.trainer INFO: eta: 7:16:13 iter: 35400 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7952) loss_reg: nan (nan) time: 0.4782 (0.4794) data: 0.012
5 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:31:51,898 atss_core.trainer INFO: eta: 7:16:03 iter: 35420 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7952) loss_reg: nan (nan) time: 0.4796 (0.4794) data: 0.012
6 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:32:01,503 atss_core.trainer INFO: eta: 7:15:53 iter: 35440 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7953) loss_reg: nan (nan) time: 0.4759 (0.4794) data: 0.013
2 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:32:10,856 atss_core.trainer INFO: eta: 7:15:44 iter: 35460 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7953) loss_reg: nan (nan) time: 0.4740 (0.4794) data: 0.012
9 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:32:20,435 atss_core.trainer INFO: eta: 7:15:34 iter: 35480 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7953) loss_reg: nan (nan) time: 0.4752 (0.4794) data: 0.013
8 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:32:30,141 atss_core.trainer INFO: eta: 7:15:25 iter: 35500 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7953) loss_reg: nan (nan) time: 0.4839 (0.4794) data: 0.013
4 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:32:39,787 atss_core.trainer INFO: eta: 7:15:15 iter: 35520 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7953) loss_reg: nan (nan) time: 0.4814 (0.4794) data: 0.012
4 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:32:49,445 atss_core.trainer INFO: eta: 7:15:06 iter: 35540 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7954) loss_reg: nan (nan) time: 0.4775 (0.4794) data: 0.012
8 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:32:59,029 atss_core.trainer INFO: eta: 7:14:56 iter: 35560 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7954) loss_reg: nan (nan) time: 0.4765 (0.4794) data: 0.013
3 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:33:08,577 atss_core.trainer INFO: eta: 7:14:46 iter: 35580 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7954) loss_reg: nan (nan) time: 0.4760 (0.4794) data: 0.013
0 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:33:18,283 atss_core.trainer INFO: eta: 7:14:37 iter: 35600 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7954) loss_reg: nan (nan) time: 0.4818 (0.4794) data: 0.012
3 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:33:27,834 atss_core.trainer INFO: eta: 7:14:27 iter: 35620 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7954) loss_reg: nan (nan) time: 0.4761 (0.4794) data: 0.012
7 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:33:37,441 atss_core.trainer INFO: eta: 7:14:18 iter: 35640 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7955) loss_reg: nan (nan) time: 0.4758 (0.4794) data: 0.012
5 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:33:47,026 atss_core.trainer INFO: eta: 7:14:08 iter: 35660 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7955) loss_reg: nan (nan) time: 0.4770 (0.4794) data: 0.013
8 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:33:56,665 atss_core.trainer INFO: eta: 7:13:59 iter: 35680 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7955) loss_reg: nan (nan) time: 0.4823 (0.4794) data: 0.013
1 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:34:06,223 atss_core.trainer INFO: eta: 7:13:49 iter: 35700 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7955) loss_reg: nan (nan) time: 0.4770 (0.4794) data: 0.013
6 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:34:15,821 atss_core.trainer INFO: eta: 7:13:39 iter: 35720 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7956) loss_reg: nan (nan) time: 0.4782 (0.4794) data: 0.012
8 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:34:25,217 atss_core.trainer INFO: eta: 7:13:30 iter: 35740 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7956) loss_reg: nan (nan) time: 0.4768 (0.4794) data: 0.013
6 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:34:34,755 atss_core.trainer INFO: eta: 7:13:20 iter: 35760 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7956) loss_reg: nan (nan) time: 0.4747 (0.4794) data: 0.012
6 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:34:44,318 atss_core.trainer INFO: eta: 7:13:10 iter: 35780 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7956) loss_reg: nan (nan) time: 0.4766 (0.4794) data: 0.013
5 (0.0135) lr: 0.010000 max mem: 5423
2020-03-16 11:34:53,960 atss_core.trainer INFO: eta: 7:13:01 iter: 35800 loss: nan (nan) loss_centerness: nan (nan) loss_cls: 21.8341 (21.7956) loss_reg: nan (nan) time: 0.4813 (0.4794) data: 0.013
9 (0.0135) lr: 0.010000 max mem: 5423
Traceback (most recent call last):
File "/home/wxp/anaconda3/envs/atss/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/wxp/anaconda3/envs/atss/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/wxp/anaconda3/envs/atss/lib/python3.7/site-packages/torch/distributed/launch.py", line 235, in
main()
File "/home/wxp/anaconda3/envs/atss/lib/python3.7/site-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/home/wxp/anaconda3/envs/atss/bin/python', '-u', 'tools/train_net.py', '--local_rank=0', '--config-file', 'configs/atss/atss_R_50_FPN_1x.yaml', 'DATALOADER.NUM_WO
RKERS', '2', 'OUTPUT_DIR', 'training_dir/atss_R_50_FPN_1x']' died with <Signals.SIGKILL: 9>.
The text was updated successfully, but these errors were encountered: