Could you share your train log? #8

zxt881108 · 2017-11-23T03:25:08Z

@sfzhang15 Hi! Thanks for your great job! When I use your code to train res101 on coco, we found the training loss is so high(both arm and odm), the total loss is always around 10 (the learning rate is 0.001), so I want to know is it normal?

sfzhang15 · 2017-11-23T06:03:21Z

@zxt881108 Hi, the training log is too big, here is a part of training log of RefineDet512_ResNet101_COCO:
I1009 15:11:17.230608 5518 solver.cpp:243] Iteration 0, loss = 77.6168
I1009 15:11:17.230693 5518 solver.cpp:259] Train net output #0: arm_loss = 30.3417 (* 1 = 30.3417 loss)
I1009 15:11:17.230715 5518 solver.cpp:259] Train net output #1: odm_loss = 47.2751 (* 1 = 47.2751 loss)
I1009 15:11:17.230823 5518 sgd_solver.cpp:138] Iteration 0, lr = 0.001
I1009 22:51:56.919220 13991 solver.cpp:243] Iteration 10000, loss = 8.28546
I1009 22:51:56.919309 13991 solver.cpp:259] Train net output #0: arm_loss = 4.22539 (* 1 = 4.22539 loss)
I1009 22:51:56.919328 13991 solver.cpp:259] Train net output #1: odm_loss = 5.07301 (* 1 = 5.07301 loss)
I1009 22:51:58.349028 13991 sgd_solver.cpp:138] Iteration 10000, lr = 0.001
1010 06:42:09.443035 13991 solver.cpp:243] Iteration 20000, loss = 8.75266
I1010 06:42:09.443153 13991 solver.cpp:259] Train net output #0: arm_loss = 4.03777 (* 1 = 4.03777 loss)
I1010 06:42:09.443168 13991 solver.cpp:259] Train net output #1: odm_loss = 3.69968 (* 1 = 3.69968 loss)
I1010 06:42:10.775079 13991 sgd_solver.cpp:138] Iteration 20000, lr = 0.001
I1010 22:09:05.001247 13991 solver.cpp:243] Iteration 40000, loss = 8.14145
I1010 22:09:05.001334 13991 solver.cpp:259] Train net output #0: arm_loss = 3.60879 (* 1 = 3.60879 loss)
I1010 22:09:05.001350 13991 solver.cpp:259] Train net output #1: odm_loss = 3.71262 (* 1 = 3.71262 loss)
I1010 22:09:05.995534 13991 sgd_solver.cpp:138] Iteration 40000, lr = 0.001
I1012 05:23:05.806807 13991 solver.cpp:243] Iteration 80000, loss = 7.2509
I1012 05:23:05.806875 13991 solver.cpp:259] Train net output #0: arm_loss = 3.2211 (* 1 = 3.2211 loss)
I1012 05:23:05.806884 13991 solver.cpp:259] Train net output #1: odm_loss = 2.79722 (* 1 = 2.79722 loss)
I1012 05:23:06.111764 13991 sgd_solver.cpp:138] Iteration 80000, lr = 0.001
I1014 21:35:23.140607 13991 solver.cpp:243] Iteration 160000, loss = 6.43958
I1014 21:35:23.140681 13991 solver.cpp:259] Train net output #0: arm_loss = 3.72447 (* 1 = 3.72447 loss)
I1014 21:35:23.140689 13991 solver.cpp:259] Train net output #1: odm_loss = 2.4261 (* 1 = 2.4261 loss)
I1014 21:35:24.672145 13991 sgd_solver.cpp:138] Iteration 160000, lr = 0.001
I1019 00:20:20.329696 21651 solver.cpp:243] Iteration 280000, loss = 6.08943
I1019 00:20:20.329771 21651 solver.cpp:259] Train net output #0: arm_loss = 3.02973 (* 1 = 3.02973 loss)
I1019 00:20:20.329785 21651 solver.cpp:259] Train net output #1: odm_loss = 2.79212 (* 1 = 2.79212 loss)
I1019 00:20:21.572805 21651 sgd_solver.cpp:138] Iteration 280000, lr = 0.001
I1025 10:35:06.596961 22840 solver.cpp:243] Iteration 480000, loss = 5.46378
I1025 10:35:06.597018 22840 solver.cpp:259] Train net output #0: arm_loss = 3.11097 (* 1 = 3.11097 loss)
I1025 10:35:06.597024 22840 solver.cpp:259] Train net output #1: odm_loss = 3.3891 (* 1 = 3.3891 loss)
I1025 10:35:06.973990 22840 sgd_solver.cpp:138] Iteration 480000, lr = 1e-05

PS: If you train the RefineDet512_ResNet101_COCO model, every GPU must have more than 4 images (e.g., 5 images in our training stage) to keep the BN layer stable.

zxt881108 · 2017-11-23T06:10:23Z

Thx! Limit to the GPU memory, I set minibatch=2 for each GPU, maybe this is the main reason.

XiongweiWu · 2018-01-31T05:58:25Z

@sfzhang15 hi, which gpu hardware and cuda/cudnn version are u using for training resnet101-512? I use P100 cards with 16G memory ,but can only holds at most 3 images.

sfzhang15 · 2018-01-31T06:28:06Z

@XiongweiWu Hi, as said in footnote 7 in our paper, we use 4 M40 (24G) with cuda 8.0 and cudnn 6.0.

moyans · 2018-05-30T06:24:01Z

finnally the train loss eaqual 4 is normal?

sfzhang15 · 2018-05-30T11:03:45Z

@moyans
The following is our log in the end:

moyans · 2018-05-31T02:11:24Z

@sfzhang15 thanks

zxt881108 closed this as completed Nov 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could you share your train log? #8

Could you share your train log? #8

zxt881108 commented Nov 23, 2017

sfzhang15 commented Nov 23, 2017 •

edited

Loading

zxt881108 commented Nov 23, 2017

XiongweiWu commented Jan 31, 2018

sfzhang15 commented Jan 31, 2018

moyans commented May 30, 2018

sfzhang15 commented May 30, 2018

moyans commented May 31, 2018

Could you share your train log? #8

Could you share your train log? #8

Comments

zxt881108 commented Nov 23, 2017

sfzhang15 commented Nov 23, 2017 • edited Loading

zxt881108 commented Nov 23, 2017

XiongweiWu commented Jan 31, 2018

sfzhang15 commented Jan 31, 2018

moyans commented May 30, 2018

sfzhang15 commented May 30, 2018

moyans commented May 31, 2018

sfzhang15 commented Nov 23, 2017 •

edited

Loading