Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could you share your train log? #8

Closed
zxt881108 opened this issue Nov 23, 2017 · 7 comments
Closed

Could you share your train log? #8

zxt881108 opened this issue Nov 23, 2017 · 7 comments

Comments

@zxt881108
Copy link

@sfzhang15 Hi! Thanks for your great job! When I use your code to train res101 on coco, we found the training loss is so high(both arm and odm), the total loss is always around 10 (the learning rate is 0.001), so I want to know is it normal?

@sfzhang15
Copy link
Owner

sfzhang15 commented Nov 23, 2017

@zxt881108 Hi, the training log is too big, here is a part of training log of RefineDet512_ResNet101_COCO:
I1009 15:11:17.230608 5518 solver.cpp:243] Iteration 0, loss = 77.6168
I1009 15:11:17.230693 5518 solver.cpp:259] Train net output #0: arm_loss = 30.3417 (* 1 = 30.3417 loss)
I1009 15:11:17.230715 5518 solver.cpp:259] Train net output #1: odm_loss = 47.2751 (* 1 = 47.2751 loss)
I1009 15:11:17.230823 5518 sgd_solver.cpp:138] Iteration 0, lr = 0.001
I1009 22:51:56.919220 13991 solver.cpp:243] Iteration 10000, loss = 8.28546
I1009 22:51:56.919309 13991 solver.cpp:259] Train net output #0: arm_loss = 4.22539 (* 1 = 4.22539 loss)
I1009 22:51:56.919328 13991 solver.cpp:259] Train net output #1: odm_loss = 5.07301 (* 1 = 5.07301 loss)
I1009 22:51:58.349028 13991 sgd_solver.cpp:138] Iteration 10000, lr = 0.001
1010 06:42:09.443035 13991 solver.cpp:243] Iteration 20000, loss = 8.75266
I1010 06:42:09.443153 13991 solver.cpp:259] Train net output #0: arm_loss = 4.03777 (* 1 = 4.03777 loss)
I1010 06:42:09.443168 13991 solver.cpp:259] Train net output #1: odm_loss = 3.69968 (* 1 = 3.69968 loss)
I1010 06:42:10.775079 13991 sgd_solver.cpp:138] Iteration 20000, lr = 0.001
I1010 22:09:05.001247 13991 solver.cpp:243] Iteration 40000, loss = 8.14145
I1010 22:09:05.001334 13991 solver.cpp:259] Train net output #0: arm_loss = 3.60879 (* 1 = 3.60879 loss)
I1010 22:09:05.001350 13991 solver.cpp:259] Train net output #1: odm_loss = 3.71262 (* 1 = 3.71262 loss)
I1010 22:09:05.995534 13991 sgd_solver.cpp:138] Iteration 40000, lr = 0.001
I1012 05:23:05.806807 13991 solver.cpp:243] Iteration 80000, loss = 7.2509
I1012 05:23:05.806875 13991 solver.cpp:259] Train net output #0: arm_loss = 3.2211 (* 1 = 3.2211 loss)
I1012 05:23:05.806884 13991 solver.cpp:259] Train net output #1: odm_loss = 2.79722 (* 1 = 2.79722 loss)
I1012 05:23:06.111764 13991 sgd_solver.cpp:138] Iteration 80000, lr = 0.001
I1014 21:35:23.140607 13991 solver.cpp:243] Iteration 160000, loss = 6.43958
I1014 21:35:23.140681 13991 solver.cpp:259] Train net output #0: arm_loss = 3.72447 (* 1 = 3.72447 loss)
I1014 21:35:23.140689 13991 solver.cpp:259] Train net output #1: odm_loss = 2.4261 (* 1 = 2.4261 loss)
I1014 21:35:24.672145 13991 sgd_solver.cpp:138] Iteration 160000, lr = 0.001
I1019 00:20:20.329696 21651 solver.cpp:243] Iteration 280000, loss = 6.08943
I1019 00:20:20.329771 21651 solver.cpp:259] Train net output #0: arm_loss = 3.02973 (* 1 = 3.02973 loss)
I1019 00:20:20.329785 21651 solver.cpp:259] Train net output #1: odm_loss = 2.79212 (* 1 = 2.79212 loss)
I1019 00:20:21.572805 21651 sgd_solver.cpp:138] Iteration 280000, lr = 0.001
I1025 10:35:06.596961 22840 solver.cpp:243] Iteration 480000, loss = 5.46378
I1025 10:35:06.597018 22840 solver.cpp:259] Train net output #0: arm_loss = 3.11097 (* 1 = 3.11097 loss)
I1025 10:35:06.597024 22840 solver.cpp:259] Train net output #1: odm_loss = 3.3891 (* 1 = 3.3891 loss)
I1025 10:35:06.973990 22840 sgd_solver.cpp:138] Iteration 480000, lr = 1e-05

PS: If you train the RefineDet512_ResNet101_COCO model, every GPU must have more than 4 images (e.g., 5 images in our training stage) to keep the BN layer stable.

@zxt881108
Copy link
Author

Thx! Limit to the GPU memory, I set minibatch=2 for each GPU, maybe this is the main reason.

@XiongweiWu
Copy link

@sfzhang15 hi, which gpu hardware and cuda/cudnn version are u using for training resnet101-512? I use P100 cards with 16G memory ,but can only holds at most 3 images.

@sfzhang15
Copy link
Owner

@XiongweiWu Hi, as said in footnote 7 in our paper, we use 4 M40 (24G) with cuda 8.0 and cudnn 6.0.

@moyans
Copy link

moyans commented May 30, 2018

finnally the train loss eaqual 4 is normal?

@sfzhang15
Copy link
Owner

@moyans
The following is our log in the end:
image

@moyans
Copy link

moyans commented May 31, 2018

@sfzhang15 thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants