Training not converge #19

yq1011 · 2017-02-20T02:49:22Z

Hi,

Can you share how long does it take for you to train on what GPU? And what should the final loss be?

After 2 days' training on K80, it just trained for 17100 iters and the loss is still in the range of 500-1000. Is this right?

Thanks

ZheC · 2017-02-20T21:29:26Z

With the large training dataset, the model convergence is slow. We use two Titanx (old version) and train for six days. You can change the batch size from 10 to 8 to speed it up a little bit. We did not try a batch size below 8. I will plot the loss with respect to iterations and post here.

yq1011 · 2017-02-21T01:50:51Z

Thanks a lot!

ZheC · 2017-02-21T19:12:47Z

I deleted the wrong figure posted before, this will be updated soon.

yq1011 · 2017-02-22T02:45:00Z

Hi, is this the loss of L1 or L2?

yq1011 · 2017-03-02T08:14:38Z

hi, any updates? :D

Shaswat27 · 2017-05-30T08:44:59Z

@ZheC Can you post the loss vs iterations curve again?

guanxiongsun · 2017-06-19T02:00:35Z

@yq1011 what's your final result? Was it converge? How long did you train for it? I am training this model for days, and it's not so fast as ZheC said, I use 4 GPUs trained for 2 days , and it's about 2w iterations...

ds2268 · 2017-07-09T19:18:23Z

I think we should talk in terms of epochs instead (the training is printing that). @ZheC when you mentioned that you were using 2 x Titan X with batch size of 10 that probably meant that the actual batch size was 20 (10 per gpu) ?

wujiyoung · 2017-08-17T07:30:19Z

I have the same problem as you @yq1011 .
Did you get a converged loss finally?

jricheimer · 2017-09-06T19:56:35Z

@ZheC Hi, would you be able to post the loss curve again? I would like to compare with the model when I train it locally to make sure it is performing comparably. Thanks!

ildoonet · 2017-09-23T04:16:16Z

@ZheC Can you post the loss curve?

ildoonet · 2017-09-25T08:33:35Z

@ZheC at least you can tell the last loss value so that we can compare.

ZheC · 2017-09-26T06:22:50Z

Hi all, I am really sorry for my late response. I graduated from CMU so it is not easy to access the old files again. But I plot the loss for the two levels here:

https://github.com/ZheC/Realtime_Multi-Person_Pose_Estimation/blob/master/training/example_loss/Loss_l1.png

https://github.com/ZheC/Realtime_Multi-Person_Pose_Estimation/blob/master/training/example_loss/Loss_l2.png

All the terminal output is: https://github.com/ZheC/Realtime_Multi-Person_Pose_Estimation/blob/master/training/example_loss/output.txt

The code to plot the loss is here: https://github.com/ZheC/Realtime_Multi-Person_Pose_Estimation/blob/master/training/example_loss/plotLoss.sh

ildoonet · 2017-09-26T08:35:55Z

@ZheC Thanks for sharing!!

Nestarneal · 2017-10-06T02:28:59Z

@ZheC Hi, I set the parameters based on your terminal output and train, but it is terminated about iteration 1200 without any log shown on the screen. Do you have any idea about this? Thanks.

Ai-is-light · 2017-12-18T06:13:41Z

Hi,@ZheC,thanks for your great work! I have some questions about train the pose model. In your https://github.com/ZheC/Realtime_Multi-Person_Pose_Estimation/blob/master/training/example_loss/Loss_l1.png

https://github.com/ZheC/Realtime_Multi-Person_Pose_Estimation/blob/master/training/example_loss/Loss_l2.png
,But there was only the result about iterations of # 25,0000. However, in the Openpose
https://github.com/CMU-Perceptual-Computing-Lab/openpose
the https://github.com/CMU-Perceptual-Computing-Lab/openpose/tree/master/models/pose/coco
the model was shown is pose_iter_# 440000.caffemodel, I would like to know whether it is only 250000 iterations or 440000 iterations.
Thanks for you attentions.

Ai-is-light · 2017-12-18T06:16:18Z

Are there someone getting the same results in the paper , and trying to train the small model only 2-stages used?

Ai-is-light · 2017-12-18T06:17:17Z

@yq1011 Did you get the same results as the paper?Thanks

ZheC · 2018-02-07T07:00:26Z

@Ai-is-light I use 440000 iterations' model. I pick up the best iteration based on the evaluation score on a validation set. I keep testing the accuracy of the trained models at different iteration. The best iteration is not fixed for different models. So I think probably you want to follow the same way to pick up your trained model.

yw155 · 2018-03-02T16:17:12Z

Hi @ZheC , I have a question that how much the effect the number of images being used in the evaluation have to the evaluation score. I saw on your paper you chosen 1160 images randomly. Why not choose the whole validation set?

soans1994 · 2020-12-27T04:44:35Z

Hello,
How to choose the stepsize for a given iteration?

The default parametrs from the author is step size=13106 for iterations=600000

Thank You

ZheC mentioned this issue Dec 13, 2017

training convergence issue #126

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training not converge #19

Training not converge #19

yq1011 commented Feb 20, 2017

ZheC commented Feb 20, 2017

yq1011 commented Feb 21, 2017

ZheC commented Feb 21, 2017 •

edited

yq1011 commented Feb 22, 2017

yq1011 commented Mar 2, 2017

Shaswat27 commented May 30, 2017

guanxiongsun commented Jun 19, 2017

ds2268 commented Jul 9, 2017

wujiyoung commented Aug 17, 2017

jricheimer commented Sep 6, 2017

ildoonet commented Sep 23, 2017

ildoonet commented Sep 25, 2017

ZheC commented Sep 26, 2017

ildoonet commented Sep 26, 2017

Nestarneal commented Oct 6, 2017 •

edited

Ai-is-light commented Dec 18, 2017

Ai-is-light commented Dec 18, 2017

Ai-is-light commented Dec 18, 2017

ZheC commented Feb 7, 2018

yw155 commented Mar 2, 2018 •

edited

soans1994 commented Dec 27, 2020

Training not converge #19

Training not converge #19

Comments

yq1011 commented Feb 20, 2017

ZheC commented Feb 20, 2017

yq1011 commented Feb 21, 2017

ZheC commented Feb 21, 2017 • edited

yq1011 commented Feb 22, 2017

yq1011 commented Mar 2, 2017

Shaswat27 commented May 30, 2017

guanxiongsun commented Jun 19, 2017

ds2268 commented Jul 9, 2017

wujiyoung commented Aug 17, 2017

jricheimer commented Sep 6, 2017

ildoonet commented Sep 23, 2017

ildoonet commented Sep 25, 2017

ZheC commented Sep 26, 2017

ildoonet commented Sep 26, 2017

Nestarneal commented Oct 6, 2017 • edited

Ai-is-light commented Dec 18, 2017

Ai-is-light commented Dec 18, 2017

Ai-is-light commented Dec 18, 2017

ZheC commented Feb 7, 2018

yw155 commented Mar 2, 2018 • edited

soans1994 commented Dec 27, 2020

ZheC commented Feb 21, 2017 •

edited

Nestarneal commented Oct 6, 2017 •

edited

yw155 commented Mar 2, 2018 •

edited