Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Loss is Puzzling #11

Closed
XYZ-qiyh opened this issue Jan 8, 2020 · 12 comments
Closed

Training Loss is Puzzling #11

XYZ-qiyh opened this issue Jan 8, 2020 · 12 comments

Comments

@XYZ-qiyh
Copy link

XYZ-qiyh commented Jan 8, 2020

image
I have re-trained the network using the same cofiguration, but the training loss don't converge.

@XYZ-qiyh
Copy link
Author

XYZ-qiyh commented Jan 9, 2020

image
Hello @xy-guo . Thanks for your amazing work. I have re-trained the network using the same cofiguration, but the training loss don't converge.
I don't think the training loss is very plausible. Do you have this problem??
Regards.

@kwea123
Copy link

kwea123 commented Feb 3, 2020

hi, did you continue training? What does the final loss curve look like? Also does the depth in the image tab seem improving? My training loss also fluctuates a lot.

@zhiwenfan
Copy link

image
Hello @xy-guo . Thanks for your amazing work. I have re-trained the network using the same cofiguration, but the training loss don't converge.
I don't think the training loss is very plausible. Do you have this problem??
Regards.

Have you ever try to use larger batchsize?

@xy-guo
Copy link
Owner

xy-guo commented Feb 13, 2020

I use 8 gpus to train the model. It seems your training loss is converging, try using larger smoothing weight in tensorboard.

@kwea123
Copy link

kwea123 commented Feb 13, 2020

Do you have the final metrics of abs_depth_error, thres2mm_error, thres4mm_error, thres8mm_error? I want to compare with my results, thank you.

@kwea123
Copy link

kwea123 commented Feb 16, 2020

@xy-guo

@xy-guo
Copy link
Owner

xy-guo commented Feb 16, 2020

Sorry I have lost the final metrics information. @kwea123

@whubaichuan
Copy link

@kwea123 Have you try bigger batchsize? What's the difference with the training loss of batchsize=1?

@kwea123
Copy link

kwea123 commented Feb 20, 2020

The model consumes too much memory so I cannot try bigger batchsize (batchsize=1 requires ~8GB). You will need multiple gpus.

@whubaichuan
Copy link

@QTODD Hi, have you solve the problem that the training loss is not plausible?

@whubaichuan
Copy link

@xy-guo Do you use 8 GPUS to test?

@kwea123
Copy link

kwea123 commented Feb 29, 2020

I tried, even 8GPUs with batchsize 4 results still in fluctuating losses. I think it's insolvable then; although it doesn't worsen the model performance..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants