Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird loss progression #10

Closed
RaphaelRoyerRivard opened this issue Jun 20, 2019 · 4 comments
Closed

Weird loss progression #10

RaphaelRoyerRivard opened this issue Jun 20, 2019 · 4 comments

Comments

@RaphaelRoyerRivard
Copy link

Since I am training the model on VLOG with a very small batch size, the training is going to take forever (8 days). And because I don't want to wait that long, I'll stop the training before 30 epochs. But the losses shown in the logs seem odd to me. Can someone provide me the log of a complete training so I can compare the losses and see if my early results are normal or not? Thanks

Learning Rate	Train Loss	Theta Loss	Theta Skip Loss	
0.000200	-0.002401	0.366067	0.331109	
0.000200	-0.002381	0.369635	0.328924	
0.000200	-0.001740	0.402181	0.374113	
0.000200	-0.001929	0.378956	0.342752
@xiaolonw
Copy link
Owner

One example log will be in following.

Note that the current code will not give you the exact same loss, but the trend of how the loss is developed will be similar

Learning Rate Train Loss Theta Loss Theta Skip Loss
0.000200 -0.023201 0.223515 0.185768
0.000200 -0.082967 0.149054 0.120956
0.000200 -0.121153 0.138757 0.109839
0.000200 -0.141511 0.132837 0.103349
0.000200 -0.154124 0.130685 0.101065
0.000200 -0.164161 0.126941 0.097509
0.000200 -0.171910 0.124375 0.094423
0.000200 -0.177002 0.123230 0.092237
0.000200 -0.182402 0.120037 0.089529
0.000200 -0.186588 0.118543 0.086799
0.000200 -0.189803 0.116007 0.084808
0.000200 -0.192916 0.114425 0.082736
0.000200 -0.196440 0.112402 0.080228
0.000200 -0.198626 0.111003 0.079104
0.000200 -0.200321 0.109698 0.077720
0.000200 -0.201791 0.108161 0.076239
0.000200 -0.204281 0.105937 0.073543
0.000200 -0.207024 0.104847 0.071410
0.000200 -0.207578 0.102365 0.069629
0.000200 -0.209727 0.101646 0.069230
0.000200 -0.210965 0.100404 0.067125
0.000200 -0.213229 0.097842 0.064572
0.000200 -0.214765 0.096944 0.063795
0.000200 -0.215127 0.095416 0.062738
0.000200 -0.215839 0.094996 0.062121
0.000200 -0.217097 0.093684 0.060339
0.000200 -0.219261 0.092733 0.059287
0.000200 -0.219723 0.091869 0.058745
0.000200 -0.221097 0.091318 0.058428
0.000200 -0.221912 0.090675 0.058063

@RaphaelRoyerRivard
Copy link
Author

RaphaelRoyerRivard commented Jun 25, 2019

The only things I modified in your code are the YOUR_DATASET_FOLDER to put my path and some other path that was hardcoded.
I ran the following command on VLOG (resized to 256)
python train_cycle_siple.py --checkpoint pytorch_checkpoints/release_model_simple --batchSize 4 --workers 4
but the losses are very different from yours...

Learning Rate	Train Loss	Theta Loss	Theta Skip Loss	
0.000200	-0.002401	0.366067	0.331109	
0.000200	-0.002381	0.369635	0.328924	
0.000200	-0.001740	0.402181	0.374113	
0.000200	-0.001929	0.378956	0.342752	
0.000200	-0.001893	0.402664	0.362544	
0.000200	-0.001851	0.384101	0.343538	
0.000200	-0.001888	0.392817	0.348998	
0.000200	-0.002026	0.373430	0.329414	
0.000200	-0.002127	0.374545	0.322591	
0.000200	-0.002059	0.373383	0.322823	
0.000200	-0.002283	0.347109	0.295166	
0.000200	-0.002365	0.354452	0.294233	
0.000200	-0.002127	0.369732	0.314337	
0.000200	-0.002101	0.369753	0.312066	
0.000200	-0.002192	0.354708	0.296371	
0.000200	-0.002064	0.373753	0.311506	
0.000200	-0.002031	0.386576	0.323555	
0.000200	-0.001990	0.379806	0.317385	
0.000200	-0.001882	0.391573	0.329034	
0.000200	-0.002011	0.374667	0.311523	
0.000200	-0.001822	0.412275	0.347809	
0.000200	-0.001636	0.460999	0.391921	
0.000200	-0.001858	0.373273	0.313632	
0.000200	-0.001881	0.371901	0.308502	

The train loss is slightly increasing instead of getting lower like yours and the two other losses are not really changing... Do you have an idea of what is going on?
Thank you

@xiaolonw
Copy link
Owner

very small batch size will work badly for batch norm, you will also need to adjust the learning rate according to the batch size, if you divide the batch size by 8, you should also divide the lr by 8

@RaphaelRoyerRivard
Copy link
Author

Thank you for your fast answer, I will try that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants