Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model params change with 0 learning rate #2293

Closed
vabh opened this issue Aug 4, 2017 · 8 comments
Closed

Model params change with 0 learning rate #2293

vabh opened this issue Aug 4, 2017 · 8 comments

Comments

@vabh
Copy link

vabh commented Aug 4, 2017

While training a convnet with SGD, the train/test loss and error change when using a learning rate of 0.

Log:

[TEST] Loss: 2.3440, Error: 8999/10000 (90%)

[TRAIN Epoch 1] Loss: 2.33110598225, Error: 44996/50000
[TEST] Loss: 2.3364, Error: 9027/10000 (90%)

[TRAIN Epoch 2] Loss: 2.33058553279, Error: 45001/50000
[TEST] Loss: 2.3342, Error: 9025/10000 (90%)

[TRAIN Epoch 3] Loss: 2.33120793699, Error: 44981/50000
[TEST] Loss: 2.3358, Error: 9030/10000 (90%)

[TRAIN Epoch 4] Loss: 2.33072890223, Error: 44987/50000
[TEST] Loss: 2.3350, Error: 9024/10000 (90%)

[TRAIN Epoch 5] Loss: 2.33064097578, Error: 45025/50000
[TEST] Loss: 2.3367, Error: 9029/10000 (90%)

[TRAIN Epoch 6] Loss: 2.33016999603, Error: 44991/50000
[TEST] Loss: 2.3359, Error: 9026/10000 (90%)

[TRAIN Epoch 7] Loss: 2.33080320681, Error: 44999/50000
[TEST] Loss: 2.3352, Error: 9035/10000 (90%)

[TRAIN Epoch 8] Loss: 2.33087820165, Error: 44996/50000
[TEST] Loss: 2.3365, Error: 9018/10000 (90%)

[TRAIN Epoch 9] Loss: 2.33066928387, Error: 45002/50000
[TEST] Loss: 2.3356, Error: 9025/10000 (90%)

This happens with DenseNet and ResNet.

Training script: https://gist.github.com/vabh/50c12ca28619836e32a869aa0e52ea38
The architecture can be chosen in lines 52-65

Links to implementation:
DenseNet: https://github.com/bamos/densenet.pytorch
DenseNet: https://github.com/andreasveit/densenet-pytorch
ResNeXt: https://github.com/prlz77/ResNeXt.pytorch

PyTorch version: 0.1.12_2

@apaszke
Copy link
Contributor

apaszke commented Aug 4, 2017

Probably because BatchNorm changes its running averages

@apaszke
Copy link
Contributor

apaszke commented Aug 4, 2017

Try with model.eval()

@rasoolfa
Copy link

rasoolfa commented Aug 4, 2017

I observe the same thing with BatchNorm even with model.eval()

@shreyassaxena
Copy link

The running mean and variance of BatchNorm layer will change epoch to epoch, even when LR is 0. To get the same loss, you can set the momentum parameter of BatchNorm layer to 0 (in conjunction with LR being 0). This should fix the problem.

@fmassa
Copy link
Member

fmassa commented Aug 4, 2017

Do you have weight_decay? That could also explain the change.

@vabh
Copy link
Author

vabh commented Aug 5, 2017

In the training script I do the testing after model.eval().
The weight_decay is also set to 0.

I missed the fact that the running mean/std would change on every iteration. On setting momentum = 0 in the BN modules, there is no change in the train/test loss/err.

@vabh vabh closed this as completed Aug 5, 2017
@bigbrother33
Copy link

bigbrother33 commented Aug 23, 2019

thank ! I use this to fix bn in training:

def fix_bn(m):
    classname = m.__class__.__name__
    if classname.find('BatchNorm') != -1:
        m.eval()

@kluaspan
Copy link

When I use parameter server training and set the learning rate to 0, the loss changed a lot between epochs:
epoch 1: loss: 0.9186
epoch 2: loss: 0.8939
epoch 3: loss: 0.9186
epoch 4: loss: 0.9957
epoch 5: loss: 0.9710
but without parameter server, the loss is always loss: 0.6916, and my model is a simple one layer of linear.

samnordmann pushed a commit to samnordmann/pytorch that referenced this issue Jan 12, 2023
code cleaning
fixing sign-comparator
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants