You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for the great work!
I have some cases when the performance drops when using swa compared to a single model.
In this case from 0.67 loss to 0.72 loss of the exact SWA copy.
In order to debug the problem I run SWA for only one epoch and compared the model vs the SWA copy.
All the parameter are the same except the batch norms running_mean and running var. and it seems that the deeper you go in the network the bigger the divergence is:
Do you have any tips on how to recalculate the batch_norm params more accurately? or should i just run the training set to the swa version multiple times for them to converge to the original model params?
This the code I use to compare m the model state dict and swa the SWA copy state dict
Thanks for the great work!
I have some cases when the performance drops when using swa compared to a single model.
In this case from 0.67 loss to 0.72 loss of the exact SWA copy.
In order to debug the problem I run SWA for only one epoch and compared the model vs the SWA copy.
All the parameter are the same except the batch norms running_mean and running var. and it seems that the deeper you go in the network the bigger the divergence is:
Do you have any tips on how to recalculate the batch_norm params more accurately? or should i just run the training set to the swa version multiple times for them to converge to the original model params?
This the code I use to compare
m
the model state dict and swa the SWA copy state dictThe text was updated successfully, but these errors were encountered: