New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kaiming init of conv and linear layers, why gain = sqrt(5) #15314
Comments
I've also being trying to work out where the This thread explains the reasoning. It was due to a refactor of initialisation code. |
closing via @eugeneware 's comment. the code refactor from jramseyer changes the default pytorch initialization from manually initializing the weights by calling random number generator function The initialization itself comes from torch7 and torch5 and is a modified version of initialization fro Lecun'98 Efficient Backprop. This post gives more context: https://plus.google.com/106447253626219410322/posts/RZfdrRQWL6u |
The G+ link no longer works. Alternative Internet Archive link follows: https://web.archive.org/web/20170721060953/https://plus.google.com/+SoumithChintala/posts/RZfdrRQWL6u |
cc @fmassa as he introduces those in #9038.
Looking into the initialisation of Linear and Convolution layers we have the following
Linear:
pytorch/torch/nn/modules/linear.py
Lines 58 to 63 in 3df79f4
Convolution:
pytorch/torch/nn/modules/conv.py
Lines 45 to 51 in 3df79f4
Notice the sqrt(5) scaling factor.
Kaiming paper
https://arxiv.org/abs/1502.01852
The standard deviation should be sqrt(2 / fan_in)
Using the same principle as Glorot et al paper, for an uniform distribution we should use bounds of
±√3 * sqrt(2 / fan_in)
This is what is done here:
pytorch/torch/nn/init.py
Lines 288 to 293 in 700271d
Diving deeper into the implementation
It seems like the a = √5 is used in
pytorch/torch/nn/init.py
Lines 8 to 47 in 700271d
The
a
is only used for leaky_relu, which actually is the default if we don't pass any activation tokaiming_uniform
:pytorch/torch/nn/init.py
Line 261 in 700271d
Furthermore this √5 factor conflicts with the recommended
sqrt(2.0 / (1 + negative_slope ** 2))
in calculate_gains, and I suspect this is unintentional.Docs
Whether the √5 factor is intentional or not, the documentation is wrong for the weights.
Linear
While for bias
k = 1/in_features
is true, for the weight,k = 6/in_features
assuming pure Kaiming, ork = 6 * 5/in_features
at the moment.Convolution
Same remark
Closing thoughts
Plenty of tutorials uses ReLU and not LeakyReLU, having the default initialisation for
kaiming_uniform
to leaky relu would create suboptimal training for those.At the very least it should be noted in the documentation that Linear and Conv layers initialisation is done assuming it is followed by a leaky relu activation.
Finally the √5 should be explained.
The text was updated successfully, but these errors were encountered: