Kaiming init of conv and linear layers, why gain = sqrt(5) #15314

mratsim · 2018-12-17T22:16:02Z

cc @fmassa as he introduces those in #9038.

Looking into the initialisation of Linear and Convolution layers we have the following

Linear:

Lines 58 to 63 in 3df79f4

    
           def reset_parameters(self): 
        
               init.kaiming_uniform_(self.weight, a=math.sqrt(5)) 
        
               if self.bias is not None: 
        
                   fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight) 
        
                   bound = 1 / math.sqrt(fan_in) 
        
                   init.uniform_(self.bias, -bound, bound)

Convolution:

pytorch/torch/nn/modules/conv.py

Lines 45 to 51 in 3df79f4

    
           def reset_parameters(self): 
        
               n = self.in_channels 
        
               init.kaiming_uniform_(self.weight, a=math.sqrt(5)) 
        
               if self.bias is not None: 
        
                   fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight) 
        
                   bound = 1 / math.sqrt(fan_in) 
        
                   init.uniform_(self.bias, -bound, bound)

Notice the sqrt(5) scaling factor.

Kaiming paper

https://arxiv.org/abs/1502.01852

The standard deviation should be sqrt(2 / fan_in)

Using the same principle as Glorot et al paper, for an uniform distribution we should use bounds of ±√3 * sqrt(2 / fan_in)

This is what is done here:

pytorch/torch/nn/init.py

Lines 288 to 293 in 700271d

    
           fan = _calculate_correct_fan(tensor, mode) 
        
           gain = calculate_gain(nonlinearity, a) 
        
           std = gain / math.sqrt(fan) 
        
           bound = math.sqrt(3.0) * std  # Calculate uniform bounds from standard deviation 
        
           with torch.no_grad(): 
        
               return tensor.uniform_(-bound, bound)

Diving deeper into the implementation

It seems like the a = √5 is used in

pytorch/torch/nn/init.py

Lines 8 to 47 in 700271d

    
           def calculate_gain(nonlinearity, param=None): 
        
               r"""Return the recommended gain value for the given nonlinearity function. 
        
               The values are as follows: 
        
               ================= ==================================================== 
        
               nonlinearity      gain 
        
               ================= ==================================================== 
        
               Linear / Identity :math:`1` 
        
               Conv{1,2,3}D      :math:`1` 
        
               Sigmoid           :math:`1` 
        
               Tanh              :math:`\frac{5}{3}` 
        
               ReLU              :math:`\sqrt{2}` 
        
               Leaky Relu        :math:`\sqrt{\frac{2}{1 + \text{negative\_slope}^2}}` 
        
               ================= ==================================================== 
        
               Args: 
        
                   nonlinearity: the non-linear function (`nn.functional` name) 
        
                   param: optional parameter for the non-linear function 
        
               Examples: 
        
                   >>> gain = nn.init.calculate_gain('leaky_relu') 
        
               """ 
        
               linear_fns = ['linear', 'conv1d', 'conv2d', 'conv3d', 'conv_transpose1d', 'conv_transpose2d', 'conv_transpose3d'] 
        
               if nonlinearity in linear_fns or nonlinearity == 'sigmoid': 
        
                   return 1 
        
               elif nonlinearity == 'tanh': 
        
                   return 5.0 / 3 
        
               elif nonlinearity == 'relu': 
        
                   return math.sqrt(2.0) 
        
               elif nonlinearity == 'leaky_relu': 
        
                   if param is None: 
        
                       negative_slope = 0.01 
        
                   elif not isinstance(param, bool) and isinstance(param, int) or isinstance(param, float): 
        
                       # True/False are instances of int, hence check above 
        
                       negative_slope = param 
        
                   else: 
        
                       raise ValueError("negative_slope {} not a valid number".format(param)) 
        
                   return math.sqrt(2.0 / (1 + negative_slope ** 2)) 
        
               else: 
        
                   raise ValueError("Unsupported nonlinearity {}".format(nonlinearity))

The a is only used for leaky_relu, which actually is the default if we don't pass any activation to kaiming_uniform:

pytorch/torch/nn/init.py

Line 261 in 700271d

def kaiming_uniform_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu'):

Furthermore this √5 factor conflicts with the recommended sqrt(2.0 / (1 + negative_slope ** 2)) in calculate_gains, and I suspect this is unintentional.

Docs

Whether the √5 factor is intentional or not, the documentation is wrong for the weights.

Linear

While for bias k = 1/in_features is true, for the weight, k = 6/in_features assuming pure Kaiming, or k = 6 * 5/in_features at the moment.

Convolution

Same remark

Closing thoughts

Plenty of tutorials uses ReLU and not LeakyReLU, having the default initialisation for kaiming_uniform to leaky relu would create suboptimal training for those.

At the very least it should be noted in the documentation that Linear and Conv layers initialisation is done assuming it is followed by a leaky relu activation.

Finally the √5 should be explained.

The text was updated successfully, but these errors were encountered:

eugeneware · 2018-12-17T23:37:37Z

I've also being trying to work out where the sqrt(5) factor comes from for Linear layer initialisation.

This thread explains the reasoning. It was due to a refactor of initialisation code.

soumith · 2019-03-28T04:59:24Z

closing via @eugeneware 's comment.

the code refactor from jramseyer changes the default pytorch initialization from manually initializing the weights by calling random number generator function uniform to using torch.nn.init.kaiming -- but it wanted to have the same end-result in weights, because we wanted to preserve backward-compatibility. So the sqrt(5) is nothing more than giving the code the same end-result as before.

The initialization itself comes from torch7 and torch5 and is a modified version of initialization fro Lecun'98 Efficient Backprop. This post gives more context: https://plus.google.com/106447253626219410322/posts/RZfdrRQWL6u

dguera · 2019-07-03T19:56:19Z

The G+ link no longer works. Alternative Internet Archive link follows: https://web.archive.org/web/20170721060953/https://plus.google.com/+SoumithChintala/posts/RZfdrRQWL6u

soumith closed this as completed Mar 28, 2019

msbaines mentioned this issue Apr 10, 2021

_ConvNd weight initialization does not match docs #55741

Closed

jbschlosser mentioned this issue May 25, 2021

DOC Adds code comment for _ConvNd.reset_parameters #58931

Closed

vasanthzoh mentioned this issue Mar 7, 2024

Issue with rescale_conv ? facebookresearch/demucs#505

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kaiming init of conv and linear layers, why gain = sqrt(5) #15314

Kaiming init of conv and linear layers, why gain = sqrt(5) #15314

mratsim commented Dec 17, 2018 •

edited

eugeneware commented Dec 17, 2018 •

edited

soumith commented Mar 28, 2019

dguera commented Jul 3, 2019

Kaiming init of conv and linear layers, why gain = sqrt(5) #15314

Kaiming init of conv and linear layers, why gain = sqrt(5) #15314

Comments

mratsim commented Dec 17, 2018 • edited

Kaiming paper

Diving deeper into the implementation

Docs

Linear

Convolution

Closing thoughts

eugeneware commented Dec 17, 2018 • edited

soumith commented Mar 28, 2019

dguera commented Jul 3, 2019

mratsim commented Dec 17, 2018 •

edited

eugeneware commented Dec 17, 2018 •

edited