## Kaiming-Initialization

**Paper:** https://arxiv.org/pdf/1502.01852.pdf

We need to initialize layers in a neural network. The Kaiming Init is one way to do that. It was build with the RELU activation function in mind.

We sample weights from the following distribution: 

**X ~ N(0, std)**

--> fan = `fan_in` or `fan_out`

--> std = sqrt(2 / (1 + a^2) * fan)

- a: Negative slope of the rectifier used afer the layer (0 for RELU)
- fan_in: Number of inputs. If we create a Linear layer with following dimensions (784, 50), fan_in would be 784. Preserves the magnitude of the variance of the weights in the forward pass.
- fan_out: Number of outputs. Preserves the magnitudes in the backwards pass.

In [91]:
import torch
from torch.nn.functional import relu, linear

### Without Kaiming Init

In [92]:
x = torch.randn(784)

In [93]:
w = torch.randn(50, 784)
b = torch.randn(50)

In [94]:
y = relu(linear(x, w, b))

The mean and std are very large. This is bad and will lead to exploding gradients, which will make training deep neural networks really difficult. This is what kaiming init is trying to solve.

In [95]:
y.mean()

tensor(11.2751)

In [96]:
y.std()

tensor(16.0316)

### With Kaiming Init

In [111]:
i = 784
o = 50

In [124]:
std = torch.sqrt(torch.Tensor([2 / i]))
std

tensor([0.0505])

In [125]:
w = torch.randn(50, 784) * std
b = torch.randn(50)

In [126]:
w.shape

torch.Size([50, 784])

In [127]:
y = relu(linear(x, w, b)) # - 0.5

The mean and std are much better than before. One idea would be to substract -0.5 to counteract the effect of relu and to push the mean closer to 0.

In [128]:
y.mean()

tensor(0.0479)

In [129]:
y.std()

tensor(0.9681)