In [1]:
import torch

## Why you need a good init

To understand why initialization is important in a neural net, we'll focus on the basic operation you have there: matrix multiplications. So let's just take a vector `x` and a matrix `a` initialized randomly then multiply them 100 times (to simulate 100 layers).

The activations get very very large quickly, as we can see at the end

In [4]:
x = torch.randn(512)
a = torch.randn(512, 512)

for i in range(100):
    x = a @ x
    
x.mean(), x.std()

(tensor(nan), tensor(nan))

We can ask the loop to break when this happens, turns out that it only takes 28 iterations

In [5]:
x = torch.randn(512)
a = torch.randn(512, 512)

for i in range(100):
    x = a @ x
    if x.std() != x.std():
        break
print(i)

28


What if we make the activations super small? .. it goes to **zero**

In [6]:
x = torch.randn(512)
a = torch.randn(512, 512) * 0.01

for i in range(100):
    x = a @ x
    
x.mean(), x.std()

(tensor(0.), tensor(0.))

### The magic number of scaling

Using Xavier initialization, we use the formula, that the weight should be **`1/math.sqrt(n_in)`**

In [13]:
import math

x = torch.randn(512)
a = torch.randn(512, 512) / math.sqrt(512)

for i in range(100):
    x = a @ x
    
x.mean(), x.std()

(tensor(-0.3070), tensor(2.9736))

<img src="https://snag.gy/CUFMsa.jpg">

- There's another paper called "all you need is a good init"
- Orthogonal initialization
- Kaiming
- Xavier
- Fixup initialization
- Self-normalizing neural networks, how to try and set a combination of activation functions + init that you are guaranteed unit-variance ( std = 1)

The last two are incredibly fiddly, if the architecture changes at all. the inits + activations all need to be returned and reset. **SeLU** is the name. `All you need is a good init` is a good paper that suggests doing a little loop to find the best init