# Lesson 9 

#### How to Train your model

<img src="https://snag.gy/GxVakf.jpg" width="600px"/>

#### Recap from last Lesson

- We looked at Conv2D, and looked at how it initializes parameters
- We found `math.sqrt(5)` and didn't know the reasoning
- How I researched: [fastai nb link](https://github.com/fastai/fastai_docs/blob/master/dev_course/dl2/02a_why_sqrt5.ipynb)
- Step 1: loaded everything we had from last week
- Step 2: used MNIST
- Step 3: setup a Conv2D layer
- Step 4: looked at the first 100 samples
- Step 5: look at the mean / std of different weights



In [2]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [5]:
from fastai_lib import *

def get_data():
    path = datasets.download_data(MNIST_URL, ext='.gz')
    with gzip.open(path, 'rb') as f:
        ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')
    return map(tensor, (x_train,y_train,x_valid,y_valid))

def normalize(x, m, s): 
    return (x-m)/s

## Why SQRT(5)?

`torch.nn.modules.conv._ConvNd.reset_parameters??`

    Signature: torch.nn.modules.conv._ConvNd.reset_parameters(self)
    Docstring: <no docstring>
    Source:   
        def reset_parameters(self):
            n = self.in_channels
            init.kaiming_uniform_(self.weight, a=math.sqrt(5))
            if self.bias is not None:
                fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
                bound = 1 / math.sqrt(fan_in)
                init.uniform_(self.bias, -bound, bound)
    File:      ~/envs/py3/lib/python3.6/site-packages/torch/nn/modules/conv.py
    Type:      function
    
    
To test out and understand the reasoning behind the square root 5, we will start with MNIST data and observe how the initialization affects the performance. The following code block will load and prepare the MNIST dataset

In [11]:
# load our data from MNIST
x_train, y_train, x_valid, y_valid = get_data()

# find our mean and std
train_mean, train_std = x_train.mean(), x_train.std()

# normalize the training and the validation data
x_train = normalize(x_train, train_mean, train_std)
x_valid = normalize(x_valid, train_mean, train_std)

# because the data comes in single vectors, we
# will reshape to the 28 x 28 sized 1 channel format images
x_train = x_train.view(-1,1,28,28)
x_valid = x_valid.view(-1,1,28,28)
x_train.shape, x_valid.shape

# get the number of samples
n, *_ = x_train.shape

# number of classes
c = y_train.max()+1

# number of hidden units
nh = 32
n, c

(50000, tensor(10))

### Let's experiment with the Conv2d

In [12]:
l1 = nn.Conv2d(1, nh, 5)

# look at the first 100 examples
x = x_valid[:100]
print(x.shape)

torch.Size([100, 1, 28, 28])


In [15]:
# we will be keeping track of the weights as the layer trains
def stats(x): 
    """ Returns mean and std """
    return x.mean(), x.std()

print(l1.weight.shape)
print(stats(l1.weight), stats(l1.bias))

torch.Size([32, 1, 5, 5])
(tensor(0.0005, grad_fn=<MeanBackward1>), tensor(0.1133, grad_fn=<StdBackward0>)) (tensor(-0.0158, grad_fn=<MeanBackward1>), tensor(0.1103, grad_fn=<StdBackward0>))


#### Pass the first 100 samples through the Conv2D

if we look at the stats of result `t`, we see that the mean is close to zero, but the standard deviation is NOT close to 1.

In [18]:
t = l1(x)
stats(t)

(tensor(-0.0151, grad_fn=<MeanBackward1>),
 tensor(0.6587, grad_fn=<StdBackward0>))

In comparison, the regular kaiming normal gets close to mean=0, std=1. The weights resemble a leaky relu layer.

In [23]:
init.kaiming_normal(l1.weight, a=1.)
stats(l1(x))

  """Entry point for launching an IPython kernel.


(tensor(-0.0083, grad_fn=<MeanBackward1>),
 tensor(0.9999, grad_fn=<StdBackward0>))

Let's make our own leaky relu function

In [38]:
import torch.nn.functional as F
import numpy as np

def f1(x, a=0):
    return F.leaky_relu(l1(x), a) 

Applied against the kaiming_normal we get a variance of 1. The mean is no longer 0.

In [29]:
init.kaiming_normal(l1.weight, a=0.)
stats(f1(x))

  """Entry point for launching an IPython kernel.


(tensor(0.5679, grad_fn=<MeanBackward1>),
 tensor(1.0585, grad_fn=<StdBackward0>))

But if we apply this against the default pytorch value, we don't get mean 0, std 1

In [30]:
l1 = nn.Conv2d(1, nh, 5)
stats(f1(x))

(tensor(0.2036, grad_fn=<MeanBackward1>),
 tensor(0.3707, grad_fn=<StdBackward0>))

### To understand, we will write our own Kaiming Init function

Note that in this case, since we are looking at the convolutions, there's an additional dimension of multiplication

Remember from before:

```python
print(l1.weight.shape)
-> torch.Size([32, 1, 5, 5])
```
We need to multiply the kernel size `[5x5]` times the number of filters `32`


In [31]:
# receptive field size (5 x 5)
rec_fs = l1.weight[0, 0].numel()
rec_fs

25

In [32]:
# number of filters out, number of filters in
nf, ni, *_ = l1.weight.shape
nf, ni

(32, 1)

In [33]:
fan_in = ni * rec_fs
fan_out = nf * rec_fs
fan_in, fan_out

(25, 800)

In [39]:
def gain(a):
    return math.sqrt(2.0 / (1 + a **2))

for a in [1, 0, 0.01, 0.1, math.sqrt(5)]:
    print(f"{np.round(a,2)} : \t{gain(a)}")

1 : 	1.0
0 : 	1.4142135623730951
0.01 : 	1.4141428569978354
0.1 : 	1.4071950894605838
2.24 : 	0.5773502691896257
