# Lesson 9 

#### How to Train your model

<img src="https://snag.gy/GxVakf.jpg" width="600px"/>

#### Recap from last Lesson

- We looked at Conv2D, and looked at how it initializes parameters
- We found `math.sqrt(5)` and didn't know the reasoning
- How I researched: [fastai nb link](https://github.com/fastai/fastai_docs/blob/master/dev_course/dl2/02a_why_sqrt5.ipynb)
- Step 1: loaded everything we had from last week
- Step 2: used MNIST
- Step 3: setup a Conv2D layer
- Step 4: looked at the first 100 samples
- Step 5: look at the mean / std of different weights



In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai_lib import *

def get_data():
    path = datasets.download_data(MNIST_URL, ext='.gz')
    with gzip.open(path, 'rb') as f:
        ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')
    return map(tensor, (x_train,y_train,x_valid,y_valid))

def normalize(x, m, s): 
    return (x-m)/s

## Why SQRT(5)?

`torch.nn.modules.conv._ConvNd.reset_parameters??`

    Signature: torch.nn.modules.conv._ConvNd.reset_parameters(self)
    Docstring: <no docstring>
    Source:   
        def reset_parameters(self):
            n = self.in_channels
            init.kaiming_uniform_(self.weight, a=math.sqrt(5))
            if self.bias is not None:
                fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
                bound = 1 / math.sqrt(fan_in)
                init.uniform_(self.bias, -bound, bound)
    File:      ~/envs/py3/lib/python3.6/site-packages/torch/nn/modules/conv.py
    Type:      function
    
    
To test out and understand the reasoning behind the square root 5, we will start with MNIST data and observe how the initialization affects the performance. The following code block will load and prepare the MNIST dataset

In [3]:
# load our data from MNIST
x_train, y_train, x_valid, y_valid = get_data()

# find our mean and std
train_mean, train_std = x_train.mean(), x_train.std()

# normalize the training and the validation data
x_train = normalize(x_train, train_mean, train_std)
x_valid = normalize(x_valid, train_mean, train_std)

# because the data comes in single vectors, we
# will reshape to the 28 x 28 sized 1 channel format images
x_train = x_train.view(-1,1,28,28)
x_valid = x_valid.view(-1,1,28,28)
x_train.shape, x_valid.shape

# get the number of samples
n, *_ = x_train.shape

# number of classes
c = y_train.max()+1

# number of hidden units
nh = 32
n, c

(50000, tensor(10))

### Let's experiment with the Conv2d

We will make a single conv2d layer, with 1 channel, and a 5x5 kernel

In [4]:
l1 = nn.Conv2d(1, nh, 5)

# look at the first 100 examples for testing
x = x_valid[:100]
print(x.shape)

torch.Size([100, 1, 28, 28])


If we look at the weights, we have
- 32 different filters
- 1 channel
- 5 kernel width
- 5 kernel height

In [5]:
l1.weight.shape

torch.Size([32, 1, 5, 5])

Here we see the weights which are initialized with kaiming_normal and with sqrt(5)

In [6]:
# we will be keeping track of the weights as the layer trains
def stats(x): 
    """ Returns mean and std """
    return x.mean(), x.std()

print(stats(l1.weight), stats(l1.bias))

(tensor(0.0084, grad_fn=<MeanBackward1>), tensor(0.1142, grad_fn=<StdBackward0>)) (tensor(-0.0250, grad_fn=<MeanBackward1>), tensor(0.1251, grad_fn=<StdBackward0>))


#### Pass the first 100 samples through the Conv2D

if we look at the stats of result `t`, we see that the mean is close to zero, but the **standard deviation is NOT close to 1.**

In [7]:
t = l1(x)
stats(t)

(tensor(-0.0039, grad_fn=<MeanBackward1>),
 tensor(0.5825, grad_fn=<StdBackward0>))

In comparison, the regular kaiming normal gets close to mean=0, std=1. The weights resemble a leaky relu layer. A leaky relu layer has gradient <0 is called a, or leak. So we don't have anything like that going on (no leak) we have a slope of 1 so `a=1`

In [8]:
init.kaiming_normal(l1.weight, a=1.)
stats(l1(x))

  """Entry point for launching an IPython kernel.


(tensor(-0.0538, grad_fn=<MeanBackward1>),
 tensor(1.1656, grad_fn=<StdBackward0>))

### Let's make our own leaky relu function

In [9]:
import torch.nn.functional as F
import numpy as np

def f1(x, leak_amt=0):
    return F.leaky_relu(l1(x), leak_amt) 

Applied against the kaiming_normal we get a variance of 1. **The mean is no longer 0.**

In [10]:
init.kaiming_normal(l1.weight, a=0.)
stats(f1(x))

  """Entry point for launching an IPython kernel.


(tensor(0.4494, grad_fn=<MeanBackward1>),
 tensor(0.7478, grad_fn=<StdBackward0>))

But if we apply our relu against the default pytorch value, **we don't get mean 0, std 1**

In [11]:
l1 = nn.Conv2d(1, nh, 5)
stats(f1(x))

(tensor(0.2404, grad_fn=<MeanBackward1>),
 tensor(0.4425, grad_fn=<StdBackward0>))

### To understand, we will write our own Kaiming Init function

Recall: when we do a matrix multiplication of convolution, its not only the input vs. a 2D array, there is an additional dimension that represents **channels**. So its really like multiplying against a volume

Remember from before:

```python
print(l1.weight.shape)
-> torch.Size([32, 1, 5, 5])
```
We need to multiply the kernel size `[5x5]` times the number of filters `32`


In [12]:
# receptive field size (5 x 5)
# how many elements in that kernel (only single channel)
rec_fs = l1.weight[0, 0].numel()
rec_fs

25

In [13]:
# number of filters out, number of filters in
nf, ni, *_ = l1.weight.shape
nf, ni

(32, 1)

In [14]:
# input filters times receptive field size
fan_in = ni * rec_fs

# output filters times receptive field size
fan_out = nf * rec_fs
fan_in, fan_out

(25, 800)

In [15]:
# for a leaky relu
# kaiming init method
def gain(leaky_amt):
    return math.sqrt(2.0 / (1 + leaky_amt **2))

for leaky_amt in [1, 0, 0.01, 0.1, math.sqrt(5)]:
    print(f"{np.round(leaky_amt,2)} : \t{gain(leaky_amt)}")

1 : 	1.0
0 : 	1.4142135623730951
0.01 : 	1.4141428569978354
0.1 : 	1.4071950894605838
2.24 : 	0.5773502691896257


What the heck is 0.577?

Remember what they use to init isnt `kaiming_normal` its actually **`kaiming_uniform`**. Whats the difference?

<img src="https://snag.gy/xVbW92.jpg" style='width: 400px' />

In [16]:
# turns out the gain is adjusted for uniform random numbers
torch.zeros(100000).uniform_(-1,1).std()

tensor(0.5772)

In [17]:
# this recreates the original pytorch implementation of kaiming init
def kaiming2(x, leaky_amt, use_fan_out=False):
    nf, ni, *_ = x.shape
    rec_fs = x[0, 0].shape.numel()
    
    if use_fan_out:
        fan = nf * rec_fs  
    else:
        fan = ni * rec_fs
        
    std = gain(leaky_amt) / math.sqrt(fan)
    bound = math.sqrt(3) * std
    x.data.uniform_(-bound, bound)

Testing our custom kaiming function we still get variance = 1, but the mean is still off

In [18]:
kaiming2(l1.weight, leaky_amt=0)
stats(f1(x))

(tensor(0.4799, grad_fn=<MeanBackward1>),
 tensor(0.8741, grad_fn=<StdBackward0>))

In [19]:
kaiming2(l1.weight, leaky_amt=math.sqrt(5.))
stats(f1(x))

(tensor(0.2216, grad_fn=<MeanBackward1>),
 tensor(0.3925, grad_fn=<StdBackward0>))

### Lets test out to see what happens w/ default init

We are going to make a simple conv2d layer and let the normal pytorch defaults do the initialization

In [30]:
class Flatten(nn.Module):
    def forward(self, x):
        """ unrolls input tensor to a single shape"""
        return x.view(-1)

model = nn.Sequential(
    nn.Conv2d(1,8,5, stride=2, padding=2),
    nn.ReLU(),
    nn.Conv2d(8,16,3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(16,32,3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(32,1,3, stride=2, padding=1),
    nn.AdaptiveAvgPool2d(1),
    Flatten(),
)

In [31]:
y = y_valid[:100].float()

### If we look at the std, its not zero

This is a problem, even when examining the backward grad, the variance is still off

In [32]:
processed_x = model(x)
stats(processed_x)

(tensor(-0.0059, grad_fn=<MeanBackward1>),
 tensor(0.0115, grad_fn=<StdBackward0>))

In [33]:
l = mse(processed_x, y)
l.backward()
stats(model[0].weight.grad)

(tensor(-0.0204), tensor(0.0263))

### Let's try the model again, but with kaiming uniform init

In [34]:
model = nn.Sequential(
    nn.Conv2d(1,8,5, stride=2, padding=2),
    nn.ReLU(),
    nn.Conv2d(8,16,3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(16,32,3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(32,1,3, stride=2, padding=1),
    nn.AdaptiveAvgPool2d(1),
    Flatten(),
)

for l in model:
    if isinstance(l, nn.Conv2d):
        init.kaiming_uniform_(l.weight)
        l.bias.data.zero_()

In [35]:
processed_x = model(x)
stats(processed_x)

(tensor(0.2259, grad_fn=<MeanBackward1>),
 tensor(0.2259, grad_fn=<StdBackward0>))

In [36]:
l = mse(processed_x, y)
l.backward()
stats(model[0].weight.grad)

(tensor(-0.0650), tensor(0.3066))