## 5. Losses, optimization, and initialization

In [1]:
"""
    Initialization
"""


import torch
import os.path
from torch import cuda, nn, optim, Tensor
from torch.nn import functional as F
from torchvision import datasets
from torch.autograd import Variable

### 1. Cross-entropy
- Mean-Squared Error is not the best loss function for Classification due to **conceptually wrong** $\rightarrow$ Cross-entropy (kinda better choice)
- Given 2 distributions $p$ and $q$, their cross-entropy is defined as:
    $$\text{H}(p, q) = - \sum^{}_{k} p(k)\,\text{log}\,q(k)$$
- `torch.nn.CrossEntropyLoss` :
    $$L(w) = -\frac{1}{N}\sum^{N}_{n=1}\text{log}(\frac{\text{exp }f_{y_{n}(x_{n}; w)}}{\sum_{k}\text{exp }f_{k}(x_{n}; w)})$$

In [2]:
"""
    Example of 'torch.nn.CrossEntropy.Loss'
"""

f = Variable(Tensor([[-1, -3, 4], [-3, 3, -1]]))
target = Variable(torch.LongTensor([0, 1]))
criterion = torch.nn.CrossEntropyLoss()
print(criterion(f, target))

tensor(2.5141)


<img width=50% src="images/5-1.png">

- In two-class problem with $x$ axis is the activation of the correct output unit, and the $y$ axis is the activation of the other one $\rightarrow$ MSE incorrectly penalizes outputs which are perfectly valid for prediction
    <img width=60% src="images/5-2.png">
- If a network should compute log-probabilities, it may have a `torch.nn.LogSoftmax` final layer, and be trained with `torch.nn.NLLLoss`
- Soft-max mapping:
    <img width=40% src="images/5-3.png">

In [3]:
"""
    Example of Soft-max layer
"""
x = Variable(Tensor([[-10, -10, 10, -5],
                     [  3,   0,  0,  0],
                     [  1,   2,  3,  4]]))
f = torch.nn.Softmax()
print(f(x))

tensor([[ 2.0612e-09,  2.0612e-09,  1.0000e+00,  3.0590e-07],
        [ 8.7005e-01,  4.3317e-02,  4.3317e-02,  4.3317e-02],
        [ 3.2059e-02,  8.7144e-02,  2.3688e-01,  6.4391e-01]])


  


- Other Loss functions provided by PyTorch:
    1. `torch.nn.MSELoss`
    2. `torch.nn.CrossEntropyLoss`
    3. `torch.nn.NLLLoss`
    4. `torch.nn.L1Loss`
    5. `torch.nn.NLLLoss2d`
    6. `torch.nn.MultiMarginLoss`

### 2. Stochastic gradient descent
- Disadvantage of traditional Gradient descent:
    - Takes time to compute
    - Computation redundancy
    - Bad performence due to computing repeatly $l_{n}$
    - Difficult to choose efficient step size
- **Stochastic gradient descent**
    $$w_{t+1} = w_{t} - \eta \nabla l_{\,n(t)}(w_{t})$$
$\rightarrow$ Does not benefit from the speed-up of batch-processing
- **Mini-batch Stochastic gradient descent**
    - Standard procedure for deep learning
        $$w_{t+1} = w_{t} - \eta \sum^{B}_{b = 1} \nabla l_{\,n(t,\,b)}(w_{t})$$
    - Help evade local minima
    - Performance
        <img width=60% src="images/5-4.png">

### 3. Momemtum & Moment estimation
- Vanilla mini-batch Stochastic gradient descent (SGD) consist of 2 parts
    1. $w_{t+1} = w_{t} - \eta\, g_{t}$
    2. where $g_{t} = \sum^{B}_{b=1}\nabla\, l_{n(t,\,b)}(w_{t})$ is the gradient summed over a mini-btach
- Improvements:
    1. Momentum, to add inertia in the choise of the step direction
        $$u_{t} = \gamma\, u_{t-1} + \eta\,g_{t}$$
        $$w_{t+1} = w_{t} - u_{t}$$
        - With $\gamma = 0$, this is the same as vanilla SGD
            <img width=40% src="images/5-5.png">
        - With $\gamma > 0$:
            - It can "go through" local barriers
            - It accelerates if the gradient does not change much
            - It dampens oscillations in narrow valleys
            <img width=40% src="images/5-6.png">
    2. Adam algorithm

|||
|---|---|
|<img src="images/5-7.png">|<img src="images/5-8.png">|

### 4. `torch.optim`
- Implementing the standard SGD with `torch.optim`:
    - Normal Vanilla SGD
    ```python
    optimizer = torch.optim.SGD(model.parameters(), lr = eta)
    ...
    loss.backward()
    optimizer.step()
    ```
    - Vanilla SGD + Adam algorithm
    ```python
    optimizer = torch.optim.Adam(model.parameters(), lr = eta)
    ...
    loss.backward()
    optimizer.step()
    ```
- Other optimizations:
    - `torch.optim.SGD` (momentum, Nesterov's algorithm)
    - `torch.optim.Adam`
    - `torch.optim.Adadelta`
    - `torch.optim.Adagrad`
    - `torch.optim.RMSprop`
    - `torch.optim.LBFGS`
- **The learning rate may have to be different if the functional was not properly scaled**

### 5. An example putting all this together
- Tools to define a deep network:
    - fully connected layer (torch.nn.Linear)
    - convolutional layer (torch.nn.Conv2d)
    - pooling layer
    - ReLU
- Tools to optimize a deep network:
    - Loss
    - Back-propagation
    - Stochastic gradient descent
- PyTorch initialize paramters as normalize weights according to the layer sizes

In [8]:
"""
    Example putting all things together
"""


PATH = 'model/5-init-optim-ex5.pth.tar'

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=5)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=5)
        self.fc1 = nn.Linear(256, 200)
        self.fc2 = nn.Linear(200, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), kernel_size=3, stride=3))
        x = F.relu(F.max_pool2d(self.conv2(x), kernel_size=2, stride=2))
        x = F.relu(self.fc1(x.view(-1, 256)))
        x = self.fc2(x)
        return x

def train_model():
    train_set = datasets.MNIST('./data/mnist/', train = True, download = True)
    train_input = Variable(train_set.train_data.view(-1, 1, 28, 28).float())
    train_target = Variable(train_set.train_labels)

    model, criterion = Net(), nn.CrossEntropyLoss()
    
    if cuda.is_available():
        print('(Cuda is available) ', end='')
        model.cuda()
        criterion.cuda()
        train_input, train_target = train_input.cuda(), train_target.cuda()

    ### Scaling Data
    mu, std = train_input.data.mean(), train_input.data.std()
    train_input.data.sub_(mu).div_(std)

    ### Constants
    lr, nb_epochs, batch_size = 1e-1, 10, 100
    optimizer = optim.SGD(model.parameters(), lr = lr)

    for k in range(nb_epochs):
        for b in range(0, train_input.size(0), batch_size):
            output = model(train_input.narrow(0, b, batch_size))
            loss = criterion(output, train_target.narrow(0, b, batch_size))
            model.zero_grad()
            loss.backward()
            optimizer.step()
    print('Done')
    
    torch.save(model, PATH)
    print('Save model: Done')
    
    return model

if os.path.exists(PATH):
    print('Pretrained model found')
    model = torch.load(PATH)
    print('Loading model: Done')
else:
    print('Pretrained model not found')
    print('Training: ', end='')
    model = train_model()

Pretrained model found
Loading model: Done


### 6. $L_{2}$ and $L_{1}$ penalties
- $L_{2}$ regularization:
    $$\lambda\, |w|\,^{2}_{2}$$
- $L_{1}$ regularization:
    $$\lambda\, |w|\,_{1}$$

### 7. Weight initialization
- Rely on controlling
$$\vee\, (\frac{\delta\,l}{\delta\,w^{(l)}_{i,\,j}}) \,\text{ and }\, \vee\, (\frac{\delta\,l}{\delta\,b^{(l)}_{i}})$$
so that
    - the gradient does not vanish
    - weights evolve at the same rate across layers during training, and no layer reaches a saturation behavior before others
- Types of initialization:
    1. Controlling the Variance of activations
        $$\vee\,(w^{(l)}) = \frac{1}{N_{l - 1}}$$
    ```python
        def reset_parameters(self):
            stdv = 1. / math.sqrt(selft.weight.size(1))
            self.weight.data.uniform_(-stdv, stdv)
            if self.bias is not None:
                self.bias.data.uniform_(-stdv, stdv)
    ```
    2. Controllling the Variance of the gradient with activations
        $$\vee\,(w^{(l)}) = \frac{1}{N_{l}}$$
    3. Xavier Initialization
        $$\vee\,(w^{(l)}) = \frac{2}{N_{l - 1} + N_{l}}$$
    ```python
        def xavier_normal(tensor, gain=1):
            if isinstance(tensor, Variable):
                xavier_normal(tensor.data, gain=gain)
                return tensor
            
            fan_in, fan_out = _calculate_fan_in_and_fan_out(tensor)
            std = gain * math.sqrt(2.0 / (fan_in + fan_out))
            return tensor.normal_(0, std)
    ```

### 7. Data normalization
- Normal method
```python
mu, std = train_input.mean(), train_input.std()
train_input.sub_(mu).div_(std)
test_input.sub_(mu).div_(std)
```
- Component-wise method
```python
mu, std = train_input.mean(0), train_input.std(0)
train_input.sub_(mu).div_(std)
test_input.sub_(mu).div_(std)
```

### 8. Choice of architecture and step size
- Choosing the network structure is a difficult exercise. Strategy:
    - Re-use something "well known, that works"
    - Split feature extraction / inference
    - Modulate the capacity until if overfits a small subset, but does not overfit / underfit the full set
    - Capacity increases with more layers, more channels, larger receptive fields or more units
    - Regularization to reduce the capativy or induce sparsity
    - Identify common paths for siamese-lise
    - Idenfify what path(s) or sub-oarts need more/less capacity
    - Use prior knowledge about the "scale of meaningful context" to size filters / combinations of filters
    - Grid-search all the variations that come to mind
- Requirement for learning rate
    - Reduce loss quickly -> large lr
    - Not be trapped in a bad minimum -> large lr
    - Not bounce around in narrow valleys -> small lr
    - Not oscillate around a minimun -> small lr

### 9. Writing a `torch.autograd.Function`
- Need to implement 2 static methods
    1. `forward()`
    2. `backward()`