## 6. Going deeper

In [1]:
"""
    Initialization
"""


import torch
from torch import nn
from torch import Tensor
from torchvision import datasets
from torch.autograd import Variable

### 1. Benefits and challenges of greater depth
- Trending: deeper architectures -> improve performance
- $\forall f,\,\kappa\,(\sigma\,(\,f)) \leq 2\,\kappa\,(\,f)$
- $\forall (\,f, g),\,\kappa\,(\,f + g) \leq \kappa\,(\,f) + \kappa\,(\,g)$
- For any ReLU MLP: $\kappa\,(\,y) \leq 2^{D} \prod_{d = 1}^{D}\,W_{d}$
- Have to ensure:
    1. The gradient does not "vanish"
    2. Gradient amplitude is homogeneous so that all parts of the network train at the same rate
    3. The gradient does not vary too unpredictably when the weights change

### 2. Rectifiers
- $ReLU$ is much better than $tanh$ function because:
    1. The derivative of $ReLU$ itself not vanishing
    2. Under experiment, a 4-layer CNN with $ReLU$ reaches a 25% training error rate on CIFAR-10 6 times faster than an equivalent network with $tanh$ neurons
- Variants of $ReLU$:
    1. $Leaky-ReLU$:
    $$x \mapsto max(ax, x) \text{ with } 0 \leq a < 1$$
        - Parameter $a$ can be either fixed or optimized during training
    2. $"maxout"$ layer - Goodfellow (2013)
    3. Concatenated Rectified Linear Unit (CReLU) - Shang (2016)
    $$R \mapsto R^{2}$$
    $$x \mapsto (max(0,\,x),\,max(0,\,-x))$$

### 3. Dropout
- Removing units at random during the forward pass on each sample, and putting them all back during test
<img width=80% src="images/6-1.png">
- Purpose:
    1. Increase independence between units
    2. Distribyte the representation
    3. Improve performance

> "Units may change in a way that they fix up the mistakes of the other units" (Srivastava, 2014) $\rightarrow$ That's the reason for the first and the third purpose

- $p$: probability for units to be dropped (default = 0.5)
- Dropout is not implemented by actually swotching off unit,, but equivalently as a module that drops activations at random on each sample
- `torch.nn.DropOut` which is a `torch.Module`

In [3]:
"""
    Example of 'torch.nn.DropOut' function
"""

x = Variable(Tensor(3, 9).fill_(1.0), requires_grad = True)
print(x.data)

# Dropout = Forward + Dropping
dropout = nn.Dropout(p = 0.5)
y = dropout(x) # == .forward()

l = y.norm(2, 1).sum()
l.backward()
print(x.grad.data)

tensor([[ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
        [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
        [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.]])
tensor([[ 0.7559,  0.7559,  0.7559,  0.7559,  0.7559,  0.0000,  0.0000,
          0.7559,  0.7559],
        [ 0.0000,  0.0000,  0.0000,  1.1547,  1.1547,  0.0000,  1.1547,
          0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.8944,  0.0000,  0.0000,  0.8944,  0.8944,
          0.8944,  0.8944]])


- Add DropOut layer to network:
    - Original:
    ```python
    model = nn.Sequential(nn.Linear(10, 100), nn.ReLU(),
                            nn.Linear(100, 50), nn.ReLU(),
                            nn.Linear(50, 2));
    ```
    ---
    - Adding dropout layers:
    ```python
    model = nn.Sequential(nn.Linear(10, 100), nn.ReLU(),
                            nn.Dropout()
                            nn.Linear(100, 50), nn.ReLU(),
                            nn.Dropout()
                            nn.Linear(50, 2));
    ```
- A model using dropout has to be set in "train" or "test" mode
- **Variant of DropOut**:
    1. **SpatialDropout** - dropping channels instead of individual units
    2. **DropConnect** - dropping connection instead of individual units. Can't be implemented as a separate layer and it's computationally intensive
- Performance comparison
<img width=60% src="images/6-2.png">

In [8]:
"""
    Example of SpatialDropout - 'torch.nn.Dropout2d
"""
x = Variable(Tensor(2, 3, 2, 2).fill_(1.0))
dropout2d = nn.Dropout2d()

print(dropout2d(x))

tensor([[[[ 2.,  2.],
          [ 2.,  2.]],

         [[ 0.,  0.],
          [ 0.,  0.]],

         [[ 0.,  0.],
          [ 0.,  0.]]],


        [[[ 2.,  2.],
          [ 2.,  2.]],

         [[ 0.,  0.],
          [ 0.,  0.]],

         [[ 0.,  0.],
          [ 0.,  0.]]]])


### 4. Activation normalization
- Explicitly forcing the activation statistics during the forward pass by re-normalizing them
- **Batch normalization**:
    - Can be done anywhere in a deep architecture
    - Force the activation's first and second order moments $\rightarrow$ the following layers don't need to adapt to their drift
    - Shift and rescale according to the mean and variance estimated on the batch during training

In [9]:
"""
    Example of 'torch.BatchNorm1d'
"""

x = Tensor(1000, 3).normal_()
x = x * Tensor([2, 5, 10]) + Tensor([-10, 25, 3])
x = Variable(x)

print(x.data.mean(0))
print(x.data.std(0))

bn = nn.BatchNorm1d(3)
bn.bias.data = Tensor([2, 4, 8])
bn.weight.data = Tensor([1, 2, 3])
y = bn(x)

print(y.data.mean(0))
print(y.data.std(0))

tensor([-10.0133,  24.9739,   2.7379])
tensor([ 1.9520,  5.0703,  9.9830])
tensor([ 2.0000,  4.0000,  8.0000])
tensor([ 1.0005,  2.0010,  3.0015])


### 5. Residual network
- A residual network uses a building block with a pass-through identity mapping
    <img width=80% src="images/6-3.png">
    <img width=80% src="images/6-4.png">
- This structure allow the parameters to be optimized to learn a residual (the difference between the value before the block and the one needed after)
- Purpose:
    1. Reduce the activation map size by a factor 2:
        <img width=80% src="images/6-5.png">
    2. Increase the number of channels: from $C$ to $C'$
        - Pad the original value with $C' - C' zeros$
        - Use $C'$ convolutions with a $1 \times 1 \times C$ filter
- Residual networks are **fully convolutional**

### 6. Smart initialization
- **Layer-Sequential Unit-Variance** initialization:
    1. Initialize the weights of all layers with orthonormal matrices
    2. Re-scale layers one after another in a forward direction, so that the empirical activation variance is $1.0$
- Suggestion: combine CReLU with a **Looks Linear initialization**
    <img width=40% src="images/6-6.png">

### 7. Summarize
- Techniques enable the training of very deep architectures:
    1. **Rectifiers** to prevent  the gradient from vanishing during the backward pass
    2. **Drop-out** to force a distributed representation
    3. **Batch normalization** to dynamically maintain the statistics of activations
    4. **Identity pass-through** to keep a structured gradient and distribute representation
    5. **Smart initialization** to put the gradient in a good regime