<a href="https://colab.research.google.com/github/yexf308/MachineLearning/blob/main/Module3/Training_Neural_Network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%pylab inline 
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from tqdm import tqdm


Populating the interactive namespace from numpy and matplotlib


$\def\m#1{\mathbf{#1}}$
$\def\mm#1{\boldsymbol{#1}}$
$\def\mb#1{\mathbb{#1}}$
$\def\c#1{\mathcal{#1}}$
$\def\mr#1{\mathrm{#1}}$
$\newenvironment{rmat}{\left[\begin{array}{rrrrrrrrrrrrr}}{\end{array}\right]}$
$\newcommand\brm{\begin{rmat}}$
$\newcommand\erm{\end{rmat}}$
$\newenvironment{cmat}{\left[\begin{array}{ccccccccc}}{\end{array}\right]}$
$\newcommand\bcm{\begin{cmat}}$
$\newcommand\ecm{\end{cmat}}$

# Training Neural Network is HARD!
Many times your model has millions of parameters, so it is easy to overfit. 

## Method 1: Data Augmentation
We don't have enough data so it will overfit.

We can increase the size of data by

- Rotation: random angle between $-\pi$ to $\pi$.

- Shift: 4 directions

- Rescaling: random scaling up/down

- Flipping

Can be combined perfectly with SGD (augmentation when forming
each batch)


In [None]:
from torchvision import datasets, transforms

#1. Using transform=transforms.Compose([trans1, trans2,…])
#2. transforms.Compose() is similar to nn.Sequential() in a sense 
#that it compiles a series of operations and executes as one unit
#3. transforms.RandomXXX provides randomness to transformation

batch_size=200
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../data', train=True, download=True,
                  transform=transforms.Compose([
                      transforms.RandomHorizontalFlip(),
                      transforms.RandomVerticalFlip(),
                      transforms.RandomRotation(15),
              #       transforms.RandomRotation([90, 180, 270]),
                      transforms.Resize([32, 32]),
                      transforms.RandomCrop([28, 28]),
                      transforms.ToTensor()
                  ])),
    batch_size=batch_size, shuffle=True)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/9912422 [00:00<?, ?it/s]

Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/28881 [00:00<?, ?it/s]

Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/1648877 [00:00<?, ?it/s]

Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/4542 [00:00<?, ?it/s]

Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw



## Method 2: Dropout-Regularization for neural network training
### Dropout in training
Dropout in the training phase:

- For each batch, turn off each neuron (including inputs) with a
probability $\alpha$

- Zero out the removed nodes/edges and do backpropagation.

- Dropout as estimating a noisy version of the weights $\Theta_{ij}^{(l)}= \mathbf{W}_{ij}^{(l)}\epsilon_{li}$, where $\epsilon_{li}\sim \text{Ber}(1-\alpha)$. 

- So if we sample  $\epsilon_{li} = 0$, then all of the weights going out of
unit $i$ in layer $l-1$ into any $j$ in layer $l$ will be set to 0.


![padding](https://github.com/yexf308/MAT592/blob/main/image/dropout.gif?raw=true "padding")

### Dropout in testing
In the testing phase,
The model is different from the full model:


- We usually turn the noise off. To ensure the weights have the same expectation at test time as they did during training (so the
input activation to the neurons is the same, on average), at test time we should use $\Theta_{ij}^{(l)}= \mathbf{W}_{ij}^{(l)}(1-\alpha)$.

- We can, however, use dropout at test time if we wish. The result is an ensemble of networks,
each with slightly different sparse graph structures. This is called **Monte Carlo dropout**.
\begin{align}
p(\m{y}|\m{x},\c{D})\approx \frac{1}{S}\sum_{s=1}^S p(\m{y}|\m{x}, \{\Theta_{ij}^{l}\})
\end{align}


### Explanations of dropout
- For a network with $N$ neurons, there are $2^N$ possible sub-networks. 

- Dropout: randomly sample over all $2^N$ possibilities. Can be viewed as a way to learn Ensemble of $2^N$ models.


<img src="https://github.com/yexf308/MAT592/blob/main/image/dropout2.png?raw=true" width="800" />




##  Method 3: Weight Initialization
The initial values of the weights can have a significant impact on the training process. A proper initialization of the weights in a neural network is critical to its convergence. One can even show that arbitrary initialization can slow down or even stall the training. 


### Initializating Neural Networks depends on 4️ main factors
1. Number of Inputs $D_{in}$.
2. Number of Outputs $D_{out}$. 
3. Type of Non-Linearity.
4. Type of Network. 

### Some Popular Initialization Solutions
1. **Uniform initialization**: sample each parameter independently from $U(-a,a)$. 

2. **Normal Initialization**: sample each parameter independently from $N(0,\sigma^2)$. 

3. **Orthogonal Initialization**: Initialize the weight matrix as orthogonal matrices, widely used for Convolutional Neural Networks.

### Zero initialization (bad)





## Method 4: Batch Normalization

## Method 5: Weight Decay
Impose a prior on the parameter and use MAP estimation. The most common one is to use Gaussian weights $\mathcal{N}(\m{W}|0, \alpha^2 \m{I})$ and biases $\mathcal{N}(\m{b}|0, \beta^2 \m{I})$. It is the same as $\ell_2$ regularization and also called weight decay, since it encourages small weights, and hence simpler models. 

## Method 6: Residual Networks

## AlexNet Training

- Dropout: 0.5 (in FC layers)
- A lot of data augmentation
- Momentum SGD with batch size 128, momentum factor 0.9
- L2 weight decay (L2 regularization)
- Learning rate: 0.01, decreased by 10 every time when reaching a stable
validation accuracy