# Advanced Deep Learning - Class 2
## Convolutional Neural Networks and Optimization Tricks

<hr>

**Driving question**: can a network structure reduce the exploration space while also providing useful properties: invariance, robustness, etc.

<hr>

## Invariance to spatial changes
> convolutional neural network aka CNN, ConvNet

In a convolutional neural networks, filters (which emulate simple or complex biological cortex functions) are learnable neurons. Computing via filter is intensively used in image processing and games.

### Convolution in nature

Based on **Hubel and Wiesel**'s work in 1962, the cortex process follows:

- Convolution: resulting in the feature map
- Activation
- Pooling

### Deep representation by CNN

**Yann LeCun** (1998) showed that:

- subpart of the field of vision are translation invariant
- S-cells: convolution with filters
- C-cells: max-pooling

A **feature map** is the result of a convolution (but before the application of the activation function). Convolution is a **process with a filter that extracts parallelized characteristics at each layer** (e.g. edges).

## Optimization, tips and tricks

### Pre-processing steps

Pre-processing steps are important in order to prep a machine learning process:

1. **Train/Validation/Test splits** (Validation is used to tune hyperparameters) are mandatory.

2. **Features must be normalized** ($W\approx\mathcal{N}(0,1)$) in order to have features at the same scale.

3. If the data is highly dimensional, consider using a **dimensionality reduction** technique such as PCA

4. **Data augmentation** ca be performed if the dataset is small (e.g. CV, horizontal flipping, random crops and color jitteriing, NLP: synonym substitution)

### Softmax layer

Multi-class classification models rely on a softmax activation function at the last layer. 

> Softmax is **monotonical** and **non-local**

### Activation Function

1. <u>Sigmoid</u>
    - **Pros**: 
        - $\alpha(x)\in[0,1]$, which is useful at the output layer
        - derivation is easily computable: $\alpha(x)(1-\alpha(x))$
    - **Cons**:
        - Saturates when receiving strong signals
        - Derivatives at 0 on both ends, causing gradients in previous layers to go towards 0 (diminishing gradient)
        - exploding gradient problem (can be solved with gradient clipping)
        
2. <u>tanh</u>
    - **Pros**: 
        - $\alpha(x)$ is centered around 0
        - derivation is easily computable: $1-\alpha^2(x)$
        - converges faster than Sigmoid
    - **Cons**:
        - Saturating, vanishing, or exploding gradient problems
        
### Loss Function $L$

**Total Cost**: $C(W) = \underset{r=1}{\overset{R}{\sum}}L^r(W) \Leftarrow$ *Find the network parameters $W^*$ that minimizes this value*

1. <u>Classification:</u>
    - **Cross-Entropy**: $$L(x) = -y\,.\,ln(f(x) + (1+y)\,.\,ln(1-f(x))$$
    - **Hinge-Loss** (max-magin loss, 0-1 loss): $$L(x) = max(0, 1-y\,.\,f(x))$$

2. <u>Regression:</u>
    - **Mean square loss** (or Quadratic Loss): $$L(x) = (f(x)-y)^2$$
    - **Mean Absolute Loss**

3. <u>Retrieval:</u>
    - Triplet loss
    - Cosine similarity
    
> If the loss is minimized but accuracy is low: **check the loss function**

### Overfitting and underfitting

Y. Bengio: "*Check if the model is powerful enough to overfit, if not then change model structure or make model larget.*"

![baby](images/babysit.png)

### Gradient Descent

1. Randomly pick a starting point $W^0$
2. Compute the negative gradient at $W^0$: $-\nabla C(W^0)$ s.t.:
![error](images/error_surface.png)
3. Update the weights with a learning rate $\eta$
4. Repeat until a minima is reached

**Gradient descent risks**: plateau, saddle points, local minima.

![vanilla_sgd](images/vanilla_SGD.png)

![momentum](images/momentum.png)

### Weight initialization

1. **All zero initialization**: Not good when the network is deep, every neuron computes the same output, have the same gradients during back-propagation

2. **Glorot normal/uniform**: zero-mean gaussian intialization with a variance scaled by number of input neurons and number of output neurons. The goal is to keep the signal in a resonable range of values through many layers $$W\approx\mathcal{N}(0, 2/(input\,\,neurons+output\,\,neurons))$$

3. **He normal/uniform**: zero-mean gaussian intializaiton scaled by number of input neurons $$W\approx\mathcal{N}(0, 1/input\,\,neurons)$$

### Train by Mini-Batches of training Samples

*Usually works faster and better than standard SGD*.

1. randomly intialize $W^0$
2. Pick the first batch and update the weights: $W^1 \leftarrow W^0 - \eta\nabla C(W^0)$
3. Pick the second batch and update the weights: $W^2 \leftarrow W^1 - \eta\nabla C(W^1)$
4. once all batches are done, an epoch is done, and the process is repeated

### Dealing with internal covariate shift

**Internal covariate shift** corresponds to the change in the distribution of activations owing to parameter updates that might slow learning.

This can be dealt with via **batch-normalization** layers. It helps provide:

- Faster learning
- Increased accuracy 
- the possibility for a higher learning rate
- some preventive measure against bad initialization

### Recipes when poor performance on training data

1. **Modify the network**: new activation functions (ReLU, LeakyReLU)

2. **Better optimization strategy**: Adaptive learning rates

### Recipes when poor performance on validation data

3. **preventing overfitting**: dropout

4. **Regularization**

5. **Early Stopping**

### Vanishing Gradient Problem

Early layers have smaller gradients, learn very slowly, and are almost random. Meanwhile, deeper layers have larger gradients, learn quickly and converges. There is an amount of randomness. 

<u>Solutions:</u>
- Rectified Linear Unit (ReLU): fast to compute, biological reason, infinite signoid with different biases, vanishing gradient problem. 
    - setting nodes to 0 when a result is negative is equivalent to having a thinner linear network.
- Leaky ReLU
- Parametric ReLU

### Learning Rates

**Idea**: Reducing the learning rate by some factor at each epoch. E.g. $\frac{1}{t},\,\,\eta^t=\eta/\sqrt{t+1}$

0. <u>Original Gradient Descent:</u> $W^t \leftarrow W^{t-1}-\eta\,.\,\nabla C(W^{t-1})$

1. <u>AdaGrad:</u> $W^t \leftarrow W^{t-1}-\eta_W\,.\,g^t$, with $g^t=\frac{\delta C(W^{t})}{\delta W}$ and $\eta_W = \frac{\eta}{\sqrt{\sum^t_{i=0}(g^i)^2}}$
- parameter dependent learning rate ($\eta_W$)
- summation of the square of the previous derivatives

2. <u>RMSProp:</u> Root Mean Square of the gradients with previous gradients being decayed
\begin{align}
W^1 &\leftarrow W^0 - \frac{\eta}{\sigma^0}g^0,\text{ $\sigma^0=g^0$}\\
W^{t+1} &\leftarrow W^t - \frac{\eta}{\sigma^t}g^t,\text{ $\sigma^t=\sqrt{\alpha(\sigma^{t-1})^2 + (1-\alpha)(g^t)^2}$}\\
\end{align}

3. <u>Adam</u>
![adam](images/adam.png)

### Regularization

Regularization is an add-on to the loss function. The new loss function must not only minimize the original cost but also the regularization term.

1. <u>L2 Regularization:</u> $$L'(W) = L(W) + \lambda\frac{1}{2}||W||_2$$
$$\frac{\delta L'}{\delta W} = \frac{\delta L}{\delta W} + \lambda W$$
$$W^{t+1}\rightarrow (1-\eta\lambda) W^t - \eta\frac{\delta L}{\delta W}$$
Also known as **ridge**, or weight decay, it is the most widely used regularization method.

2. <u>L1 Regularization:</u>
Also known as **lasso**, produces sparse results: $$L'(W) = L(W) + \lambda\frac{1}{2}|W|$$
$$\frac{\delta L'}{\delta W} = \frac{\delta L}{\delta W} + \lambda sgn(W)$$
$$W^{t+1}\rightarrow W^t - \eta\frac{\delta L}{\delta W}-\eta\lambda sgn(W^T)$$

3. <u>Elastic Net:</u> combines L1 and L2

### Dropout

Each neuron has a dropout percentage. Dropout changes the network shape.

During validation/testing, the weights are multiplied by the dropout rate.

**Dropout ends up being a kind of ensemble technique where parallel networks are 'averaged'**.