## Current week: The Data Driven Deep Leanring Process

### Exponentially Moving Average (EMA)

$$ v_k = \gamma v_{k-1} + (1-\gamma)\theta_k, \; v_0=0$$

Iterate through steps:

\begin{align*}
v_1&=(1-\gamma)\theta_1\\
v_2&=\gamma(1-\gamma)\theta_1 +(1-\gamma)\theta_2\\
v_3&=\gamma^2(1-\gamma)\theta_1+\gamma(1-\gamma)\theta_2+(1-\gamma)\theta_3

\end{align*}

Applying it to **gradient** values:

$$ \hat{v}_k =\gamma \hat{v}_{k-1} +(1-\gamma)\nabla_{w_k}$$
$$w_{k+1}=w_k-\alpha\hat{v}_k$$


#### RMS Prop

Let's apply EMA to **square gradient** values:

$$ S_k=\gamma S_{k-1}+(1-\gamma)(\nabla_{w_k})^2$$
$$w_{k+1}=w_k-\frac{\alpha}{\sqrt{S_k}+\epsilon}\nabla_{w_k}$$

#### Biases in EMA Correction

Correcton: $\displaystyle \tilde{v}_k = \frac{v_k}{1-\gamma^{k}}$

#### Adam

What if we combine Momentum with RMSProp

$$\hat{v}_k = \gamma_1 \hat{v}_{k-1} +(1-\gamma_1)\nabla_{wk}$$
$$S_k =\gamma_2 S_{k-1} +(1-\gamma_2)(\nabla_{wk})^2$$

And do bias correction

$$\hat{v}_k^c = \frac{\hat{v}_k}{1-\gamma_1^k}$$

$$ w_{k+1}=w_k-\frac{\alpha}{\sqrt{S_k^c}+\epsilon}\hat{v}_k^c$$

$$S_k^c = \frac{S_k}{1-\gamma_2^k} $$

hyperparameters: $\alpha$, $\gamma_1$, $\gamma_2$, $\epsilon$

### Cross-Validation 

Train/Validation/Test set division:  

if total 10k, 8k/1k/1k

if total 1M, 960k/20k/20k

## Regularization

### L2 norm regularization

### Dropout Regularization
Analogous to L2 norm ergularization, Dropout regularization restricts the optimization to fewer parameters. It basically apply a random probability map to the nodes, making some unimportant (small) and keeping the others important (intact).


### Additional Regularization

**Data Augmentation**:

Enhance the data set with additional manipulations. 

* General input: add noise/distortions, synthetic

* Images: resolutions, rotate, add symmetries 

* Shapes/digits: distort

So the idea is about getting more data points based on exsiting data.

**Early Stopping**



### Vanishing/Exploding Gradients

* **Very deep** neural network
without activation

$\displaystyle \hat{y}=w_L\cdot ... w_l \cdot ...w_2 \cdot w_1\cdot x$

if $w>1, \;\; \hat{y} \to \infty\\$
if $w<1, \;\; \hat{y} \to 0$

with activation: forward propagation and backward propagation are modified.

## Normalization of the datasets

* Zero mean

* Normalized Variance

### Batch Normalization

Normalize the outputs of each layer.

* To make sure that the layers outputs will not be forced to be zero mean and variance 1

$$\tilde{z}^{[l](i)} = \gamma^{[l]}z_{norm}^{[l](i)}+\beta^{[l]}$$

where $\gamma$ and $\beta$ are learned in the process


* Forward propagation

$$ a^{[l-1](i)} \; \rightarrow\; z^{[l](i)}\;\rightarrow\;_{BN}\tilde{z}^{[l](i)}\;\rightarrow\; a^{[l](i)}$$

* Can do per mini-batch


**Pseudo-code**

For each mini-batch, compute F-prop, in each layer replace $z^{[l](i)}$ by $\tilde{z}^{[l](i)}$. Do B-prop to compute $\nabla_{w^{[l]}},\nabla_{\beta^{[l]}},\nabla_{\gamma^{[l]}}$. Update.

## Initialization

* Zero -> Problematic

* Random Normal (0,1) -> Problematic (vanishing gradients, for example)

* Xavier (tanh):
    $$Var(w^{[l]}): 1/n^{[l-1]}     $$
    $$w^{[l]} = N(0,1)\cdot \sqrt{\frac{1}{n^{[l-1]}}}                            $$
    
* He (ReLU):
    $$Var(w^{[l]}): 2/n^{[l-1]}$$
    $$w^{[l]} = N(0,1)\cdot \sqrt{\frac{2}{n^{[l-1]}}} $$ 
    , the factor 2 is found to be more useful in practice

* Other: 

    $$Var(w^{[l]}): \frac{2}{n^{[l-1]}+n^{[l]}}$$


## Search

* Grid Search, Random Search

* Coarse to fine

* Linear for: $L, n^{[l]}$

* $log_{10}$ scale for