# Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

## Train / dev / test sets

- train on train set
- see which model performs the best on dev set
- evaluate your best model on test set
- dev and test set must come from the same distribution
- not having test set might be okay

<table>
<tr>
    <td>train set error</td>
    <td>1%</td>
    <td>15%</td>
    <td>15%</td>
    <td>0.5%</td>
</tr>
<tr>
    <td>test set error</td>
    <td>11%</td>
    <td>16%</td>
    <td>30%</td>
    <td>1%</td>
</tr>    
<tr>
    <td>means</td>
    <td>variance</td>
    <td>bias</td>
    <td>both bias and variance</td>
    <td>neither</td>
</tr>    
</table>

- if bias? use bigger network, train longer
- if variance? get more data, use regularization

## Regularization

- penalizes weights being large
- $J(w,b) = \dfrac{1}{m}\displaystyle\sum_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i))}) + \dfrac{\lambda}{2m}||w||^{2}$
- where $||w||^{2} = \displaystyle\sum_{j=1}^{n_{x}}w_{j}^{2} = w^{T}w$

In general
- $J(w^{[1]},b^{[1]} \dots w^{[2]},b^{[2]}) = \dfrac{1}{m}\displaystyle\sum_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i))}) + \dfrac{\lambda}{2m}\displaystyle\sum_{l=1}^{L}||w^{[l]}||_{F}^{2}$
- where $||w^{[l]}||_{F}^{2} = \displaystyle\sum_{i=1}^{n^{[l-1]}}\displaystyle\sum_{j=1}^{n^{[l]}}(w_{ij}^{[l]})^{2}$
- add $\dfrac{\lambda}{m}w^{[l]}$ to $dw^{[l]}$
- $w^{[l]} = w^{[l]} - \alpha dw^{[l]}$ remains the same

## Dropout

Example: $l =3$
- keep_prop = 0.8 (20% chance that units will be shutdown)
- d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prop
- a3 = np.multiply(a3, d3)
- a3 = a3 / keep_prop

Other regularization
- data augmentation
- early stopping

## Normalizing inputs

- to make gradient descent faster 
- applies to both training and test sets
- $\mu = \dfrac{1}{m}\displaystyle\sum_{i=1}^{m}x^{(i)}$, $\sigma^{2} = \dfrac{1}{m}\displaystyle\sum_{i=1}^{m}x^{(i)}**2$
- $x = \dfrac{x-\mu}{\sigma}$

## Vanishing/exploding gradient

- happens in deep neural network
- to partially overcome, weight initialization
    - Var$(w_{i})$ = $\dfrac{1}{n}$ or $\dfrac{2}{n}$ (good for RELU)
    - $w^{[l]}$ = np.random.randn() * init_factor
    - init_factor = np.sqrt$\left(\dfrac{2}{n^{[l-1]}}\right)$ (good for RELU) or np.sqrt$\left(\dfrac{1}{n^{[l-1]}}\right)$ (Xavier initialization)
    
## Gradient checking

- take $w^{[1]},b^{[1]} \dots w^{[l]},b^{[l]}$ and reshape into a big factor $\theta$
    - $J(w^{[1]},b^{[1]} \dots w^{[l]},b^{[l]}) J(\theta)$
- take $dw^{[1]},db^{[1]} \dots dw^{[l]},db^{[l]}$ and reshape into a big factor $d\theta$
- for each $i$
    - $d\theta_{approx}[i] = \dfrac{J(\theta_{1}, \theta_{2} \dots \theta_{i+\epsilon} \dots) - J(\theta_{1}, \theta_{2} \dots \theta_{i-\epsilon} \dots)}{2\epsilon} \approx \partial \theta[i] = \dfrac{\partial J}{\partial \theta_{i}}$
- check
    - $\dfrac{||d\theta_{approx}-d\theta||_{2}}{||d\theta_{approx}||_{2} + ||d\theta||_{2}} \approx 10^{-7}$ good 
    - bigger than $10^{-3}$ means something wrong!
- don't use in training, only to debug
- include regularization term in $d\theta$ calculation
- doesn't work with dropout (you can grad check with keep_prop=1.0, then later turn on dropout)

## Mini-batch gradient descent

- in one epoc, min-batch gradient descent take $nb$ gradient descents rather than 1 (Batch gradient descent)
- let batch_size = $bs$
- let number_of_batches = $nb$
- $x^{\{t\}}$: $(n_{x}, bs)$, $y^{\{t\}}$: $(1, bs)$

Implementation
- for $t = 1 \dots nb$
    - forward prop on $x^{\{t\}}$
        - $Z^{[1]} = w^{[1]}X^{\{t\}} + b^{[1]}$
        - $A^{[1]} = g^{[1]}(Z^{[1]})$
        - $ \vdots $
        - $A^{[L]} = g^{[L]}(Z^{[L]})$
    - compute cost 
        - $J^{\{t\}} = \dfrac{1}{bs}\displaystyle\sum_{i=1}^{l}L(\hat{y}^{(i)}, y^{(i)}) + \dfrac{\lambda}{2bs}\displaystyle\sum_{l}||w^{[l]}||_{F}^{2}$
    - backward prop to compute gradients w.r.t. $J^{\{t\}}$ (use $X^{\{t\}}, y^{\{t\}}$) 
    - $w^{[l]} = w^{[l]} - \alpha dw^{[l]}$
    - $b^{[l]} = b^{[l]} - \alpha db^{[l]}$
   
Notes   
- batch size = $m$: batch gradient (takes too long per iteration)
- batch size = 1: stochastic gradient (loses speed up from vectorization)
- typical min-batch sizes are 64, 128, 256, 512, 1024

## Exponentially weighted average (moving average)

- $V_{t} = \beta V_{t} + (1-\beta)\theta_{t}$
    - approximately averges over $\dfrac{1}{1-\beta} data$
    
Implement
- $V = 0$
    - repeat
        - compute $\theta_{t}$
        - $V_{\theta} = \beta V_{\theta} + (1-\beta)\theta_{t}$
        
## Momentum

- reduce oscilation by slowing learning vertially but speeding up learning horizontally
- init $V_{dw} = 0, V_{db} = 0$
- on iteration $t$
    - compute $dw, db$ on current min-batch
    - $V_{dw} = \beta V_{dw} + (1-\beta)dw$
    - $V_{db} = \beta V_{db} + (1-\beta)db$
    - $w = w - \alpha V_{dw}$
    - $b = b - \alpha V_{db}$
- $\beta$ is usually set to 0.9

## RMSprop

- on iteration $t$
    - compute $dw, db$ on current min-batch
    - $S_{dw} = \beta S_{dw} + (1-\beta)dw^{2}$
    - $S_{db} = \beta S_{db} + (1-\beta)db^{2}$
    - $w = w - \alpha \dfrac{dw}{\sqrt{S_{dw}}}$
    - $b = b - \alpha \dfrac{db}{\sqrt{S_{db}}}$
    
## Adam (Adaptive moment estimation)

- init $V_{dw} = 0, V_{db} = 0, S_{dw} = 0, S_{db} = 0$
- on iteration $t$
    - compute $dw, db$ on current min-batch
    - $V_{dw} = \beta_{1} V_{dw} + (1-\beta_{1})dw$
    - $V_{db} = \beta_{1} V_{db} + (1-\beta_{1})db$
    - $S_{dw} = \beta_{2} S_{dw} + (1-\beta_{2})dw^{2}$
    - $S_{db} = \beta_{2} S_{db} + (1-\beta_{2})db^{2}$
    - $V_{dw,corrected} = \dfrac{V_{dw}}{(1-\beta_{1}^{t})}$
    - $V_{db,corrected} = \dfrac{V_{db}}{(1-\beta_{1}^{t})}$
    - $S_{dw,corrected} = \dfrac{S_{dw}}{(1-\beta_{2}^{t})}$
    - $S_{db,corrected} = \dfrac{S_{db}}{(1-\beta_{2}^{t})}$
    - $w = w - \alpha \dfrac{V_{dw,corrected}}{\sqrt{S_{dw,corrected}}+\epsilon}$
    - $b = b - \alpha \dfrac{V_{db,corrected}}{\sqrt{S_{db,corrected}}+\epsilon}$
- usually $\beta_{1} = 0.9$, $\beta_{2} = 0.900$, $\epsilon = 10^{-8}$

## Learning rate decay

- helps gradient descent converge by taking smaller steps as it approaches the minimum
- for example,
    - let decay rate = dr
    - let epoch number = en
    - $\alpha = \dfrac{1}{1+dr*en} * \alpha_{0}$
    
## Local optima

- in high dimensions, when gradient is 0, it is almost always saddle points rather than local optima (so it is unlikely for the optimization algorithm to stuck at bad local optima)
- plateaus (where derivatives are close to 0) can slow down the learning

## Hyperparameters

- should use random sampling to choose the number of layers, number of features, etc
- scale parameters accordingly
    - for example, $\alpha = 0.0001 \dots 1$
        - use log scale such that $0.0001, 0.001, 0.01, 0.1, 1$
    - for example, $\beta = 0.9 \dots 0.999$
        - use $1-\beta$ such that $0.1, 0.01, 0.001$
- panda: babysit one model
- caviar: train many models in parallel

## Batch nomralization

- normalize activations 
    - given some intermediate values in neural network $z^{(1)} \dots z^{(m)}$
    - $\mu = \dfrac{1}{m}\displaystyle\sum_{i}z^{(i)}$
    - $\sigma = \dfrac{1}{m}\displaystyle\sum_{i}(z_{i}-\mu)^{2}$
    - $z_{norm}^{(i)} = \dfrac{z^{(i)}-\mu}{\sqrt{\sigma^{2}+\epsilon}}$
    - $\tilde{z}^{(i)} = \gamma z_{norm}^{(i)} + \beta$
- for example, if $\gamma = \sqrt{\sigma^{2}+\epsilon}, \beta = \mu$, then $z_{norm}^{(i)} = \tilde{z}^{(i)}$
- use $\tilde{z}^{(i)}$ instead of ${z}^{(i)}$ 
- but unlike inputs, you don't want to force activation to be ~ $N(0,1)$

$X \xrightarrow{w^{[1]}, b^{[1]}} z^{[1]} \xrightarrow{\beta^{[1]}, \gamma^{[1]}} \tilde{z}^{[1]} \rightarrow a^{[1]} = g^{[1]}(\tilde{z}^{[1]}) \xrightarrow{w^{[2]}, b^{[2]}} z^{[2]} \xrightarrow{\beta^{[2]}, \gamma^{[2]}} \tilde{z}^{[2]} \rightarrow a^{[2]} \rightarrow \dots$ 
- parameters: $w, b, \beta, \gamma$

Working with mini-batches
- parameters: $w, \beta, \gamma$ (no need for $b$)
- $z^{[l]} = w^{[l]}a^{[l-1]}$
- $\tilde{z}^{[l]} = \gamma^{[l]}z_{norm}^{[l]} + \beta^{[l]}$
- for $t = 1 \dots$ num_mini_batches
    - compute forward prop on $X^{\{t\}}$ 
        - in each layer, use BN to replace $z^{[l]}$ with $\tilde{z}^{[l]}$
    - use backprop to compute $dw^{[l]}, d\beta^{[l]}, d\gamma^{[l]}$ (no need for $db^{[l]}$)
    - update $w^{[l]} = w^{[l]} - \alpha dw^{[l]}, \beta^{[l]} = \beta^{[l]} - \alpha d\beta^{[l]}, \gamma^{[l]} = \gamma^{[l]} - \alpha d\gamma^{[l]}$
    
Batch normalization as regularization
- each mini-batch is scaled by mean/variance computed on just that mini-batchj
- this adds some noise to $z^{[l]}$
- this has slight regularization effect

Batch normalization as test time
- $\mu, \sigma^{2}$: estimate using exponentially weighted average (across mini-batches)
- $X^{\{1\}} \rightarrow \mu^{\{1\}[l]}, \sigma^{\{1\}[l]}, X^{\{2\}} \rightarrow \mu^{\{2\}[l]}, \sigma^{\{1\}[2]}, X^{\{3\}} \rightarrow \mu^{\{3\}[l]}, \sigma^{\{3\}[l]}, \dots$

## Softmax regression

- let $C$ be number of classes
- last layer (softwax layer) has $n^{[L]}= C$ units
    - $z^{[L]} = w^{[L]}a^{[L-1]} + b^{[L]}$
    - $t = e^{(z^{[L]})}$
    - $a^{[L]} = \dfrac{e^{(z^{[L]})}}{\displaystyle\sum_{j}t_{i}}, a_{i}^{[L]} = \dfrac{t_{i}}{\displaystyle\sum_{j}t_{i}}$
- softmax regression generalizes logistic regression to $C$ classes
- loss function
    - $L(\hat{y}, y) = -\displaystyle\sum_{j}y_{j}log\hat{y}_{j}$
- cost function
    - $J(w^{[1]}, b^{[1]}, \dots) = \dfrac{1}{m}\displaystyle\sum_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i)})$
    
$z^{[L]} \rightarrow a^{[L]} = \hat{y} \rightarrow L(\hat{y}, y)$
- backprod: $dz^{[L]} = \hat{y} - y$