# Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

## Train / dev / test sets

- train on train set
- see which model performs the best on dev set
- evaluate your best model on test set
- dev and test set must come from the same distribution
- not having test set might be okay

<table>
<tr>
    <td>train set error</td>
    <td>1%</td>
    <td>15%</td>
    <td>15%</td>
    <td>0.5%</td>
</tr>
<tr>
    <td>test set error</td>
    <td>11%</td>
    <td>16%</td>
    <td>30%</td>
    <td>1%</td>
</tr>    
<tr>
    <td>means</td>
    <td>variance</td>
    <td>bias</td>
    <td>both bias and variance</td>
    <td>neither</td>
</tr>    
</table>

- if bias? use bigger network, train longer
- if variance? get more data, use regularization

## Regularization

- penalizes weights being large
- $J(w,b) = \dfrac{1}{m}\displaystyle\sum_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i))}) + \dfrac{\lambda}{2m}||w||^{2}$
- where $||w||^{2} = \displaystyle\sum_{j=1}^{n_{x}}w_{j}^{2} = w^{T}w$

In general
- $J(w^{[1]},b^{[1]} \dots w^{[2]},b^{[2]}) = \dfrac{1}{m}\displaystyle\sum_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i))}) + \dfrac{\lambda}{2m}\displaystyle\sum_{l=1}^{L}||w^{[l]}||_{F}^{2}$
- where $||w^{[l]}||_{F}^{2} = \displaystyle\sum_{i=1}^{n^{[l-1]}}\displaystyle\sum_{j=1}^{n^{[l]}}(w_{ij}^{[l]})^{2}$
- add $\dfrac{\lambda}{m}w^{[l]}$ to $dw^{[l]}$
- $w^{[l]} = w^{[l]} - \alpha dw^{[l]}$ remains the same

## Dropout

Example: $l =3$
- keep_prop = 0.8 (20% chance that units will be shutdown)
- d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prop
- a3 = np.multiply(a3, d3)
- a3 = a3 / keep_prop

Other regularization
- data augmentation
- early stopping

## Normalizing inputs

- to make gradient descent faster 
- applies to both training and test sets
- $\mu = \dfrac{1}{m}\displaystyle\sum_{i=1}^{m}x^{(i)}$, $\sigma^{2} = \dfrac{1}{m}\displaystyle\sum_{i=1}^{m}x^{(i)}**2$
- $x = \dfrac{x-\mu}{\sigma}$

## Vanishing/exploding gradient

- happens in deep neural network
- to partially overcome, weight initialization
    - Var$(w_{i})$ = $\dfrac{1}{n}$ or $\dfrac{2}{n}$ (good for RELU)
    - $w^{[l]}$ = np.random.randn() * init_factor
    - init_factor = np.sqrt$\left(\dfrac{2}{n^{[l-1]}}\right)$ (good for RELU) or np.sqrt$\left(\dfrac{1}{n^{[l-1]}}\right)$ (Xavier initialization)
    
## Gradient checking

- take $w^{[1]},b^{[1]} \dots w^{[l]},b^{[l]}$ and reshape into a big factor $\theta$
    - $J(w^{[1]},b^{[1]} \dots w^{[l]},b^{[l]}) J(\theta)$
- take $dw^{[1]},db^{[1]} \dots dw^{[l]},db^{[l]}$ and reshape into a big factor $d\theta$
- for each $i$
    - $d\theta_{approx}[i] = \dfrac{J(\theta_{1}, \theta_{2} \dots \theta_{i+\epsilon} \dots) - J(\theta_{1}, \theta_{2} \dots \theta_{i-\epsilon} \dots)}{2\epsilon} \approx \partial \theta[i] = \dfrac{\partial J}{\partial \theta_{i}}$
- check
    - $\dfrac{||d\theta_{approx}-d\theta||_{2}}{||d\theta_{approx}||_{2} + ||d\theta||_{2}} \approx 10^{-7}$ good 
    - bigger than $10^{-3}$ means something wrong!
- don't use in training, only to debug
- include regularization term in $d\theta$ calculation
- doesn't work with dropout (you can grad check with keep_prop=1.0, then later turn on dropout)