&copy;Copyright for [Shuang Wu] [2017]<br>
Cite from the [coursera] named [Neural network and Machine Learning] from [deeplearning.ai]<br>
Learning notes<br>

# Hyperparameter tuning

## Tuning process

* Hyperparameters
    * $\alpha$, 1st important
    * $\beta$, momentum, 0.9, 2nd important
    * $\beta_1$, always use default
    * $\beta_2$, always use default
    * $\epsilon$, always use default
    * #of layers, 3rd important
    * #of hidden units, 2nd important
    * learning rate decay, 3rd important
    * mini-batch size, 2nd important
    
* Try random values: Don't use a grid
    * ![img1](imgs/img1.jpg)
    
* Coarse to fine
    * ![img2](imgs/img2.jpg)
    * sample and then zoom in and sample more 

## Using an appropriate scale to pick hyperparameters

* Picking hyperparameters at random
    * ![img3](imgs/img3.jpg)
    * #of layers L: maybe 2-4

* Appropriate scale for hyperparameters
    * if $\alpha=0.0001$ to $1$
        * search the value on a log scale
        * not sample uniform random 
        * ```python
        r=-4*np.randn.rand() #[-4,0]
        alpha=10**r #10^-4...10^0
        ```
        
* Hyperparameters for exponentially weighted averages
    * $\beta=0.9,\cdots,0.999$
    * $1-\beta=0.1,\cdots,0.001$
    * such as before sample only (0.1, 0.01, 0.001)
    * $\beta$ from 0.9000 to 0.9005 will not make that huge change
    * but $\beta$ from 0.9990 to 0.9995 will make huge change

## Hyperparameters tuning in practive: pandas vs. Caviar

* Re-test hyperparameters occasionally
    * Idea to code to experiment to idea
        * NLP, vision, speech, Ads, logistics,...
        * Intuitions do get stale. Re-evaluate occasionally

* Babysitting one model
    * ![img4](imgs/img4.jpg)
    * big data w/o enough cpu or gpu
        * panda
    
* Training many models in parallel
    * ![img5](imgs/img5.jpg)
        * caviar (fish)

# Batch Normalization

## Normalizaing activations in a network
    
* Normalizing inputs to speed up learning
    * ![img6](imgs/img6.jpg)
    * this work for w,b not for deep NN
    
* BN 
    * normalize the Z val.
    
* Implementing Batch Norm
    * Give some intermedient val. in NN $Z^{(1)}$, ..., $Z^{(m)}$
        * $\mu = \frac{1}{m}\sum_i Z^{(i)}$
        * $\sigma^i = \frac{1}{m}\sum_i(Z_i-\mu)^2$
        * $$Z^{(i)}_{norm}=\frac{$Z^{(1)}$-\mu}{\sqrt{\sigma^2+\epsilon}}$$
        * $$\tilde{Z^{(i)}}=\gamma Z^{(i)}_{norm}+\beta$$
            * $\gamma$ and $\beta$ learnable parameters of model
        * If $\gamma = \sqrt{\sigma^2+\epsilon}$ and $\beta=\mu$
        * Then $\tilde{Z^{(i)}}= Z^{(i)}$

## Fitting Batch Norm into a NN

* Adding Batch Norm to a network
    * ![img7](imgs/img7.jpg)

* Working w/ mini-batches
    * ![img8](imgs/img8.jpg)

* Implementing gradient descent
    * ```python
    for t =1...num minibathes
        compute forwad prop on X{t}
             in each hidden layer use BN to repa z[l] w/ tilde(z)[l]
         use backprop to compute dw, db, dbeta, dgamma
         update parameters
    ```
    * ![img9](imgs/img9.jpg)

## Why does Batch Norm work

* Learning on shifting input distribution
    * ![img10](imgs/img10.jpg)

* Why this is a problem with NN
    * 
    
* Batch Norm as regularization
    * Each mini-batch is caled by the mean/variance computed on just that mini-batch
    * This adds some noise to the values z within that minibatch. So similar to dropout, it adds some noise to each hidden layer's activations
    * this has a slight regularization effect
        * minibatch: 64 to 514

## Batch Norm at test time

* 