&copy;Copyright for [Shuang Wu] [2017]<br>
Cite from the [coursera] named [Neural network and Machine Learning] from [deeplearning.ai]<br>
Learning notes<br>

# Hyperparameter tuning

## Tuning process

* Hyperparameters
    * $\alpha$, 1st important
    * $\beta$, momentum, 0.9, 2nd important
    * $\beta_1$, always use default
    * $\beta_2$, always use default
    * $\epsilon$, always use default
    * #of layers, 3rd important
    * #of hidden units, 2nd important
    * learning rate decay, 3rd important
    * mini-batch size, 2nd important
    
* Try random values: Don't use a grid
    * ![img1](imgs/img1.jpg)
    
* Coarse to fine
    * ![img2](imgs/img2.jpg)
    * sample and then zoom in and sample more 

## Using an appropriate scale to pick hyperparameters

* Picking hyperparameters at random
    * ![img3](imgs/img3.jpg)
    * #of layers L: maybe 2-4

* Appropriate scale for hyperparameters
    * if $\alpha=0.0001$ to $1$
        * search the value on a log scale
        * not sample uniform random 
        * ```python
        r=-4*np.randn.rand() #[-4,0]
        alpha=10**r #10^-4...10^0
        ```
        
* Hyperparameters for exponentially weighted averages
    * $\beta=0.9,\cdots,0.999$
    * $1-\beta=0.1,\cdots,0.001$
    * such as before sample only (0.1, 0.01, 0.001)
    * $\beta$ from 0.9000 to 0.9005 will not make that huge change
    * but $\beta$ from 0.9990 to 0.9995 will make huge change

## Hyperparameters tuning in practive: pandas vs. Caviar

* Re-test hyperparameters occasionally
    * Idea to code to experiment to idea
        * NLP, vision, speech, Ads, logistics,...
        * Intuitions do get stale. Re-evaluate occasionally

* Babysitting one model
    * ![img4](imgs/img4.jpg)
    * big data w/o enough cpu or gpu
        * panda
    
* Training many models in parallel
    * ![img5](imgs/img5.jpg)
        * caviar (fish)

# Batch Normalization

## Normalizaing activations in a network
    
* Normalizing inputs to speed up learning
    * ![img6](imgs/img6.jpg)
    * this work for w,b not for deep NN
    
* BN 
    * normalize the Z val.
    
* Implementing Batch Norm
    * Give some intermedient val. in NN $Z^{(1)}$, ..., $Z^{(m)}$
        * $\mu = \frac{1}{m}\sum_i Z^{(i)}$
        * $\sigma^i = \frac{1}{m}\sum_i(Z_i-\mu)^2$
        * $$Z^{(i)}_{norm}=\frac{Z^{(1)}-\mu}{\sqrt{\sigma^2+\epsilon}}$$
        * $$\tilde{Z^{(i)}}=\gamma Z^{(i)}_{norm}+\beta$$
            * $\gamma$ and $\beta$ learnable parameters of model
        * If $\gamma = \sqrt{\sigma^2+\epsilon}$ and $\beta=\mu$
        * Then $\tilde{Z^{(i)}}= Z^{(i)}$
    * use $\tilde{Z^{(i)}}$ instead of $Z^{(i)}$

## Fitting Batch Norm into a NN

* Adding Batch Norm to a network
    * ![img7](imgs/img7.jpg)

* Working w/ mini-batches
    * ![img8](imgs/img8.jpg)

* Implementing gradient descent
    * ```python
    for t =1...num minibathes
        compute forwad prop on X{t}
             in each hidden layer use BN to repa z[l] w/ tilde(z)[l]
         use backprop to compute dw, db, dbeta, dgamma
         update parameters
    ```
    * for $t=1$ ... num MiniBatches
        * compute forward prop on $X^{\{t\}}$
            * In each hidden layer, use BN to replace $z^{[l]}$ with $\tilde{z}^{[l]}$
        * Use backprop to compute the $dw^{[l]}$, $db^{[l]}$, $d\beta^{[l]}$, $d\gamma^{[l]}$
        * Update parameters:
            * $w^{[l]} = w^{[l]}-\alpha dw^{[l]}$
            * $\beta^{[l]} = \beta^{[l]}-\alpha d\beta^{[l]}$
            * $\gamma^{[l]} = \gamma^{[l]}-\alpha d\gamma^{[l]}$
    * works w/ momentum, RMSprop, Adam, Gradient descent
    * the constant added, the $b$, will be cancel when do mean subtraction, and replace by the parameter $\beta$

## Why does Batch Norm work

* Learning on shifting input distribution
    * ![img10](imgs/img10.jpg)

* Why this is a problem with NN
    * ![img11](imgs/img11.jpg)
    
* Batch Norm as regularization
    * Each *mini-batch* is caled by the mean/variance computed on just that mini-batch
    * This adds some noise to the values $z^{[l]}$, to $\tilde{z}^{[l]}$, within that minibatch. So similar to dropout, it adds some noise to each hidden layer's activations
    * this has a slight regularization effect
        * minibatch size from 64 to 512
            * bigger mini-bachsize reduce the regularization effect
            * but mini-batch not intend to regularization

## Batch Norm at test time

* Batch norm at test time
    * $$\mu=\frac{1}{m}\sum_i z^{(i)}$$
        * $m$, number of example for minibatch
        * only avaliable when training
    * $$\sigma^2=\frac{1}{m}\sum_i (z^{(i)}-\mu)^2$$
        * only avaliable when training
    * $$z_{norm}^{(i)}=\frac{z^{(i)}-\mu}{\sqrt{\sigma^2+\epsilon}}$$
    * $$\tilde{z}^{(i)} = \gamma z_{norm}^{(i)}+\beta$$
    * $\mu$, $\sigma^2$: estimate using exponentialy weighted average across diff. mini-baches
        * $X^{\{1\}}$, $X^{\{2\}}$, $X^{\{3\}}$
        * $\mu^{\{1\}[l]}$, $\mu^{\{2\}[l]}$, $\mu^{\{3\}[l]}$
        * $\theta_1$, $\theta_2$, $\theta_3$
        * $\sigma^{2\{1\}[l]}$, $\sigma^{2\{2\}[l]}$, $\sigma^{2\{3\}[l]}$
    * In test time
        * $$z_{norm}=\frac{z-\mu}{\sqrt{\sigma^2+\epsilon}}$$
        * $$\tilde{z}=\gamma z_{norm}+\beta$$

# Multi-class classification

## Softmax Regression

* Recognizing cats, dogs, and baby chicks
    * cats, 1
    * dogs, 2
    * baby chicks, 3
    * other, 0
    * ![img12](imgs/img12.jpg)
    * $C = #classes =4$ 
    * ![img13](imgs/img13.jpg)
    
* Softmax Layer
    * ![img14](imgs/img14.jpg)
    * 
    * $$z^{[L]} = w^{[L]}a^{[L-1]}+b^{[L]}$$
    * Activation function:
        * (4,1) temp var. $$t = e^{z^{[L]}}$$
        * $$a^{[L]} = \frac{e^{z^{[L]}}}{\sum_{j=1}^4}t_i$$
        * $$a^{[L]}_i = \frac{t_i}{\sum_{j=1}^4}t_i$$
    * e.g.
        * $z^{[L]}=\begin{bmatrix} 5\\ 2\\ -1\\ 3\end{bmatrix}$,  $t=\begin{bmatrix} e^5\\ e^2\\ e^-1\\ e^3\end{bmatrix}=\begin{bmatrix} 148.4\\ 7.4\\ 0.4\\ 20.1\end{bmatrix}$
        * $$\sum^4_{j=1}t_j=176.3$$
        * $$a^{[L]} = \frac{t}{176.3}$$
        * layer L: $$\hat{y}=\begin{bmatrix} \frac{e^5}{176.3}\\ \frac{e^2}{176.3}\\ \frac{e^-1}{176.3}\\ \frac{e^3}{176.3}\end{bmatrix}=\begin{bmatrix} 0.842\\ 0.042\\ 0.002\\ 0.114\end{bmatrix}$$
    * 
    * $$a^{[L]} = g^{[L]}(z^{[L]})$$
        * (4,1) for $a$ and $z$
        
* softmax examples
    * C=3, linear bondary
        * ![img15](imgs/img15.jpg)
        * ![img16](imgs/img16.jpg)
        * ![img17](imgs/img17.jpg)
    * C= 4, 5, 6
        * ![img18](imgs/img18.jpg)
    

## Training a softmax classifier

* Understanding softmax
    * "hard max"
    * Softmax regression generalizes logistic regression to $C$ classes
        * If $C=2$, softmax reduces to logistic regression

* Loss function
    * $y=\begin{bmatrix} 0\\1\\0\\0\end{bmatrix}$, cat
    * $a^{[L]}=\hat{y}=\begin{bmatrix} 0.3\\0.2\\0.1\\0.4\end{bmatrix}$, $C=4$
    * small the loss: $$l(\hat(y),y)=-\sum^4_{j=1}y_jlog\hat{y}_j$$
        * $-y_2log\hat{y}_2 = -log\hat{y}_2$
        * make $\hat{y}_2$ big
    * cost: $$J(w,b,...)=\frac{1}{m}\sum^m_{i=1}l(\hat{y}^{i},y^{i})$$
    
* Gradient descent w/ softmax
    * ![img19](imgs/img19.jpg)
    * 
    * Backprop: $dz^{[L]}=\hat{y}-y$, (4,1)
        * $\frac{\partial J}{\partial z^{[L]}}$

# Introduction to programming frameworks

## DL frameworks

* DL frameworks
    * caffe/caffe2
    * CNTK
    * DL4J
    * Keras
    * Lasagne
    * mxnet
    * PaddlePaddle
    * TensorFlow
    * Theano
    * Torch
    
* Choosing DL frameworks
    * Ease of programming (development and deployment)
    * Running speed
    * Truly open (open source w/ good governance)

## TensorFlow

* Motivating problem
    * Minimize some cost function $J(w)=w^2-10w+25$
        * $(w-5)^2$, $w=5$
        * $J(w,b)$ to find the $w$, $b$

In [1]:
import numpy as np
import tensorflow as tf

In [14]:
#coefficients = np.array([[1.], [-10.], [25.]])
coefficients = np.array([[1.], [-20.], [100.]])

w = tf.Variable(0, dtype=tf.float32)
x = tf.placeholder(tf.float32, [3,1])
#cost = tf.add(tf.add(w**2, tf.multiply(-10.,w)),25)
#cost = w**2 - 10*w +25
cost = x[0][0]*w**2 + x[1][0]*w + x[2][0] #(w-5)**2
train = tf.train.GradientDescentOptimizer(0.01).minimize(cost)

init=tf.global_variables_initializer()
session = tf.Session()
session.run(init)
print(session.run(w))

0.0


In [18]:
with tf.Session() as session:
    session.run(init)
    print(session.run(w))

0.0


In [15]:
session.run(train, feed_dict={x:coefficients})
print(session.run(w))

0.2


In [16]:
for i in range(1000):
        session.run(train, feed_dict={x:coefficients})
print(session.run(w))

9.99998


* ![img20](imgs/img20.jpg)