## train/dev/test sets
train set: train model  
dev set: evaluate model and change hyperparameters  
test set: give a unbiased final performance  

### ratio
- small dataset(like 1000 or 10000): 60%/20%/20%
- larger dataset(1,000,000): 98%/1%/1% 

## Bias and Variance

<img src="../image/note2/bav.png" style="width:70%;">
<img src="../image/note2/td.png" style="width:70%;">

High bias: model performance poorly, having a high error percent  
High variance: the error percent between training set and dev set has a significant difference  

## Basic Recipe for Machine Learning

1. High bias --> does not perform well on training set --> bigger network(more hidden units or layers), train longer, try new NN architectures, try advanced optimization algorithm 

2. High variance --> looking at dev set performance --> more data, regularization, or try more appropriate NN architecture

3. Low bias and variance --> done

## Regularization
regularization can prevent overfitting and reduce variance    

L2 regularization: 
$$
J(w, b) = \frac{1}{m}L(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m}\|\mathbf{w}\|^2_2
$$ 
where $$\|\mathbf{w}\|^2_2 = w^Tw$$ 

meaning ***it adds the squares of all individual elements in w.*** It's called Frobenius norm.

L1 regularization:

changing the regularization term into
$$
\frac{\lambda}{2m}\|\mathbf{w}\|_1
$$

w will become sparse, meaning w will have a lot of zeros(helps to compress model).

$\lambda$ is called regularization parameter. It is a hyperparameter.


When doing back propogation, 

$dw^{[l]} = (backprop) + \frac{\lambda}{m}w^{[l]}$

$w^{[l]} := w^{[l]} - \alpha dw^{[l]}$

$w^{[l]} := w^{[l]} - \alpha (backprop) - \frac{\alpha\lambda}{m}w^{[l]}$

L2 regularization is also called weight decay.


## How does regularization prevent overfitting?

1. zeroing out some impact of hidden units  --> simplifing NN  --> change from high variance to high bias

2. w decrease --> z closer to 0 --> every layer becomes closer to linear part of tanh --> solve overfitting

## Dropout Regularization

Each layer has a probability of eliminating some nodes, so eventually you can get a reduced NN, having a similar effect as regularization.  

When implementing dropout, z should time keep_prob to keep the expected value of z

## Other Regularization Methods
1. Data augmentation: includes additional fake trainning examples(like flipping horizontally) to reduce overfitting.  
2. Early stopping: stop training halfway. It can prevent overfitting, but it can not get a optimized cost function.  

## Normalizing Training Sets
1. substract out the mean: move the training set until it has a zero mean.
$$
\mu = \frac{1}{m} \sum_{i=1}^m x^{(i)}  
$$
$$
x := x - \mu
$$

2. Nomalize variance: change both variance's height and width into 1
$$
\sigma^{2} = \frac{1}{m} \sum_{i=1}^m x^{(i)2}
$$
$$
x /= \sigma
$$

Without normalizing data, gradient descent might oscillate. If you normalize your data, it is easier and faster to optimize.

## Vanishing and Exploding Gradients
<img src="../image/note2/vanishing.png" style="width:70%;">  

If all the node in the graph above has a small w like 0.5, then $\hat{y}$ will be very small, since each layer will time 0.5 to the z. If w has a value of 1.5, then $\hat{y}$ will be very large. It also happens in back propagation, causing inefficient update of w at the initial layer.

## Weight Initialization for Deep Network
When initializing w with np.random.randn(), we should multiply by a standard deviation to make sure w is in an appropriate range, preventing gradient exlopsion or vanishing.
- ReLU(He initialization): $\sqrt{\frac{2}{n^{[l-1]}}}$
- tanh(Xavier initialization): $\sqrt{\frac{1}{n^{[l-1]}}}$
- other initialization: $\sqrt{\frac{2}{n^{[l-1]}+n^{[l]}}}$

## Numerical Approximation of Gradient
$\frac{f(\theta+\varepsilon)-f(\theta-\varepsilon)}{2\varepsilon} \approx f'(x)$ 


## Gradient Checking
1. Take $w^{[1]}$, $b^{[1]}$, ... , $w^{[l]}$, $b^{[l]}$ and reshape into a big vector $\theta$.

2. cost function --> J($\theta$) = J($\theta_1$, $\theta_2$, $\theta_3$, ...)

3. Take $dw^{[1]}$, $db^{[1]}$, ... , $dw^{[l]}$, $db^{[l]}$ and reshape into a big vector $d\theta$.

4. for each i: $d\theta_{appox}^{[i]} = \frac{J(\theta_1, \theta_2, ..., \theta_i + \varepsilon, ...) - J(\theta_1, \theta_2, ..., \theta_i - \varepsilon, ...)}{2\varepsilon} \approx \frac{\partial J}{\partial\theta_i} = d\theta[i]$ 

5. compute Euclidean distance between $d\theta[i]_{appox}$ and $d\theta[i]$ --> $\| \mathbf{d\theta}_{approx} - \mathbf{d\theta} \|_2$​

6. Check the ratio, the relative error by using the formula $\frac{\| \mathbf{d\theta}_{approx} - \mathbf{d\theta} \|_2}{\|{d\theta}_{approx} \mathbf\|_2 + \| d\theta\mathbf\|_2}$​

7. If the ratio is nearly $10^{-7}$, then it is probably correct. If the ratio is nearly $10^{-5}$, then there might be some minor errors. If the ratio is nearly $10^{-3}$, then it is incorrect.

Don't use gradient checking in training, only use it in debugging.

When using regularization, remember to include the reglarization term's derivative in back propagation

Gradient Checking does not work with dropout.

## Mini-batch gradient descent

mini-batch is to divide the entire training set into several smaller training set to speed up gradient descent.

$x^{\{t\}}$ represents $t_{th}$ mini batch

<img src="../image/note2/minibatch.png" style="width:70%;">  

mini-batch training has more oscillations, since each iteration it is training on a different mini batch: 

<img src="../image/note2/minibatchtrain.png" style="width:50%;">  


## Mini-batch Size

- If mini-batch size = m: Batch gradient descent (too slow)

- If mini-batch size = 1: Stochastic gradient descent (too much noise)

- In practice: choose a size between 1 and m, because it can both use the advantages of vectorization and smaller training set

#### Choose Your Mini-batch Size

small training set(m <= 2000): Use batch gradient descent

Typical mini-batch size: 64, 128, 256, 512, power of 2

## Exponentially Weighted Averages
If we want to calculate the temperature.

Formula:

$$V_t = \beta V_{t-1} + (1-\beta)\theta_t$$

where $V_t$ is approximately average over $\frac{1}{1-\beta}$ days' temperature.

If $\beta = 0.9$, then this formula can compute the average over the last 10 days. 

<img src="../image/note2/average.png" style="width:70%;">  

the green line represent $\beta = 0.98$ and the red line represents $\beta = 0.9$. Having a larger $\beta$ means having a larger weight to the previous values. Therefore, the green line adapts more slowly. On the other hand, the red line is more noisy, since it averages over a much shorter window.

The v graph is a exponentially decaying graph, and this formula is just add all the values with the decaying weight from that graph.

## Bias Correction

At the beginning exponentially weighted averages, the value is underestimated since $V_0 = 0$, so we should use $\frac{V_t}{1-\beta^t}$ to make correction.


## Gradient Descent with Momentum
<img src="../image/note2/momentum.png" style="width:70%;">  

When doing gradient descent with momentum, the gradient descent will also consider previous gradients. Thus, there will be less oscillations.  

Think gradient descent with momentum as a ball. It will accelerate at the right direction since the gradient has a consistent direction, and decelerate at unnecessary directions due to oscillations.

## RMSprop(root mean square prop)

<img src="../image/note2/RMS.png" style="width:70%;">  

RMSprop is used to reduce oscillations. Direction with consistently large gradient will have a large denominator, effectively lowering their learning rate, while directions that has more stable gradient get a relatively larger learning rate.

epsilon pervents the term to be divided by zero or an extremely small value.

The only difference between momentum and RMSprop is that RMSprop is square, so RMSprop only considers the magnitude instead of direction.

## Adam(Adaptive Moment Estimation) Optimization Algorithm

$V_{dw} = 0, S_{dw} = 0, V_{db} = 0, S_{db} = 0$

On iteration $t$:

Compute $dw, db$ using current mini-batch

$V_{dw} = \beta_1 V_{dw} + (1-\beta_1)dw,  V_{db} = \beta_1 V_{db} + (1-\beta_1)db$  <-- momentum $\beta_1$

$S_{dw} = \beta_2 S_{dw} + (1-\beta_2)dw^2,  S_{db} = \beta_2 S_{db} + (1-\beta_2)db^2$  <-- RMSprop $\beta_2$

$V_{dw}^{corrected} = \frac{V_{dw}}{1-\beta_1^t}, V_{db}^{corrected} = \frac{V_{db}}{1-\beta_1^t}$

$S_{dw}^{corrected} = \frac{S_{dw}}{1-\beta_2^t}, S_{db}^{corrected} = \frac{S_{db}}{1-\beta_2^t}$

$w := w - \alpha\frac{V_{dw}^{corr}}{\sqrt{S_{dw}^{corr}}+\epsilon}$

$b := b - \alpha\frac{V_{db}^{corr}}{\sqrt{S_{db}^{corr}}+\epsilon}$


#### Hyperparameter choices
- $\alpha$ : needs to be tune
- $\beta_1$ : 0.9  --> first moment
- $\beta_2$ : 0.999  --> second moment
- $\epsilon$ : $10^{-8}$

## Learning Rate Decay
Fixed value of $\alpha$ will not converge very well. A smaller learning rate will end up oscillating in a tight region around the minimum.

1 epoch = 1 pass through all data

$\alpha = \frac{1}{1+\text{decayrate} * epoch_{num}} \alpha_0$

$\alpha = 0.95^{epoch_{num}} * \alpha_0$  <-- exponential decay

$\alpha = \frac{k}{\sqrt{epoch_{num}}} * \alpha_0$

or discrete staircase