# Practical aspects of Deep Learning

- Recall that different types of initializations lead to different results
- Recognize the importance of initialization in complex neural networks.
- Recognize the difference between train/dev/test sets
- Diagnose the bias and variance issues in your model
- Learn when and how to use regularization methods such as dropout or L2 regularization.
- Understand experimental issues in deep learning such as Vanishing or Exploding gradients and learn how to deal with them
- Use gradient checking to verify the correctness of your backpropagation implementation

## Setting up your Machine Learning Application 

### Train / Dev / Test sets

** Applied machine learning is a highly iterative process **

- number of layers
- number of hidden units
- learning rate
- activation function
- ...

Imposible to guess the right hyper parameters at the first time. By train and test the algorithms and model friquenly and tune hyper parameters, we can come out with the optimal hyper parameters which provide best performance.

** Train / Dev / Test sets **
- Split dataset to train set, cross validation (dev) set and test set, where train set use to train the model, dev set to test the algorithm and tune hyper parameters once we finished with the final model, we will test it with the test set.
- Previous era (100 - 1,000 - 10,000): 70/30 (train/test) or 60/20/20 (train/dev/test)
- Big data era (> 1,000,000): 10,000 dev and 10,000 test --> 98/1/1 or 99.5/0.25-0.4/0.1-0.25 (train/dev/test)

** Mismatched train/test distribution **

For example we are building an application which allow users to upload pictures and the app will recognize cats in those picstures.
- We conduct the train set of the cat pitures from web pages which have high quality and resolution
- The dev and test set were uploaded by user which have low quality and resolution (maybe blur, etc.)

$\Rightarrow$ Make sure the train and test sets came from the same distribution

Not having the 

### Bias / Variance

- high bias: under fitting
- high variance: over fitting
- Human error or Optimal (Bayes) error, for example a very blur picture will have the Bayes error very high or even human cannot regconize it, for high quality picture it is nearly zero.
- Compare the Bayes error with the train set error to determine high the bias problem
- Compare the dev/test set error with train set error to determine the high variance probem


### Basic Recipe for Machine Learning

** Check high bias? (train set performance) **
- _Bigger network_
- Train longer (more iteration)
- New NN architecture

** Check high variance? (dev/test set performance) **
- _More data_
- Regularization
- New NN architecture

** Bias vs. Variance trade off **
- More data will not hurt bias
- Bigger network will not hurt variance




## Regularizing your neural network

### Regularization

** Logistic Regression **

- We try to minimize:
$$
agrmin \ J(w, b) \ where \ w \in \mathbb{R}^n_x \ , \ b \in \mathbb{R}
$$

with:

$$
J(w, b) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}^{(i)}, y^{(i)})
$$

we will add regularization to pinalize the weights:

$$
J(w, b) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \|w\|^2_2
$$

where the L2 norm of w is:

$$
\|w\|^2_2 = \sum_{j=1}^{n_x} w_j^2 = w^tw
$$

$\Rightarrow$ This called L2 regularization

- We can use L1 regularization using L1 norm of w, where the w will be sparse, normally it won't help much, the L2 regularizaton is used much more often:

$$
\frac{\lambda}{m} \|w\|_1 = \frac{\lambda}{m}\sum_{j=1}^{n_x}|w_j|
$$

- $\lambda$ called the regularization parameter

** Neural Network **

- We have lost function $L$ layers and $m$ samples in data set with additional regularization to pinalize the weight matrix:

$$
J(w^{[1]}, b^{[1]}, w^{[2]}, b^{[2]}, ..., w^{[L]}, b^{[L]}) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{l=1}^L \|W^{[l]}\|^2_F
$$

where the Frobenius norm (L2) of matrix $W^{[l]}$ is:

$$
\|W^{[l]}\|^2_F = \sum_i^{n^{[l-1]}}\sum_j^{n^{[l]}}(W_{ij}^{[l]})^2 \ because \ W \ is \ a \ [n^{[l-1]} \times n^{[l]}] \ matrix
$$

- Gradient decent with regularization

$$
dw^{[l]} = \frac{\partial{J}}{\partial{W^{[l]}}} = (backprob) + \frac{\lambda}{m} W^{[l]} 
$$

- Update parameters with regularization

$$
W^{[l]} = W^{[l]} - \alpha dw^{[l]} 
= W^{[l]} - \alpha ((backprob) + \frac{\lambda}{m} W^{[l]}) \\
= W^{[l]} - \frac{\lambda\alpha}{m} W^{[l]} - \alpha(backprob) \\
= W^{[l]}(1 - \frac{\lambda\alpha}{m}) - \alpha(backprob)
$$

this called "Weight decay" where we multiply the $W^{[l]}$ by $(1 - \frac{\lambda\alpha}{m})$



### Why regularization reduces overfitting?
- Lost function:

$$
J(w^{[1]}, b^{[1]}, w^{[2]}, b^{[2]}, ..., w^{[L]}, b^{[L]}) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{l=1}^L \|W^{[l]}\|^2_F
$$

- With $\lambda$ very large we can make the $W^{[l]}$ close to zero, and mininize the impact of the hidden units to have a simpler network
- If we use $tanh$ as the activation function $g(z) = tanh(z)$, if $z$ very small close to zero, the $g(z)$ will be nearly linear
- When we increase $\lambda$, the W will decrease close to zero, with $z = Wx + b$, by decrease $W$, we will have $z$ very small so the $g(z)$ will be nearly linear
- The activations will be nearly equa the linear function, we will have the deep network with linear activation functions in every layer
- Remember with debug the SGD, we need to plot the cost function with the regularization term

### Dropout Regularization

- Go throught the layers
- Set probabilities of eliminating a node (eg. keep 80%, drop 20%)
- Have a smaller network

**Implementing dropout (Inverted dropout)**

with layer $l=3$

```python
keep_prob = 0.8
d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob
a3 = a3 * d3
a3 = a3 / keep_prob # to not reduce the value of z by dropout 20%
```
- Do not dropout at the test time

### Understanding Dropout

### Other regularization methods

## Setting up your optimization problem

### Normalizing inputs

### Vanishing / Exploding gradients

### Weight initialization for deep network

### Numerical approximation of gradients

### Gradient checking

### Gradient checking implementation notes