(PAofDL)=
# 5 Practical Aspects of Deep Learning

Discover and experiment with a variety of different initialization methods, apply L2 regularization and dropout to avoid model overfitting, then apply gradient checking to identify errors in a fraud detection model.

**Learning Objectives**

- Give examples of how different types of initializations can lead to different results

- Examine the importance of initialization in complex neural networks

- Explain the difference between train/dev/test sets

- Diagnose the bias and variance issues in your model

- Assess the right time and place for using regularization methods such as dropout or L2 regularization

- Explain Vanishing and Exploding gradients and how to deal with them

- Use gradient checking to verify the accuracy of your backpropagation implementation

- Apply zeros initialization, random initialization, and He initialization

- Apply regularization to a deep learning model
---

## Setting up your Machine Learning Application

### Train / Dev / Test Sets

[Video](https://youtu.be/AzaEKXlZ4cM)

The main points are:

1. In the modern big data era, where you might have a million examples in total, then the trend is that your **dev and test sets** have been becoming a **much smaller percentage** of the total. Because remember, the goal of the dev set or the development set is that you're going to test different algorithms on it and see which algorithm works better. So the dev set just needs to be big enough for you to evaluate, say, two different algorithm choices or ten different algorithm choices and quickly decide which one is doing better. And you might not need a whole 20\% of your data for that. 
	
	- For example, if you have a million training examples, you might decide that just having 10,000 examples in your dev set is more than enough to evaluate, you know, which one or two algorithms does better. And in a similar vein, the main goal of your test set is, given your final classifier, to give you a pretty confident estimate of how well it's doing. And again, if you have a million examples, maybe you might decide that 10,000 examples is more than enough in order to evaluate a single classifier and give you a good estimate of how well it's doing. So, in this example, where you have a million examples, if you need just 10,000 for your dev and 10,000 for your test, your ratio will be more like...this 10,000 is 1\% of 1 million, so you'll have 98\% train, 1\% dev, 1\% test. And I've also seen applications where, if you have even more than a million examples, you might end up with, you know, 99.5\% train and 0.25\% dev, 0.25\% test. Or maybe a 0.4\% dev, 0.1\% test. 

	- When setting up your machine learning problem, I'll often set it up into a train, dev and test sets, and if you have a relatively small dataset, these traditional ratios might be okay. But if you have a much larger data set, it's also fine to set your dev and test sets to be much smaller than your 20\% or even 10\% of your data.

2. **Make sure the dev and test set come from a same distribution**.

3. Not having a test set might be fine.


### Bias / Variance

[Video](https://youtu.be/2hIwfc5ak_w)

- **Aim***: in deep learning, less bias-variance trade-off.

```{figure} images/5-1.png
---
height: 200px
name: 5-1
---
```

```{list-table}
:header-rows: 0

* - **Train set error**
  - 1\%
  - 15\%
  - 15\%
  - 0.5\%
* - **Dev set error**
  - 11\%
  - 16\%
  - 30\%
  - 1\%
* - **Comments**
  - high variance
  - high bias
  - high bias and high variance
  - low bias and low variance

```

This analysis is predicated on the assumption, that human level performance gets nearly 0\% error or, more generally, that the optimal error, sometimes called Bayes error, so the Bayesian optimal error is nearly 0\%. 

> For example, if the optimal error or the Bayes error were much higher - were 15\%, then if you look at the second column in the table above, 15\% is actually perfectly reasonable for training set and you wouldn't say that's high bias and also a pretty low variance. 

```{figure} images/5-2.png
---
height: 200px
name: 5-2
---
```

Where it has high bias, because, by being a mostly linear classifier, it's just not fitting. The quadratic line shape that well. But by having too much flexibility in the middle, it somehow overfits those two examples in the middle as well. So this classifier kind of has high bias, because it was mostly linear, but you need maybe a curve function or quadratic function. And it has high variance, because it had too much flexibility to fit, you know, those two mislabel, or those aligned examples in the middle as well. In case this seems contrived, well, this example is a little bit contrived in two dimensions, but with very high dimensional inputs. You actually do get things with high bias in some regions and high variance in some regions. And so it is possible to get classifiers like this in high dimensional inputs that seem less contrived.

```{admonition} Summary
:class: warning
You've seen how by looking at your algorithm's error on the training set and your algorithm's error on the dev set, you can try to diagnose, whether it has problems of high bias or high variance, or maybe both, or maybe neither. And depending on whether your algorithm suffers from bias or variance, it turns out that there are different things you could try.
```


### Bias Recipe for Machine Learning

[Video](https://youtu.be/zIxFN41JEyY)

```{figure} images/5-3.png
---
height: 360px
name: 5-3
---
```

>  Putting `find a new NN architecture` in parentheses because one of those things that, you just have to try, maybe you can make it work, maybe not. Whereas, getting a bigger network almost always helps, and training longer, well, doesn't always help, but it certainly never hurts. So when training a learning algorithm, I would try these things until I can at least get rid of the bias problems.

```{admonition} Notice
:class: hint

1. Depending on whether you have high bias or high variance, the set of things you should try could be quite different. So I'll usually use the training-dev set to try to diagnose if you have a bias or variance problem, and then use that to select the appropriate subset of things to try. So, for example, if you actually have a high bias problem, getting more training data is actually not going to help. Or, at least it's not the most efficient thing to do. So being clear on how much of a bias problem or variance problem or both, can help you focus on selecting the most useful things to try.

2. in the earlier era of machine learning, there used to be a lot of discussion on what is called the bias variance tradeoff. And the reason for that was that, for a lot of the things you could try, you could increase bias and reduce variance, or reduce bias and increase variance. But, back in the pre-deep learning era, we didn't have many tools, we didn't have as many tools that just reduce bias, or that just reduce variance without hurting the other one. But in the modern deep learning, big data era, so long as you can keep training a bigger network, and so long as you can keep getting more data, which isn't always the case for either of these, but if that's the case, then getting a bigger network almost always just reduces your bias, without necessarily hurting your variance, so long as you regularize appropriately. And getting more data, pretty much always reduces your variance and doesn't hurt your bias much. So what's really happened is that, with these two steps, the ability to train, pick a network, or get more data, we now have tools to drive down bias and just drive down bias, or drive down variance and just drive down variance, without really hurting the other thing that much. And I think this has been one of the big reasons that deep learning has been so useful for supervised learning, that there's much less of this tradeoff where you have to carefully balance bias and variance, but sometimes, you just have more options for reducing bias or reducing variance, without necessarily increasing the other one. And, in fact, so last, you have a well-regularized network.

3. Training a bigger network almost never hurts. And the main cost of training a neural network that's too big is just computational time, so long as you're regularizing. 
```

## Regularizing your Neural Network

The second section will introduce regularizaiton which is a very useful technique for reducing variance. To help you understand how to apply regularization to the neural network.

### Regularization

[Video](https://youtu.be/ZlrNwgvycNw)

If you suspect your neural network is over fitting your data, that is, you have a high variance problem. 

- One of the first things you should try is probably **regularization**. 
- The other way to address high variance is to **get more training data** that's also quite reliable. But you can't always get more training data, or it could be expensive to get more data. 

But adding regularization will often help to prevent overfitting, or to reduce variance in your network.

Let's develop these ideas using logistic regression. Recall that for logistic regression, you try to minimize the cost function $J$, $\text{min}_{w,b} \ \boldsymbol{J}(w,b)$ which is defined as this:

$$\boldsymbol{J}(w,b) = \dfrac{1}{m} \sum_{i=1}^{m}  \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) , \ w \in  \mathbb{R}^{n_x}, \ b \in \mathbb{R}$$

After adding the regularization item:

$$\boldsymbol{J}(w,b) = \dfrac{1}{m} \sum_{i=1}^{m}  \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) + \dfrac{\lambda}{2m} \lVert w \rVert_2^2 + \dfrac{\lambda}{2m} b^2$$

where $\lambda$ is the regularization parameter. And usually, you set this using your development set, or using hold-out cross validation. When you try a variety of values and see what does the best, in terms of trading off between doing well in your training set versus also setting that two normal of your parameters to be small, which helps prevent over fitting. So lambda is another hyper parameter that you might have to tune.

```{admonition} Notice
:class: hint

$$\boldsymbol{J}(w,b) = \dfrac{1}{m} \sum_{i=1}^{m}  \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) + \dfrac{\lambda}{2m} \lVert w \rVert_2^2$$

The final regualrization item for $b$ ($\dfrac{\lambda}{2m} b^2$) in the equation above usually can be omitted. Because if you look at your parameters, $w$ is usually a pretty high dimensional parameter vector, especially with a high variance problem. Maybe $w$ just has a lot of parameters, so you aren't fitting all the parameters well, whereas $b$ is just a single number. So almost all the parameters are in $w$ rather than $b$. And if you add this last term, in practice, it won't make much of a difference, because $b$ is just one parameter over a very large number of parameters. In practice, we usually just don't bother to include it. But you can if you want. 
```
- L2 regularization (the Euclidean norm): $\lVert w \rVert_2^2 = \sum_{j=1}^{n_x} w_j^2 = w^{T}w$
- L1 regularization: $\dfrac{\lambda}{2m} \sum_{j=1}^{n_x} \lvert w_j \rvert = \dfrac{\lambda}{2m}\lVert w \rVert_1 $

L2 regularization is the most common type of regularization. Some situations will be used L1 regularization.

```{admonition} Notice
:class: hint
If you use L1 regularization, then $w$ will end up being **sparse**. And what that means is that the $w$ vector will **have a lot of zeros in it**. And some people say that this can help with compressing the model, because the set of parameters are zero, then you need less memory to store the model. Although, I find that, in practice, L1 regularization, to make your model sparse, helps only a little bit. So I don't think it's used that much, at least not for the purpose of compressing your model. And when people train your networks, L2 regularization is just used much, much more often.
```

Implement L2 regularization for neural network:

$$\boldsymbol{J}(w^{[1]}, b^{[1]}, \cdots, w^{[l]}, b^{[l]}) = \dfrac{1}{m} \sum_{i=1}^{m}  \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) + \dfrac{\lambda}{2m} \sum_{l=1}^L \lVert w^{[l]} \rVert_F^2 + \dfrac{\lambda}{2m} b^2$$

The item $\lVert \bullet \rVert_F^2$ is called **Frobenius norm**. Due to the dimension of the $w^{[l]}$ is $(n^{[l]}, n^{[l-1]})$, the Frobenius norm formula of a matrix should be:

$$\lVert w^{[l]} \rVert_F^2 = \sum_{i=1}^{n^{[l]}} \sum_{j=1}^{n^{[l-1]}} (w_{i,j}^{[l]})^2$$

Frobenius norm just means **the sum of square of elements of a matrix**.

The rows "$i$" of the matrix should be the number of neurons in the current layer $n^{[l]}$. 

Whereas the columns "$j$" of the weight matrix should equal the number of neurons in the previous layer $n^{[l-1]}$. 

To implement gradient descent, for the back propagation to calculate the $\mathrm{d}w^{[l]}$:

$$\mathrm{d}w^{[l]} = \text{item calculated form back propagation} + \dfrac{\lambda}{m} w^{[l]}$$

where $\mathrm{d}w^{[l]} = \dfrac{\partial J}{\partial w^{[l]}}$.

It turns out that with this new definition of $\mathrm{d}w^{[l]}$, this is still a correct definition of the derivative of your cost function, with respect to your parameters, now that you've added the extra regularization term at the end.

And then put this $\mathrm{d}w^{[l]}$ into the update step:

$$w^{[l]} := w^{[l]} - \alpha \mathrm{d}w^{[l]}$$

The L2 regularization method sometimes also called 'weight decay', the reason is:

$$
\begin{aligned}
w^{[l]} 
&= w^{[l]} - \alpha [\text{item calculated form back propagation} + \dfrac{\lambda}{m} w^{[l]}] \\
&= w^{[l]} - \dfrac{\alpha \lambda}{m} w^{[l]} - \alpha (\text{item calculated form back propagation}) \\
&= \underbrace{(1 - \dfrac{\alpha \lambda}{m})}_{<1} w^{[l]} - \alpha (\text{item calculated form back propagation})
\end{aligned}
$$

And so the first term shows that whatever the matrix $w^{[l]} $ is, you're going to make it a little bit smaller. Like you're multiplying the matrix $w$ by the number $(1 - \frac{\alpha \lambda}{m})$, which is going to be a little bit less than 1. 



### Why regularization Reduces Overfitting?

[Video](https://youtu.be/E-e1LRzRz0o)

**Regularization adds a extra term that penalizes the weight matrices from being too large.** 

1. Increasing $\lambda$, the matrices $W$ will be reasonably close to zero. 

	- One piece of intuition is maybe it'll set the weight to be so close to zero for a lot of hidden units that's basically zeroing out a lot of the impact of these hidden units. This much simplified neural network becomes a much smaller neural network. This will take you from the overfitting case, much closer to the other high bias case. But, hopefully, there'll be an intermediate value of lambda that results in the result closer to this "just right" case in the middle.
	- However, the intuition of completely zeroing out a bunch of hidden units isn't quite right. It turns out that what actually happens is it'll still use all the hidden units, but each of them would just have a much smaller effect. But you do end up with a simpler network, and as if you have a smaller network that is, therefore, less prone to overfitting. 

2. I'm going to assume that we're using the $\text{tanh}()$ activation function:

```{figure} images/5-4.png
---
height: 250px
name: 5-4
---
$g(z) = \text{tanh}(z)$
```

Notice that so long as $z$ is quite small, so if $z$ takes on only a smallish range of parameters, maybe around in the middle range shows in the figure, then you're just using the linear regime of the $\text{tanh}$ function. Only if $z$ is allowed to larger values or smaller values, that the activation function starts to become less linear. 

- If $\lambda$, the regularization parameter is large, then the parameters will be relatively small, because they are penalized being large in the cost function. 

- And so if the weights, $W$, are small, then $z$ will also be relatively small.
	
$$z^{[l]} = w^{[l]}a^{[l-1]} + b^{[l]}$$
	
- In particular, if $z$ ends up taking relatively small values, just in this little range, then $g(z)$ will be roughly linear. So if every layer is linear, then the whole network is just a linear network. 

- And so even a very deep network, with a deep network with a linear activation function is only able to compute a linear function. So it's not able to fit very complicated decision, very non-linear decision boundaries. And so, is also much less able to overfit.


### Dropout Regularization

[Video](https://youtu.be/62AiUPL_g18)

**Dropout**: Go through each of the layers of the network and set some probability of eliminating a node in neural network. 

```{figure} images/dropout1_kiank.gif
---
height: 360px
name: 5-5
---
```

```{figure} images/dropout2_kiank.gif
---
height: 360px
name: 5-6
---
```

- Implementing droptout in Python: `keep_prob` - which is a number, there will be the probability that a given hidden unit will be kept. For example, if `keep_prob = 0.8`, then this means that there's a 0.2 chance of eliminating any hidden unit. So, what it does is it generates a random matrix. 

- Making prediction at test time: **NOT** to use dropout at test time. That's because when you are making predictions at the test time, you don't really want your output to be random. If you are implementing dropout at test time, that just add noise to your predictions. Also don't need to add in an extra scaling parameter at test time. 


### Understanding Dropout

[Video](https://youtu.be/mcNkV_hFoY8)

Intuition: 

1. Drop-out randomly knocks out units in your network. So it's as if on every iteration you're working with a smaller neural network. And so using a smaller neural network seems like it should have a regularizing effect. 

2. Cannot rely on any feature (because anyone feature could go away at random or anyone of its own inputs could go away at random), so have to spread out weights (this will tend to have an effect of shrinking the squared norm of the weights).

```{important} The thing to remember is that 
drop out is a regularization technique, it helps prevent overfitting. 
```

### Other Regularization Methods

[Video](https://youtu.be/ZMvhY3nSilg)

- Data augmentation

- Early stopping

- Please find more practical exercises in [Practical 6: Regularization](C5_Regularization).


## Setting up your Opimization Problem

The third section will introduce some techniques for setting up your optimization problem to make your training go quickly.

### Normalizing Inputs

[Video](https://youtu.be/bxMVspfz2I4)


When training a neural network, one of the techniques to speed up your training is if you normalize your inputs.

#### Normalizing training sets

Let's see the training sets with two input features. 

$$\mathbf{x} = \left[ 
\begin{matrix}
x_1 \\
x_2
\end{matrix}
\right]$$

The input features $\mathbf{x}$ are two-dimensional and here's a scatter plot of your training set.

```{figure} images/5-5.png
---
height: 200px
---
```

Normalizing your inputs corresponds to two steps:

1. **Subtract mean**: 

$$\begin{aligned}
\mu &= \dfrac{1}{m} \sum_{i=1}^m x^{(i)} \\
x &= x-\mu
\end{aligned}$$

This step means that you just move the training set until it has zero mean.

2. **Normalize variance**: 

$$\begin{aligned}
\sigma^2 &= \dfrac{1}{m} \sum_{i=1}^m x^{(i)} ** 2 \\
x &= x/\sigma^2
\end{aligned}$$

> Note: `**` means element-wise squaring. $\sigma^2$ is a vector with the variances of each of the features.

After these two steps, the variance of $x_1$ and $x_2$ are both equal to one.

```{hint} If you use this to scale your training data, 
then use the same $\mu$ and $\sigma$ to normalize your test set.

In particular, you don't want to normalize the training set and a test set differently. Whatever $\mu$ value is and whatever $\sigma^2$ value is, use them in above two formulas so that you scale your test set in exactly the same way rather than estimating $\mu$ and $\sigma^2$ separately on your training set and test set, because you want your data both training and test examples to go through the same transformation defined by the same $\mu$ and $\sigma^2$ calculated on your training data. 
```

#### Why normalize inputs?

```{figure} images/5-6.png
---
height: 360px
---
```

- If you use unnormalized input features, it's more likely that your cost function will look like the plot shows in the left of figure, like a very squished out bowl, very elongated cost function. 

	- This situation happens when the features are on very different scales, for example the feature $x_1$ ranges from $(1, 1000)$ and the feature $x_2$ ranges from $(0, 1)$, then it turns out that the ratio or the range of values for the parameters $w_1$ and $w_2$ will end up taking on very different values, cost function can be very elongated bowl like that. If you plot the contours of this function, you can have a very elongated function like that. 

- Whereas if you normalize the features, then your cost function will on average look more symmetric. 

- If you are running gradient descent on a cost function like the one on the left, then you might have to use a very small learning rate, the gradient decent might need a lot of steps to oscillate back and forth before it finally finds its way to the minimum. 

- Whereas if you have more spherical contours, then wherever you start, gradient descent can pretty much go straight to the minimum. You can take much larger steps where gradient descent need, rather than needing to oscillate around like the picture on the left. 

- In practice, $w$ is a high dimensional vector. Trying to plot this in 2D doesn't convey all the intuitions correctly. But the rough intuition that you cost function will be in a more round and easier (faster) to optimize when your features are on similar scales. 


```{hint} In practice,
if your dataset included $x_1$ range from $(0, 1)$, $x_2$ range from $(-1, 1)$, and $x_3$ range from $(1, 2)$, these are fairly similar ranges, so this will work just fine. When they are on dramatically different ranges like ones $(1, 1000)$ and another from $(0, 1)$. That really hurts your optimization algorithm. 

That by just setting all of them to zero mean and variance one like we did in last section, that just guarantees that all your features are in a similar scale and will usually help you learning algorithm run faster. 
```

```{admonition} Summary
:class: important
If your input features came from very different scales, maybe some features are from $(0,1)$, some from $(1, 1000)$, then it's important to normalize your features. 

If your features came in on similar scales, then this step is less important although performing this type of normalization pretty much never does any harm. 

Often you could do it anyway, if you are not sure whether or not it will help with speeding up training for your algorithm. 
```

### Vanishing / Exploding Gradients

[Video](https://youtu.be/lRutSzyhr8Y)

One of the problems of training neural network, especially very deep neural networks, is data vanishing and exploding gradients. What that means is that when you're training a very deep network your derivatives or your slopes can sometimes get either very, very big or very, very small, maybe even exponentially small, and this makes training difficult. In this section you will see what this problem of exploding and vanishing gradients really means, as well as how you can use careful choices of the random weight initialization to significantly reduce this problem.

Here shows a very deep neural network, to save space and as a simple example, only two hidden units per layer, but it could be more as well.

```{figure} images/5-7.png
---
height: 100px
---
```

Assumptions:

1. For a deep neural, the parameters of weight: $W^{[1]}, W^{[2]}, \cdots, W^{[l]}$;
2. Ignore $b$, setting $b = 0$;
3. Activation function: $g(z) = z$, a linear function.

Based on these assumptions, the output:

$$\hat{y} = W^{[l]}W^{[l-1]}W^{[l-2]} \cdots W^{[3]}W^{[2]}W^{[1]} x$$

$$\begin{aligned}
z^{[1]} &= W^{[1]} x \\
a^{[1]} &= g(z^{[1]}) = z^{[1]} \\
a^{[2]} &= g(z^{[2]}) = z^{[2]} =  W^{[2]} a^{[1]} \\
&\vdots \\
\hat{y} = a^{[l]} &= g(z^{[l]}) = z^{[l]}
\end{aligned}$$

- Now let's say that each of you weight matrices:

$$W^{[l]} = \left[ 
\begin{matrix}
1.5 & 0 \\
0 & 1.5
\end{matrix}
\right]$$

> Notice: Technically, the last one $W^{[l]}$ has different dimensions so maybe this is just the rest of $[l-1]$ weight matrices.

In this case, 

$$ \hat{y} = W^{[l]}  \left[ 
\begin{matrix}
1.5 & 0 \\
0 & 1.5
\end{matrix}
\right]^{L-1} \mathbf{x} \approx 1.5^{L-1} \mathbf{x}$$

If $L$ was large for very deep neural network, $\hat{y}$ will be very large. It grows exponentially, it grows like 1.5 to the number of layers. And so if you have a very deep neural network, the value of $\hat{y}$ will explode.


- Conversely, if we replace $1.5$ with $0.5$, something less than 1 $(<1)$, then the result becomes $0.5$ to the power of L.

$$ \hat{y} = W^{[l]}  \left[ 
\begin{matrix}
0.5 & 0 \\
0 & 0.5
\end{matrix}
\right]^{L-1} \mathbf{x} \approx 0.5^{L-1} \mathbf{x}$$

In this case, the activation values will decrease exponentially as a function of the number of layers L of the network. So in the very deep network, the activations end up decreasing exponentially. 

```{admonition} Summary
:class: important

$$\begin{aligned}
W^{[l]} &> \mathbf{I} \  \longrightarrow \text{explode} \\
W^{[l]} &< \mathbf{I} \  \longrightarrow \text{decrease exponentially} 
\end{aligned}$$

- The weights $W$, if they're all just a little bit bigger than one or just a little bit bigger than the identity matrix, then with a very deep network the activations can explode. 

- And if $W$ is just a little bit less than identity. Then you have a very deep network, the activations will decrease exponentially. 

With such a deep neural network, if your activations or gradients increase or decrease exponentially as a function of L, then these values could get really big or really small. And this makes training difficult, especially if your gradients are exponentially smaller than L, then gradient descent will take tiny little steps. It will take a long time for gradient descent to learn anything.
```

To summarize, you've learned how deep networks suffer from the problems of vanishing or exploding gradients. In fact, for a long time this problem was a huge barrier to training deep neural networks. It turns out there's a partial solution that doesn't completely solve this problem but it helps a lot which is careful choice of how you initialize the weights. 



### Weight Initialization for Deep Networks

[Video](https://youtu.be/qbwFUDJ4R_w)

- ReLU: $W^{[l]}$ = `np.random.randn`($n^{[l]}, n^{[l-1]}$) * `np.sqrt`($\frac{2}{n^{[l-1]}}$)

- tanh (Xavier initialization): $W^{[l]}$ = `np.random.randn`($n^{[l]}, n^{[l-1]}$) * `np.sqrt`($\frac{1}{n^{[l-1]}}$)

- some other initialization method: $W^{[l]}$ = `np.random.randn`($n^{[l]}, n^{[l-1]}$) * `np.sqrt`($\frac{2}{n^{[l]} + n^{[l-1]}}$)

If the input features of activations are roughly mean 0 and standard variance and variance 1 then this would cause $z$ to also take on a similar scale. And this ***doesn't solve***, but it definitely ***helps reduce*** the vanishing, exploding gradients problem, because it's trying to set each of the weight matrices $W$ not too much bigger than 1 and not too much less than 1, so it doesn't explode or vanish too quickly.

For more details in [Practical 5: Initialization](C5_Initialization).



### Numerical Approximation of Gradients

[Video](https://youtu.be/5zTEBiHOAhc)

#### One-sided difference

```{figure} images/5-8.png
---
height: 300px
---
```


$$\begin{aligned}
J = f(\theta) &= \theta^3, \ \theta \in \mathbb{R} \\
g(\theta) = \dfrac{\mathrm{d}}{\mathrm{d}\theta} f(\theta) = f'(\theta) &= 3\theta^2 \\
\dfrac{f(\theta+\epsilon) - f(\theta)}{\epsilon} &\approx g(\theta) 
\end{aligned}$$

When $\theta = 1, \ \epsilon = 0.01$, 

- $g(\theta) =  f'(\theta) = 3\theta^2 = 3$

- $\dfrac{f(\theta+\epsilon) - f(\theta)}{\epsilon} = \dfrac{(1.01)^3 - 1^3}{0.01} = 3.0301 \approx 3$

- the approximate error: $0.0301$


#### Two-sided difference

```{figure} images/5-9.png
---
height: 300px
---
```


$$\begin{aligned}
J = f(\theta) &= \theta^3, \ \theta \in \mathbb{R} \\
g(\theta) = \dfrac{\mathrm{d}}{\mathrm{d}\theta} f(\theta) = f'(\theta) &= 3\theta^2 \\
\dfrac{f(\theta+\epsilon) - f(\theta-\epsilon)}{2\epsilon} &\approx g(\theta) 
\end{aligned}$$

When $\theta = 1, \ \epsilon = 0.01$, 

- $g(\theta) =  f'(\theta) = 3\theta^2 = 3$

- $\dfrac{f(\theta+\epsilon) - f(\theta-\epsilon)}{2\epsilon} = \dfrac{(1.01)^3 - (0.99)^3}{0.02} = 3.0001 \approx 3$

- the approximate error: $0.0001$

```{admonition} Optional
:class: hint

The formal definition of a derivative:

$$f'(\theta) = \lim_{\epsilon \to \infty} \dfrac{f(\theta+\epsilon) - f(\theta-\epsilon)}{2\epsilon} \tag{1}$$

It turns out that for a non zero value of $\epsilon$ $(\epsilon \neq 0)$, you can show that the error of this approximation is on the order of epsilon squared ($\boldsymbol{O}(\epsilon^2)$), and remember epsilon is a very small number. The big O notation ($\boldsymbol{O}) means the error is actually some constant times this.

$$f'(\theta) = \lim_{\epsilon \to \infty} \dfrac{f(\theta+\epsilon) - f(\theta)}{\epsilon} \tag{2}$$ 

The error is on the order of epsilon ($\boldsymbol{O}(\epsilon)$).

And again, when $\epsilon$ is a number less than 1, then $\epsilon$ is actually much bigger than $\epsilon^2$ which is why the formula (2) is actually much less accurate approximation than the formula (1). Which is why when doing gradient checking, we use this two-sided difference rather than just one sided difference which is less accurate.
```

```{admonition} Summary
:class: important
Two-sided difference formula is much more accurate.
```

### Gradient Checking

[Video](https://youtu.be/CglVBBL2fZc)

For more details in [Practical 7: Gradient Checking](C5_Gradient_Checking).


### Gradient Checking Implementation Notes

[Video](https://youtu.be/2lEYV0OP0ug)

In this section, I want to share with you some practical tips or some notes on how to actually go about implementing ***gradient checking*** for your neural network:

1. **DO NOT** use in training - only use to debug.

2. If algorithm fails grad check, look at components to try to identify bug.

3. Remember regularization. When doing grad check, remember your regularization term if you're using regularization.

4. Grad check **DOES NOT** work with dropout.

	- Because in every iteration, dropout is randomly eliminating different subsets of the hidden units. There isn't an easy to compute cost function $J$ that dropout is doing gradient descent on. It turns out that dropout can be viewed as optimizing some cost function $J$, but this cost function $J$ defined by summing over all exponentially large subsets of nodes they could eliminate in any iteration. So the cost function $J$ is very difficult to compute, and you're just sampling the cost function every time you eliminate different random subsets in those we use dropout. So it's difficult to use grad check to double check your computation with dropouts. 

5. Run at random initialization; perhaps again after some training.

```{admonition} Summary
:class: important

In this week, you've learned about:

- how to set up your train, dev, and test sets;

- how to analyze bias and variance and what things to do if you have high bias versus high variance versus maybe high bias and high variance;

- how to apply different forms of regularization, like L2 regularization and dropout on your neural network. So some tricks for speeding up the training of your neural network. 

- And then finally, gradient checking. 
```


## Quiz

1. If you have 20,000,000 examples, how would you split the train/dev/test set? Choose the best option.

    A. 90\% train. 5\% dev. 5\% test.

    B. 99\% train. 0.5\% dev. 0.5\% test.

    C. 60\% train. 20\% dev. 20\% test.


2.  In a personal experiment, an M.L. student decides to not use a test set, only train-dev sets. In this case which of the following is true?

    A. He won't be able to measure the bias of the model.
    
    B. He won't be able to measure the variance of the model.
    
    C. He might be overfitting to the dev set.
    
    D. Not having a test set is unacceptable under any circumstance.


3. If your Neural Network model seems to have high variance, what of the following would be promising things to try?

    A. Make the Neural Network deeper
    
    B. Add regularization
    
    C. Get more training data
    
    D. Increase the number of units in each hidden layer
    
    E. Get more test data


4. You are working on an automated check-out kiosk for a supermarket and are building a classifier for apples, bananas, and oranges. Suppose your classifier obtains a training set error of 19\% and a dev set error of 21\%. Which of the following are promising things to try to improve your classifier? (Check all that apply, suppose the human error is approximately 0\%)

    A. Use a bigger network.
    
    B. Increase the regularization parameter lambda.
    
    C. Get more training data.


5.  **True/False** In every case it is a good practice to use dropout when training a deep neural network because it can help to prevent overfitting.   __________

6.  **True/False** The regularization hyperparameter must be set to zero during testing to avoid getting random results.    __________

7.  Which of the following are true about dropout?

    A. It helps to reduce overfitting.
    
    B. In practice, it eliminates units of each layer with probability of $1-$keep_prob.
    
    C. In practice, it eliminates units of each layer with probability of keep_prob.
    
    D. It helps to reduce the bias of model.
    
 
8.  During training a deep neural network that uses the tanh activation function, the value of the gradients is practically zero. Which of the following is most likely to help the vanishing gradient problem?    
    
    A. Use Xavier initialization.
    
    B. Increase the number of cycles during the training.
    
    C. Increase the number of layers of the network.
    
    D. Use a larger regularization parameter.    
    
    
9. Which of these techniques are useful for reducing variance (reducing overfitting)?

      A. Xavier initialization

      B. L2 regularization

      C. Drop out

      D. Data augmentation
      
      E. Gradient checking
      
      F. Vanishing gradient
      
      G. Exploding gradient


10. Which of the following is the correct expression to normalize the input $\mathbf{x}$?

      A. $x = \dfrac{x-\mu}{\sigma}$
      
      B. $x = \dfrac{x}{\sigma}$
      
      C. $x = \dfrac{1}{m} \sum_{i=1}^{m}\Big(x^{(i)}\Big)^2$

      D. $x = \dfrac{1}{m} \sum_{i=1}^{m} x^{(i)}$


11. The dev and test set should:
    
    A. Be identical to each other (same (x,y) pairs)
    
    B. Have the same number of examples
    
    C. Come from the same distribution
    
    D. Come from different distributions
    

12. A model developed for a project is presenting high bias. One of the sponsors of the project offers some resources that might help reduce the bias. Which of the following additional resources has a better chance to help reduce the bias?

    A. Give access to more computational resources like GPUs.
    
    B. Use different sources to gather data and better test the model.
    
    C. Gather more data for the project.


13. Which of the following are regularization techniques?

    A. Gradient Checking.
    
    B. Increase the number of layers of the network.
    
    C. Dropout.
    
    D. Weight decay.


14. **True/False** To reduce high variance, the regularization hyperparameter lambda must be increased.   __________  


15. Decreasing the parameter keep_prob from (say) 0.6 to 0.4 will likely cause the following:

    A. Increasing the regularization effect.
    
    B. Causing the neural network to have a higher variance.
    
    C. Reducing the regularization effect.


16. Which of the following actions increase the regularization of a model? (Check all that apply)

    A. Increase the value of the hyperparameter lambda.

    B. Decrease the value of the hyperparameter lambda.
    
    C. Normalizing the data.
    
    D. Increase the value of keep_prob in dropout.
    
    E. Make use of data augmentation.


17. Suppose that a model uses, as one feature, the total number of kilometers walked by a person during a year, and another feature is the height of the person in meters. What is the most likely effect of normalization of the input data?

      A. It will make the training faster.
      
      B. It won't have any positive or negative effects.
      
      C. It will make the data easier to visualize.
      
      D. It will increase the variance of the model.
      

18. When designing a neural network to detect if a house cat is present in the picture, 500,000 pictures of cats were taken by their owners. These are used to make the training, dev and test sets. It is decided that to increase the size of the test set, 10,000 new images of cats taken from security cameras are going to be used in the test set. Which of the following is true?

      A. This will increase the bias of the model so the new images shouldn't be used.
      
      B. This will be harmful to the project since now dev and test sets have different distributions.
      
      C. This will reduce the bias of the model and help improve it.


19. You are working on an automated check-out kiosk for a supermarket, and are building a classifier for apples, bananas and oranges. Suppose your classifier obtains a training set error of 0.5\%, and a dev set error of 7\%. Which of the following are promising things to try to improve your classifier? (Check all that apply.)

      A. Increase the regularization parameter lambda.

      B. Decrease the regularization parameter lambda.
      
      C. Getting more training data.
      
      D. Use a bigger neural network.


20. What is weight decay?

      A. A technique to avoid vanishing gradient by imposing a ceiling on the values of the weights.
      
      B. A regularization technique (such as L2 regularization) that results in gradient descent shrinking the weights on every iteration.
      
      C. The process of gradually decreasing the learning rate during training.
      
      D. General corruption of the weights in the neural network if it is trained on noisy data.
      
      
21. Which of the following actions increase the regularization of a model? (Check all that apply)

      A. Use Xavier initialization.
      
      B. Increase the value of keep_prob in dropout.
      
      C. Increase the value of the hyperparameter lambda.
      
      D. Decrease the value of keep_prob in dropout.
      
      E. Decrease the value of the hyperparameter lambda.

22. Why do we normalize the inputs $\mathbf{x}$?

      A. It makes it easier to visualize the data.
      
      B. It makes the parameter initialization faster.
      
      C. Normalization is another word for regularization--It helps to reduce variance.
      
      D. It makes the cost function faster to optimize.



:::{admonition} Click here for answers!
:class: tip, dropdown

1. B </br>

    B. Given the size of the dataset, 0.5\% of the samples are enough to get a good estimate of how well the model is doing. </br>

2. C </br>

    C.  Although not recommended, if a more accurate measure of the performance is not necessary it is ok to not use a test set. However, this might cause an overfit to the dev set.
    
3. BC </br>

4. A </br>

    A. This can be helpful to reduce the bias of the model, and then we can start trying to reduce the high variance if this happens.  </br>

5. False </br>

     In most cases, it is recommended to not use dropout if there is no overfit. Although in computer vision, due to the nature of the data, it is the default practice. </br>
    
6. False </br>

    The regularization parameter affects how the weights change during training, this means during backpropagation. It has no effect during the forward propagation that is when predictions for the test are made. </br>
    
7. AB </br>

    A. The dropout is a regularization technique and thus helps to reduce the overfit. </br>
    B. The probability that dropout doesn't eliminate a neuron is keep_prob.

8. A </br>

   A careful initialization can help reduce the vanishing gradient problem. </br>
    
9. BCD </br>
    
10. A </br>

      A. This shifts the mean of the input to the origin and makes the variance one in each coordinate of the input examples.
    
11. C </br>

12. A </br>

      A. This can allow the developers to try bigger networks, train for more cycles, and test different architectures. </br>  
      
13.  CD</br>
     
       Using dropout layers is a regularization technique. Weight decay also is a form of regularization.</br>      

14. True </br>

     By increasing the regularization parameter the magnitude of the weight parameters is reduced. This helps avoid overfitting and reduces the variance. </br> 

15. A </br>

     A. This will make the dropout have a higher probability of eliminating a node in the neural network, increasing the regularization effect. </br> 

16. AE </br>

      A. When increasing the hyperparameter lambda we increase the effect of the L_2 penalization. </br>
      E. Data augmentation has a way to generate "new" data at a relatively low cost. Thus making use of data augmentation can reduce the variance. </br>

17. A </br>

      A. Since the difference between the ranges of the features is very different, this will likely cause the process of gradient descent to oscillate, making the optimization process longer. </br>
      
18. B </br>

      B. The quality and type of images are quite different thus we can't consider that the dev and the test sets came from the same distribution. </br>
      
19. AC </br>

20. B </br>

21. CD </br>

      C. When increasing the hyperparameter lambda, we increase the effect of the L_2 penalization. </br>
      D. When decreasing the keep_prob value, the probability that a node gets discarded during training is higher, thus reducing the regularization effect. </br>

22. D </br>
:::