# 1) The Problem of Overfitting

## 1) Understanding Overfitting

What I'd like to do is explain to you what is this **overfitting problem**, and we'll talk about a technique called **regularization**, that will allow us to **ameliorate or to reduce** this overfitting problem and get these learning algorithms to maybe work much better. So what is overfitting?

**Consider the example below**

For the **first example**, we use a straight line to fit the data. But this isn't a very good model. Looking at the data, it seems pretty clear that as the size of the housing increases, the housing prices plateau, or kind of flattens out as we move to the right and so this algorithm does not fit the training and we call this problem **underfitting**, and another term for this is that this algorithm has **high bias**.
- The term **"high bias"** is kind of a historical or technical one, but the idea is that if a fitting a straight line to the data, then, it's as if the algorithm has a **very strong preconception**, or a **very strong bias** that housing prices are going to vary linearly with their size and despite the data to the contrary. Despite the evidence of the contrary is preconceptions still are bias, still closes it to fit a straight line and this ends up being a poor fit to the data. 

For the **second example**, we could fit a quadratic functions, and that works pretty well.

For the **third example**, we fit a fourth degree polynomial function to the data and we can actually fill a curve that process through all five of our training examples. But this is not a good model, this problem we call **overfitting**, and, another term for this is that this algorithm has **high variance**.
- The term **high variance** is another historical or technical one. But, the intuition is that, if we're fitting such a high order polynomial, then, the hypothesis can fit almost any function and this face of possible hypothesis is just too large, it's too variable. And we don't have enough data to constrain it to give us a good hypothesis so that's called **overfitting**.

**The problem of overfitting** comes when if we have too many features, then to learn hypothesis may fit the training set very well. So, your cost function may actually be very close to zero or may be even zero exactly, but you may then end up with a weird curve. It means you tries too hard to fit the training set, so that it even fails to generalize to new examples and fails to predict prices on new examples as well, and here the term generalized refers to how well a hypothesis applies even to new examples.

<img src="images/lec7_pic01.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/ACpTQ/the-problem-of-overfitting) 2:16*

<!--TEASER_END-->

**A similar example on logistic regression**

<img src="images/lec7_pic02.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/ACpTQ/the-problem-of-overfitting) 5:37*

<!--TEASER_END-->

<img src="images/lec7_pic03.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/ACpTQ/the-problem-of-overfitting) 6:03*

<!--TEASER_END-->

## 2) Addressing Overfitting

If we think overfitting is occurring, what can we do to address it?

- In the previous examples, we had one or two dimensional data so, we could just plot the hypothesis and see what was going on and select the appropriate degree polynomial. So plotting the hypothesis, could be one way to try to decide what degree polynomial to use.
- But that doesn't always work. When we have so many features, it also becomes much harder to plot the data and it becomes much harder to visualize it, to decide what features to keep or not.

If we have a lot of features, and, very little training data, then, over fitting can become a problem.

<img src="images/lec7_pic04.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/ACpTQ/the-problem-of-overfitting) 7:00*

<!--TEASER_END-->

<img src="images/lec7_pic05.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/ACpTQ/the-problem-of-overfitting) 9:00*

<!--TEASER_END-->

# 2) Cost function

Let's talk about the main intuitions behind **how regularization works**.

Consider the 2 following examples, if we were to fit a quadratic function to this data, it gives us a pretty good fit to the data. Whereas, if we were to fit an overly high order degree polynomial, we end up with a curve that may fit the training set very well, but overfit the data poorly, and, not generalize well.

**Suppose we were to penalize, and, make the parameters theta 3 and theta 4 really small**.

Here's what I mean, here is our optimization objective, or here is our optimization problem, where we minimize our usual squared error cause function. 

$$min \dfrac{1}{2m}\sum_{i=1}^m(h_{\theta}(x^{i}) - y^{i})^{2}$$

Let's say I take this objective and modify it and add to it, plus 1000 theta 3 squared, plus 1000 theta 4 squared

$$\sum_{i=1}^m(h_{\theta}(x^{i}) - y^{i})^{2} + 1000(\theta_{3}^{2}) + 1000(\theta_{4}^{2})$$

Now, if we were to minimize this function, the only way to make this new cost function small is if theta 3 and data 4 are small. Because otherwise, if you have a thousand times theta 3, this new cost functions gonna be big.

So when we minimize this new function we are going to end up with theta 3 close to 0 and theta 4 close to 0, and it is like we are getting rid of these two terms theta 3 and theta 4 in the function. And if we do that, then we are being left with a quadratic function, and we end up with a fit to the data, that's a quadratic function plus maybe, tiny contributions from small terms, theta 3, theta 4, that they may be very close to 0.

<img src="images/lec7_pic06.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/B1MnL/cost-function) 2:17*

<!--TEASER_END-->

**Here is the idea behind regularization**

- If we have small values for the parameters, then it usually correspond to having a simpler hypothesis. So, for our last example, we penalize just theta 3 and theta 4 and when both of these were close to zero, we wound up with a much simpler hypothesis that was essentially a quadratic function. But more generally, it is possible to show that **having smaller values of the parameters corresponds to usually smoother functions as well for the simpler.** And which are therefore, also, less prone to overfitting.

**Let's look at the specific example:**

For a housing prediction problem, we have:
- hundred of features: $x_{1}, x_{2}, ..., x_{100}$
- hundred of parameter: $\theta_{1}, \theta_{2}, ..., \theta_{100}$
And unlike the polynomial example, we don't know that theta 3, theta 4, are the high order polynomial terms. So, if we have just a bag, or a set of a hundred features, it's hard to pick in advance which are the ones that are less likely to be relevant. In a hundred and one parameters, we don't know which ones to pick, to try to shrink. 

So, in regularization, what we're going to do, is take our cost function.
$$J(\theta) = \dfrac{1}{2m}\sum_{i=1}^m(h_{\theta}(x^{i}) - y^{i})^{2}$$
And what I'm going to do is, modify this cost function to shrink all of my parameters, because I don't know which one or two to try to shrink. So I am going to modify my cost function to add a term at the end.
$$J(\theta) = \dfrac{1}{2m}\sum_{i=1}^m(h_{\theta}(x^{i}) - y^{i})^{2} + \lambda\sum_{i=1}^n\theta_{j}^2 $$

When I add an extra regularization term at the end, it means I will shrink all of my parameters theta 1, theta 2, theta 3 up to theta 100. And by convention the summation here starts from one so I am not actually going penalize theta zero being large. Because the sum starts from i = 1 through n. But in practice, it makes very little difference, whether you include theta zero or not.

<img src="images/lec7_pic07.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/B1MnL/cost-function) 5:55*

<!--TEASER_END-->

Writing down our regularized optimization objective, our regularized cost function again.
$$J(\theta) = \dfrac{1}{2m}\Big[\sum_{i=1}^m(h_{\theta}(x^{i}) - y^{i})^{2} + \lambda\sum_{i=1}^n\theta_{j}^2 \Big]$$

We have: 
- regularization term: $\lambda\sum_{i=1}^n\theta_{j}^2$
- regularization parameter (lambda): $\lambda$

What lambda does, is it controls a trade off between two different goals.
- The first goal, capture it by the first goal objective, is that we would like to fit the training data well.
- The second goal is, we want to keep the parameters small, and that's captured by the second term, by the regularization objective.

What lambda, the regularization parameter does is the controls the trade of between these two goals, between the goal of fitting the training set well and the goal of keeping the parameter plan small and therefore keeping the hypothesis relatively simple to avoid overfitting.

<img src="images/lec7_pic08.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/B1MnL/cost-function) 7:30*

<!--TEASER_END-->

<img src="images/lec7_pic09.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/B1MnL/cost-function) 7:59*

<!--TEASER_END-->

In regularized linear regression, if the regularization parameter monitor is set to be very large, then what will happen is we will end up penalizing the parameters theta 1, theta 2, theta 3, theta 4 very highly. We may end up with all of these parameters close to zero. And if we do that, then we're just left with a hypothesis:
$$\large h_\theta(x) = \theta_{0}$$

That's mean we just fit a flat horizontal straight line to the data. And **this is an example of underfitting.** And another way of saying this is that this hypothesis has too strong a preconception or too high bias that housing prices are just equal to theta zero, and despite the clear data to the contrary, you choose to fit a flat horizontal line. 

So for regularization to work well, some care should be taken, to choose a good choice for the regularization parameter lambda as well.

<img src="images/lec7_pic10.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/B1MnL/cost-function) 9:36*

<!--TEASER_END-->

# 3) Regularized Linear Regression