# 1) Characteristics of overfit models

## 1) Symptoms of overfitting in polynominal regression

In the last module, we talked about the potential for high complexity models to become overfit to the data. And we also discussed this idea of a bias-varience tradeoff. Where high complexity models could have very low bias, but high variance. Whereas low complexity models have high bias, but low variance. 

And in this module, what we're gonna do is talk about a way to automatically balance between bias and variance using something called ridge regression. 

So let's recall this issue of overfitting in the context of polynomial regression. And remember, this is our polynomial regression model. And if we assume we have some low order of polynomial that we're fitting to our data, we might get a fit that
looks like the following. This is just a quadratic fit to the data. But once we get to a much higher order polynomial, we can get these really wild fits to our training observations. Again, this is an instance of a high variance model. But we refer to this model or this fit as being overfit. Because it is very well tuned to our training observations, but it doesn't generalize well to other observations we might see. 

<img src="images/lec4_pic01.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/TIGJ5/symptoms-of-overfitting-in-polynomial-regression) 1:00*

<!--TEASER_END-->

So, previously we had discussed a very formal notion of what it means for a model to be overfit. In terms of the training error being less than the training error of another model, whose true error is actually smaller than the true error of the model with smaller training error. 

Okay, hopefully you remember that from the last module. But a question we have now is, is there some type of quantitative measure that's indicative of when a model is overfit? And to see this, let's look at the following demo, where what we're going to show is that when models become overfit, the estimated coefficients of those models tend to become really large in magnitude. 

<img src="images/lec4_pic02.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/TIGJ5/symptoms-of-overfitting-in-polynomial-regression) 2:00*

<!--TEASER_END-->

## 2) Overfitting for more general multiple regression models

So, in particular we an also face this issue of overfitting when we get lots and lots of inputs. That represents a very flexible model that can run into the same issues that we saw in our demo for polynomial regression. 

Or more generally, we can say just if we have lots of features. So we'll say that capital D is very large. And this could be different functions of our input. But when you include lots and lots of these functions of our inputs, in our regression model then again we're in this place where the model has a lot of flexibility to explain the data and we're subject to becoming overfit. 

But this issue of overfitting with respect to increasing model complexity is really relative to how much data that we have. So let's talk about overfitting as a function of the number of observations that we have. As well as a function of the number of inputs, or the complexity of the model. 

<img src="images/lec4_pic03.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/38mGi/overfitting-for-more-general-multiple-regression-models) 1:00*

<!--TEASER_END-->

So in particular if we have very few observations and it's small, then our models can rapidly become overfit to the data. Because we have only a few points and as we're increasing in our model complexity like the order of the polynomial, it becomes very easy to hit all of our observations, but in between where we have those observations, things can go very wild. 

On the other hand, if we have lots and lots of observations, even with really, really complex models, we're not gonna as quickly become overfit because we have dense observations across our input, so the function is pinned down basically everywhere. In this example asa function of square feet. And it's not able to hit every observation, it's not able to do these really crazy wiggly things. 

<img src="images/lec4_pic04.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/38mGi/overfitting-for-more-general-multiple-regression-models) 2:00*

<!--TEASER_END-->

On the other hand when we have just one input like number of square feet of a house in order to avoid overfitting, we need to have observations that are very dense across number of square feet. So we need to have lots of representative examples of square feet and house value pairs. So this is actually pretty hard to do, to have lots of examples of houses of every possible square feet that you might see. 

<img src="images/lec4_pic05.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/38mGi/overfitting-for-more-general-multiple-regression-models) 2:50*

<!--TEASER_END-->

So this is already a hard problem, but it becomes even harder when I increase the number of inputs in my model. So, for example, just think of a model where I have square feet and number of bathrooms. And I want to cover all possible combinations of those two inputs in order to provide representative examples and avoid overfitting. Well that's really really hard.

<img src="images/lec4_pic06.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/38mGi/overfitting-for-more-general-multiple-regression-models) 3:00*

<!--TEASER_END-->

# 2) The ridge objective

## 1) Balancing fit and magnitude of coefficients

So now let's talk about a way to automatically address this issue by modifying the cost term that we're minimizing when we're addressing how good our fit is. So, in particular we're looking at this orange box, this quality metric. And before our quality metric just depended on the difference between our predicted house sales price, and our actual house sales price. In particular we're looking at residual sum of squares for measure of fit. But now we're gonna modify this quality metric to also take into account a measure of the complexity of the model. In particular, in order to buy assess toward simpler models. 

<img src="images/lec4_pic07.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/b1fbX/balancing-fit-and-magnitude-of-coefficients) 0:40*

<!--TEASER_END-->

So when we're thinking about defining this modified cost function, what we're gonna want to do is balance between how well the function fits the data and a measure of how complex, or how potentially overfit, the model is. And what did we see was an indicator of that? The magnitude of our estimated coefficients. 

So, what we're going to balance between is the fit of the model to the data and the magnitude of the coefficients of the model. Okay, so we can write down a total cost that has these two terms. Where this is our new measure of the quality of the fit, and when I say measure of fit here, what I mean is that a small number indicates that there's a good fit to the data. And on the other hand, the measure of the magnitude of the coefficients if that number is small that means the size of the coefficients are small and we're unlikely to be in this setting of a very overfit model. 

Okay, so clearly we want to balance between these two measures, because if I just optimize the magnitude of the coefficients, I'd set all the coefficients to zero and that would sure not be overfit, but it also would not fit the data well. So that would be a very high bias solution. On the other hand, if I just focused on optimizing the measure of fit, that's what we did before. That's the thing that was subject to becoming overfit in the face of complex models. So somehow we want to trade off between these two terms, and that's what we're going to discuss now.

<img src="images/lec4_pic08.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/b1fbX/balancing-fit-and-magnitude-of-coefficients) 2:00*

<!--TEASER_END-->

Okay, what's our measure of fit? It's our residual sum of squares, which I've written here and hopefully this formula is quite
familiar to you at this point. 

But sometimes we also write it as follows where remember $\hat y_i$ is our predicted value using w in our model to make these predictions. 

And just remember that a small residual sum of squares is indicative of the model that fit the training data well. So just as we said on the previous slide, when we're thinking about measure of fit, a small number is gonna indicate a good fit. 

<img src="images/lec4_pic09.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/b1fbX/balancing-fit-and-magnitude-of-coefficients) 3:00*

<!--TEASER_END-->

Okay, so now what we need is a measure of the magnitude of the coefficients. So what summary number might be indicative of the size of the regression coefficients? 

Well maybe you think about just summing all the coefficients together? Is this gonna be a good measure of the overall magnitude of the coefficients? Probably not in a lot of cases because you might end up with a situation where, let's say, 
- $w_0$ is 1,527,301 
- $w_1$ is -1,605,253

Well if you look at and let's say $w_0$ and $w_1$, the only two coefficients in our model. If I look at $w_0$ + $w_1$, this is gonna be some small number, despite the fact that each of the coefficients themselves were quite large. 

Okay, so you might say, I know how to fix this, I'll just look at the absolute value. So, maybe what I'll do, is I'll look at absolute value of $w_0$ + $w_1$ plus all the way up to $w_D$ and this is, I'll just write this compactly, sum from j=0 to capital D, the number of features we have. Absolute value of $w_j$. And this is defined to be equal to what's called the one norm of the vector of coefficient w. So we write it, so this is a vector, I'll try and make this a thick font here, sub 1 and this is called L1 norm. And this is actually a very reasonable choice. And we're gonna discuss this more in the next module. 

$$L_1 norm:  |w_0| + |w_1| + ... + |w_D| = \sum_{j=0}^D |w_j| \triangleq ||w||_1 $$

But for now the thing that we're gonna consider is to consider the sum of the squares of the coefficients. So w0 squared w1 squared, all the way up to wD squared. So this is the sum j equals zero to capital D, of wj squared. And this is defined to be equal to, we've actually seen this norm many times in this class so far, it's the two norm squared. So this is called our L2 norm, or really the L2 norm squared and this is gonna be the focus of this module. 

$$L_2 norm:  w_0^2 + w_1^2 + ... + w_D^2 = \sum_{j=0}^D w_j^2 \triangleq ||w||_2^2 $$

<img src="images/lec4_pic10.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/b1fbX/balancing-fit-and-magnitude-of-coefficients) 6:00*

<!--TEASER_END-->

So again, what we have, just to summarize, is we have our total cost is a sum of the measure of fit + a measure of the magnitude of coefficients and we said our measure of fit is our residual sum of squares. And our measure of the magnitude of the coefficients for this module is going to be this two norm of the w vector squared. 

<img src="images/lec4_pic11.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/b1fbX/balancing-fit-and-magnitude-of-coefficients) 6:50*

<!--TEASER_END-->

## 2) The resulting ridge objective and its extreme solutions

Okay, so let's consider the resulting objective, where I'm gonna try and search over all possible w vectors. To find the ones that minimize the sum of residual sum of squares plus the square of the two norm of w. So that's gonna be my $\hat w$, my estimated model parameters. 

But really what I'd like to do, is I'd like to be able to control how much I'm weighing the complexity of the model as measured by this magnitude of my coefficient, relative to the fit of the model. I'd like to balance between these two terms, and so I'm gonna introduce another parameter. And this is called a tuning parameter. With the model, it's a $\lambda$ (lambda), and this is balancing between this fit and magnitude. 

- If $\lambda = 0$: So let's see what happens if I choose lambda to be 0. Well, if I choose lambda to be 0, this magnitude term that we've introduced completely disappears and my objective reduces down just to minimizing the residual sum of squares. Which was exactly the same as my objective before. So, this reduces to minimizing residual sum of squares of w as before. So this is our old solution, Which leads to some w hat which I'm gonna call w hat superscript LS for least squares. Because what we were doing before is commonly referred to as the least squares solution. So I'm gonna specifically represent the parameters associated with that old procedure we're doing as the least squares parameters. 

- If $\lambda = \infty$: On the other hand, what if I completely crank up that tuning parameter to be infinity? So I have a really, really massively large weight on this magnitude term. Massively large being infinitely large. So as large as you can possibly imagine. 
    - So what happens to any solution where w hat is not equal to 0? So, For solutions where w hat does not equal 0. Then the total cost is what? Well I get something that's non-0 times infinity plus something, my residual sum of squares, whatever that happens to be. But the sum of that is infinity. Okay, so my total cost is infinite. 
    - On the other hand, what if w hat is exactly equal to 0? Then if w hat equals 0, then total cost is equal to the residual sum of squares of this 0 vector. And that's some number, but it's probably not infinity. Actually it's not infinity, so the minimizing solution here is always gonna be w hat equals 0. Cuz that's the thing that's gonna minimize the total cost over all possible w's. 

Okay, so just to recap, we said that if we put that tuning parameter all the way to 0, make it very, very small, all the way to 0. Then we return to our previously square solution and if we crank that parameter all the way up to be infinite. In that limit, we get all of ourcoefficients being exactly 0, okay? 

- If $\lambda$ in between 0: But we're gonna be operating in a regime where lambda is somewhere in between 0 and infinity. And in this case, then we know that the magnitude of our estimated coefficients, they're gonna be less than or equal to the magnitude of our least squares coefficients. In particular, the two norm will be less than. But we also know it's gonna be greater than or equal to 0. So we're gonna be somewhere in between these two regions. 
    - And a key question is, what lambda do we actually want? How much do we want to bias away from our least square solution, which was subject to potentially over-fitting, down to this really simple, the most trivial model you can consider which is nothing, no coefficients in the model. What's the model if all the coefficients are 0? Just noise, we just have y equals epsilon, that noise term. Okay, so we're gonna think about somehow trading off between these two extremes. 

<img src="images/lec4_pic12.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/xYzTA/the-resulting-ridge-objective-and-its-extreme-solutions) 5:00*

<!--TEASER_END-->

Okay, I wanted to mention that this is referred to as Ridge regression. And that's also known as doing L2 regularization. Because, for reasons that we'll describe a little bit more later in this module, we're regularizing the solution to the old objective that we had, using this L2 norm term. 

<img src="images/lec4_pic13.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/xYzTA/the-resulting-ridge-objective-and-its-extreme-solutions) 5:50*

<!--TEASER_END-->

## 3) How ridge regression balances bias and variance

Let's talk about this in the context of the bias variance trade-off. 

And what we saw is when we had very large lambda, we had a solution with very high bias, but low variance. And one way to see this is thinking about when we're cranking lambda all the way up to infinity, in that limit, we get coefficients shrunk to be zero, and clearly that's a model with high bias but low variance. It's completely low variance, it doesn't change no matter what data you give me. 

On the other hand, when we had very small lambda, we have a model that is low bias, but high variance. And to see this think about setting lambda to zero, in which case, we get out just our old solution, our old lee squares or minimizing residual sum of squares fit. And there we see that for higher complexity models clearly you're gonna have low bias but high variance. 

So what we see is this lambda tuning parameter controls our model complexity and controls this bias variance trade-off.

<img src="images/lec4_pic14.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/697eG/how-ridge-regression-balances-bias-and-variance) 1:00*

<!--TEASER_END-->

Let's return to our polynomial regression demo, but now using ridge regression and see if we can ameliorate the issues of over-fitting as we vary the choice of lambda. And so we're going to explore this ridge regression solution for a couple different choices of this lambda tuning parameter. 

<img src="images/lec4_pic15.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/697eG/how-ridge-regression-balances-bias-and-variance) 1:20*

<!--TEASER_END-->

## 4) The ridge coefficient path

Well, we've motivated analytically how the coefficients that we get when solving this ridge regression problem are gonna
change for different settings of lambda. Specifically, we saw that when lambda was 0, we get our least square solution. When lambda goes to infinity, we get very, very small coefficients approaching 0. And in between, we get some other set of coefficients and then we explore this experimentally in this polynomial regression demo. 

But one thing that's interesting to draw is what's called the coefficient path for ridge regression. Which shows as you vary lambda, all the way from 0 up towards infinity, how do the coefficients change? So how does my solution change as a function of lambda? And what we're doing in this plot here is we're drawing this for our housing example, where we have eight different features. Number of bedrooms, bathrooms, square feet of the living space, number of square feet of the lot size. Number of floors, the year the house was built, the year the house was renovated, and whether or not the property is waterfront. And for each one of these different inputs to our model are different, and these we're just gonna use as different features, we're drawing what the coefficients, so this would be, Coefficient value for square feet living. For some specific choice of lambda and how that coefficient varies as I increase lambda and I'm showing this for each one of the eight different coefficients. 

And I just want to briefly mention that in this figure, we've rescaled the features so that they all have unit norm so each one of these different inputs. That's why all of these coefficients are roughly on the same scale. They're roughly the same order of magnitude. Okay, and so what we see in this plot is, as lambda goes towards 0, or when it's specifically at 0, our solution here. The value of each of these coefficients, so each of these circles touching this line, this is gonna be my w hat least squares solution. And as I increase lambda out towards infinity, I see that my solution, w hat, approaches 0. There's a vector of coefficients is going to 0. And we haven't made lambda large enough in this plot to see them actually really, really, really, really close to 0, but you see the trend happening here. 

And then there's some sweet spot in in this plot. Which we're gonna talk about later in this module. So this is gonna represent some $\lambda ^* $ (lambda star). Which will be the value of lambda that we wanna use when we're selecting our specific regularized model to use for forming predictions. And we're gonna discuss how we choose which lambda to use later in the module. But for now, the main point of this plot is to realize that for every value of lambda, every slice of this plot, we get a different solution, a different w hat vector. 

<img src="images/lec4_pic16.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/Ki05k/the-ridge-coefficient-path) 3:00*

<!--TEASER_END-->