# 1) Defining how we assess performance

## What do we mean by "loss"?

<img src="images/lec3_pic01.png">
<img src="images/lec3_pic02.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/cGUQ3/what-do-we-mean-by-loss) 1:00*

<!--TEASER_END-->

How do we formalize this notion of how much we're losing? And in machine learning, we do this by defining something called a loss function.

And what the loss function specifies is the cost incurred when the true observation is y, and I make some other prediction. So, a bit more explicitly, what we're gonna do, is we're gonna estimate our model parameters. And those are $\hat w$. We're gonna use those to form predictions.
-  $f_{\hat w}(x)$ = \hat f(x), it's our predicted value at some input x.

The loss function L, is somehow measuring the difference between these two things.

And there are a couple ways in which we could define loss function. And very common choices include assuming
something that's called absolute error, which just looks at the absolute value of the difference between your true value and your predicted value. And another common choice is something called squared error, where, instead of just looking at the absolute value, you look at the square of that difference. And so that means that you have a very high cost if that difference is large, relative to just absolute error.

<img src="images/lec3_pic03.png">
<img src="images/lec3_pic04.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/cGUQ3/what-do-we-mean-by-loss) 3:30*

<!--TEASER_END-->

# 2) 3 measures of loss and their trends with model complexity

## 1) Training error: assessing loss on the training set

The first measure of error of our predictions that we can look at is something called training error. And we discussed this at a high level in the first course of the specialization, but now let's go through it in a little bit more detail.

So, to define training error, we first have to define training data. So, training data typically you have some dataset which I've shown you are these blue circles here, and we're going to choose our training dataset just some subset of these points. So, the greyed circles are ones that are not included in the training set. The blue circles are the ones that we're keeping in this training set. And then we take our training data and, as we've discussed in previous modules of this course, we use it in order to fit our model, to estimate our model parameters. Just as an example, for example with this dataset here, maybe we choose to fit some quadratic function to the data and like we've talked about in order to fit this quadratic function, we're gonna minimize the residual sum of squares on these training data points. 

<img src="images/lec3_pic05.png">
<img src="images/lec3_pic06.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/VN4Qo/training-error-assessing-loss-on-the-training-set) 1:00*

<!--TEASER_END-->

So, now we have our estimated model parameters, w hat. And we want to assess the training error of that estimated model. And the way we do that is first we need to define some lost functions. So, maybe we look at squared error, absolute error. 

And then the way training error's defined is simply as the average loss, defined over the training points. So, mathematically what this is is simply:
$$\dfrac{1}{N} \sum_{i=1}^N L(y_i, f_{\hat w}(x_i))$$
- N: are the total number of observations in my training set

And just to remember to be very clear the estimated parameters were estimated on the training set. They were minimizing the residual sum of squares for these training points that we're looking at again and defining this training error.

<img src="images/lec3_pic07.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/VN4Qo/training-error-assessing-loss-on-the-training-set) 2:00*

<!--TEASER_END-->

So, we can go through this pictorially in the following example, where in this case we're specifically looking at using squared error as our loss function. And in this case, our training error is simply $\dfrac{1}{N}$ times the sum of the difference between our actual house sales price and our predicted house sales price squared, where that sum is taken over all houses in our training data set. And what we see is that in this case where we choose squared error as our loss function, then the form of training error is exactly $\dfrac{1}{N}$ times our residual sum of squares. So, just be aware of that when you're computing training error and reporting these numbers. Here we're defining it as the average loss. 

<img src="images/lec3_pic08.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/VN4Qo/training-error-assessing-loss-on-the-training-set) 3:00*

<!--TEASER_END-->

More formally we can write our training error as follows and then we can define something that's commonly referred to just as something as RMSE and the full name is root mean square error. And RMSE is simply the square root of our average loss on the training houses. So, the square root of our training error. And the reason one might consider looking at root mean square error is because the units, in this case, are just dollars. Whereas when we thought about our training error, the units were dollars squared.

<img src="images/lec3_pic09.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/VN4Qo/training-error-assessing-loss-on-the-training-set) 3:39*

<!--TEASER_END-->

Now, that we've defined training error, we can look at how training error behaves as model complexity increases. So, to start with let's look at the simplest possible model you might fit, which is just a constant model. So this is the simplest model we're gonna consider, or could consider, and you see that there is pretty significant training error.

Then let's say I fit a linear model. Well, a line, these are all linear models we're looking at, it's linear regression. But just fitting a line to the data. And you see that my training error has gone down.

Then I fit a quadratic function again training error goes down, and what I see is that as I increase my model
complexity to maybe this higher order of polynomial, I have very low training error just this one pink bar here. So, training error decreases quite significantly with model complexity .

So, there's a decrease in training error as you increase your model complexity. And why is that? Well, it's pretty intuitive, because the model was fit on the training points and then I'm saying how well does it fit it? As I increase the model complexity, I'm better and better able to fit my training data points. So, then when I go to assess my training error with these high-complexity models, I have very low training error.

<img src="images/lec3_pic10.png">
<img src="images/lec3_pic11.png">
<img src="images/lec3_pic12.png">
<img src="images/lec3_pic13.png">
<img src="images/lec3_pic14.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/VN4Qo/training-error-assessing-loss-on-the-training-set) 5:00*

<!--TEASER_END-->

So, a natural question is whether a training error is a good measure of predictive performance? And what we're showing here is
one of our high-complexity, high-order polynomial models that had very low training error. So it really fit those training data points well. But how's it gonna perform on some new house?

<img src="images/lec3_pic15.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/VN4Qo/training-error-assessing-loss-on-the-training-set) 6:00*

<!--TEASER_END-->

So, in particular, maybe we're looking at a house in this gray region, so with this range of square feet. Question is, is there something particularly wrong with having $x_t$ square feet? Because what our fitted function is saying is that I believe or I'm predicting that the values of houses with roughly Xt square feet are less valuable than houses with fewer square feet, cuz there's this dip down in this function. Do we really believe that this is a true dip in value, that these houses are just less desirable than houses with fewer or more square feet? Probably not. So, what's going wrong here? 

<img src="images/lec3_pic16.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/VN4Qo/training-error-assessing-loss-on-the-training-set) 6:45*

<!--TEASER_END-->

The issue is the fact that training error is overly optimistic when we're going to assess predictive performance. And that's because these parameters, $\hat w$, were fit on the training data. They were fit to minimize residual sum of squares, which can often be related to training error. And then we're using training error to assess predictive performance but that's gonna be very very optimistic as this picture shows. So, in general, having small training error does not imply having good predictive performance unless your training data set is really representative of everything that you might see there out in the world.

<img src="images/lec3_pic17.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/VN4Qo/training-error-assessing-loss-on-the-training-set) 7:30*

<!--TEASER_END-->

## 2) Generalization error: what we really want

So, instead of using training error to assess our predictive performance. What we'd really like to do is analyze something that's called generalization or true error. So, in particular, we really want an estimate of what the loss is averaged over all houses that we might ever see in our neighborhood. But really, in our dataset we only have a few examples of houses that were sold. But there are lots of other houses that are in our neighborhood that we don't have in our dataset, or other houses that
you might imagine having been sold.

<img src="images/lec3_pic18.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/CDx5h/generalization-error-what-we-really-want) 0:30*

<!--TEASER_END-->

Okay, so to compute this estimate over all houses that we might see in our dataset, we'd like to weight these house pairs,
so the pair of house attributes and the house sale's price. By how likely that pair is to have occurred in our dataset. So to do this we can think about defining a distribution and in this case over square feet of houses in our neighborhood.

What this picture is showing is a distribution that says we're very unlikely to see houses with very small or low number of square feet, very small houses. And we're also very unlikely to see really, really massive houses. So there's some bell curve to this, there's some sweet spot of kind of typical houses in our neighborhood, and then the likelihood drops off from there. 

<img src="images/lec3_pic19.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/CDx5h/generalization-error-what-we-really-want) 1:30*

<!--TEASER_END-->

Likewise what we can do is define a distribution that says for a given square footage of a house, what's the distribution over
the sales price of that house? ? So let's say the house has 2,640 square feet. Maybe I expect the range of house prices to be somewhere between $680,000 to maybe $950,000. That might be a typical range. But of course, you might see much lower valued houses or higher value, depending on the quality of that house. 

<img src="images/lec3_pic20.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/CDx5h/generalization-error-what-we-really-want) 1:39*

<!--TEASER_END-->

Formally when we go to define our generalization error, we're saying that we're taking the average value of our loss weighted by how likely those pairs were in our dataset.

So specifically we estimate our model parameters on our training data set so that's what gives us $\hat w$. That defines the model we're using for prediction, and then we have our loss function, assessing the cost of predicting $f_{\hat w}$ at our square foot x when the true value was y. And then what we're gonna do is we're gonna average over all possible (x,y). But weighted by how likely they are according to those distributions over square feet and value given square feet. 

<img src="images/lec3_pic21.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/CDx5h/generalization-error-what-we-really-want) 3:00*

<!--TEASER_END-->

Let's go back to these plots of looking at error versus model complexity. But in this case let's quantify our generalization error as a function of this complexity.

And to do this, what I'm showing by this crazy blue region here. And, it has different gradation going from white to darker blue, is the distribution of houses that I'm likely to see in my dataset. So, this white region here, these are the houses that I'm very likely to see, and then as I go further away from the white region I get to less likely house sale prices given a specific square foot value. 

And so what I'm gonna do when I look at thinking about generalization error is I'm gonna take my fitted function where remember this green line was fit on the training data which are these blue circles. And then I'm gonna say, how well does it predict houses in this shaded blue region, weighted by how likely they are, how close to that white region.

Okay, so what I see here is this constant model who really doesn't approximate things well except maybe in this region here. So overall it has a reasonably high generalization error and I can go to my more complex model.

<img src="images/lec3_pic22.png">
<img src="images/lec3_pic23.png">
<img src="images/lec3_pic24.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/CDx5h/generalization-error-what-we-really-want) 5:00*

<!--TEASER_END-->

Then I get to this much higher order polynomial, and when we were looking at training error, the training error was lower, right? But now, when we think about generalization error, we actually see that the generalization error is gonna go up relative to the simpler model.


<img src="images/lec3_pic25.png">
<img src="images/lec3_pic26.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/CDx5h/generalization-error-what-we-really-want) 6:50*

<!--TEASER_END-->

So our generalization error in general will have some shape where it's going down. And then we get to a point where
the error starts increasing. Sorry, that should have been a smoother curve. The error starts increasing because we're getting to these overly complex models that fit the training data really well but don't generalize to other houses that we might see.

But importantly, in contrast to training error we can't actually compute generalization error. Because everything was relative
to this true distribution, the true way in which the world works. How likely houses are to appear in our dataset over all possible square feet and all possible house values. And of course, we don't know what that is. So, this is our ideal picture or
our cartoon of what would happen. But we can't actually go along and compute these different points. 

<img src="images/lec3_pic27.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/CDx5h/generalization-error-what-we-really-want) 8:00*

<!--TEASER_END-->

## 3) Test error: what we can actually compute

So we can't compute generalization error, but we want some better measure of our predictive performance than training error gives us. And so this takes us to something called test error, and what test error is going to allow us to do is approximate generalization error. 

And the way we're gonna do this is by approximating the error, looking at houses that aren't in our training set.

<img src="images/lec3_pic28.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/pq0SM/test-error-what-we-can-actually-compute) 1:00*

<!--TEASER_END-->

So instead of including all these colored houses in our training set, we're gonna shade out some of them, these shaded gray houses and we're gonna make these into what's called a test set.

<img src="images/lec3_pic29.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/pq0SM/test-error-what-we-can-actually-compute) 1:15*

<!--TEASER_END-->

And when we go to fit our models, we're just going to fit our models on the training data set. But then when we go to assess
our performance of that model, we can look at these test houses, and these are hopefully going to serve as a proxy of everything out there in the world. So hopefully, our test data set is a good measure of other houses that we might see, or at least in order to think of how well a given model is performing. 

<img src="images/lec3_pic30.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/pq0SM/test-error-what-we-can-actually-compute) 1:25*

<!--TEASER_END-->

So test error is gonna be our average loss computed over the houses in our test data set. 
- $N_{test}$: are the number of houses in our test data 
- $\hat w$: very important, estimated parameters were fit on the training data set

Okay, so even though this function looks very much like training error, the sum is over the test houses, but the function we're looking at was fit on training data. Okay, so these parameters in this fitted function never saw the test data. 


<img src="images/lec3_pic31.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/pq0SM/test-error-what-we-can-actually-compute) 2:20*

<!--TEASER_END-->

So just to illustrate this, we might think of fitting a quadratic function through this data, where we're gonna minimize the residual sum of squares on the training points, those blue circles, to get our estimated parameters $\hat w$. 


<img src="images/lec3_pic32.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/pq0SM/test-error-what-we-can-actually-compute) 2:33*

<!--TEASER_END-->

Then when we go to compute our test error, which in this case again we're gonna use squared error as an example, we're computing this error over the test points, all these grey different circles here. So test error is $\dfrac{1}{N}$ times the sum of the difference between our true house sales prices and our predicted price squared summing over all houses in our test data set. 

<img src="images/lec3_pic33.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/pq0SM/test-error-what-we-can-actually-compute) 2:45*

<!--TEASER_END-->

**Let's summarize our measures of error as a function of model complexity**

- Our training error decreased with increasing model complexity.
- In contrast, our generalization error went down for some period of time. But then we started getting to overly complex models that didn't generalize well, and the generalization error started increasing. So here we have generalization error. Or true error
- Our test error is a noisy approximation of generalization error. Because if our test data setting included everything we might ever see in the world in proportion to how likely it was to be seen, then that would be exactly our generalization error. But of course, our test data set is just some finite data set, and we're using it to approximate generalization error, so it's gonna be some noisy version of this curve here. 
 
Test error is the thing that we can actually compute. Generalization error is the thing that we really want. 

<img src="images/lec3_pic34.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/pq0SM/test-error-what-we-can-actually-compute) 3:00*

<!--TEASER_END-->

## 4) Defining overfitting

The notion of overfitting is if you have a model with parameters $\hat w$. In this model, there exists an estimated parameters, I'll just call them $w'$. 

The model is overfit with two conditions hold:
- training error ($\hat w$) < training error ($w'$).
- true error ($\hat w$) > true error ($w'$).

Generally, the models are overfit, are the ones that have smaller training error. These are the ones that are really highly fit to the training data set but don't generalize well. Whereas the other points on the other half of this space are the ones that are not really well fit to the training data and also don't generalize well.

<img src="images/lec3_pic35.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/u8c2x/defining-overfitting) 2:00*

<!--TEASER_END-->

## 5) Training/test split

So we've said to assess the performance of our model, we really need to have a test data set carved out from our full data set. So, this raises the question of, how do I think about dividing the data set into training data versus test data? 

- If I put too few points in my training set, then I'm not going to estimate my model well. And so, I'm going to have clearly bad predictor performance because of that. 
- If I put too few points in my test set, that's gonna be a bad approximation to generalization error.

A general rule of thumb is typically you want just enough points in your test set to approximate generalization error well. And you want all your points in your training data set. Because you want to have as many points in your training data set
to learn a good model.

<img src="images/lec3_pic36.png">
<img src="images/lec3_pic37.png">
<img src="images/lec3_pic38.png">
<img src="images/lec3_pic39.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/ml-regression/lecture/qn2vj/training-test-split) 1:00*

<!--TEASER_END-->

# 3) 3 sources of error and the bias-variance tradeoff

## 1) Irreducible error and bias