Remember how in chapter 5, we talked about what makes a "better" measure of central tendency, the mean or median? This is how we are defining "better" - lower error when we use that statistic as the model for the data distribution. 


Statisticians tend to model unexplained variation, whether real or induced by data collection, as though it were generated by a random process (e.g., uniform, or normal, or some other probability distribution). They do this because this helps them make some progress. It is easy to predict what unexplained variation might look like if the DGP is random. Then they can compare what they predicted, assuming a random process, with what the data distribution actually looks like.


As it happens, in the absence of other information about the objects being studied, the mean of our sample is the best estimate we have of the actual mean of the population. It is equally likely to be too high as it is too low for any one data point (the typical variation around the mean is the same on the left side as it is on the right), making it an *unbiased* estimator of the parameter. Because it is our best guess of what the population parameter is, it is the best predictor we have of the value of a subsequent observation. While it will almost certainly be wrong, the mean will do a better job than any other single number.

# Chapter 10 - Quantifying Model Error

## 10.1 Total error around a model

Up to now we have developed the idea that a statistical model can be thought of as a number, a predicted value for the outcome variable. We are trying to model the data generation process, but because we can’t see the data generation process directly, we fit a model to our data and estimate parameters.

Using the DATA = MODEL + ERROR framework, we have defined error as the residual that is left after we account for the variance in our data that the model can explain. In the case of our simple model for a quantitative outcome variable, the model is the mean, and the error (or residual) is the deviation of each score above or below the mean.

We represent the simple model like this using the notation of the General Linear Model:

$$ Y_i = b_0 + e_i $$

This equation represents each score in our data as the sum of two components: the mean of the distribution (represented by b<sub>0</sub>), and the deviation of that score above or below the mean (represented as e<sub>i</sub>). In other words, DATA = MODEL + ERROR.

In this chapter, we will dig deeper into the ERROR part of our DATA = MODEL + ERROR framework. In particular, we will develop methods for quantifying the total amount of error around a model, and for modeling the distribution of error itself.

Quantifying the total amount of error will help us compare models later on to see which one explains more variation. Modeling the distribution of error will help us to make more detailed predictions about future observations and more precise statements about the data generation process.

At the outset, it is worth remembering what the whole statistical enterprise is about: explaining variation. Once we have created a model, we can think about explaining variation in a new way, as reducing error around the model. 

We have noted before that the mean is a better model of a quantitative outcome variable when the spread of the distribution is smaller than when it is larger. When the spread is smaller, the collection of residuals from the model are smaller. Quantifying the total error around a model will help us to know how good our models are, and which models are better than others.

To make this concrete, let’s consider our very simple model of our tiny sample (n = 6) of data from last chapter, ```tiny_fingers```. Here is the data frame including the predictions and residuals from ```tiny_empty_model```.


In [6]:
student_ID <- c(1, 2, 3, 4, 5, 6)
thumb <- c(56, 60, 61, 63, 64, 68)

tiny_fingers <- data.frame(student_ID, thumb)
tiny_empty_model <- lm(thumb ~ NULL, data = tiny_fingers)
tiny_fingers$prediction <- predict(tiny_empty_model)
tiny_fingers$residual <- resid(tiny_empty_model)

tiny_fingers

student_ID,thumb,prediction,residual
<dbl>,<dbl>,<dbl>,<dbl>
1,56,62,-6
2,60,62,-2
3,61,62,-1
4,63,62,1
5,64,62,2
6,68,62,6


The histograms below show the distribution of ```thumb``` and the distribution of ```residual``` for our tiny data set. And of course, as you know by now, these distributions have the exact same shape, but different means.

<img src="images/ch10-dist-resid.png" width="1000">

It makes sense to use residuals to analyze error from the model. If we want to quantify total error, would we just add up all the residuals? Worse models should have more error, so the sum of all the errors should represent the “total” error, right?

Let’s do that, using one of the first R functions you learned, ```sum()```. The following code will add up all the residuals from our  ```tiny_empty_model```.

In [3]:
sum(tiny_fingers$residual)

The sum of all error in this model is actually 0 (or a number so tiny it's practically zero and is just a rounding error in the computer).

Although we might at first think that the sum of the residuals would be a good indicator of total error, we've discovered a fatal flaw in that approach: the sum of the residuals around the mean is equal to 0! If this were our measure of total error, all data sets would be equally well modeled by the mean, because the residuals around the mean would always sum to 0. Thus a data set widely spread out around the mean, and one tightly clustered around the mean, would have the same amount of error around this simple model. Clearly we need a different approach.

We can return back to the measures of spread in a distribution that we learned in chapter 5 and apply them to describing total error in a statistical model. Because several of those measures involve talking about spread as deviations away from the middle of a distribution, we can also use them to talk about deviations in a model based on the mean. 

### Sum of Squares

First we'll use a type of spread measure we didn't directly work with yet - **sum of squares**. As we talked about in chapter 5, one way to get around the issue of positive and negative deviations adding up to zero is to square all those numbers first before adding them together. That's all sum of squares is: the sum of all the squared residuals after fitting a model to data. Mathematically, this is written as:

$$ \sum_{i=1}^{N}(Y_i-\bar{Y})^2$$

Since we already have the column ```residual``` in ```tiny_fingers```, we can easily create a column of squared residuals. Adding those together will create the overall sum of squares for our model. 

In [7]:
sum(tiny_fingers$residual^2)

There may be some rounding errors going on again, but you can see the sum of squares (or SS for short) is about 82, and not 0 this time. In addition, SS helps us distinguish better-fitting models from worse ones because it is a measure of total error that is minimized exactly at the mean. Since our goal in statistical modeling is to reduce error, this is a good thing. In any distribution of a quantitative variable, the mean is the point in the distribution at which SS is lower than at any other point. 

<div class="alert alert-block alert-info">
<b>Note</b>: It is worth pointing out that the advantage of SS is only there if our model is the mean. If we were to choose another number, such as the median, as our model of a distribution, we would probably choose a different measure of error. But our focus in this course is primarily on the mean.
</div>


### Variance

Sum of squares is a good measure of *total* variation if we are using the mean as a model. But, it does have one important disadvantage. To see it, compare these two distributions:

<img src="images/ch10-ss.png" width="650">

The one on top is clearly less spread out than the distribution on the bottom, so we would expect that if we used the mean to model that distribution, there'd be less error. However, because there are more data points in that distribution, there are just more error values to add up. The sum of squares becomes larger than that for the more spread out distribution. 