[Back to Table of Contents](https://www.shannonmburns.com/Psyc158/intro.html)

[Previous: Chapter 10 - Quantifying Model Error](https://colab.research.google.com/github/smburns47/Psyc158/blob/main/chapter-10.ipynb)

In [None]:
# Run this first so it's ready by the time you need it
install.packages("ggformula")
install.packages("supernova")
install.packages("dplyr")
library(ggformula)
library(supernova)
library(dplyr)

studentdata <- read.csv("https://raw.githubusercontent.com/smburns47/Psyc158/main/studentdata.csv")

# Chapter 11 - Adding an Explanatory Variable

As we discussed previously, in the absence of other information about the objects being studied, the mean of our sample (an estimate of the population mean) is the best single-number we have for predicting the value of any one data point. Making a statistical model for this prediction (called the null model) would look like:

$$ Y_i = b_0 + e_i $$

Where $b_0$ represents the mean of our data sample for predicting the values of $Y_i$. $b_0$ is our model, our prediction, of what $\hat{Y}_i$ will be, while $e_i$ is all the variation in $Y_i$ that we couldn't explain with this model. Since $b_0$ is the same number for every data point, if $Y_i$ has any variation at all, $e_i$ will be substantial. So the null model isn't all that useful for making predictions in the end.

However, we don't have to be limited to using *just* a single number like the mean to make predictions. After all, we're interested in the data generation process, and we probably think there is some explanatory variable that contributes to the value of an outcome variable. For instance, earlier in chapter 9 we created a hypothesis about what contributes to reaction time of labeling the color of a word:

$$reaction time_i = \beta_1congruence_i + \beta_2speed_i + \beta_3font_i + \beta_4coordination_i$$

This statistical model clearly has more things going into the equation. In this chapter we'll learn how to add more pieces to a statistical model to start incorporating our thoughts about the data generation process and make better predictions. We started with the null model in order to get some important ideas across, but certainly that’s not where we want to end up. It is time we start building models that include explanatory variables. We will still use the null model, but only as a reference point.

## 11.1 Explaining variation

Let’s first review what we mean by explaining variation. In chapter 6, we developed an intuitive idea of explanation by comparing the distribution of one variable across two different groups. So, for example, we looked at the distribution of thumb length broken down by sex, which we can see in the two histograms below.

In [None]:
gf_histogram(~ Thumb, data = studentdata, fill = ~ Sex) %>% 
 gf_facet_grid(., Sex ~ .)

You can clearly see that sex explains some of the variation in thumb length *in our data*. (This might not be true in the population; it’s always possible that we are being fooled by a sample that doesn’t accurately represent what’s true in the population.) When we break up thumb length by sex it looks like two separate, though overlapping distributions. In general, males have longer thumbs than females in our data.

If we assume that this relationship (between sex and thumb length) exists in the population, and not just in our data, we can use it to help us make a better prediction about a future observation. If you know that someone is male, you would make a different prediction of their thumb length than if you knew they were female.

It seems, then, that if we were to use a statistical model to make predictions about a person's thumb length, somehow incorporating information about their sex would be helpful - our predictions would be more accurate on average, and there would be less overall error in the model. 

## 11.2 Adding an explanatory variable to the model

In the previous chapters we introduced the idea of a statistical model as an equation that is meant to represent our best guess of the data generation process. This model generates a predicted score for each observation. We developed what we called the null model, in which we use the mean as the predicted score for each observation.

We represented this model in General Linear Model (GLM) notation like this:

$$ Y_i = b_0 + e_i $$

where $b_0$ represents the mean of the outcome variable in the sample. When we use the notation of the GLM, we must define the meaning of each symbol in context. $Y_i$, for example, could mean lots of different things, depending on what our outcome variable is. But we will always use it to represent the outcome variable. 

It is also important to remember that $b_0$ is just an estimate of the true mean in the population. To distinguish the true mean, which is unknown, from the estimate of the true mean we construct from our data, we use the Greek letter $\beta_0$ and write the null model like this:

$$ Y_i = \beta_0 + \epsilon_i $$

In the case of thumb length, the null model states that the DATA (each data point, represented as $Y_i$, which is each person’s measured thumb length), can be thought of as being generated by the combination of two inputs: the MODEL, represented as $\beta_0$ (which is the mean thumb length for everyone, usually called the **Grand Mean**) plus ERROR, which is each person’s residual from the model, represented by $\epsilon_i$.

<div class="alert alert-block alert-info">
<b>Note</b>: We use the term Grand Mean to refer to the mean of everyone's outcome value, in order to distinguish it clearly from other means such as the mean for subgroups within the sample.
</div>

It’s useful to illustrate the null model (and what we're about to do to it) with our ```tiny_data``` dataset. ```tiny_data```, you will recall, contains six people’s thumb lengths randomly selected from our complete ```studentdata``` dataset. This time, we'll also include their value on the ```Sex``` variable as well. 


In [None]:
student_ID <- c(1, 2, 3, 4, 5, 6)
Thumb <- c(56, 60, 61, 63, 64, 68)
Sex <- c("female", "female", "female", "male", "male", "male")

tiny_data <- data.frame(student_ID, Thumb, Sex)
tiny_data

We can put this data into a basic scatter plot with ```Sex``` on the x-axis and ```Thumb``` on the y-axis in order to visualize how ```Sex``` might explain ```Thumb```. 

<img src="images/ch11-nullmodel.png" width="650">

In the above plot, we drew a blue horizontal line in order to mark where the Grand Mean of the whole ```tiny_data``` dataset is. This is the same value as $b_0$ - in other words, this is what we would predict everyone's thumb length to be if we were using the empty or null model. But there is plenty of error to this prediction - no data point is on this line. We could calculate the RMSE to find out how large the residuals are on average. 

Now let's try to take into account the effect of ```Sex``` and improve our prediction. One thing we could do is, instead of using the Grand Mean of ```Thumb``` to predict everyone's thumb length, we could first consider whether or not the prediction we want to make is for a male or female. Then, we could use the mean of *just that group* in order to make our prediction. 

<img src="images/ch11-sexpredictor.png" width="650">

A model that takes ```Sex``` into account generates a different prediction for a male than it does for a female. Error is still measured the same way, as the deviation of each person’s measured thumb length from their predicted thumb length. But this time, the prediction is each person’s group mean (male or female) instead of the Grand Mean, and this prediction varies as a function of the predictor variable ```Sex```. 

## 11.3 Specifying the model form

The null model only had one parameter in it, the Grand Mean $b_0$. To take into account the effect of Sex, we need to add in another parameter that will change our prediction depending on whether someone is male or female. This makes the Sex model a two-parameter model. 

One way to specify this equation is to use the mean of one group (say, females) as $b_0$, and then add an extra amount to that value if someone is actually male. In other words, we could specify another parameter, $b_1$, as the *difference* between male and female mean thumb lengths. But this should only be added if someone is male, so let's multiply this parameter by 1 if Sex is "male", and by 0 if Sex if "female" (effectively ignoring this addition in the case of females).

Here is how to write this in GLM form: 

$$ Y_i = b_0 + b_1X_i + e_i $$

In this equation, $b_0$ is the group mean for "female" and $b_1$ is the difference between the group means of male and female. $X_i$ is a variable, in this case ```Sex``` (which takes the value of either 0 for female or 1 for male). 

We're still using the DATA = MODEL + ERROR framework for this. Except this time, our MODEL takes into account the value of Sex and has multiple components ($b_0 + b_1X_i$) instead of one component ($b_0$). $b_0$ also no longer stands for the Grand Mean of this sample, but the *group* mean of whatever group we assigned to be 0 in the Boolean variable Sex (called the **reference group**). $b_1$ is called the **effect** of the predictor, because it's the effect on our prediction of Y by changing the value of X. 

Defining the meaning of $b_0$ and $b_1$ this way is for a particular reason - it helps us calculate predictions for $Y_i$ no matter the value of $X_i$. To calculate our prediction of each person's thumb, we'd fill in the parameters with the group mean of female (59) and the effect of Sex, the difference between the group means of male and female (65 - 59 = 6):

$$ \hat{Y}_i = 59 + 6X_i $$

This equation would make one prediction (59) when the value of $X_i$ is 0 (female), and a different prediction (65) when the value of $X_i$ is 1 (male). 

<img src="images/ch11-sexmodel.png" width="650">

## 11.4 Fitting the one-variable model

Now that you have learned how to specify a model with an explanatory variable (also frequently called a predictor), let’s learn how to fit the model using R.

Fitting a model, as a reminder, simply means automatically calculating the parameter estimates that minimize error in our data sample. We use the word “fitting” because we want to calculate the best estimate, the one that will result in the least amount of error and best "fit" our data. For the tiny data set, we could calculate the parameter estimates by hand — it’s just a matter of calculating the mean for males and the mean for females. But when the data set is larger, it is much easier to use code.

Using R, we will first fit the Sex model to the tiny dataset, just so you can see that R gives you the same parameter estimates you got before. After that we will fit it to the complete data set.

Here's the model form we are going to fit:

$$ Y_i = b_0 + b_1X_i + e_i $$

Note that the parts that are going to have different values for each observation ($X_i$ and $Y_i$) are called variables (because they vary). The parts that are going to have the same value for each observation ($b_0$ and $b_1$) are called parameter estimates.

We do not need to estimate the variables. Each student in the dataset already has a score for the outcome variable ($Y_i$) and the explanatory variable ($X_i$), and these scores vary across students. Notice that the subscript *i* is attached to the parts that are different for each person.

We do need to estimate the parameters because, as discussed previously, they are features of the population, and thus are unknown. The parameter estimates we calculate are those that best fit our particular sample of data. But we would have probably gotten different estimates if we had a different sample. Thus, it is important to keep in mind that these estimates are only that, and they are undoubtedly a bit off. Calling them estimates keeps us humble!

Parameter estimates don’t vary from person to person, so they don’t carry the subscript *i*.

To fit the Sex model we use ```lm()``` again. This time, instead of the right side of the formula being ```NULL```, we have a variable to put there. Thus, the formula argument of ```lm()``` is ```Thumb ~ Sex```. This means we are asking R to find a statistical model where "Thumb varies as a function of Sex." 

In [None]:
lm(Thumb ~ Sex, data = tiny_data)

Notice that the estimates are exactly what we used earlier: the first estimate, for $b_0$, is 59 (the mean for females); the second, $b_1$, is 6, which is the number of millimeters you need to add to the female average thumb length to get average male thumb length.

Notice also that the estimate for $b_0$ is labeled “intercept” in the output. You have encountered the concept of intercept before, when you studied how to plot a line in algebra. Remember that $y = mx + b$ is the equation for a line? $m$ represents the slope of the line, and $b$ represents the y-intercept. The General Linear Model notation is similar to this. It's like if we drew a line between the prediction for females and the prediction for males: the intercept is the value of Y when X=0 (female) and the effect of the predictor (labeled "Sexmale" in the output) is the slope between X=0 and X=1 (aka, the difference between males and females).

<img src="images/ch11-glmeq.png" width="500">
<img src="images/ch11-glmeq2.png" width="500">

If you want — and it’s a good idea — you can save the results of this model fit in a model object. Here’s the code to save the model fit in an object called ```tiny_sex_model```. Once you’ve saved the model, If you want to see what the model estimates are, you can just type the name of the model and you will get the same output as above.

In [None]:
tiny_sex_model <- lm(Thumb ~ Sex, data = tiny_data)

#type the name of the saved model below to print out its output


Now that we have estimates for the two parameters (intercept and effect of Sex), we can put them in our model statement to yield: $\hat{Y}_i = 59 + 6X_i$.

You may have noticed that the values of Sex in ```tiny_data``` are the categorical strings ```female``` or ```male```, and not 1 or 0. We were able to run ```lm()``` anyway, so it seems like R is able to handle converting categorical data to Boolean data. But how does R know which level of Sex should be 0 and which should be 1? The answer to this question is, R doesn’t really know. If Sex is character data, it’s just taking whatever group comes first alphabetically (in this case,  ```female```) and making it the reference group. If Sex is a factor variable, the reference group is whichever level of the variable was specified to be the first one (by appearing first in the vector passed to ```levels = ``` in the ```factor()``` function). The mean of the reference group is the first parameter estimate ($b_0$ or the Intercept in the ```lm()``` output). R then takes the difference between the reference group and the second group (in this case, ```male```) and represents it with $b_1$.  

Let’s say, just for fun, that you changed the code for ```female``` into ```woman``` in the data frame. Because ```male``` now comes first in the alphabet, ```male``` becomes the reference group, and its mean is now the estimate for the intercept ($b_0$) when we automatically fit the model.

In [None]:
tiny_data <- mutate(tiny_data, Sex = case_match(Sex, "female" ~ "woman", "male" ~ "male"))
tiny_data
#male now comes first in the alphabet, so that value is the reference group
lm(Thumb ~ Sex, data = tiny_data)
#b0 is now the group mean of 'male' and b1 is the difference between male and woman

Making a categorical predictor like this into Boolean values is also called making a **dummy variable**. This does not get saved to your data frame - it's just a temporary computation R makes under the hood. You could supply a Sex variable as a Boolean of 0s and 1s already, but if you don't R will automatically translate a categorical variable into a dummy variable for the purpose of fitting the model. 

Now that you have looked in detail at the tiny set of data, let's find the best estimates for using ```Sex``` to predict ```Thumb``` in our bigger dataset ```studentdata```. Do so by modifying the null model code below. What would be $b_0$ and what would be $b_1$?

In [None]:
# fit and store a model where Sex predicts Thumb in studentdata
sex_model <- 

# this prints out the model estimates
sex_model

## 11.5 Generating predictions from the model

Now that you have fit the Sex model, you can use your estimates to make predictions about future observations. Doing this requires you to use your model as an equation. In this case, you will put in a value for your explanatory variable (Sex), and get out a predicted thumb length.

Recall that the basic form of our model equation looks like this:

$$ Y_i = b_0 + b_1X_i + e_i $$

Once fit, we can print out the coefficients of the model, and then replace $b_0$ with the Intercept coefficient value and replace $b_1$ with the 'Sexmale' coefficient value. Then, in order to make a prediction about the value of Thumb for any one person, we remove the error term. We also change the $Y_i$ to $\hat{Y}_i$, which indicates a predicted score for person *i*. Our prediction equation, then, looks like this:

$$ \hat{Y}_i = 58.59 + 6.056X_i$$

We leave out the error term because every person will have a different residual. If we knew their residual ahead of time, we could predict their score exactly. But since we don’t when making a new prediction, all we can do is predict their score based on the information available to us - the group means and the new observation's Sex value.

This prediction equation is straightforward to use. If we want to predict what the next observed thumb length will be, we can see that if the next student sampled is female, their predicted thumb length is 58.59. If they are male, the prediction is 58.59 + 6.056, or 64.646.

If you want to extract each coefficient by itself with code, you can also use the commands below: 

In [None]:
sex_model$coefficients[1] #this will print out the first coefficient, b0
sex_model$coefficients[2] #this will print out the second coefficient, b1

You can use the ```$``` operator here to pull out the property ```coefficients``` from the ```sex_model``` object. This is a vector that contains every coefficient in the model. To get just $b_0$ or $b_1$, we use indexing. 

As we did in chapter 9, we also will want to generate model predictions for our sample data. It seems odd to predict values when we already know the actual values. But it’s very useful to do so, because then we can calculate residuals from the model predictions and measure our model's performance.

To get predicted values from the ```sex_model```, we use the ```fitted.values``` property or the ```predict()``` function:

In [None]:
sex_model$fitted.values

This is a big output, but the results are just what we've already done - for each observation, their predicted thumb length is the mean of female students if their value on Sex is ```female``` (58.59), or the mean of male students if their value on ```Sex``` is ```male``` (64.646).

## 11.6 Quantifying model fit

Why should we take Sex into account in the first place? Using two parameters in our model instead of one makes it more complex, or less **parsimonious**. We'll talk more later about the importance of parsimony, but for now we should just know that it's harder to work with a more complex model than a simpler one. Thus, there should be a good reason for making it more complex - it should reduce the error in our model predictions. 

Let's try it out. We'll add columns of predictions to ```tiny_data```, calculate the residuals, and then calculate the RMSE.

In [None]:
tiny_data$GrandMean <- c(62, 62, 62, 62, 62, 62) #predictions using grand mean
tiny_data$GroupMeans <- c(59, 59, 59, 65, 65, 65) #predictions using group means
tiny_data$GrandResid <- tiny_data$Thumb - tiny_data$GrandMean #grand mean prediction residuals
tiny_data$GroupResid <- tiny_data$Thumb - tiny_data$GroupMeans #group mean prediction residuals

#equation for RMSE in null model 
sqrt(sum(tiny_data$GrandResid^2) / (length(tiny_data$GrandResid) - 1))
#equation for RMSE in Sex model 
sqrt(sum(tiny_data$GroupResid^2) / (length(tiny_data$GroupResid) - 2))

Success! If we make predictions about people's thumb lengths using the Grand Mean, on average we're about 4.05mm off. But if we take into account each person's sex for making the prediction, we're only about 2.65mm off. Not perfect, but we almost halved our error! 

You may have noticed that, in order to calculate RMSE for the one- and two-parameter models above, we used slightly different equations. Specifically, in the null model it was calculated as: 

$$RMSE_{null} = \sqrt{\frac{1}{N-1}\sum_{i=1}^{N}(Y_i-\hat{Y}_{i-empty})^2}$$

and in the predictor model it was calculated as:

$$RMSE_{sex} = \sqrt{\frac{1}{N-2}\sum_{i=1}^{N}(Y_i-\hat{Y}_{i-sex})^2}$$

The difference is that we divided the sum of squares in the null model by N - 1, and in the Sex model we divided by N - 2. We've already talked about how, if this measure should be the root *mean* squared error, it's weird that we're not actually *calculating the mean of the error* (which would be found by dividing by N only). Now, we can learn more about why that is. 

If we were to only divide by N, that would actually be fine for finding the RMSE *of this specific sample*. It would be the root of the mean squared error, exactly as it sounds. However, remember again that our ultimate goal is not to make a model *for this sample*, but to estimate the data generation process *for the whole population*. As it turns out, if we divide by only N for finding the RMSE of a sample, we will systematically underestimate how much error our model would have in the population. We'll demonstrate this in more detail in a later chapter. 

In order for us to correct for this underestimation, we need to divide by a slightly smaller number than N: N - 1 in the empty model, or N - 2 in the predictor model. This replacement term is called the **degrees of freedom** in the model.

What are degrees of freedom? In essense, they are the number of unique pieces of information in a dataset, or the number of ways the dataset can vary. You might think that, if a dataset has 6 items (N=6), then there should be 6 unique pieces of information there, right? Each observation can vary in its own way? That would be true, until you bring a parameter estimate of that data into play. Once we have an estimate about the dataset as a whole (say, the mean as $b_0$ in the null model), that actually takes away one way the dataset can vary - it takes away one degree of freedom.

Let's demonstrate this with our tiny dataset. We have a set of thumb lengths, [56, 60, 61, 63, 64, 68]. If we were missing one (our set looked like [56, 60, 61, 63, 64, ?]), we wouldn't have any way of knowing what that sixth item should be - it is free to vary. But if we have the 5 known items, *and* we have an estimate of the mean of the sample (grand mean = 62), the missing 6th item can *only* be 68 in order to keep that grand mean estimate at 62. It is not free to vary. Thus, when we have an estimate of the mean of a sample, the degrees of freedom are N - 1. We only need to know 5 of the items in the sample in order to know the whole sample. 

When we extend to the Sex model, we have two parameter estimates - $b_0$ and $b_1$. We could be missing one value from each sex subgroup (2 datapoints total), and still solve for all the values in the dataset since we have each group mean. Thus, the degrees of freedom for a two-parameter model is N - 2. To generalize this, the degrees of freedom of any model is *sample size - number of parameters*, or *N - k* for short. 

When calculating error in a model, dividing by degrees of freedom instead of sample size keeps us from underestimating the error in the population. 

You can also use ```supernova()``` to get an ANOVA table for models with a predictor variable, not just null models. This way we can find RMSE without a big long line of code and math. Write some code below that will run the function on the model object the same way we did last chapter. 

In [None]:
# ANOVA table of an empty model
null_model <- lm(Thumb ~ NULL, data = studentdata)
supernova(null_model)

# write code to use supernova() on the sex_model object instead


How do the two ANOVA tables compare? It looks like the table for ```sex_model``` has the same information from the empty model, but with a lot of new numbers added also. Let's break this down. 

As you will recall from last chapter, the sum of squares (SS in the table) is the sum of squared residuals from our model. We will refer to it as $SS_{total}$ to diferentiate it from the values on other lines. In both tables, this appears on the line "Total (empty model)" in order to give you a comparison for how much the sum of squares is when just using the Grand Mean to make predictions. It is the total variation in the outcome variable that we hope to explain.

df on the same line stands for degrees of freedom, which you now know the meaning of: there are 157 people in the sample with valid thumb measurements, and we're estimating one parameter in the null model, so degrees of freedom are 157 - 1 = 156. 

Mean squared error (MS) on the same line is the sum of squares divided by the degrees of freedom. To get root mean squared error (RMSE), we'd take the square root of this value.

Now, for the new lines of numbers. The line "Error (from model)" was filled out because we now have a model that is not empty - there's an explanatory variable in it. SS, df, and MS mean the same thing for this model as they did in the empty model: they reflect the amount of error that is left unexplained by the model, and how many degrees of freedom there are after estimating the model parameters (157 observations - 2 parameters = 155 df). This SS value is known as $SS_{error}$. We calculate $SS_{error}$ in much the same way we calculate $SS_{total}$, except this time we use residuals from the Sex model predictions instead of from the null model. 

The key thing to look for is whether or not error left over in the predictor model is less than in the null model. If so, that means it was valuable to add our variable! It helped explain some variance in the outcome variable, enabling us to make better predictions and produce smaller residuals. This difference between the null model and the full model is on the top line. $SS_{model}$ refers to how much of the unexplained error from the null model was explained by adding a variable in the full model. To calculate $SS_{model}$, we simply need to subtract $SS_{error}$ (error from the Sex model predictions) from $SS_{total}$ (error from the null model predictions). Thus:

$$ SS_{model} = SS_{total} - SS_{error} $$


## 11.7 Improvement over empty model

Statistical modeling is all about explaining variation. $SS_{total}$ tells us how much total variation there is to be explained in an outcome variable. When we fit a model (as we have done with the Sex model), that model explains some of the total variation, and leaves some of that variation still unexplained.

These relationships are visualized in the diagram below: Total SS can be seen as the sum of the Model SS (the amount of variation explained by a more complex model) and SS Error, which is the amount left unexplained after fitting the model. SS Total = SS Model + SS Error.

<img src="images/ch11-explainedvar.png" width="750">

We can see this in the ANOVA table: the first two rows (Model and Error) add up to the Total SS. So, 1192.747 + 11777.738 = 12970.485.

We have now quantified how much error from $SS_{total}$ has been explained by our model: $SS_{model}$, or in our specific case with ```Sex``` predicting ```Thumb``` in ```studentdata```, 1192.747 square millimeters. Is that good? Is that a lot of explained variation? Remember, $SS_{total}$ in influenced by how many datapoints are in the dataset, so it's hard to interpret whether 1192.747 is a lot or a little. It would be easier to understand if we knew the *proportion* of total error that has been reduced rather than the raw amount of error reduced measured in mm<sup>2</sup>.

If you take another look back at the ANOVA table produced above, you will see another column labelled PRE. PRE stands for **Proportional Reduction in Error.**

PRE is calculated using the sums of squares. It is simply $SS_{model}$ (i.e., the variation explained by the model) divided by $SS_{total}$ (the total variation in the outcome variable there is to explain). We can represent this in an equation:

$$ PRE = \frac{SS_{model}}{SS_{total}} $$

Based on this equation, PRE can be interpreted as the proportion of total variation in the outcome variable that is explained by the explanatory variable. It tells us something about the overall strength of our statistical model. Because PRE in the ANOVA table is 0.092, that means ```Sex``` explained 9.2% of the variance in ```Thumb```. Most of the original variance is still there (we didn't explain it perfectly), but we made a reduction in error.

It is important to remember that $SS_{model}$ in the numerator of the equation above represents the *reduction* in error when going from the null model to the more complex model, which includes an explanatory variable. To make this clearer we can re-write the above equation like this:

$$ PRE = \frac{SS_{total} - SS_{error}}{SS_{total}} $$

The numerator of this formula starts with the error from the null model $SS_{total}$, and then subtracts the error from the full model $SS_{error}$ to get the error reduced by the full model $SS_{model}$. Dividing this reduction in error by the $SS_{total}$ yields the proportion of total error in the null model that has been reduced by the full model.

<div class="alert alert-block alert-info">
<b>Note</b>: We're calling the comparison of a more complex model to the empty model PRE since that's what supernova() calls it, but it goes by other names as well. In more traditional statistics it is referred to as $\eta^2$ (eta squared) or $R^2$. For now all you need to know is: these are different terms used to refer to the same thing - reduction of error accomplished by a more complex model.
</div>  

Although right now we're using PRE to compare a model with one predictor to the null model, the comparison doesn’t need to be to the null model. In fact, PRE can be used to compare any two models as long as they're modeling the same outcome variable.

Let's fit another model, this time trying to predict thumb lengths based on which campus a student came from (UCLA or Claremont). 

In [None]:
campus_model <- lm(Thumb ~ Campus, data = studentdata)

#create an ANOVA table for campus_model


This model has a much smaller PRE - it explains next to no variance in thumb lengths. 

Why might a more complex model not reduce error? This is worth thinking about, because it’s exactly what differentiates the predictor model from the null model. Let’s compare the two:

One-predictor model: $$Y_i = b_0 + b_1X_i + e_i$$

Null model: $$Y_i = b_0 + e_i$$

In the one variable model, if $b_1 = 0$, then $b_1X_i$ effectively drops out of the equation - any value of X multiplied by 0 is just 0. Then the equation is no different from the null model. Since $b_1$ refers to the difference in group means, this would be the case where the means of UCLA and Claremont thumbs lengths are the same. In other words, Campus does NOT explain any variation in Thumb. In the ANOVA table, this would look like $SS_{model}$ of 0 (or close to it) - there is effectively no difference in the error between the null model and the full model. In practice, $SS_{model}$ is almost never 0. Adding an explanatory variable almost always explains some amount of variance, even if it's tiny. In a later chapter we will explore the question of how *much* error reduction is enough for us to care about.  

## 11.8 A historical note - the t-test

In this course we are relying on the general linear model framework to formalize and test our ideas of the data generation process. The main logic of the general linear model is that we can create a (simplified) statistical model that is an estimate of at least a part the of true data generation process, and see whether predictions using that model are more accurate than predictions made using a null model. 

This is not the only approach to statistics, however. Indeed until recently, it was not the most common. The traditional approach to statistics was to develop a separate tool for each kind of dataset. In addition, the goal wasn't about improving accuracy of data predictions, but describing the distinguishability of data from different groups. In this course it is your professor's opinion that this is not the best way to teach statistics, since it relies on more abstract concepts from the get-go and using separate tests makes it harder to remember the logic of all of them. But because so many researchers learned statistics with these tools, they are often still found in research publications. Thus it would be good for you to at least know what they are, and what version of the general linear model they map onto. 

### One-sample t-test

The one-sample t-test is a tool to evaluate whether the mean of one sample is *significantly different* than a particular number. We have not covered the logic of *statistical significance* yet - we will get to it later in the course. But a quick understanding of it right now for the purpose of the t-test is that this tool is asking the probability of whether a distribution of data in a sample was taken from a population where the true mean *&mu;* is a particular value.  

A t-test outputs a coefficient, known as a *t-value*. This coefficient can be calculated as:

$$ t_{one} = \frac{\bar{X} - \mu}{\frac{s}{\sqrt{n}}} $$

Where $\bar{X}$ is the mean of a data sample, $\mu$ is some hypothesized population mean, $s$ is the standard deviation of the sample, and $n$ is the sample size. $\mu$ is a hypothesized *population* mean, hence why it is written with the Greek letter $\mu$. Often, this hypothesized population mean is 0 (but not always). The t-test thus measures how far away the mean of a sample is from 0, and whether that's exceptional or not considering the spread of the distribution. The bigger the t-value, the more unlikely it is that a population with mean 0 created this sample, and thus the more certain we are that some other population did instead. 

A t-value isn't the same thing as any value in our general linear model, but is based on the difference between the Grand Mean and a hypothesized population mean. Because of this you can use $b_0$ in the null form of the General Linear Model for a similar purpose. In this approach, we could ask whether setting $b_0$ to the mean of our sample makes better predictions than setting $b_0$ to 0. Thus if you encounter one-sample t-tests in the wild, just think of using the null form of the general linear model for the same purpose. 

### Independent samples t-test

A related type of t-test compares the means of two samples to each other to ask whether we think the same underlying population created them, or not. The computation for an independent-samples t-test is:

$$ t_{independent} = \frac{\bar{X}_1 - \bar{X}_2}{s_p\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} $$

where $s_p$ stands for the *pooled* standard deviation, a way of combining the standard deviations of two samples:

$$s_p = \sqrt{\frac{(n_1 - 1)s^2_1 + (n_2 - 1)s^2_2}{n_1 + n_2 + 2}} $$

The independent samples t-test asks whether two groups are likely from the same population. These two groups are usually split based on an explanatory variable - e.g., the thumb lengths of females in our study would be one group, and the thumb lengths of males would be another one. Once separated, those distributions are compared to each other. If an independent samples t-test value is large, it is unlikely that both groups came from the same population.

In the general linear model, we can achieve a similar goal by looking at the value of $b_1$ in a predictor model. If $b_1$ is large, that means there is a large difference between female and male group means - we'd make substantially different predictions for someone who is female compared to someone who is male. If $b_1$ is small, there is not much difference between the two groups (and perhaps the small difference is only there by random chance). 

## Chapter summary

After reading this chapter, you should be able to:

- Understand the difference between an empty model and a one-variable model
- Explain what $b_0$ and $b_1$ mean in a one-variable model
- Write the equation form of the one-variable statistical equation
- Fit a one-variable model in R with lm()
- Explain what degrees of freedom are
- Identify the various error components in an ANOVA table of the one-variable model
- Define what proportional reduction in error means 
- Calculate proportional reduction in error from SS
- Know what general linear model specification maps onto one-sample and independent t-tests

## New concepts
- **Grand Mean**: The mean of an outcome variable for all people in the dataset.  
- **reference group**: In a statistical model with a categorical variable as a predictor, the reference group is the level of the categorical variable for which $b_0$ is the group mean and to which all other group means are compared. 
- **effect**: The effect of a predictor in a model is the extent to which outcome variable predictions are changed based on different values in this predictor.
- **dummy variable**: A predictor variable in a model that has been converted to Boolean in order to be useful for mathematical calculation. Meant to represent categorical membership (i.e. 1=in some category, 0=not in this category).
- **parsimony**: A statistical model is parsimonious if it is simple with relatively few predictor variables. Parsimonious models explain less error in a data sample but have a better chance at generalizing to other samples. 
- **degrees of freedom**: The number of values in a data sample that are free to vary after the calculation of some statistic(s). In general the degrees of freedom for a dataset are N-k, where N is the sample size and k is the number of parameters calculated on the dataset.  
- **proportional reduction in error**: The proportion of variation in an outcome variable that a statistical model is able to explain.  

[Next: Chapter 12 - Quantitative Predictor Models](https://colab.research.google.com/github/smburns47/Psyc158/blob/main/chapter-12.ipynb)