In [11]:
# This chapter uses some packages that take a few minutes to download on Google Colab. 
# Run this first so it's ready by the time you need it
install.packages("ggformula")
install.packages("readr")
install.packages("supernova")
library(ggformula)
library(readr)
library(supernova)


The downloaded binary packages are in
	/var/folders/mg/1wy1xcls587_h0tqnj42l5740000gn/T//RtmpUpNniz/downloaded_packages


also installing the dependency ‘vroom’





  There is a binary version available but the source version is later:
      binary source needs_compilation
readr  2.1.2  2.1.3              TRUE


The downloaded binary packages are in
	/var/folders/mg/1wy1xcls587_h0tqnj42l5740000gn/T//RtmpUpNniz/downloaded_packages


installing the source package ‘readr’





The downloaded binary packages are in
	/var/folders/mg/1wy1xcls587_h0tqnj42l5740000gn/T//RtmpUpNniz/downloaded_packages



Attaching package: ‘readr’


The following object is masked from ‘package:scales’:

    col_factor



Attaching package: ‘supernova’


The following object is masked from ‘package:scales’:

    number




# Chapter 11 - Adding an Explanatory Variable

As we discussed previously, in the absence of other information about the objects being studied, the mean of our sample is the best single-number estimate we have of the actual mean of the population. It is equally likely to be too high as it is too low for any one data point (the typical variation around the mean is the same on the left side as it is on the right). Because it is our best guess of what the population parameter is, it is the best predictor we have of the value of a subsequent observation. While it will almost certainly be wrong, the mean will do a better job than any other single number.

However, we don't have to be limited to *just* a single number. In this chapter we'll learn how to add more pieces to a statistical model to do a better job. 

## 11.1 Explaining variation

We started with the empty model in order to get some important ideas across, but certainly that’s not where we want to end up. It is time we start building models that include explanatory variables. We will still use the empty model, but only as a reference point.

Let’s quickly review what we mean by explaining variation. Earlier in the course, we developed an intuitive idea of explanation by comparing the distribution of one variable across two different groups. So, for example, we looked at the distribution of thumb length broken down by sex, which we can see in the two density histograms below.

<img src="images/ch11-variation.png" width="650">

You can clearly see that sex explains some of the variation in thumb length *in our data*. (This may not be true in the population, of course. It’s always possible that we are being fooled by a sample that doesn’t accurately represent what’s true in the population.) When we break up thumb length by sex it looks like two separate, though overlapping distributions. In general, males have longer thumbs than females in our data.

If we assume that this relationship (between sex and thumb length) exists in the population, and not just in our data, we can use it to help us make a better prediction about a future observation. If you know that someone is male, you would make a different prediction of their thumb length than if you knew they were female.

It seems, then, that if we were to use a statistical model to make predictions about a person's thumb length, somehow incorporating information about their sex would be helpful - our predictions would be more accurate on average, and there would be less overall error in the model. 

## 11.2 Adding an explanatory variable to the model

In the previous chapters we introduced the idea of a statistical model as an equation that is meant to represent our best guess of the data generation process. This model generates a predicted score for each observation. We developed what we called the empty model, in which we use the mean as the predicted score for each observation.

We represented this model in General Linear Model (GLM) notation like this:

$$ Y_i = b_0 + e_i $$

where b<sub>0</sub> represents the mean of the outcome variable in the sample. When we use the notation of the GLM, we must define the meaning of each symbol in context. Y<sub>i</sub>, for example, could mean lots of different things, depending on what our outcome variable is. But we will always use it to represent the outcome variable. 

It is also important to remember that b<sub>0</sub> is just an estimate of the true mean in the population. To distinguish the true mean, which is unknown, from the estimate of the true mean we construct from our data, we use the Greek letter &beta;<sub>0</sub> and write the empty model like this:

$$ Y_i = \beta_0 + \epsilon_i $$

The empty model is called a one parameter model because we only need to estimate one parameter (&beta;<sub>0</sub>) in order to generate a predicted score for each observation.

In the case of thumb length, this model states that the DATA (each data point, represented as Y<sub>i</sub>, which is each person’s measured thumb length), can be thought of as being generated by the combination of two inputs: the MODEL, represented as b<sub>0</sub> (which is the mean thumb length for everyone, usually called the Grand Mean); plus ERROR, which is each person’s residual from the model, represented by e<sub>i</sub>.

<div class="alert alert-block alert-info">
<b>Note</b>: We use the term Grand Mean to refer to the mean of the entire sample in order to distinguish it clearly from other means, such as the mean for subgroups within the sample.
</div>

It’s useful to illustrate the empty model (and what we're about to do to it) with our ```tiny_fingers``` dataset. ```tiny_fingers```, you will recall, contains six people’s thumb lengths randomly selected from our complete ```fingers``` dataset. This time, we'll also include their value on the ```Sex``` variable as well. 


In [6]:
student_ID <- c(1, 2, 3, 4, 5, 6)
Thumb <- c(56, 60, 61, 63, 64, 68)
Sex <- c("female", "female", "female", "male", "male", "male")

tiny_fingers <- data.frame(student_ID, Thumb, Sex)
tiny_fingers

student_ID,Thumb,Sex
<dbl>,<dbl>,<chr>
1,56,female
2,60,female
3,61,female
4,63,male
5,64,male
6,68,male


We can put this data into a basic scatter plot with ```Sex``` on the x-axis and ```Thumb``` on the y-axis in order to visualize how ```Sex``` might explain ```Thumb```. 

<img src="images/ch11-nullmodel.png" width="750">

In the above plot, we drew a blue horizontal line in order to mark where the Grand Mean of the whole ```tiny_fingers``` dataset is. This is the same value as b<sub>0</sub> - in other words, this is what we would predict everyone's thumb length to be if we were using the empty or null model. But there is plenty of error to this prediction - no data point is on this line, and we could calculate the RMSE to find out how large the residuals are in general. 

So let's try to take into account the effect of ```Sex``` and improve our prediction. One thing we could do is, instead of using the Grand Mean of ```Thumb``` to predict everyone's thumb length, we could first consider whether or not the prediction we want to make is for a male or female. Then, we could use the mean of *just that group* in order to make our prediction. 

<img src="images/ch11-sexpredictor.png" width="750">

A model that takes ```Sex``` into account generates a different prediction for a male than it does for a female. Error is still measured the same way, as the deviation of each person’s measured thumb length from their predicted thumb length. But this time, the error is calculated from each person’s group mean (male or female) instead of from the Grand Mean.

## 11.3 Specifying the model form

Whereas the empty model was a one-parameter model (we only had to estimate one parameter, the Grand Mean), the ```Sex```  model is a two-parameter model. One of the parameters is the mean for males, the other is the mean for females. By using the model as an equation for predicting the value of ```Thumb```, we should be able to use one or the other mean depending on the value of ```Sex```. 

One way to do this is to use the mean of one group (say, females) as &beta;<sub>0</sub>, and then add an extra amount to that value if someone is actually male. In other words, we could specify another parameter, &beta;<sub>1</sub>, as the *difference* between male and female mean thumb lengths. But this should only be added if someone is male, so let's multiply this parameter by 1 if ```Sex``` is "male", and by 0 if ```Sex``` if "female" and we should ignore this addition.

Here is how to write this in GLM form: 

$$ Y_i = \beta_0 + \beta_1X_i + \epsilon_i $$

In this equation, &beta;<sub>0</sub> is the group mean for "female" and &beta;<sub>1</sub> is the difference between the group mean of male and female. X<sub>i</sub> is a variable, in this case ```Sex``` (which can take the value of either 0 for female or 1 for male). 

Of course we're making an estimate of the population and not measuring it directly, so the estimate form of this equation would be:

$$ Y_i = b_0 + b_1X_i + e_i $$

We're still using the DATA = MODEL + ERROR framework for this. Except this time, our MODEL takes into account the value of ```Sex``` and has multiple components (b<sub>0</sub> + b<sub>1</sub>X<sub>i</sub>) instead of one component (b<sub>0</sub>). b<sub>0</sub> also no longer stands for the Grand Mean of this sample, but the *group* mean of whatever group we assigned to be 0 in the Boolean variable ```Sex```. To calculate our prediction of each person's thumb, we'd fill in the parameters with the group mean of female (59) and the difference between the group means of male and female (65 - 59 = 6):

$$ \hat{Y}_i = 59 + 6X_i $$

This equation would make one prediction (59) when the value of X<sub>i</sub> is 0 (female), and a different prediction (65) when the value of X<sub>i</sub> is 1 (male). 

## 11.4 Error in the one-variable model

Why should we take ```Sex``` into account in the first place? Using two parameters in our model instead of one makes it more complex, or less **parsimonious**. We'll talk more later about the importance of parsimony, but for now we should just know that it's harder to work with a more complex model than a simpler one. Thus, there should be a good reason for making it more complex - it should reduce the error in our model predictions. 

Let's try it out. We'll add columns of predictions to ```tiny_fingers```, calculate the residuals, and then calculate the RMSE.

In [9]:
tiny_fingers$GrandMean <- c(62, 62, 62, 62, 62, 62) #predictions using grand mean
tiny_fingers$GroupMeans <- c(59, 59, 59, 65, 65, 65) #predictions using group means
tiny_fingers$GrandResid <- tiny_fingers$Thumb - tiny_fingers$GrandMean #grand mean prediction residuals
tiny_fingers$GroupResid <- tiny_fingers$Thumb - tiny_fingers$GroupMeans #group mean prediction residuals

 #equation for RMSE in one-parameter case 
sqrt(sum(tiny_fingers$GrandResid^2) / (length(tiny_fingers$GrandResid) - 1))
 #equation for RMSE in two-parameter case 
sqrt(sum(tiny_fingers$GroupResid^2) / (length(tiny_fingers$GroupResid) - 2))

Success! If we make predictions about people's thumb lengths using the Grand Mean, on average we're about 4.05mm off. But if we take into account each person's sex for making the prediction, we're only about 2.65mm off. Not perfect, but we almost halved our error! 

You may have noticed that, in order to calculate RMSE for the empty and one-variable models above, we used slightly different equations. Specifically, in the null model it was calculated as: 

$$RMSE_{null} = \sqrt{\frac{1}{N-1}\sum_{i=1}^{N}(Y_i-\hat{Y}_i)^2}$$

and in the one-variable model it was calculated as:

$$RMSE_{sex} = \sqrt{\frac{1}{N-2}\sum_{i=1}^{N}(Y_i-\hat{Y}_i)^2}$$

The difference is that we divided the sum of squares in the null model by N - 1, and in the one-variable model we divided by N - 2. We've already talked about how, if this measure should be the root *mean* squared error, it's weird that we're not actually *calculating the mean of the error* (which would be found by dividing by N only). Now, we can learn more about why that is. 

If we were to only divide by N, that would actually be fine for finding the RMSE *of this specific sample*. It would be the root of the mean squared error, exactly as it sounds. However, remember again that our ultimate goal is not to make a model *for this sample*, but to *estimate the data generation process for the whole population*. As it turns out, if we divide by only N for finding the RMSE of a sample, we will systematically underestimate how much error our model would have in the population. We'll demonstrate this in more detail in a later chapter. 

In order for us to correct for this underestimation, we need to divide by a slightly smaller number than N: N - 1 in the empty model, or N - 2 in the one-variable model. This replacement term is called the **degrees of freedom** in the model.

What are degrees of freedom? In essense, they are the number of unique pieces of information in a dataset, or the number of ways the dataset can vary. You might think that, if a dataset has 6 items (N=6), then there should be 6 unique pieces of information there, right? Each observation can vary in its own way? That would be true, until you bring a parameter estimate of that data into play. Once we have an estimate about the dataset as a whole (say, the mean as b<sub>0</sub> in the empty model), that actually takes away one way the dataset can vary - it takes away one degree of freedom.

Let's demonstrate this with our tiny dataset. We have a set of thumb lengths, [56, 60, 61, 63, 64, 68]. If we were missing one (our set looked like [56, 60, 61, 63, 64, ?]), we wouldn't have any way of knowing what that sixth item should be - it is free to vary. But if we have the 5 known items, *and* we have an estimate of the mean of the sample (grand mean = 62), the missing 6th item can *only* be 68 in order to keep that grand mean estimate at 62. It is not free to vary. Thus, when we have an estimate of the mean of a sample, the degrees of freedom are N - 1. We only need to know 5 of the items in the sample in order to know the whole sample. 

When we extend to the one-variable model, we have two parameter estimates - b<sub>0</sub> and b<sub>1</sub>. We could be missing one value from each sex subgroup (2 datapoints total), and still solve for all the values in the dataset since we have each group mean. Thus, the degrees of freedom for a one-variable model is N - 2. To generalize this, the degrees of freedom of any model is *sample size - number of parameters*, or *N - k* for short. 

When calculating error in a model, dividing by degrees instead of sample size keeps us from underestimating the error in the population. 

## 11.5 Fitting the one-variable model

Now that you have learned how to specify a model with an explanatory variable (also frequently called a *predictor*), let’s learn how to fit the model using R.

Fitting a model, as a reminder, simply means automatically calculating the parameter estimates. We use the word “fitting” because we want to calculate the best estimate, the one that will result in the least amount of error and best "fit" our data. For the tiny data set, we could calculate the parameter estimates in our head — it’s just a matter of calculating the mean for males and the mean for females. But when the data set is larger, it is much easier to use R.

Using R, we will first fit the Sex model to the tiny dataset, just so you can see that R gives you the same parameter estimates you got before. After that we will fit it to the complete data set.

Here's the model we are going to fit:

$$ Y_i = b_0 + b_1X_i + e_i $$

Note that the parts that are going to have different values for each observation (X<sub>i</sub> and Y<sub>i</sub>) are called variables (because they vary)! e<sub>i</sub> also varies, but we typically reserve the label “variable” for outcome and explanatory variables. The parts that are going to have the same value for each observation (b<sub>0</sub> and b<sub>1</sub>) are called parameter estimates.

We do not need to estimate the variables. Each student in the dataset already has a score for the outcome variable (Y<sub>i</sub>) and the explanatory variable (X<sub>i</sub>), and these scores vary across students. Notice that the subscript *i* is attached to the parts that are different for each person.

We do need to estimate the parameters because, as discussed previously, they are features of the population, and thus are unknown. The parameter estimates we calculate are those that best fit our particular sample of data. But we would have probably gotten different estimates if we had a different sample. Thus, it is important to keep in mind that these estimates are only that, and they are undoubtedly a bit off. Calling them estimates keeps us humble!

Parameter estimates don’t vary from person to person, so they don’t carry the subscript *i*.

To fit the Sex model we use ```lm()``` again. This time, instead of the left side of the formula being ```NULL```, we have a variable to put there. Thus, the formula argument of ```lm()``` is ```Thumb ~ Sex```. 

In [10]:
lm(Thumb ~ Sex, data = tiny_fingers)


Call:
lm(formula = Thumb ~ Sex, data = tiny_fingers)

Coefficients:
(Intercept)      Sexmale  
         59            6  


Notice that the estimates are exactly what you should have expected: the first estimate, for b<sub>0</sub>, is 59 (the mean for females); the second, b<sub>1</sub>, is 6, which is the number of millimeters you need to add to the female average thumb length to get average male thumb length.

Notice also that the estimate for b<sub>0</sub> is labeled “intercept” in the output. You have encountered the concept of intercept before, when you studied the concept of a line in algebra. Remember that ```y = mx + b``` is the equation for a line? *m* represents the slope of the line, and *b* represents the y-intercept. The General Linear Model notation is similar to this, though it includes error, whereas the equation for a line does not.

The reason the estimate for b<sub>0</sub> is called "Intercept" is because it is the estimate for thumb length when *X* is equal to 0. In other words, when sex is female. The estimate that R called “Sexmale,” by this line of reasoning, is kind of like the slope of a line. It is the increment in thumb length for a unit increase in *X*.

If you want — and it’s a good idea — you can save the results of this model fit in an R object. Here’s the code to save the model fit in an object called ```tiny_sex_model```. Once you’ve saved the model, If you want to see what the model estimates are, you can just type the name of the model and you will get the same output as above.

In [None]:
tiny_sex_model <- lm(Thumb ~ Sex, data = tiny_fingers)

#type the name of the saved model below to print out its output


Now that we have estimates for the two parameters, we can put them in our model statement to yield: Y<sub>i</sub> = 59 = 6X<sub>i</sub>.

You may have noticed that the values of ```Sex``` in ```tiny_fingers``` are the categorical strings ```female``` or ```male```, and not 1 or 0. We were able to run ```lm()``` anyway, so it seems like R is able to handle converting categorical data to Boolean data. But how does R know which level of ```Sex``` should be 0 and which should be 1? The answer to this question is, R doesn’t know; it’s just taking whatever group comes first alphabetically (in this case,  ```female```) and making it the **reference group**. The mean of the reference group is the first parameter estimate (b<sub>0</sub> or the Intercept in the ```lm()``` output). R then takes the second group (in this case, ```male```) and represents it with b<sub>1</sub>.  

Let’s say, just for fun, that you changed the code for ```female``` into ```woman``` in the data frame. Because ```male``` now comes first in the alphabet, ```male``` becomes the reference group, and its mean is now the estimate for the intercept (b<sub>0</sub>).

Converting it to a Boolean variable is also called making a **dummy variable** for the model prediction equation (this does not get saved to your data frame - it's just a temporary computation R makes under the hood). You could supply a Sex variable as a Boolean of 0s and 1s already, but if not R will translate a categorical variable into a dummy variable do build the model with. 

Now that you have looked in detail at the tiny set of data, find the best estimates for our bigger set of data (in the data frame called ```fingers```) by modifying the code below. What would be b<sub>0</sub> and what would be b<sub>1</sub>?

In [12]:
fingers <- read_csv("https://raw.githubusercontent.com/smburns47/Psyc158/main/fingers.csv")

# store the model where Sex predicts Thumb
sex_model <- 

# this prints out the model estimates
sex_model

[1mRows: [22m[34m157[39m [1mColumns: [22m[34m16[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (5): Sex, RaceEthnic, Job, MathAnxious, Interest
[32mdbl[39m (11): FamilyMembers, SSLast, Year, GradePredict, Thumb, Index, Middle, R...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.



Call:
lm(formula = Thumb ~ Sex, data = fingers)

Coefficients:
(Intercept)      Sexmale  
     58.256        6.447  


## 11.6 Generating predictions from the model

Now that you have fit the Sex model, you can use your estimates to make predictions about future observations. Doing this requires you to use your model as an equation. In this case, you will put in a value (e.g., “female”) for your explanatory variable (```Sex```), and get out a predicted thumb length.

Recall that our model looks like this:

$$ Y_i = b_0 + b_1X_i + e_i $$

Once fit, we can print out the coefficients of the model, and then replace b<sub>0</sub> with the Intercept value, and replace b<sub>1</sub> with the other coefficient. Then, in order to make a prediction about the value of ```Thumb``` for any one person, we remove the error term. If our goal is to model the variation, we want the error term there. But if our goal is to predict, we are going to ignore error and just do our best! We also change the Y<sub>i</sub> to Y^<sub>i</sub>, which indicates a predicted score for person *i*. Our prediction equation, then, looks like this:

$$ \hat{Y}_i = 58.256 + 6.447*X_i$$

We leave out the error term because every person will have a different error term. If we knew their error, we could predict their score exactly. But since we don’t when making a new prediction, all we can do is predict their score based on their sex.

This prediction equation is straightforward to use. If we want to predict what the next observed thumb length will be, we can see that if the next student sampled is female, their predicted thumb length is 58.256. If they are male, the prediction is 58.256 + 6.447, or 64.703.

If you want to produce each coefficient by itself with code, you can also use commands below: 

In [13]:
sex_model$coefficients[1] #this will print out the first coefficient, b0
sex_model$coefficients[2] #this will print out the second coefficient, b1

You can use the ```$``` operator here even though ```sex_model``` isn't a data frame because it works with multiple types of complex objects. In a data frame, it accesses a column by name. In a model object, it accesses outputs of the model by name. The ```coefficients``` output has two items in it, so you can use indexing like ```[1]``` or ```[2]``` to access the first or the second coefficient. 

As we did in chapter 9, we also will want to generate model predictions for our sample data. It seems odd to predict values when we already know the actual values. But it’s actually very useful to do so, because then we can calculate residuals from the model predictions.

To get predicted values from the ```sex_model```, we use the ```predict()``` function:

In [14]:
predict(sex_model)

This is a big output, but the results are just what we've already done - for each observation, their predicted thumb length is the mean of female students if their value on ```Sex``` is ```female``` (~58.256), or will be the mean of male students if their value on ```Sex``` is ```male``` (~64.703).

Let’s say you want to save these predicted values for each person as a variable called ```Sex_predicted``` (in the ```fingers``` data frame). See if you can complete the R code to do this.

In [None]:
fingers$Sex_predicted <-

# this prints the first few lines of the fingers data frame, to check if your variable saved successfully
head(fingers)


## 11.7 Quantifying model fit

**CASE WHERE DIFFERENCE BETWEEN GROUP MEANS IS 0**


## 11.8 Improvement over empty model