In [None]:
# Run this first so it's ready by the time you need it
install.packages("readr")
install.packages("supernova")
install.packages("ggformula")
library(readr)
library(supernova)
library(ggformula)

# Chapter 12 - Quantitative Predictor Models

## 12.1 Categorical vs. interval predictors

In the previous chapter we figured out how to add an explanatory variable to a model. We explained thumb length using sex, and noticed how doing so reduced the error in our model and made our predictions less accurate. 

Sex is a categorical variable, with two easily distinguishable groups and separate means for each of the groups that we can model with b<sub>0</sub> and b<sub>1</sub> in the general linear model equation. But what about a different sort of variable, like height? How do we add in continuous variables like this that don't have clearly distinguishable group means?

One option is to create categorical groups out of this data. We could say anyone who is shorter than the mean height is short, and anyone taller than the mean height is tall. That way we now have a categorical variable with two levels, "short" and "tall", and we can use it in a model the same way we used Sex. 

There are some problems with this approach, however. First, remember that the definition of a categorical variable is one where there is no quantitative relationship between the different categories. That holds true for a variable like Sex - it doesn't make sense to say the "female" level is any more or less than the "male" level. But for categories like "short" and "tall", there is an inherent quantitative relationship between them. "Short" means less than "tall" by definition. So forcing Height to be a categorical variable is mispecifying the meaning of Height.

The other problem is that by forcing all values in Height to be in one category or another, we are inherently throwing away information that that variable can give us for the purposes of modeling the data generation process. Less information means less flexible models and worse predictions. We'll explore this is greater detail later in the chapter.

So instead of using height in inches to divide students into groups (e.g., short or tall), let's just model height as a continuous variable. Earlier in the course we learned how to visualize this approach in a scatterplot. In this chapter we will figure out how to extend our models to accommodate quantitative explanatory variables.

The models we develop in this chapter are a special type usually called **regression models**. Before we start, though, note that the core ideas behind these new models are exactly the same as those we have developed for group-based models. A model still yields a single predicted score for each observation, based on some mathematical function of the explanatory variable.

Further, in regression models, we still use residuals (the difference between the predicted and observed score) to measure error around the model. We also still use the sum of squared deviations from the model predictions as a measure of model fit. And, we still use PRE to indicate the proportion reduction in error of the regression model compared with the empty model.

### The two-group model review

Let’s return to our ```tiny_fingers``` data set, but this time add ```Height``` as an explanatory variable. 

In [None]:
student_ID <- c(1, 2, 3, 4, 5, 6)
Thumb <- c(56, 60, 61, 63, 64, 68)
Height <- c(62, 66, 67, 63, 68, 71)

tiny_fingers <- data.frame(student_ID, Thumb, Height)
tiny_fingers

Now, let's make a new variable that recodes ```Height``` as groups, ```Height2Group```. 

In [None]:
#Boolean variable, 0 = less than avg height 1 = greater than or equal to avg height
tiny_fingers$Height2Group <- tiny_fingers$Height >= mean(tiny_fingers$Height) 

#Recode to meaningful labels
tiny_fingers$Height2Group <- factor(tiny_fingers$Height2Group, levels = c(FALSE, TRUE), 
                                    labels = c("short", "tall"))

tiny_fingers

In the graph below, we illustrate the two-group approach to modeling this data. Just as a reminder, we would specify this model — the  Height2Group  model — as a two-parameter model, with one parameter being the mean thumb length for short people, the other, the increment to add on for tall people.

<img src="images/ch12-height2group.png" width="650">

We can fit a model to this data and compare it to the empty model, which is represented in the graph above with the blue horizontal line that goes all the way across the plot. The predicted score for everyone under the empty model is the Grand Mean.

We can also generate an ANOVA table to compare errors in the models. The SS<sub>total</sub> for the Height2Group model is the sum of squared deviations from the empty model. The SS<sub>error</sub> for the Height2Group model is the deviation of each score from its group mean (```short``` or ```tall```).


In [None]:
height2group_model <- lm(Thumb ~ Height2Group, data=tiny_fingers)

height2group_model

supernova(height2group_model)

We can see just by inspecting the graph above, without even making an ANOVA table, that the two-group model will yield better predictions than the empty model. Knowing whether someone is short or tall would help us make a more accurate prediction of their thumb length than if we didn’t know which group they were in (in which case we would just use the empty model).

### The continous variable model

Now let’s compare what happens when we plot ```Height``` as a quantitative variable instead of as a grouping variable.

<img src="images/ch12-groupvscont.png" width ="850">

The scatterplot (right) shows ```Height``` on the x-axis measured in inches (the explanatory variable) and thumb length measured in millimeters on the y-axis (the outcome variable). We have included the ```Height2Group``` plot on the left, for comparison.

The same six data points are represented in the graph on the left (which uses ```Height2Group``` as the explanatory variable) as in the graph on the right, (which uses ```Height``` in inches). Whereas the points on the left are categorized into two groups, the points on the right are spread out over the x-axis, with each person’s height and thumb length represented as a point in two-dimensional space.

The empty model can be represented the same way in both graphs, by a horizontal line through the Grand Mean.

<img src="images/ch12-groupvscont2.png" width ="850">

The empty model of the outcome variable is exactly the same no matter how you code the explanatory variable: it is just the mean of the six data points. Sum of squared deviations around the empty model, thus, is the same in both cases.

However, we get a sense from the scatterplot that knowing someone’s exact height would help us make a better prediction of their thumb length than just knowing whether they are short or tall. But where, in the scatterplot, is the model? What are the groups to make means for?

When the explanatory variable is categorical, as with ```Height2Group```, we can use the group means as our model (as reviewed above). But when the explanatory variable is quantitative, as with ```Height``` in inches, we can’t use the group means to predict new scores because there are no groups!

## 12.2 The regression line as a model

When both the outcome and explanatory variables are quantitative, we need a different kind of model. We can’t use the mean as a model, because we don’t have groups. Instead we use a line, the **regression line**.

The regression line has a lot more in common with the mean as a model than you might at first think. Both are mathematical objects constructed from data. The mean is easy to construct and thus makes very simple predictions. The regression line is a little more complex and can be used to make better predictions.

<img src="images/ch12-regression.png" width="600">

As you likely learned in your previous study of algebra, the equation for a straight line is ```y = mx + b```. In statistics, we would say that the regression line is a two-parameter model, the parameters being the slope (m) and y-intercept (b). You may recall a teacher at some point telling you that the slope is the “rise over run” and the y-intercept is where the line crosses the y-axis (that is, the value of y when x is 0). Fitting the model, therefore, is a matter of finding the particular line (i.e., slope and intercept) that best fits the data - the line that minimizes error around it.

Both the mean and the regression line, when used as models, are used to generate predicted scores on the outcome variable, both for existing data points and for new observations that might be created in the future.

The two graphs below illustrate how these predictions work for the empty model (left) and the ```Height``` (regression) model (right).

<img src="images/ch12-regmodels.png" width ="850">

In the left panel of the figure above, the blue horizontal line is drawn at the mean of the outcome variable, ```Thumb```, representing the empty model. We have indicated the model prediction for each data point, represented as blue dots at the mean, directly above or below the data point.

Each blue dot represents a person’s height and predicted thumb length under the empty model. For example, one student has a height of 66 inches. Using the empty model, their predicted thumb length is the mean, or 62 millimeters. As you can see, under the empty model, every person has the same predicted thumb length: 62 millimeters.

On the right, in orange, we have drawn in the best-fitting regression line, which represents the ```Height``` model, over the same ```tiny_fingers``` data points. (Don’t worry for now about how to find the best-fitting regression line; R will take care of that later.) Just as the mean is the best predictor of thumb length under the empty model, and the group means are the best predictions of thumb length under the ```Height2Group``` model, the regression line represents the best predictions under the ```Height``` model.

Under the ```Height``` model, we use information about each person’s height to predict their thumb length. The orange dots on the regression line represent the predicted values for each of the six data points in the ```tiny_fingers``` dataset. Notice that the orange dots differ depending on height. Thus, we predict a longer thumb for someone who is 68 inches than for someone 66 inches tall.

Notice that we can even predict different thumb lengths for values of height that do not exist in our data. To predict the thumb length of someone 64 inches tall, find 64 inches on the x-axis (see the plot below). Then go up vertically to the regression line (the red dotted line) to find the model’s prediction. The red dot corresponds to the predicted thumb length for someone who is 64 inches tall: 60 mm.

<img src="images/ch12-regprediction.png" width="600">

If you didn’t know someone’s height, you would need to use the empty model, in which case, the mean of ```Thumb``` would be the best predictor of thumb length.

## 12.3 Defining error in the regression model

As we have said all along, all models are wrong. What we are looking for is a model that is better than nothing, or, as you might by now suggest, better than the empty model. Comparing error between the ```Height``` model and the empty model lets us see which model is less wrong.

Error, under both the empty model and the regression model, is defined as the residual, or the observed score minus the score predicted by the model for each data point. Error in the empty model is the deviation of an observed score from the mean. Error in the regression model is the deviation of the observed score from the regression line, measured in vertical distance (see figure).

<img src="images/ch12-regerror.png" width="600">

Recall that the mean is the middle of a *univariate* distribution (distribution of one variable), equally balancing the residuals above and below. In a similar way, the regression line is the middle of a *bivariate* distribution, the pattern of variation in two quantitative variables. Just as the sum of the residuals around the mean add up to 0, so too the sum of the residuals around the regression line also add up to 0.

Here’s another cool relationship between the mean and regression line — if someone had an average height, intuitively we would predict that their thumb length might also be average. And it turns out that the regression line passes through a point that is both the mean of the outcome variable and of the explanatory variable. So the regression line doesn’t cancel out the empty model, it enhances it!

Finally, just as the mean is the point in the univariate distribution at which the SS<sub>error</sub> is minimized, the same is true of error around the regression line. The sum of the squared deviations of the observed points is at its lowest possible level around the best-fitting regression line.

## 12.4 Specifying the model in GLM notation

Let’s look at how we specify the model in the case where we have a single quantitative explanatory variable (such as ```Height```). We will write the model like this:

$$ Y_i = b_0 + b_1X_i + e_i $$

It might be useful to compare this notation to that used in the previous chapter to specify the two-group model (such as ```Sex``` or ```Height2Group```):

$$ Y_i = b_0 + b_1X_i + e_i $$

No, it’s not a typo: both of the model specifications are identical. This, in fact, is what is so beautiful about the General Linear Model. It is simple and elegant and can be applied across a wide variety of situations, including situations with categorical or quantitative explanatory variables.

Although both models are specified using the same notation, the interpretation of the notation varies from situation to situation. It is always important to think, first, about what each component of the model specification means.

As before, Y<sub>i</sub> is the DATA, and e<sub>i</sub> is the ERROR. b<sub>0</sub> + b<sub>1</sub>X<sub>i</sub> represents the complex model. Let’s think about what each of these elements means in the context of our ```tiny_fingers ```data. We are trying to predict the thumb length of college students a little bit better by considering their variation in height. And let’s also consider what might be different about the interpretation in the current case, with height as a quantitative variable, with the previous case in which height was coded with categories (short vs. tall).

Both Y<sub>i</sub> and e<sub>i</sub> have the same interpretation in the quantitative model (regression) as in the group model. The outcome variable in both cases is the thumb length of each person, measured as a quantitative variable. And the error term is each person’s deviation from their predicted thumb length under the model.

The explanatory variable X<sub>i</sub> has a different meaning under these two models. This may be more clear when we write the two models like this:

Group Model: Thumb<sub>i</sub> = b<sub>0</sub> + b<sub>1</sub>Height2Group<sub>i</sub> + e<sub>i</sub>

Regression Model: Thumb<sub>i</sub> = b<sub>0</sub> + b<sub>1</sub>Height<sub>i</sub> + e<sub>i</sub>

In the group model, the variable X<sub>i</sub> was coded as "tall" or "short". Because it divided people into one of two categories, the model could only make the simple prediction of one mean thumb length for short people, and another for tall people. In the regression model, in contrast, X<sub>i</sub> is the actual measurement of person i’s height in inches. This model can make a different prediction of thumb length for every possible value of height.

Coding X<sub>i</sub> in these different ways leads to a different, but related, interpretation of the b<sub>1</sub> coefficient. In the group model, you will recall, the  coefficient represents the increment in millimeters that must to be added to the mean thumb length for short people to get the mean thumb length for tall people. This increment is added only when X<sub>i</sub> is equal to 1 in the dummy variable version of ```Height2Group```.

Because X<sub>i</sub> in the regression model represents the measured height of each individual in inches, b<sub>1</sub> represents the increment that must be added to the predicted thumb length of a person for *each one-unit increment in height*. If that sounds familiar to you, it may be because it sounds exactly like the definition of the slope of a line (how much “rise” for each one unit of “run”). b<sub>1</sub> is, in fact, the slope of the best-fitting regression line.

If b<sub>1</sub> is the slope of the line, it stands to reason that b<sub>0</sub> will be the y-intercept of the line. In other words, the predicted value (under the regression model) of y when x is equal to 0. Because the regression line is a line, it will need to be defined by a slope and an intercept. But in the case of a height model trying to predict thumb length, the intercept is purely theoretical — it’s impossible for someone to be 0 inches tall! So it’s pretty weird to predict thumb length for height of 0, but technically the model can make a prediction anyways.

We have summarized these differences between the two models in the table below.

<img src="images/ch12-groupvscont-table.png" width="850">

## 12.5 Fitting a regression model

Now you can begin to see the power you’ve been granted by the General Linear Model. Fitting — or estimating the parameters — of the regression model is accomplished the same way as estimating the parameters of the grouping model. It’s all done using the ```lm()``` function in R.

The ```lm()``` function is smart enough to know that if the explanatory variable is quantitative, it should estimate the regression model. If the explanatory variable is categorical (e.g., defined as a factor in R), ```lm()``` will fit a group model.

In fact, under the hood ```lm()``` actually treats the group model as a regression model. How? This is why it converts categorical labels to dummy versions of the variable, with 0 for one level and 1 for another. This way, calculating the slope of a regression line - a one-unit increase in the explanatory variable - involves finding the change in y corresponding with going from X=0 to X=1. In other words, going from one level to the other. 

Modify the code below to fit the regression model using ```Height``` as the explanatory variable and predicting ```Thumb``` in the ```tiny_fingers``` data.

In [None]:
# modify this to fit Thumb as a function of Height
tiny_Height_model <- lm()

# this prints the best-fitting estimates
tiny_Height_model


Although R is pretty smart about knowing which model to fit, it won’t always think of your data values in the same way you do. If you code the grouping variable with the character strings “short” and “tall,” R will make a dummy variable out of them with 0's and 1's as levels. But if you code the same grouping variable as 1's and 2's yourself, and you forget to make it a factor, R may get confused and fit the model as though the *value* of each level is 1 and 2, meaning an y-intercept at 0 would be different than we expect.

For example, we’ve added a new variable to our ```tiny_fingers``` data called ```GroupNum```. Here is what the data look like.

In [None]:
tiny_fingers$GroupNum <- c(1,1,2,1,2,2)

tiny_fingers

If you take a look at the variables ```Height2Group``` and ```GroupNum```, they have the same information. Students 1, 2, and 4 are in one group and students 3, 5, and 6 are in another group. If we fit a model with ```Height2Group``` (and called it the ```Height2Group_model```) or ```GroupNum``` (and called it the ```GroupNum_model```), we would the model to have the same coefficient estimates. Let's try it.

In [None]:
# fit a model of Thumb length based on Height2Group
Height2Group_model <- lm()

# fit a model of Thumb length based on GroupNum
GroupNum_model <- lm()

# this prints the parameter estimates from the two models
Height2Group_model
GroupNum_model


b<sub>0</sub> is the y-intercept of the line - where x=0. In a group model, we can interpret it as the mean of a reference group, but *only if that reference is coded as 0.* If we code it as 1 and another group as 2, that implies the existence of another possible value for x (at 0). Thus, b<sub>0</sub> will represent that value instead of the mean of our reference group. 

Now that you have looked in detail at the tiny set of data, fit the height model to the full ```fingers``` data frame, and save the model in an R object called ```Height_model```.

In [None]:
fingers <- read_csv("https://raw.githubusercontent.com/smburns47/Psyc158/main/fingers.csv")

# modify this to fit the Height model of Thumb for the Fingers data
Height_model <- lm()

# this prints best estimates
Height_model


Here is the code to make a scatterplot to show the relationship between ```Height``` (on the x-axis) and ```Thumb``` (on the y-axis). Note that the code also overlays the best-fitting regression line on the scatterplot. 

In [None]:
gf_point(Thumb ~ Height, data = fingers, size = 4) %>%
gf_lm(color = "orange")


## 12.6 Using the regression line to make predictions

The specific regression line, defined by its slope and intercept, is the one that fits our data best. By this we mean that this model reduced leftover error to the smallest level possible given our variables. Specifically, the sum of squared deviations around this line are the lowest of any possible line we could have used instead. Like the empty and group models, error around the regression line is also balanced. You can almost imagine the data points each pulling on the regression line and the best fitting regression line balances the “pulls” above and below it.

This regression model also is our best estimate of the relationship between height and thumb length *in the population*. As with other models of the population, we can use the regression model to predict future observations. To do so we must turn it into an equation, one that will predict thumb length based on height.

Here is the fitted model for using ```Height``` to predict ```Thumb``` based on the complete ```fingers``` dataset, where Y<sub>i</sub> is each observation of ```Thumb``` and X<sub>i</sub> is each observation of ```Height```:

$$ Y_i = -3.33 + 0.96 * X_i + e_i $$

Remember, an equation takes in some input and spits out a prediction based on a model. Here is the equation we can use to predict a thumb length based on a person’s height:

$$ \hat{Y}_i = -3.33 + 0.96 * X_i $$

With the two-group model it was easy to make predictions from the model: no calculation was required to see that if the person was female, the prediction would be the mean for female people; and if the person was male, the prediction would be the mean for male people. But with the regression model it’s harder to do the calculation in your head.

Remember we can use the ```$coefficient``` element of a model object to pull out the coefficient estimates b<sub>0</sub> and b<sub>0</sub>. We can pull out specific ones with double-bracket indexing ([[1]] for b<sub>0</sub> and [[2]] for b<sub>1</sub>). So if we wanted to generate a predicted thumb length using the ```Height_model``` for someone who is 60 inches tall, we could write:

In [None]:
b0 <- Height_model$coefficient[[1]]
b1 <- Height_model$coefficient[[2]]

b0 + b1*60

#how would you output a prediction for someone who is 73.5 inches tall?


This code works fine for making individual predictions, but to check our model quality overall, we would want to generate predictions for each student in the ```fingers``` data frame using the ```predict()``` function. As we’ve said before, we really don’t need predictions when we already know their actual thumb lengths. But this is a way to see how well (or how poorly) the model *would have* predicted the thumb lengths for the students in our data set. 

## 12.7 Examining regression model fit

You probably remember from the previous chapters how to save the residuals from a model as well. We can do the same thing with a regression model: whenever we fit a model, we can generate both predictions and residuals from the model. Try to generate the residuals from the ```Height_model``` that you fit to the full ```fingers``` dataset.

In [None]:
# modify to save the residuals from Height_model
fingers$Height_resid <- resid()

# modify to make a histogram of Height_resid
gf_histogram()


The residuals from the regression line are centered at 0, just as they were from the empty model and the two-group model. In those previous models, this was true by definition: deviations of scores around the mean will always sum to 0 because the mean is the balancing point of the residuals. Thus the sum of these negative and positive residuals will be 0.

It turns out this is also true of the best-fitting regression line: the sum of the residuals from each score to the regression line add up to 0, by definition. In this sense, too, the regression line is similar to the mean of a distribution in that it perfectly balances the scores above and below the line.

Finally, let’s examine the fit of our regression model by running the ```supernova()``` function on our model. At the same time, let’s compare the table we get from the regression model (```Height_model```) with the one produced for a ```Height2Group_model```.

In [None]:
fingers$Height2Group <- fingers$Height >= mean(fingers$Height, na.rm=TRUE)
Height2Group_model <- lm(Thumb ~ Height2Group, data=fingers)

print("Height2Group model")
supernova(Height2Group_model)
print("Height model")
supernova(Height_model)

Remember, the total sum of squares is the sum of squared residuals from the empty model. Total sum of squares is all about the outcome variable, and isn’t affected by the explanatory variable or variables. And when we compare statistical models, as we are doing here, we always are modeling the same outcome variable.

For any model with an explanatory variable (what we have been calling “full models”), the SS<sub>total</sub> can be partitioned into the SS<sub>error</sub> and the SS<sub>model</sub>. The SS<sub>model</sub> is the amount by which the error is reduced under the full model (e.g., the Height model) compared with the empty model.

As we developed previously for a group model, SS<sub>model</sub> is easily calculated by subtracting SS<sub>error</sub> from SS<sub>total</sub>. This is the same, regardless of whether you are fitting a group model or a regression model. Error from the model is defined in the former case as residuals from the group means, and in the latter, residuals from the regression line.

It also is possible to calculate the SS<sub>model</sub> in the regression model directly, in much the same way we did for the group model. Recall that for the group model, SS<sub>model</sub> was the sum of the squared deviations of each person’s predicted score (their group mean) from the Grand Mean. In the regression model, SS Model is calculated in exactly the same way, except that each person’s predicted score is defined as a point on the regression line. The Grand Mean is the same in both cases.

PRE has the same interpretation in the context of regression models as it does for group models. As we have pointed out, the total sum of squares is the same for both models. And the PRE is obtained in both cases by dividing SS<sub>model</sub> by SS<sub>total</sub>.

Many statistics textbooks emphasize the difference between group-based models (using what's called ANOVA tests) and regression models. They try to get students to think of them as very separate things. But in fact, the two types of models are fundamentally the same and easily incorporated into the General Linear Model framework.

## 12.8 Comparing different full models

As we can see by comparing the "Error" and "Total" lines, both the Height2Group model and the Height model reduce error, compared to the empty model. They both are better models than using the Grand Mean to make predictions on ```Thumb```. But how do they compare to each other? 

We learned last chapter that PRE stands for the proportion of total variance that a model accounts for. So, if we want to know which model is better for us to use here, we can simply see which gave us a bigger PRE score! Using this standard, we can decide that the better model is the Height model, as it accounts for 15.29% of the variance in ```Thumb``` whereas the Height2Group model only accounts for 10.76%. 

Actually, we're being a little too simplistic right now. You can't *always* directly compare PRE scores between models to decide which is better. You can only do that in the case of models that have the same number of parameters (which these models do). When you start making more complex models with more parameters and comparing them to simpler models (but not the empty model), we have to use a different statistic. But more on that next chapter.

The reason the Height model did a better job is because, as we mentioned before, by shoving all the datapoints from ```Height``` into one of two categories, we removed information that this variable was providing. There are now only two unique values instead of many. With less information, we have less predictive ability. 

## 12.9 Correlation

You might have heard of Pearson’s *r*, often referred to as a “correlation coefficient.” It's again another historical tool, usually taught separately from other concepts, but which is actually just a special case of regression in which both the outcome and explanatory variables are transformed into z scores prior to analysis.

Let’s see what happens when we transform the two variables we have been working with: ```Thumb``` and ```Height```. Because both variables are transformed into z scores, the mean of each distribution will be 0, and the standard deviation will be 1. The function ```scale()``` will convert all the values in a variable to z scores.

In [None]:
# this transforms all Thumb lengths into zscores
fingers$Thumb_z <- scale(Fingers$Thumb)

# modify this to do the same for Height
fingers$Height_z <- 


Let’s make a scatterplot of ```Thumb_z``` and ```Height_z```, and compare it to a scatterplot of ```Thumb``` and ```Height```. 

In [None]:
# this makes a scatterplot of the raw scores
# size makes the points bigger or smaller
gf_point(Thumb ~ Height, data = fingers, size = 4, color = "black")

# modify this to make a scatterplot of the zscores
# feel free to change the colors
gf_point( #FORMULA HERE, data = fingers, size = 4, color = "firebrick")


Compare the scatterplot of ```Thumb``` by ```Height``` (in black) with the scatterplot of ```Thumb_z``` by ```Height_z``` (in firebrick). How are they similar? How are they different?

Z-scoring doesn't change the position of datapoints relative to each other. It just changes the scale on which they fall. 

Now fit a model using the z score version of each variable:

In [None]:
#fit a regression model Thumb_z ~ Height_z
Height_z_model <- lm(Thumb_z ~ Height_z, data = fingers)

# compare ANOVA tables for each model
supernova(Height_model)
supernova(Height_z_model)

Looking at the PRE scores for both models, the fit of the models is identical. This is because all we have changed is the unit in which we measure the outcome and explanatory variables. Unlike PRE, which is a proportion of the total, SS are expressed in the units of the measurement. So if we converted the mm (for ```Thumb``` length) and inches (for ```Height```) into cm, feet, etc, the SS would change to reflect those new units.

Transforming both outcome and explanatory variables can help us assess the strength of a relationship between two quantitative variables another way, independent of the units on which each variable is measured. The slope of the regression line between the standardized variables is also called the correlation coefficient, or Pearson’s *r*. Thus we can calculate Pearson’s *r* by transforming the variables into z scores, fitting a regression line, and pulling out b<sub>1</sub>:

In [None]:
Height_z_model$coefficient[[2]]

Though it seems like a lot of work to transform variables into z scores and then fit a regression line. Fortunately, R provides an easy way to directly calculate the correlation coefficient (Pearson’s *r*): the ```cor()``` function. 

In [None]:
# this calculates the correlation of Thumb and Height
cor(fingers$Thumb, fingers$Height)


```cor()``` is one of those functions that is sensitive to NAs in the data, although it doesn't handle them in the same way as ```mean()``` or ```sd()``` does. To ignore NAs, you need to use the ```use="pairwise.complete.obs"``` name-value argument to tell R to skip over any observations that are missing a value in at least one of the variables. This flag tells R to only use observations that are "pairwise complete" - having values (complete) on both variables (pairwise). Note that it's okay if there are NAs in *other* variables of a dataset for a particular observation, so long as they aren't in the variables we're making a correlation with right now. 

In [None]:
height_w_NA <- fingers$Height
height_w_NA[10] <- NA

#This will output NA, since the 10th item of height_w_NA is missing
cor(fingers$Thumb, height_w_NA)

#This fixes it by leaving out that observation on both variables
cor(fingers$Thumb, height_w_NA, use="pairwise.complete.obs")

When *r* = 0.39, the slope of the regression line when both variables are standardized is 0.39. This means that an observation that is one standard deviation above the mean on the explanatory variable (x-axis) is predicted to be 0.39 standard deviations above the mean on the outcome variable (y-axis).

Remember, the best-fitting regression model for ```Thumb``` by ```Height``` was Y<sub>i</sub> = -3.33 + 0.96X<sub>i</sub>. The slope of 0.96 indicates an increment of 0.96 mm of ```Thumb``` length for every one inch increase in ```Height```. We will call this slope the *unstandardized slope* (and we have been representing it as b<sub>1</sub>).

While the slope of the standardized regression line is Pearson’s *r*, the slope of the unstandardized regression line is not directly related to *r*. However, there is a way to see the strength of the relationship in an unstandardized regression plot; it’s just not the slope coefficient!

In an unstandardized scatterplot we can “see” the strength of a relationship between two quantitative variables by looking at the closeness of the points to the regression line. Pearson’s *r* is directly related to how closely the points adhere to the regression line. If they cluster tightly, it indicates a strong relationship; if they don’t, it indicates a weak or nonexistent one.

Just as z scores let us compare scores from different distributions, the correlation coefficient gives us a way to compare the strength of a relationship between two variables with that of other relationships between pairs of variables measured in different units.

For example, compare the scatterplots below for ```Height``` predicting ```Thumb``` (on the left) and ```Pinkie``` predicting ```Thumb``` (on the right).

<img src="images/ch12-heightxpinkie.png" width="850">

Correlation measures tightness of the data around the line — the strength of the linear relationship. Judging from the scatterplots, we would say that the relationship between ```Thumb``` and ```Pinkie``` is stronger because the linear  Pinkie model would probably produce better predictions. Another way to say that is this: there is less error around the Pinkie regression line.

Try calculating Pearson’s *r* for each relationship. Are our intuitions confirmed by the correlation coefficients?


In [None]:
# this calculates the correlation of Thumb and Height
cor(fingers$Thumb, fingers$Height)

# calculate the correlation of Thumb and Pinkie


The correlation coefficients confirm that pinkie finger length has a stronger relationship with thumb length than height. This makes sense because the points in the scatterplot of ```Thumb``` by ```Pinkie``` are more tightly clustered around the regression line than in the scatter of ```Thumb``` by ```Height```.

Just as the slope of a regression line can be positive or negative, so can a correlation coefficient. Pearson’s *r* can range from +1 to -1. A correlation coefficient of +1 means that score of an observation on the outcome variable can be perfectly predicted by the observation’s score on the explanatory variable, and that the higher the score on one, the higher on the other.

A correlation coefficient of -1 means the outcome score is just as predictable, but in the opposite direction. With a negative correlation, the higher the score on the explanatory variable, the lower the score on the outcome variable. The image below shows what scatterplots would look like for a perfect correlation of +1, no correlation, and a perfect correlation of -1. 

<img src="images/ch12-correlations.png" width="900">

As you get more familiar with statistics you will be able to look at a scatterplot and guess the correlation between the two variables, without even knowing what the variables are! Here’s a game that lets you practice your guessing. Try it, and see what your best score is!

Click here to play the game: (http://www.rossmanchance.com/applets/GuessCorrelation.html)

The unstandardized slope measures steepness of the best-fitting line. Correlation measures the strength of the linear relationship, which is indicated by how close the data points are to the regression line, regardless of slope.

Let’s think this through more carefully with the two scatterplots and best-fitting regression lines presented in the table below. For a brief moment, let’s give up on ```Thumb``` and try instead to predict the length of middle and pinkie fingers. On the left, we have tried to explain variation in ```Middle``` (outcome variable) with ```Pinkie``` (explanatory variable). On the right, we have simply reversed the two variables, explaining variation in ```Pinkie``` (outcome) with ```Middle``` (explanatory variable).

<img src="images/ch12-pinkiexmiddle.png" width="900">

The slopes of these two lines are different. The one on the left is steeper (0.92), the one on the right more gradual in its ascent (0.61). Despite this difference, however, we know that the strength of the relationship, as is measured by Pearson’s *r*, is the same in both cases: they are the same two variables.

The slope difference in this case is due to the fact that middle fingers are a bit longer than pinkies. So, a millimeter difference in ```Pinkie``` is a bigger deal than a millimeter difference in ```Middle```. When the pinkie is a millimeter longer, the middle finger is 0.92 millimeters longer, on average; but when the middle finger is a millimeter longer, the pinkie is only up by 0.61 millimeters on average.

The unstandardized slope (the b<sub>1</sub> coefficient) must always be interpreted in context. It will depend on the units of measurement of the two variables, as well as on the meaning that a change in each unit has in the scheme of things. There’s a lot going on in a slope! The advantage of unstandardized slopes is that they keep your theorizing grounded in context, and the predictions you make are in the actual units on which the outcome variable is measured.

However, the advantage of the standardized slope (Pearson’s *r*) is that it gives you a sense of the strength of the relationship. By the way, is Pearson’s *r* really the same in both the ```Middle``` by ```Pinkie``` and ```Pinkie``` by ```Middle``` situations? Try it out here.

In [None]:
cor(fingers$Middle, fingers$Pinkie)
cor(fingers$Pinkie, fingers$Middle)


The standardized slope is the same because now the original units do not matter. Both variables are now in units of standard deviation. Correlations are an excellent way to gauge the strength of a relationship. But the tradeoff is this: predictions based on correlation are difficult to interpret in regards to the original variables (e.g., “a one SD increase in variable x produces a 0.75 SD increase in variable y”).

If you want to stay more grounded in your measures and what they actually mean, then the unstandardized regression slope is useful. Regression models give you predictions you can understand.

## 12.10 Regression limitations to keep in mind

Regression and correlation are powerful tools for modeling relationships between variables. But each must be used thoughtfully, always interpreting the findings in context and using all the other knowledge you have about the context.

### Correlation does not imply causation

Most important to bear in mind is that correlation does not imply causation, something you no doubt have heard before. Just the fact that two variables are correlated does not necessarily mean we understand something about what *created* either of them. In this sense, regression is no different from correlation.

There are many examples of this: children’s shoe size is correlated with scores on an achievement test; the tendency to wear skimpy clothing is correlated with higher temperatures. In the case of shoe size, we can see that the correlation is spurious; age is a confounding variable, causing increases in both shoe size and achievement. In the case of skimpy clothing, the relationship is real, but the causal direction must be sensibly interpreted. Hiking up the temperature might indeed cause people to shed their clothing. But taking off clothes is not going to cause the temperatures to go up, no matter how many times you try.

<img src="images/ch12-xkcd.png" width="700">

Also, thumb length measured in millimeters is going to be perfectly correlated with thumb length measured in centimeters. The points will be perfectly laid out on a straight line. But does spotting this relationship get us any closer to understanding the data generation process that produces variation in thumb length? Of course not.

Disambiguating causal relationships and controlling for possible confounds is not achievable through statistical analysis alone. Statistics can help, and correlation can certainly suggest that there might be causation there. But research design is a necessary tool. Random assignment of equivalent objects to conditions that do and don’t receive some treatment is often required to figure out whether a particular relationship is causal or not.

### Are all regressions straight?

Another thing to point out is that the models we have considered in this chapter are *linear* models. We fit a straight, linear line to a scatter of points, and then look to see how well it fits by measuring residuals around the regression line.

But sometimes a straight line is just not going to be a very good model of the relationship between two variables.

Take this graph from a study of the relationship of body weight to risk of death (from McGee DL, 2005, Ann Epidemiol 15:87 and Adams KF, 2006, N Engl J Med 355:763). Being underweight and being overweight both increase the risk of death, whereas being in the middle reduces that risk.

<img src="images/ch12-weightrisk.png" width="650">

If you ignored the shape of the relationship and overlaid a straight regression line, the line would probably be close to flat, indicating no relationship. But if you did that you would be missing an important *curvilinear* relationship.

Before fitting a linear regression model, look at the relationship and see if a linear association makes sense. If it doesn’t, think about a different model. Statisticians have lots of models to offer beyond just the simple straight line.

### Do regression lines go on forever?

Finally, there is the problem of extrapolation. We have already pointed out from our regression of ```Thumb``` on ```Height``` that, according to the model, someone who is 0 inches tall would have a thumb length of -3.33 millimeters. That doesn't make any sense, on either variable. Obviously, the regression model only works within a certain range, and it is risky to extend that range beyond where you have substantial amounts of data. In general, common sense and a careful understanding of research methods must be applied to the interpretation of any statistical model.

<img src="images/ch12-extrapolation.png" width="550">

## Chapter summary

After reading this chapter, you should be able to:

- Use both categorical and quantitative predictors in linear models
- Explain a regression line
- Identify deviations in a regression model
- Fit a regression model using lm()
- Use a regression model to make predictions about new values of data
- Compare the fit of a regression model to empty model and two-group model
- Calculate a correlation
- Describe the limitations of using regression