[Back to Table of Contents](https://www.shannonmburns.com/Psyc158/intro.html)

[Previous: Chapter 18 - Evaluating Effect Sizes](https://colab.research.google.com/github/smburns47/Psyc158/blob/main/chapter-18.ipynb)

In [None]:
# Run this first so it's ready by the time you need it
install.packages("dplyr")
install.packages("ggformula")
library(dplyr)
library(ggformula)

# Chapter 19 - Model Bias

## 19.1 Error vs. bias

We use statistics to try and explain complex things in the world, and arrive at some general conclusions about what we can expect. More specifically in the context of the general linear model, we try to model the data generation process of data we care about and see what information helps us make the best predictions about those data.

We know from our discussion of model error that it is hard and maybe impossible to make perfectly correct predictions. In any data generation process, we might be able to figure out that using information from some predictor variables helps us explain some variation in an outcome variable and make *better* predictions, but there is almost always some error left unexplained.  

<img src="images/ch10-var1.png" width="500">

For any particular prediction we make about the outcome value of one data point, that prediction is likely to be off by a bit. This amount that we typically miss by is the **error** of a model. We can quantify it by looking at the distribution of the residuals a model produces when making predictions. For example, let's simulate a sample of data with a partially-known data generation process, and make predictions using the part of the model that we know: 

In [None]:
x <- rnorm(100,0,1)  #defining a random variable
e <- rnorm(100,0,1)  #some unexplained error
y <- 2 + 0.5*x + e   #the data generation process 
sim_df <- data.frame(x,y)

sim_df$predicted_y <- 2 + 0.5*sim_df$x
sim_df$residuals <- sim_df$y - sim_df$predicted_y

If we plot the residuals of the model in this sample, making a histogram of the error distribution, we see that most of them are non-zero. We are missing our predictions by a bit. 

In [None]:
gf_histogram(~ sim_df$residuals)%>% gf_vline(., xintercept = 0)

The spread of the error distribution tells us by how much we typically miss our predictions. A wide error distribution means a model has a lot of error. The narrower we can make the error distribution, and the less error there is, the better our model. 

However, model error isn't the only thing for us to be aware of when relying on the predictions a model makes. Let's imagine a situation where the true data generation process only has a very small amount of unaccountable variation, so that we are able to explain almost all the error. We can still make very inaccurate predictions if we use the wrong coefficients in a model: 

In [None]:
x <- rnorm(100,0,1)     #defining a random variable
e <- rnorm(100,0,0.1)   #tiny unexplained error
y <- 2 + 0.5*x + e      #data generation process
sim_df <- data.frame(x,y)

sim_df$biased_y <- 4 + 0.5*sim_df$x
sim_df$residuals <- sim_df$y - sim_df$biased_y

gf_histogram(~ sim_df$residuals)%>% gf_vline(., xintercept = 0)

Take a look at the center of both these error distributions. 

In the first error distribution we made, few residual values were exactly equal to 0 (few predictions were perfect), but across all the residuals they clustered *around* 0. That means even if any one prediction is unlikely to be perfect, the predictions as a whole aren't missing in any systematic way. 

In contrast, the spread of the second error distribution is narrow, but the central tendency is way off 0. This means that every prediction we are making is wrong, and they're all wrong in the same way. 

This is known as **bias**. A model has error if its predictions are sometimes wrong. A model has bias if the predictions are wrong in a systematic way. 

We can use a bulls-eye metaphor to understand predictions in terms of error and bias. If the bulls-eye is the actual value of a datapoint on an outcome variable and each dot is a prediction, model error refers to the spread of those predictions while model bias refers to where those predictions are centered on. 

<img src="images/ch20-errorbias.png" width="500">

What will make a model biased? Take a look at the model coefficients that were used to make predictions for both error distributions above. In the unbiased model, we used the values "2" and "0.5" as coefficients for the ```intercept``` and ```x```, respectively, in order to make predictions about ```y```. These are the exact same values we used in the data generation process for y, so we know the only reason our predictions were off is because we didn't know the value of ```e``` to include in the prediction equation. 

In the biased model, we used "4" instead of "2" for the coefficient of the intercept. This made our predictions systematically overshoot the actual values of ```y```. 

While error speaks to the spread of the error distribution, bias is about the central tendency. If the central tendency of the the error distribution is not zero, a model is biased.

## 19.2 Out-of-sample predictions

In practice, if you fit models in R, the model will never be biased *for this data sample*. This is because R automatically computes the best-fitting coefficients to reduce error and minimize bias. However, usually we care about more than just these data. A bigger concern is thus whether the model will be biased *for new data*.

Recall from our discussion of sampling distributions that any one estimate derived from a data sample is unlikely to exactly match the population parameter. For example, let's simulate an entire population of data with the same data generation process above, and fit a model in just a sample of it: 

In [None]:
set.seed(30)
x <- rnorm(10000,0,1)  #defining a random variable
e <- rnorm(10000,0,1)  #some unexplained error
y <- 2 + 0.5*x + e     #data generation process 
sim_df <- data.frame(x,y)

sim_sample <- sample_n(sim_df, size=100, replace=TRUE)

sim_model <- lm(y ~ x, data = sim_sample)
sim_model

Whereas the true intercept and effect of x are 2 and 0.5 respectively, our estimates for those coefficients are 1.95 and 0.57. If we were to use these values to make predictions within a separate dataset:

In [None]:
set.seed(40)
sim_sample2 <- sample_n(sim_df, size=100, replace=TRUE)

sim_sample2$predicted_y <- predict(sim_model, sim_sample2)
sim_sample2$residuals <- sim_sample2$y - sim_sample2$predicted_y
mean(sim_sample2$residuals) #center of the error distribution

The residuals are not centered on 0. This is because the model coefficients we used for predictions are just estimates of the population parameter and in this case didn't exactly match. Making predictions about new data will thus be wrong in a systemic way. 

There are a few things we can do to minimize bias when making out-of-sample predictions. One of them is to collect larger samples:

In [None]:
set.seed(50)
sim_sample <- sample_n(sim_df, size=1000, replace=TRUE) #bigger sample
sim_model <- lm(y ~ x, data = sim_sample)


sim_sample2 <- sample_n(sim_df, size=1000, replace=TRUE) #bigger sample

sim_sample2$predicted_y <- predict(sim_model, newdata = sim_sample2)
sim_sample2$residuals <- sim_sample2$y - sim_sample2$predicted_y
mean(sim_sample2$residuals) #center of the error distribution


The center of the error distribution when we make predictions in ```sim_sample2``` using the model fitted on ```sim_sample1``` is still not 0, but it's much closer to 0. This is because of the Central Limit Theorem - with bigger samples, there is less variance in coefficient estimates sample to sample. 

Another option is to collect several small samples and then average together their effect:

In [None]:
#drawing many samples and saving their coefficients
b0s <- vector(length=100)
b1s <- vector(length=100)
for (i in 1:100) {
    sim_sample <- sample_n(sim_df, size=100, replace=TRUE) 
    sim_model <- lm(y ~ x, data = sim_sample)
    b0s[i] <- sim_model$coefficients[[1]]
    b1s[i] <- sim_model$coefficients[[2]]
}

mean_b0 <- mean(b0s)
mean_b1 <- mean(b1s)

#predicting new data in the population based on average estimated effect
sim_df$predicted_y <- mean_b0 + mean_b1*sim_df$x
sim_df$residuals <- sim_df$y - sim_df$predicted_y
mean(sim_df$residuals) #center of the error distribution

This is what people are doing when they perform a meta-analysis. They are compiling the estimates across many different studies to arrive at one average effect that is hopefully closer to the true population parameter, and thus produces less bias in new predictions.

## 19.3 Biased and unbiased estimates

In the example above, we saw that even if coefficient estimates derived in one sample are different from the true population parameter and produced biased out-of-sample predictions by themselves, *on average* across many studies these estimates converged on the population parameter. This means that if we repeat our sampling and analysis process many times (or if we get a really big sample size), there is no systematic bias in our estimation of the population parameter. This is the difference between bias in predictions and bias in estimates. Over many data points there may be error in residuals but no bias in predictions. Comparably, over many studies there may be error in effect estimations but no bias in the estimated population parameter.

It is possible to get bias in coefficient estimations, however. This is the worst situation for bias because if your effect estimate is always off, your predictions will always be off too in a systematic way and no amount of sample size or meta-analyzing will help you. 

Biased estimators come from the process of building and fitting a model. Typically, the GLM framework produces unbiased estimates. But this only holds if your data meets certain **assumptions** that the GLM has about it. For the rest of the chapter we will learn how to check these assumptions. If your data violate these assumptions, the GLM will produce biased estimates that will never give you good predictions in out-of-sample data. We will also learn what to do about this situation if you find that it is the case with your data.

## 19.4 Assumption 1: representative samples

We've discussed before what it means to have a representative sample - the distribution of data in the sample closely resembles the distribution of data in the population. When your data sample is representative, the statistical estimates you make in that sample have the best chance of resembling the true population parameter. A difference between your estimate and the population paramenter can happen just by bad luck during random sampling: we just happen to draw a particularly strange sample by chance. But over many samples, random sampling will insure that on the whole our samples are representative and our estimates unbiased. 

If there's any reason our sampling process is not representative, those estimates might be biased even across many samples. For example, there is a famous saying that "money does not buy happiness". If you were to test this statistically, you might look for a significant relationship between people's happiness levels and how much money they make. But if you were to conduct this research only with college graduates (who make on average 80% higher salaries than non-grads), you would only be investigating the relationship between money and happiness in people with more money, not across all income brackets. No matter what sample you drew, if it was not representative of the wider population, you might get a biased estimate (and you might miss the fact that [money is indeed correlated with happiness](http://content.time.com/time/magazine/article/0,9171,2019628,00.html) at lower income levels where financial security is threatened). If your estimate is biased due to nonrepresentative sampling, it might apply to this specific subset of the population, but not **generalize** to other populations of people. 

One way a sample can be unrepresentative is if you don't have values from the full range of population values on your outcome. We can see the impact of this with a simulation. First, we will simulate a medium relationship between two variables x and y and see what sort of sampling distribution we would get across many samples of those variables:

In [None]:
x <- runif(10000,0,1)    #defining a random variable
e <- rnorm(10000,0,1)    #some unexplained error
y <- 2 + 0.3*x + e  #0.3 is the true b1 value in population 
sim_df <- data.frame(x, y)

#sampling many estimates of b1
nobias_b1s <- vector(length=1000)
for (i in 1:1000) {
    sim_sample <- sample_n(sim_df, size=100, replace=TRUE) 
    sim_model <- lm(y ~ x, data = sim_sample)  #fitting a linear model instead of nonlinear
    nobias_b1s[i] <- sim_model$coefficients[[2]]
}

#central tendency of b1 estimates
nobias_b1s_df <- data.frame(nobias_b1s)
gf_histogram(~ nobias_b1s, data = nobias_b1s_df, fill="red", alpha=0.5) %>% 
    gf_vline(., xintercept = 0.3) #true b1 is 0.3
mean(nobias_b1s)

Any one sample may find a b<sub>1</sub> estimate that is larger or smaller than 0.3, but in general they will cluster around 0.3.

However, let's say we only sample people who are high on the outcome variable score. The full range of that variable is about -2 to 6, but we'll only select datapoints that had y values of at least 4. We'll make the sampling distribution for this situation as well, and plot it (in blue) on the same graph as the no bias sampling distribution (in red) so that you can compare. A vertical black line marks where the true population parameter is.

In [None]:
x <- runif(40000,0,1)    #oversimulating since we're going to cut out ~3/4 of the data
e <- rnorm(40000,0,1)    
y <- 2 + 0.3*x + e       #0.3 is the true b1 value in population 
sim_df <- data.frame(x, y)
high_sim_df <- filter(sim_df, y>4) #filtering to only large y values

#sampling many estimates of b1
bias_b1s <- vector(length=1000)
for (i in 1:1000) {
    sim_sample <- sample_n(high_sim_df, size=100, replace=TRUE) 
    sim_model <- lm(y ~ x, data = sim_sample)  #fitting a linear model instead of nonlinear
    bias_b1s[i] <- sim_model$coefficients[[2]]
}

#central tendency of b1 estimates
bias_b1s_df <- data.frame(bias_b1s)
gf_histogram(~ nobias_b1s, data = nobias_b1s_df, fill="red", alpha=0.5) %>% 
    gf_histogram(~ bias_b1s, data = bias_b1s_df, fill="blue", alpha=0.5) %>%
    gf_vline(., xintercept = 0.3) #true b1 is 0.3
mean(bias_b1s)

The true relationship between x and y is 0.3, but by truncating the y variable, our sampling distribution is biased downward. 

When your data is missing the full range of the outcome variable, there's less variation present that would be explained by the predictor. Noise now has a bigger proportional presence, and the predictor can't reliably explain noise. 

<img src="images/ch20-selectiony.gif" width="500">

*[gif source](https://sites.google.com/view/robertostling/home/teaching)*

Interestingly, the same is not true for truncating the x variable: 

In [None]:
x <- runif(40000,0,1)    #oversimulating since we're going to cut out ~3/4 of the data
e <- rnorm(40000,0,1)    
y <- 2 + 0.3*x + e       #0.3 is the true b1 value in population 
sim_df <- data.frame(x, y)
high_sim_df <- filter(sim_df, x>0.75) #filtering to only large x values

#sampling many estimates of b1
bias_b1s <- vector(length=1000)
for (i in 1:1000) {
    sim_sample <- sample_n(high_sim_df, size=100, replace=TRUE) 
    sim_model <- lm(y ~ x, data = sim_sample)
    bias_b1s[i] <- sim_model$coefficients[[2]]
}

#red = unbiased sampling distribution
#blue = biased sampling distribution
bias_b1s_df <- data.frame(bias_b1s)
gf_histogram(~ nobias_b1s, data = nobias_b1s_df, fill="red", alpha=0.5) %>% 
    gf_histogram(~ bias_b1s, data = bias_b1s_df, fill="blue", alpha=0.5) %>%
    gf_vline(., xintercept = 0.3) #true b1 is 0.3
mean(bias_b1s)

On average, the sampling distribution will still cluster around an effect of 0.3 for x. The difference here is that the sampling distribution is *wider*, despite having the same sample size. It isn't the model coefficient that is biased in this case, but the *standard error* is biased to be too big. On average we will detect the right effect, but it will be harder to get an accurate picture of it. If the true effect is 0.3 but standard error is biased upward, we will have less power than we planned to find that effect significant - there will be more Type II errors. 

<img src="images/ch20-selectionx.gif" width="500">

*[gif source](https://sites.google.com/view/robertostling/home/teaching)*

Another instance of bias due to representativeness is when two predictor variables interact, such that predictor 2 influences the relationship between predictor 1 and the outcome, but we only collected data from some values of predictor 2. For example, if there's a difference in how people give to charity between individualistic and collectivistic cultures, but you only sampled people born in the US. 

Now, a nonrepresentative sample doesn't ensure that your estimates will be biased. Estimates will only be biased if your sample is non-representative on variables that are involved in the data generation process you are investigating. If you are investigating the relationship between money and happiness but only have a sample of right-handed people, there's a pretty good argument to make that being right or left handed doesn't influence how happy money makes you. So the assumption of representativeness only applies to variables that are involved in the particular data generation process you are investigating. 

### How to check for assumption

There aren't great ways to check if your samples are representative. You need to know about the expected range of the variables in your model. If you aren't familiar with your research domain enough to know these answers, it will be difficult to catch when your sample isn't representative. You can try to include representativeness on major demographic variables like sex, race, etc. as a default to catch common reasons for nonrepresentativeness, but that doesn't ensure you're not missing something else important. Also, just because a study was run on only US college students (or with culturally-homogenous samples in other countries), that doesn't mean *for sure* that the sample is not representative. There has to be a good reason to expect truncated age and culture will make a difference on the estimated effect. The best you can do here is read about other research in your field to stay abreast of what others have found, and continue developing psychological research. 

### What to do if assumption is violated

The time to deal with this assumption is before you collect data. Plan ahead to figure out what variables are important to be representative on, and do your best to recruit participants that have a variety of values on those variables. If for whatever reason you're unable to do that, make sure you interpret your results within the subset of the population you *were* able to recruit, and not overgeneralize to people who are not represented in your sample. In other words, as we saw at the end of chapter 12 when talking about the limits of regression, we should't extrapolate conclusions beyond the range of the data in our sample.    

## 19.5 Assumption 2: linear relationship between variables

The second assumption that needs to be met in order for the general linear model framework to produce unbiased estimates is that all the variables in the model are related to each other linearly. 

A "linear" model means that the best fitting regression line through the scatterplot of these data is straight. This is specified during the fitting process by combining predictors with addition. Another way of looking at it is that the slope of the effect of X is the same regardless of the value of X. 

<img src="images/ch14-linear.png" width="250">

However another data situation is when the effect of X *changes* depending on the value of X. In this case, the best-fitting regression line through the data is not straight. 

<img src="images/ch14-nonlinear.png" width="250">


The GLM form that we've been using assumes that the linear type of relationship is what best fits the data. Most of the time it is safe to assume that the relationship between a predictor and an outcome is linear. However if you happen to be dealing with a nonlinear situation and you misspecify the model as linear by using the basic form of the GLM, this can result in bias. 

Below is an example of what happens if we try to make predictions with a linear model, but the true data generation process was nonlinear. Specifically, the predictor variable in the data generation process is squared, making its effect on Y stronger as X gets larger.

In [None]:
x <- runif(10000,0,4)    #defining a random variable
e <- rnorm(10000,0,1)    #some unexplained error
x_squared <- x^2
y <- 2 + 0.5*x_squared + e  #0.5 is the true b1 value in a quadratic model 
sim_df <- data.frame(x, x_squared, y)

#sampling many estimates of unbiased b1
nobias_b1s <- vector(length=1000)
for (i in 1:1000) {
    sim_sample <- sample_n(sim_df, size=100, replace=TRUE) 
    sim_model <- lm(y ~ x_squared, data = sim_sample)  
    nobias_b1s[i] <- sim_model$coefficients[[2]]
}

#sampling many estimates of biased b1
bias_b1s <- vector(length=1000)
for (i in 1:1000) {
    sim_sample <- sample_n(sim_df, size=100, replace=TRUE) 
    sim_model <- lm(y ~ x, data = sim_sample)  #fitting a linear model instead of nonlinear
    bias_b1s[i] <- sim_model$coefficients[[2]]
}

#red = unbiased sampling distribution
#blue = biased sampling distribution
nobias_b1s_df <- data.frame(nobias_b1s)
bias_b1s_df <- data.frame(bias_b1s)
gf_histogram(~ nobias_b1s, data = nobias_b1s_df, fill="red", alpha=0.5) %>% 
    gf_histogram(~ bias_b1s, data = bias_b1s_df, fill="blue", alpha=0.5) %>%
    gf_vline(., xintercept = 0.5) #true b1 is 0.5


The true &beta;<sub>1</sub> in the data generation process to be 0.5, but we modeled the data generation process as linear rather that quadratic. This caused the sampling distribution of b<sub>1</sub> estimates to center on ~2. By misspecifying the model, we are overestimating the population &beta;<sub>1</sub> and guessing that the effect of ```x``` is larger than it actually is. If we were to take a model estimate at face value without investigating whether the data should actually be fit with a linear model, we'd make incorrect conclusions about the true effect size.  

### How to check for assumption

If you are fitting a simple model with one predictor, you can check whether the best fitting line between the raw values of the predictor and the outcome has a distinctly curved shape. In the ```ggformula``` package of R, adding a straight line to a plot is done with ```gf_lm()``` while adding the best-fitting line regardless of shape is done with ```gf_smooth()```.  

In [None]:
#adding a line with gf_smooth() will let you see whether the best fitting line should be curved or straight
gf_point(y ~ x, data = sim_sample) %>% gf_smooth(.)

If it is quite curved, that indicates you should probably use a nonlinear model. 

If you have a multivariable model, the way to tell is to plot the predictions the linear model would make on the x-axis and the residuals of the model on the y-axis:

In [None]:
sim_sample$badpredicted <- predict(lm(y ~ x), sim_sample)
sim_sample$badresid <- sim_sample$y - sim_sample$badpredicted

gf_point(badresid ~ badpredicted, data=sim_sample) %>% gf_smooth()

If this plot has a clearly curved shape, that means there is a relationship between what a model predicts and how far off that prediction is. The model is systematically under or over predicting for certain data points and is violating the assumption of linearity. On the other hand, if the assumption is met, this predictor-residual plot should have a mostly straight across line: 

In [None]:
x <- runif(100,0,4)    #defining a random variable
e <- rnorm(100,0,1)    #some unexplained error
y <- 2 + 0.5*x + e     #a typical linear model
sim_df <- data.frame(x, y)

goodpredicted <- predict(lm(y ~ x))
goodresid <- y - goodpredicted

gf_point(goodresid ~ goodpredicted) %>% gf_smooth()

The line in this plot is a little wobbly, but there's no big curve emerging and no clear relationship between a model's predictions and its residuals. 

### What to do if assumption is violated

If the assumption of linearity is violated, it becomes necessary to do more advanced modeling than is covered in this course. Specifically, you should investigate what sort of nonlinear shape would best explain the relationship between the predictor and the outcome and fit a model with that sort of structure instead of a linear model. This usually involves transforming a predictor variable or combining terms with procedures other than addition.  

## 19.6 Assumption 3: exogeneity

**Exogeneity** refers to a model that correctly identified the important predictors in the data generation process. I.e., variation is the outcome variable is actually explained by the proposed predictor(s), and not by another variable that is left out of the model. However if the outcome variation is better explained by a variable that has been left out of the model, that situation is called **endogeneity** or **omitted variable bias**. 

We've seen before how predictor variables can share variance, and how adding multiple variables in a model lets us control for other possible explanations of the outcome variable. If two predictors share variation but we leave one of them out of the model, the general linear model will lump all that shared variation into the one predictor still present. It will look like the remaining predictor is uniquely related to the outcome in a way that it actually isn't.

Let's see this in an example where we create some variables causally: ```x1``` is a random variable, while ```x2``` and ```y``` are each separately built out of adding some error to ```x1```:

In [None]:
x1 <- runif(10000,0,4)    #defining a random variable, the true explanation of others
ex <- rnorm(10000,0,1)    #some unexplained error
x2 <- x1 + ex             #x1 explains x2
ey <- rnorm(10000,0,1)    #some unexplained error
y <- x1 + ey              #x1 is the only one that explains y
sim_df <- data.frame(x1, x2, y)

If we build a model out of both x's, we can see that ```x1``` is a significant predictor of ```y``` but ```x2``` is not: 

In [None]:
summary(lm(y ~ x1 + x2, data=sim_df))

This is because the only reason ```x2``` would be related to ```y``` is because ```x1``` explains both of them. 

However, if we leave out ```x1``` as a predictor and only use ```x2```:

In [None]:
summary(lm(y ~ x2, data=sim_df))

All the variation shared between ```x1``` and ```x2``` but caused by ```x1``` is now given to ```x2```. It looks like it has a bigger effect on ```y``` than it really does in the true data generation process. 

In the simulation we set above, the true effect of ```x2``` is ~0. An unbiased estimate of this effect would give us repeated b estimates that cluster around 0. But if we commit the omitted variable bias and leave ```x1``` out, the estimate for ```x2``` becomes biased: 

In [None]:
#sampling many estimates of unbiased b1
nobias_b1s <- vector(length=1000)
for (i in 1:1000) {
  sim_sample <- sample_n(sim_df, size=100, replace=TRUE) 
  sim_model <- lm(y ~ x1 + x2, data = sim_sample)  
  nobias_b1s[i] <- sim_model$coefficients[[1]]
}

#sampling many estimates of biased b1
bias_b1s <- vector(length=1000)
for (i in 1:1000) {
  sim_sample <- sample_n(sim_df, size=100, replace=TRUE) 
  sim_model <- lm(y ~ x2, data = sim_sample)  #fitting an endogenous model
  bias_b1s[i] <- sim_model$coefficients[[1]]
}

#red = unbiased sampling distribution
#blue = biased sampling distribution
nobias_b1s_df <- data.frame(nobias_b1s)
bias_b1s_df <- data.frame(bias_b1s)
gf_histogram(~ nobias_b1s, data = nobias_b1s_df, fill="red", alpha=0.5) %>% 
    gf_histogram(~ bias_b1s, data = bias_b1s_df, fill="blue", alpha=0.5) %>%
    gf_vline(., xintercept = 0) #true b1 is 0
mean(bias_b1s)

When this assumption is violated, the p-values are affected as well: 

In [None]:
set.seed(70)
subsample <- sample_n(sim_df, size=100, replace=TRUE)
goodmodel <- lm(y ~ x1 + x2, data = subsample)
summary(goodmodel)$coefficients[3,4] #return p-value of x2

badmodel <- lm(y ~ x2, data = subsample)
summary(badmodel)$coefficients[2,4] #return p-value of x2

Without ```x1``` in the model, we would erroneously conclude that ```x2``` has a large, signicant effect on ```y``` when in fact ```x1``` is the true explanatory variable.

### How to check for assumption

To check for whether or not the assumption of exogeneity is met or violated, you first have to think about what variable(s) might be a better explanation of ```y```, and make sure to collect data on those variables. Then, you can fit a model between just ```y``` and all the possible predictors to see which ones are significant predictors of y when controlling for each other. 

### What to do if assumption is violated

If it looks like an omitted variable should be included, the solution is easy - include it! That way you can figure out the true unique effect of each explanatory variable.

## 19.7 Assumption 4: no multicollinearity 

Speaking of correlated predictors: if you have two predictors ```x1``` and ```x2``` that are both possible explanatory variables of ```y```, you should include them both in a general linear model. However, it is still dangerous to make conclusions about the model estimates if the predictors are highly correlated with each other. If the variation in ```y``` explained by the predictors is almost entirely overlapping, that means there is almost no unique variation attributable to either one. In that case, minute differences in the values of each variable can result in huge swings and instability in the model estimates from sample to sample. This situation is called **multicollinearity**. 

Here's an example of what multicollinearity can do to model estimates. We'll again generate predictor variables that are closely related to eachother:

In [None]:
x1 <- runif(10000,0,5)    #defining a random variable
ex <- rnorm(10000,0,0.5)  #some unexplained error between predictors
x2 <- x1 + ex             #another, highly related predictor variable
ey <- rnorm(10000,0,1)    #some unexplained error in the model
y <- 0.5*x1 + ey          #true data generation process only involves x1
sim_df <- data.frame(x1, x2, y)

In this data generation process, the true effect of ```x1``` should be 0.5. If we repeatedly sample and estimate the effect of ```x1``` in a multivariable model when there is multicollinearity: 

In [None]:
#sampling many estimates of unbiased b1
nobias_b1s <- vector(length=1000)
for (i in 1:1000) {
  sim_sample <- sample_n(sim_df, size=100, replace=TRUE) 
  sim_model <- lm(y ~ x1, data = sim_sample) 
  nobias_b1s[i] <- sim_model$coefficients[[2]]
}

#sampling many estimates of biased b1
bias_b1s <- vector(length=1000)
for (i in 1:1000) {
  sim_sample <- sample_n(sim_df, size=100, replace=TRUE) 
  sim_model <- lm(y ~ x1 + x2, data = sim_sample) 
  bias_b1s[i] <- sim_model$coefficients[[2]]
}

#red = unbiased sampling distribution
#blue = biased sampling distribution
nobias_b1s_df <- data.frame(nobias_b1s)
bias_b1s_df <- data.frame(bias_b1s)
gf_histogram(~ nobias_b1s, data = nobias_b1s_df, fill="red", alpha=0.5) %>% 
    gf_histogram(~ bias_b1s, data = bias_b1s_df, fill="blue", alpha=0.5) %>%
    gf_vline(., xintercept = 0.5) #true b1 is 0.5
mean(bias_b1s)

We actually don't get much bias in the b<sub>1</sub> at all. So why is multicollinearity a problem? Well, consider the standard errors of both sampling distributions (the widths). The standard error of the sampling distribution is much higher in the case of multicollinearity. When standard error is higher, the variation in estimates from sample to sample is higher. This means that, even if the sampling distribution of many b<sub>1</sub>s is centered on the correct population parameter, any one b<sub>1</sub> estimate is likely to be farther away from the true population parameter. 

This has the unfortunate effect of inflating Type I error of ```x2``` (the non-causal predictor) *and* inflating Type II error of ```x1``` (the real predictor). It is both more likely to find the wrong predictor as significant and more likely to miss which predictor is actually important. When multicollinearity is in place, we get an unbiased estimate of the effect size, but standard error will be biased high and thus it is harder to trust our significance decisions. 

### How to check for assumption
The **variance inflation factor** is a measure of the amount of multicollinearity between predictors in a model. It is calculated as: 

$$VIF = \frac{1}{1 - R^2_i}$$

Where $R^2_i$ is the proportion of variance explained in the ith predictor by all the other predictors. You can access a function ```vif()``` in R to compute this for you using the ```car``` package. 

In [None]:
install.packages("car")
library(car)

mc_model <- lm(y ~ x1 + x2, data = sim_df)
vif(mc_model)

To interpret these numbers, it is generally accepted that a VIF of at least 2.5 means multicollinearity is inflating standard errors enough to be concerned. The VIF of both predictors in our model is above 9, which is more than 2.5. We should interpret that to mean any p-values from this model are untrustworthy. 

### What to do if assumption is violated
The upside of multicollinearity is that, if two predictors share a lot of variation, you don't lose much predictive value by removing one. The second predictor is essentially redundant. So one solution to multicollinearity is to drop one variable from the model.

Which variable should you drop? That's where things get tricky. The variable you should choose to drop is the one you think is least likely to be the cause of the outcome variable. But if you're unsure about the causal structure, another option is to average together the values of both predictors:

In [None]:
x1 <- runif(10000,0,5)    #defining a random variable
ex <- rnorm(10000,0,0.5)  #some unexplained error between predictors
x2 <- x1 + ex             #another, highly related predictor variable
e <- rnorm(10000,0,1)     #some unexplained error
y <- 0.5*x1 + e
x_combined = (x1+x2)/2
sim_df <- data.frame(x1, x2, x_combined, y)

#sampling distribution of b1 when using x_combined as sole predictor
combined_b1s <- vector(length=1000)
for (i in 1:1000) {
  sim_sample <- sample_n(sim_df, size=100, replace=TRUE) 
  sim_model <- lm(y ~ x_combined, data = sim_sample) 
  combined_b1s[i] <- sim_model$coefficients[[2]]
}

#red = unbiased sampling distribution
#blue = combined sampling distribution
nobias_b1s_df <- data.frame(nobias_b1s)
combined_b1s_df <- data.frame(combined_b1s)
gf_histogram(~ nobias_b1s, data = nobias_b1s_df, fill="red", alpha=0.5) %>% 
    gf_histogram(~ combined_b1s, data = combined_b1s_df, fill="blue", alpha=0.5) %>%
    gf_vline(., xintercept = 0.5) #true b1 is 0.5
mean(bias_b1s)

This is the process that researchers follow when trying to address measurement error with multiple measures of a variable (e.g., with different questions on the same topic in a survey). Ideally, the only reason participants would answer differently on these questions is due to measurement error - they should be representing essentially the same information. Including each individual question in a model would thus be likely to introduce multicollinearity. Averaging them together gives you one composite measurement to make predictions with instead. 

Another option for addressing multicollinearity is through a process called **dimensionality reduction**. This is usually used when you have a large number of predictors that are highly related, so it is valuable to not only reduce the variance inflation factor but to also reduce the number of predictors and gain back degrees of freedom. There are several methods for doing dimensionality reduction, but most of them work by somehow identifying the shared variance between multiple predictors and turning that into one variable. This is beyond the scope of this course, but if you're interested you can read more about dimensionality reduction methods like [factor analysis](https://towardsdatascience.com/exploratory-factor-analysis-in-r-e31b0015f224), [principle component analysis](https://www.r-bloggers.com/2021/05/principal-component-analysis-pca-in-r/), and [ridge regression](https://www.statology.org/ridge-regression-in-r/). 

## 19.8 Assumption 5: Normality of residuals

The remaineder of the assumptions have to do with the error distribution after a model is fit. First, the general linear model assumes that the residuals of a model will be normally distributed. If the error distribution is instead highly skewed, that means a few data points are particularly weird and having a stronger influence on the model estimate than the rest of the non-weird data. 

Here's an exaple of what non-normal residuals might look like: 

In [None]:
x <- runif(10000,0,5)           #defining a random variable
e_norm <- runif(10000,0,1)
e_skew <- log(runif(10000,0,1)) #error is skewed, not normal
e_skew <- e_skew - mean(e_skew) #mean centering the error term
y_norm <- 0.5*x + e_norm
y_skew <- 0.5*x + e_skew
sim_df <- data.frame(x, y_skew, y_norm)
gf_histogram(~ e_skew, data = sim_df)

In [None]:
#sampling distribution of b1 with normal errors
nobias_b1s <- vector(length=1000)
for (i in 1:1000) {
  sim_sample <- sample_n(sim_df, size=100, replace=TRUE) 
  sim_model <- lm(y_norm ~ x, data = sim_sample)  
  nobias_b1s[i] <- sim_model$coefficients[[2]]
}

#sampling distribution of b1 with skewed errors
bias_b1s <- vector(length=1000)
for (i in 1:1000) {
  sim_sample <- sample_n(sim_df, size=100, replace=TRUE) 
  sim_model <- lm(y_skew ~ x, data = sim_sample)  
  bias_b1s[i] <- sim_model$coefficients[[2]]
}

#red = unbiased sampling distribution
#blue = biased sampling distribution
nobias_b1s_df <- data.frame(nobias_b1s)
bias_b1s_df <- data.frame(bias_b1s)
gf_histogram(~ nobias_b1s, data = nobias_b1s_df, fill="red", alpha=0.5) %>% 
    gf_histogram(~ bias_b1s, data = bias_b1s_df, fill="blue", alpha=0.5) %>%
    gf_vline(., xintercept = 0.5) #true b1 is 0.5

In this case, the sampling distribution is not really biased but the standard error will be wrong. We're more likely to commit a Type I error if we don't stop to consider this situation. 

Note that this assumption is about the normality of the distribution of *residuals* in a model, not the distribution of the raw variables. That is a common misconception. A highly non-normal predictor or outcome can still produce normally-distributed errors in the model in many situations. To assess this assumption you need to check the error distribution specifically.

### How to check for assumption
To check for the normality of residuals, simply plot the histogram of the model residuals and see if the distribution shape is strongly not normal: 

In [None]:
sim_model <- lm(y_skew ~ x, data = sim_sample)
sim_sample$resid <- sim_sample$y_skew - predict(sim_model, sim_sample)
gf_histogram(~ resid, data = sim_sample)

### What to do if assumption is violated
There are a number of advanced options available for addressing violations of the normality assumption. One might choose to transform the outcome variable, or use a version of the general linear model called [robust regression](https://stats.oarc.ucla.edu/r/dae/robust-regression/) that downweights the influence of data outliers. 

## 19.9 Assumption 6: Homoscedasticity

This long word refers to the assumption that residuals have constant variance at each level of model prediction. 

The best way to understand this is visually. We will make a plot where, as a predictor ```x``` increases, the error in the model increases as well: 

In [None]:
x <- runif(10000,0,5)             #defining a random variable
e_het <- x*rnorm(10000,0,1)    #error gets wider as a function of predictor
e_hom <- rnorm(10000,0,1)
y_het <- 0.5*x + e_het
y_hom <- 0.5*x + e_hom
sim_df <- data.frame(x, y_het, y_hom)

subsample <- sample_n(sim_df, size=100, replace=TRUE)
sub_model <- lm(y_het ~ x, data = subsample)
subsample$resid <- subsample$y_het - predict(sub_model, subsample)
gf_point(resid ~ x, data=subsample)

In this plot we are showing the size of the model error as it varies with a predictor x. When the assumption of **homoscedasticity** is met, this plot should look like a fairly consistent cloud where the range of residuals is about the same for every level of x. If there is instead a clear cone shape like in this plot this means that residuals are more variable at certain levels of x and the assumption is violated. This is a case of **heteroscedasticity**.  

This situation will occur if measurement error is correlated with values of the predictor. When this assumption is violated, model estimates are unbiased across many samples, but are more variable. This means the standard error is inflated and we have a harder time trusting our significance testing results.  

In [None]:
#sampling distribution of b1 with homoscedastic errors
nobias_b1s <- vector(length=1000)
for (i in 1:1000) {
  sim_sample <- sample_n(sim_df, size=100, replace=TRUE) 
  sim_model <- lm(y_hom ~ x, data = sim_sample)  #fitting a linear model instead of nonlinear
  nobias_b1s[i] <- sim_model$coefficients[[2]]
}

#sampling distribution of b1 with heteroscedastic errors
bias_b1s <- vector(length=1000)
for (i in 1:1000) {
  sim_sample <- sample_n(sim_df, size=100, replace=TRUE) 
  sim_model <- lm(y_het ~ x, data = sim_sample)  #fitting a linear model instead of nonlinear
  bias_b1s[i] <- sim_model$coefficients[[2]]
}

#red = unbiased sampling distribution
#blue = biased sampling distribution
nobias_b1s_df <- data.frame(nobias_b1s)
bias_b1s_df <- data.frame(bias_b1s)
gf_histogram(~ nobias_b1s, data = nobias_b1s_df, fill="red", alpha=0.5) %>% 
    gf_histogram(~ bias_b1s, data = bias_b1s_df, fill="blue", alpha=0.5) %>%
    gf_vline(., xintercept = 0.5) #true b1 is 0.5


### How to check for assumption
To check whether residuals are homoskedastic or heteroskedastic, one should make the same plot as above (predictor by residuals) and look for the distinctive cone shape in the cloud of residuals. 

### What to do if assumption is violated
Transformation of the dependent variable or robust regression are also ways of dealing with this assumption violation. 

## 19.10 Assumption 7: independence of residuals

Finally, the last assumption of the general linear model is that residuals are independent of each other. That means if you know the residual of one data point, you're not able to predict what the residual will be for the next data point. 

Violations of this assumption occur when the data points themselves are not independent of each other. Values on either the predictors and/or the outcomes for one data point help narrow down the possible values of another data point. The most common situation for this to occur is when you have time series data or clustered data.

Time series data is data that is collected over time and has a distinct order - data point #2 comes after data point #1, #3 after #2, etc. Often in time series data, the value of the prior data point bleeds over into the value of the following data point. This is easy to see in plots of time series like yearly temperatures. Temperature in your area can vary from 0 to 80 depending on the time of year, but if you knew it was 60 degrees yesterday, it's likely that the temperature today will be something similar. 

<img src="images/ch20-temperature.png" width="600">

Another example of non-independence among data points is when data come from the same source. Imagine you are testing the effect of time spent studying on test grades, and you collect data from students in three different sections of the same class. Each class has its own quirks - one might be at the end of the day so students are often tired, one might have a particularly tough professor whose tests are hard to study for, etc. In this case, the relationship between study time and test grade might depend on which class you're in. We would call this data clustered, within class. 

Another example of clustered data are multiple measurements taken from the same person. Say you are training students on a new study technique, so you measure their grades before learning the technique, and then again after learning it. You have two data points per person, called **paired samples** or **repeated measures**. You can imagine how well this technique works will depend on the particular person using it, so these data are clustered within person. 

Here's a simulation what happens to models when you violate the assumption of independence with paired samples. We'll measure the test scores of each student in the class twice, before and after study training:

In [None]:
before <- c(71, 73, 83, 93, 74, 84, 70, 88, 64, 100, 67, 72, 63, 86, 81)
post <- c(75, 73, 82, 100, 82, 84, 77, 89, 60, 100, 67, 82, 66, 87, 80)
student <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)

test_scores <- data.frame(student, before, post)
test_scores

At first glance at this study, we may be tempted to treat "score" as the outcome variable, and "pre/post timing" as the explanatory variable. This way we could see if there's a significant difference in test scores from before vs. after the study training. So maybe we'd actually want to arrange the dataset this way:

In [None]:
install.packages('tidyr')
library(tidyr)

#gather() is a function in the tidyr package that lets your reformat datasets
test_scores2 <- gather(test_scores,   #dataset to reformat
                       key='timing',  #name of new variable with time period
                      value='score',  #name of new variable with corresponding score value 
                      2:3)            #which columns in prior to combine

test_scores2

This way we could build a model $Y_i = b_0 + b_1X_i + e_i$ where Y<sub>i</sub> is each test score, b<sub>0</sub> is the mean of scores in the "before" group, b<sub>1</sub> is the difference in means between "before" and "post", and X<sub>i</sub> is whether a score was collected before the studying training or post-training.

In [None]:
summary(lm(score ~ timing, data = test_scores2))

The problem with this approach is that this assumes each test score is independent - that student 10's before score has no bearing on student 10's post score. But as we just talked about, this is likely not the case. Both of these scores are drawn from the sub-population of student 10, which is probably different than the population of scores for other students. Each student is independent of each other, but within one student their scores are not independent.

If your data are non-independent like this, adding new data from the same person to a dataset doesn't buy you a whole new degree of freedom - some of that information is already present in the dataset in the form of the person's previous data. Building a model on non-independent data means the model will overestimate the degrees of freedom available in the model. The model estimates will be unbiased, but the standard errors (which are calculated with degrees of freedom) will be biased downward. Our significance test decisions will be wrong more often.  

### How to check for assumption
There are some sophisticated ways to see if data are non-independent within clusters. For time series, you can calculate the [autocorrelation](https://corporatefinanceinstitute.com/resources/data-science/autocorrelation/) in the data. For clustered data, there's a measure called the [Intraclass Correlation](https://www.statisticshowto.com/intraclass-correlation/). For the purposes of this class, let's stick to using our intuition about whether or not our data are a time series, or coming from clustered sources.

### What to do if assumption is violated
The easiest way to deal with clustered data is to make it so you only have one datapoint from each independent source. In the case where there are two scores per person, that could mean calculating one *change* score per person, rather than having separate pre and post scores: 

In [None]:
test_scores$testchange <- test_scores$post - test_scores$before

test_scores

This solves our data independence issue. Now we have just one datapoint per person, and we know the people are independent of each other. Now we can build a model with the change scores as our outcome variable. Specifically, if we're interested in asking whether those changes tended to be non-zero, we can use the empty model to find the average score change:

In [None]:
summary(lm(testchange ~ NULL, data = test_scores))

The estimates for the effect of time from both versions of this model are the same, but the standard errors are different and thus the p-values are different. 

Sometimes it doesn't make sense to combine scores within independent sources, however. In the case where student data are clustered within three classes, combining data into the class level would only leave behind three data points. In cases like this, there are advanced statistical methods that can handle non-independent data. One that we will learn next chapter is repeated-measures ANOVA. Another more general approach that builds on the GLM framework is called [mixed effects modeling](https://meghan.rbind.io/blog/2022-06-28-a-beginner-s-guide-to-mixed-effects-models/). 

For time series, there are also advanced methods designed for those kind of data in particular. 

In the context of this class, we won't deal with these more advanced types of non-independent data. But you should be aware of them so that you don't violate this assumption in the future. 

## 19.11 Summary

In summary, one should combine bias investigation with significance testing and effect size evaluation when building models in order to make the best conclusions about your data. This will enable you to understand your model's performance in terms of its magnitude, uncertainty, and bias. 

For a quick reference to the assumptions that the general linear model makes and what kind of bias happens when these assumptions are violated, look over the table below. 

| GLM Assumption        | Violation                    | Biased estimate?   | Biased standard error? | 
| :-------------------: | :--------------------------: | :----------------: | :--------------------: |
| Representative sample | non-representative sample    | √                  | √                      |
| Linear relationship   | non-linear relationship      | √                  |                        |
| Exogeneity            | Endogeneity/omitted variable | √                  | √                      |
| No multicollinearity  | Multicollinearity            |                    | √                      |
| Normal residuals      | Non-normal residuals         |                    | √                      | 
| Homoskedasticity      | Heteroskedasticity           |                    | √                      |
| Independent residuals | Non-independent residuals    |                    | √                      |

## Chapter Summary

After reading this chapter, you should be able to:

- Explain the difference between error and bias
- Explain the difference between biased and unbiased estimators
- Describe the assumptions of the general linear model where violations lead to model bias
- Remember whether violations of each assumption lead to biased model estimates and/or biased standard errors

[Next: Chapter 20 - Bayesian Statistics](https://colab.research.google.com/github/smburns47/Psyc158/blob/main/chapter-20.ipynb)