# Chapter 20 - Sources of Model Bias

two types of bias - incorrect coefficients, and incorrect confidence

## 11.10 The case of paired samples

In the example of male and female thumb lengths, each data point represents the thumb length of one individual person. These datapoints are known as **independent** because one person having a length of 60mm doesn't influence the thumb length of a separate person. 

Not all datapoints are independent, however. Imagine a separate case where we have a dataset of test scores for every student in a class, where they took four different versions of the test. The test has been used for many years, so we know the underlying *population* of scores has a mean of 80 and a standard deviation of 10. The data generation process of this test is going to make a score of 80 more typical than a score of 98. 

But when drawing a sample of test scores for *each student*, they're not sampling randomly from the population of scores. Some students are better at studying than others, so their scores tend to be higher. Other students have a heavy courseload, so their scores are more variable depending on if they had to spend a lot of time studying for a test in another class as well. Thus, each test score for a particular student is *not* independent - it *depends* on which student it came from. 

In this case, you can imagine a student as their own data generation process. Their mean is drawn from the overall population of students, but scores within each student are drawn from that student's particular population of scores. This is illustrated in the graph below. 

<img src="images/ch11-multilevel.png" width="800">

Now imagine a study where we want to understand the effect of a new studying technique. We test it by giving everyone a test at the beginning of the class, then teaching them the new technique, and giving them another test. Below is a sample for a small class:

In [5]:
pretest <- c(71, 73, 83, 93, 74, 84, 70, 88, 64, 100, 67, 72, 63, 86, 81)
posttest <- c(75, 73, 82, 100, 82, 84, 77, 89, 60, 100, 67, 82, 66, 87, 80)
student <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)

test_scores <- data.frame(student, pretest, posttest)
test_scores

student,pretest,posttest
<dbl>,<dbl>,<dbl>
1,71,75
2,73,73
3,83,82
4,93,100
5,74,82
6,84,84
7,70,77
8,88,89
9,64,60
10,100,100


At first glance at this study, we may be tempted to treat "score" as the outcome variable, and "pre/post timing" as the explanatory variable. So maybe we'd actually want to arrange the dataset this way:

In [6]:
score <- c(71, 75, 73, 73, 83, 82, 93, 100, 74, 82, 84, 84, 70, 77, 88, 89, 64, 60, 100, 100,
          67, 67, 72, 82, 63, 66, 86, 87, 81, 80)
timing <- c("before", "post", "before", "post", "before", "post", "before", "post", "before", "post", 
           "before", "post", "before", "post", "before", "post", "before", "post", "before", "post", 
           "before", "post", "before", "post", "before", "post", "before", "post", "before", "post")
student <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 
            13, 13, 14, 14, 15, 15)

test_scores2 <- data.frame(student, score, timing)
test_scores2

summary(lm(score ~ timing, data = test_scores2))

student,score,timing
<dbl>,<dbl>,<chr>
1,71,before
1,75,post
2,73,before
2,73,post
3,83,before
3,82,post
4,93,before
4,100,post
5,74,before
5,82,post



Call:
lm(formula = score ~ timing, data = test_scores2)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.2667  -7.1833   0.7333   6.5667  22.0667 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   77.933      2.878  27.075   <2e-16 ***
timingpost     2.333      4.071   0.573    0.571    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 11.15 on 28 degrees of freedom
Multiple R-squared:  0.0116,	Adjusted R-squared:  -0.0237 
F-statistic: 0.3286 on 1 and 28 DF,  p-value: 0.5711


This way we could build a model Y<sub>i</sub> = b<sub>0</sub> + b<sub>1</sub>X</sub>i + e<sub>i</sub> where Y<sub>i</sub> is each test score, b<sub>0</sub> is the mean of scores in the "before" group, b<sub>1</sub> is the difference in means between "before" and "post", and X<sub>i</sub> is whether a score was collected before the studying training or post-training.

The problem with this approach is that this assumes each test score is independent - that student 10's before score has no bearing on student 10's post score. But as we just talked about, we know this is not the case. Both of these scores are drawn from the population of student 10, which is probably different than the population of scores for other students. Each student is independent of each other, but within one student their scores are not independent.

This kind of data that is not independent is called **paired scores**, **within-subject scores**, or **repeated measures**, and we have to treat it differently than independent data. If we were to treat it as independent data, our model would run fine and we could make predictions - but we would be building those predictions from a dataset where we thought we had 30 independent datapoints, rather than 15 pairs of dependent points. We'd be assuming there were more degrees of freedom than there truly were, and our confidence in our predictions would be biased. 

In the case where there are two scores per person, we can set up our data slightly differently than it is now in order to make an unbiased estimate. If our hypothesis is that the studying technique we taught the students would improve their scores, we would expect the mean of post scores to be higher than the mean of before scores. In other words, that there was an *increase* in scores. So instead of using each specific score in the model, we could model the *change* in each person's test scores:

In [7]:
test_scores$testchange <- test_scores$posttest - test_scores$pretest

test_scores

student,pretest,posttest,testchange
<dbl>,<dbl>,<dbl>,<dbl>
1,71,75,4
2,73,73,0
3,83,82,-1
4,93,100,7
5,74,82,8
6,84,84,0
7,70,77,7
8,88,89,1
9,64,60,-4
10,100,100,0


This solves our data independence issue. Now we have just one datapoint per person, and we know the people are independent of each other. Now we can build a model with the change scores as our outcome variable. Specifically, if we're interested in asking whether those changes tended to be positive, we can use the null model to find the average score change:

In [8]:
summary(lm(testchange ~ NULL, data = test_scores))


Call:
lm(formula = testchange ~ NULL, data = test_scores)

Residuals:
   Min     1Q Median     3Q    Max 
-6.333 -2.333 -1.333  3.167  7.667 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)    2.333      1.036   2.253   0.0409 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.012 on 14 degrees of freedom


According to the coefficients in both these model, on average there was a 2.33-point improvement in test scores after using our studying technique. Not everyone did better (students 2, 6, and 10 saw no change, and students 3 and 9 actually did worse), but on average there was a positive increase in scores. However, in the first model the difference between means of before and post scores is not significant, due to the high variance in each subgroup of scores. In the second model, there is only variance in one variable, the score increase, and this is low relative to the mean of that variable. Thus we can be more confident that it is not 0, and we have a significant intercept.

Sometimes paired values will make our confidence better, sometimes worse. 