[Back to Table of Contents](https://www.shannonmburns.com/Psyc158/intro.html)

[Previous: Chapter 14 - Models with Moderation](https://colab.research.google.com/github/smburns47/Psyc158/blob/main/chapter-15.ipynb)

In [None]:
# Run this first so it's ready by the time you need it
install.packages("dplyr")
install.packages("ggformula")
install.packages("interactions")
library(dplyr)
library(ggformula)
library(interactions)
studentdata <- read.csv("https://raw.githubusercontent.com/smburns47/Psyc158/main/studentdata.csv")

# Chapter 15 - Estimating Populations

We've spent a great deal of time so far learning about how to pull insights from sets of data - what the distribution of a variable looks like, how to predict new values based on other variables, etc. Ideally, these insights are useful not just for describing *this* dataset, but for many others you could possibly collect. We hope that our statistical results are good estimates of how things really are in the wider population. That way, we can make more general claims about broad categories that we're interested in, like human psychology. 

But we usually don't have access to the entire population. Instead, we have to collect a smaller sample and hope that it is representative. In a sample like the ```studentdata``` dataset, which collected information from students in statistics class, we might find that we can sort of predict someone's thumb length from how tall they are:

$$\hat{Thumb_i} = 2.84 + 0.87Height_i$$

By fitting this linear model, we predict that someone who is 60 inches tall will have a thumb length of 55.04mm (2.84 + 0.87\*60), and for every additional inch of height they have their thumb length will be 0.87mm longer. Now these guesses will almost certainly be wrong - the RMSE of this model is 9.12, meaning on average our guess will be off by about 9mm. But basing our guesses off of someone's height makes our guesses at least better than just knowing the mean of thumb (12% better in fact, since the PRE score is 0.12). 

Those RMSE and PRE scores are both measures of model performance called effect sizes. We will return to those in a few chapters, but it's an important concept to start getting familiar with because it helps us decide what to *do* with our statistical model. Is it good enough? Am I confident in my predictions?

If we're okay with guesses that are on average 9mm away, we can stop there and move on with our day. If we want to make even better guesses, we can try to build a more complex model with additional predictors. Ultimately, how confident we are in our predictions is the deciding factor for how good our statistical models are. We build psychological theories about relationships between variables in order to find the predictors that will improve predictions about how people think, feel, and behave.  

With a large PRE score or low RMSE, our predictions are quite good. We are confident that the true values will not be too different... in this dataset, specifically. But one additional layer of uncertainty still at play is how good this model will be in *other* samples, or in the whole population.

## 15.1 Sampling distributions 

To answer this question, we must first remember that any one sample we collect won't look perfectly similar to the population. If it's not a representative sample, it'll be different in important ways. But even if it's sampled randomly, the law of large numbers tells us that randomness will cause some variations, particularly if the sample is much smaller than the population (which psychology samples always are, if our population of interest is the billions of humans who have lived or will ever live). Thus, the models we build to explain variation in a sample can *themselves* vary sample to sample. The sample statistics won't be the same as the population parameters, or across different datasets. 

For example, let's pretend we're omnicient and we know the population mean of height for all Claremont College students is $\mu = 66$ inches, and the standard deviation $\sigma$ is 3.5 inches. A researcher doesn't know this, but collects 5 different samples of 50 students each, creating the following sample statistics:

| Sample number | Height mean | Height SD  | 
| :-----------: | :---------: | :--------: |
| 1             | 65.65       | 3.13       |
| 2             | 67.04       | 3.73       |
| 3             | 65.94       | 3.70       |
| 4             | 64.37       | 4.04       |
| 5             | 66.04       | 3.26       |

The sample mean and standard deviation are similar but not exactly equal to the population values. When statistics vary across samples, this is called **sampling error**. There's *variation* in the many sample estimates around the true population parameter. Where else have we seen variation around a central tendency before?

Now let’s take a large number of samples (say, 5,000) of 50 individuals and compute the mean for each of them. The result is a set of 5,000 sample statistics - kinda sounds like a dataset, doesn't it? 

We've spent a long time working with distributions of data around the mean of those data. We measured that central tendency because it is typically the best single number to use to characterize the value of data in a distribution. 

We can extend that logic to understanding how the estimates we get across several samples relates to the true population parameter. We can make a distribution of sample *statistics*, rather than only raw observations. In this case, each statistic (e.g. the mean of each separate sample) is a unique entry in the distribution. This is a **sampling distribution** - a distribution of sampling statistics. Specifically, this is a sampling distribution of the mean (since that's the statistic we're calculating).

<img src="images/ch16-sampdist.png"  width="750">

The gray histogram above shows each raw observation for height in the population, and the blue histogram shows the mean values for each 50-person sample. The sample means vary somewhat, but something to notice is that overall they are centered around the population mean. The average of the 5,000 sample means is very close to the true population mean.

Technically you can make a sampling distribution for any statistic you can think of - the mean, a model coefficient $b_1$, an effect size like PRE, etc. It would be a distribution of those values calculated on different samples.

This is how we go from being confident about predictions in our samples, to being confident about how well those predictions *generalize* to other samples and the population. We build a model within a sample, and then ask ourselves how confident we are that that model can make good predictions regardless of the sample we draw.

The process of asking this question is called **hypothesis testing**, and we will dedicate the rest of the course to doing it. Hypothesis testing is when you make a hypothesis about what the population truly looks like, and then decide how likely that image of the world matches up with the dataset you've collected. It is an exercise in hypotheticals, reasoning about the range of conclusions you could make based on a variety of data samples you could collect.  

To prepare for asking these questions, let's learn more about the nature of sampling distributions and how they relate to the population parameter. 

## 15.2 The importance of sample size

You probably have some intuition that statistical inferences are better the bigger your sample size. In the popular conscious, it seems like "sample size" ranks right under "correlation is not causation" in terms of critiques about psychology research. Sample size is indeed an important concern, and in this section you'll learn why. 

In the example above, each of our samples was 50 people. While we typically don't have the time or funding to conduct many iterations of a research experiment, we can use simulation to tell us what it might look like if we did. 

One option is to use the random sampling functions we've learned about before, that draw samples from known distributions - ```rnorm()```, ```runif()```, etc. 

What if we don't have a good idea about what the population looks like though? What if we only have our own subsample of data?

We can actually still do a useful simulation here. We can take our dataset (like ```studentdata```), and just pretend that *that* is the whole population of data. Then we can repeatedly draw random samples from it, with replacement. This simulation technique is called **bootstrapping**, and helps us make inferences about sampling distributions even when we don't know the population parameter. Essentially, we treat our data distribution as a probability distribution and draw many new samples from that. 

Remember when we learned to use the ```sample()``` function to collect samples from a probability distribution? First we defined a sample space, e.g the six sides of a die ```c(1,2,3,4,5,6)```. Then we said how many items we want to draw from that sample space, the probability of drawing each value, and whether or not we want to draw with replacement (i.e. allow for us to draw the same item multiple times). 

In bootstrapping, we do the same thing but with a data distribution as our sample space: 

In [None]:
#drawing a sample of 50 heights from the studentdata dataset
#if we leave out the probs argument, probability of each value will correspond to 
#its frequency in the dataset
bootstrap_sample <- sample(x = studentdata$Height, size = 50, replace = TRUE)
bootstrap_sample

If we calculate the mean of this sample, we come up with one statistical estimate:

In [None]:
mean(bootstrap_sample)

Compare that to the true "population" mean (mean of the entire ```Height``` variable):

In [None]:
mean(studentdata$Height)

Our sample estimate isn't exactly the same, but it's pretty close. 

Now let's scale this up and simulate many samples of heights - say, 1,000 samples. Complete the code below to make this simulation run without errors.

In [None]:
#creating an empty vector of 1000 spots
bootstrap_means50 <- vector(length=1000)

#generate 1000 unique samples, saving each mean
for (i in 1:1000) {
    #draw a sample of 50 heights from studentdata$Height, replace = TRUE
    bootstrap_sample <- #YOUR CODE HERE
    bootstrap_means50[i] <- mean(bootstrap_sample)
}

bootstrap_means50[1:10]

If we turn this set of means into a data frame, we can plot a histogram and see how these means vary. We'll also add a vertical line at the population mean, to see how these sample means compare to it. 

In [None]:
bootstrap_df <- data.frame(bootstrap_means50)
gf_histogram(~ bootstrap_means50, data=bootstrap_df, fill = "red") %>%
    gf_refine(coord_cartesian(x=c(58,77))) %>%
    gf_vline(., xintercept = mean(studentdata$Height))

Compare the spread of this sampling distribution to the distribution of raw values in ```studentdata$Height```:

In [None]:
gf_histogram(~ Height, data=studentdata) %>% 
    gf_refine(coord_cartesian(x=c(58,77))) %>%
    gf_vline(., xintercept = mean(studentdata$Height))

There's variability among the sample means such that nearly all are different from the "population" mean, but any one sample mean gets pretty close to the population parameter. Further, if we take the mean of *means* and compare it to the population parameter: 

In [None]:
mean(bootstrap_means50)
mean(studentdata$Height)

They are nearly identical. 

Now what if we collected just as many samples, but small ones? Edit this code to do the same sampling procedure as above, but with ```size = 5``` for each sample.

In [None]:
#creating an empty vector of 1000 spots
bootstrap_means5 <- vector(length=1000)

#generate 1000 unique samples, saving each mean
for (i in 1:1000) {
    #draw a sample of 5 heights from studentdata$Height, replace = TRUE
    bootstrap_sample <- #YOUR CODE HERE
    m <- mean(bootstrap_sample)
    bootstrap_means5[i] <- m
}

bootstrap_df$bootstrap_means5 <- bootstrap_means5

#plotting distribution of n=50 samples
gf_histogram(~ bootstrap_means50, data=bootstrap_df, fill = "red") %>%
gf_refine(coord_cartesian(x=c(58,77))) %>% 
gf_vline(., xintercept = mean(studentdata$Height))

#plotting distribution of n=5 samples
gf_histogram(~ bootstrap_means5, data=bootstrap_df, fill = "blue") %>%
gf_refine(coord_cartesian(x=c(58,77))) %>% 
gf_vline(., xintercept = mean(studentdata$Height))

#distribution of population data
gf_histogram(~ Height, data=studentdata) %>% 
gf_refine(coord_cartesian(x=c(58,77))) %>% 
gf_vline(., xintercept = mean(studentdata$Height))

This sampling distribution is wider than the first - on average, each sample mean is farther away from the true population parameter. However, the mean of *means* is still quite close:

In [None]:
mean(bootstrap_means5)
mean(studentdata$Height)

Finally, let's take this exercise to the extreme with a sample size that is as small as possible. Modify the code to draw 1000 samples of only 1 data point each. 

In [None]:
#creating an empty vector of 1000 spots
bootstrap_means1 <- vector(length=1000)

#generate 1000 unique samples, saving each mean
for (i in 1:1000) {
    #draw a sample of 1 height value from fingers$Height, replace = TRUE
    bootstrap_sample <- #YOUR CODE HERE
    m <- mean(bootstrap_sample)
    bootstrap_means1[i] <- m
}

bootstrap_df$bootstrap_means1 <- bootstrap_means1

#plotting distribution of n=50 samples
gf_histogram(~ bootstrap_means50, data=bootstrap_df, fill = "red") %>%
gf_refine(coord_cartesian(x=c(58,77))) %>% 
gf_vline(., xintercept = mean(studentdata$Height))

#plotting distribution of n=5 samples
gf_histogram(~ bootstrap_means5, data=bootstrap_df, fill = "blue") %>%
gf_refine(coord_cartesian(x=c(58,77))) %>% 
gf_vline(., xintercept = mean(studentdata$Height))

#plotting distribution of n=1 samples
gf_histogram(~ bootstrap_means1, data=bootstrap_df, fill = "orange") %>%
gf_refine(coord_cartesian(x=c(58,77))) %>% 
gf_vline(., xintercept = mean(studentdata$Height))

#distribution of population data
gf_histogram(~ Height, data=studentdata) %>% 
gf_refine(coord_cartesian(x=c(58,77))) %>% 
gf_vline(., xintercept = mean(studentdata$Height))

This reveals a sampling distribution that approaches the same shape as the population distribution, since each "sample" is just one data point from the population. And yet, when comparing the mean of this wide sampling distribution to the population mean:

In [None]:
mean(bootstrap_means1)
mean(studentdata$Height)

Based on this demonstration, we can see that if you only have a few observations in a sample, the mean of that sample is unlikely to be an accurate estimate of the population parameter; if you replicate a small experiment and recalculate the mean you’ll get a very different answer. The sampling distribution is quite wide. This is why statistical answers from studies with a small sample can be misleading.

In contrast, if you run a large experiment and replicate it with another large sample, you’ll probably get nearly the same answer you got last time. The sampling distribution will be very narrow. If we took this to the extreme other side and drew samples that were infinitely large, we'd create a sampling distribution that was just a single line at the population mean. This would be because we'd be taking samples over and over that *are* the population.  

Since we don't have this full population though, only smaller samples, we can quantify the variation in the sampling distribution by calculating the standard deviation of the sampling distribution. This is referred to as the **standard error.** We were exploring the mean as a statistic above, but you can calculate the standard error of any statistic. The standard error of a statistic is often denoted SE. Think of it as how far a sample estimate typically is from the true population parameter.

Sample size matters for collecting data samples because as the sample size gets larger, the standard error of the sampling distribution gets smaller. Our sample estimates are on average closer to the true population parameter. 

## 15.3 The Central Limit Theorem

Despite how poorly a single small sample can do on telling you about a population parameter, it's important to remember that several samples together, even if very small, will stack up into a distribution that is centered on the true population parameter. That's why, no matter how small our samples were in the previous demonstration, the mean of means was always very close to the population mean. 

On the basis of what we've seen so far, it seems like we have evidence for the following claims about the sampling distribution of the mean:

- The mean of the sampling distribution is the same as the mean of the population
- The standard deviation of the sampling distribution (i.e., the standard error) gets smaller as sample size increases

As it happens, not only are these statements true, there is a very famous theorem in statistics that proves them, known as the **Central Limit Theorem.** We won't spend time here on the [mathematical proof](https://en.wikipedia.org/wiki/Central_limit_theorem#Proof_of_classical_CLT) that establishes this - our simulations above are demonstration enough for our purposes. But from this proof we can get specific equations for calculating the standard error. 

When the mean is our sample estimate of interest, the Central Limit Theorem tells us, for a population with mean $\mu$ and standard deviation $\sigma$, that the sampling distribution of the mean also has mean $\mu$, and the standard error of the mean (SEM) is

$$SEM = \frac{σ}{\sqrt{N}}$$

where N is the size of a sample. This says that, when sampling from a population with standard deviation $\sigma$, the means of all samples of size N will vary around the true mean by the SEM amount on average. Further, because we divide the population standard devation $\sigma$ by the square root of the sample size N, the SEM gets smaller as the sample size increases. 

This result is useful for all sorts of things. It tells us why large experiments are more reliable than small ones, and because it gives us an explicit formula for the standard error it tells us how much *more* reliable a large experiment is than a small one. 

## 15.4 Estimating population parameters

In the simulations in the previous sections, we knew the population parameter ahead of time. This is helpful for learning about statistics, but of course the most interesting things to do research on are the things we don't already know about; that which we don't know the population distribution for. So if we want to calculate standard error (how far away our estimate of a mean is likely to be from the population mean) with an equation like $\frac{σ}{\sqrt{N}}$, how do we do that when we don't know $\mu$ or $\sigma$?

For instance, suppose you wanted to measure the IQ of Claremont College students. IQ in the general population is known - it has a mean of 100 and a standard deviation of 15. But maybe Claremont College students are a different sort of population, with a different mean and SD. So to find out the IQ of the kind of students who come to the 5Cs, we’re going to have to estimate the population parameters from a sample of data. So how do we do this?

### Estimating the population mean
Suppose we camp outside Frary and ask 100 students to take an IQ test for us. The average IQ score among these people turns out to be $\bar{X}=102.5$. So what is the true mean IQ for the entire population of Claremont College students? Obviously, we don’t know the answer to that question. It could be 97.2, or 108. We only have one sample, so we cannot give a definitive answer. Nevertheless, right now our “best guess” is 102.5. That’s the essence of statistical
estimation: giving a best guess.

In this example, estimating the unknown poulation parameter for the mean is straightforward. We calculate the sample mean, and we use that as an estimate of the population mean. We calculate $\bar{X}$, and treat that number as our best guess about the value of $\mu$. Much like we gave hats to symbols in the modeling chapters to designate predicted values vs. true values, we can do so here to designate what our predictions are of the population parameter, based on a sample of data.

| Symbol     | What is it?                       | 
| :--------: | :-------------------------------: | 
| $\bar{X}$  | Sample mean                       |
| $\mu$      | True population mean              |
| $\hat{\mu}$| Prediction of the population mean | 

When we have one data sample, $\bar{X}$ is thus used as $\hat{\mu}$ but we keep in mind that this is likely to vary on average from the true $\mu$ by the standard error. 

### Estimating the population standard deviation
So far, estimation seems pretty simple - just take a sample estimate and treat it as the population parameter. So you might be wondering why we forced you to read through all that stuff about sampling theory before getting here. In the case of the mean, our sample statistic (i.e. $\bar{X}$) turned out to be the best approximation of the true population parameter $\mu$ that we could come up with given our data. However, that’s not always true. To see this, let’s think about how to construct an estimate of the population standard deviation, which we’ll denote $\hat{\sigma}$. What shall we use as our estimate in this case? 

| Symbol        | What is it?                   |
| :-----------: | :---------------------------: | 
| $s$           | Sample sd                     |
| $\sigma$      | True population sd            |
| $\hat{\sigma}$| Estimate of the population sd |

Your first thought might be that we could do the same thing we did when estimating the mean, and just use the sample
statistic as our estimate. That’s almost the right thing to do, but not quite. 

Here’s why. Suppose we have a sample that contains a single IQ observation of 98. It has a sample mean of 98, and because every observation in this sample is equal to the sample mean (obviously!) it has a sample standard deviation of 0: the sample contains a single observation and therefore there is no variation observed within the sample. 

But as an estimate of the population standard deviation, this may feel completely wrong. Knowing that data implies variability, the only reason that we don’t see any variability in the sample is that the sample is too small to display any variation, not because everyone has an IQ of 98. So, if you have a sample size of N=1, it feels like the right answer about the population standard deviation is just to say “no idea at all”. 

Suppose we now make a second observation. The dataset now has N=2 observations, and the complete sample now contains the observations 98 and 100. This time around, our sample is just large enough for us to be able to observe some variability: two observations is the bare minimum number needed for any variability to be observed. For our new dataset, the sample mean is $\bar{X} = 99$, and the sample standard deviation is $s = 1$. 

Now that we have variation in our sample, is *this* estimate of the standard deviation going to be more reliable? Or will it also be too small, like 0 was too small? 

We can use R to simulate the results of many samples to demonstrate this. Given the true population mean of IQ is 100 and the standard deviation is 15, we can use the ```rnorm()``` function to generate the results of an experiment in which we measure N=2 IQ scores, and calculate the sample standard deviation. If we do this over and over again, and plot a histogram of these sample standard deviations, we get the sampling distribution of the standard deviation (see figure below). 

<img src="images/ch16-sdestimate.png"  width="350">

Even though the true population standard deviation is 15, the average of the sample standard deviations is only 8.5. Notice that this is a very different result to what we found when we plotted the sampling distribution of the mean of ```Height``` in ```studentdata```. In that sampling distribution, the population mean was 65.95, and the mean of the sample means was always close to that.

Now let’s extend the simulation. Instead of restricting ourselves to the situation where we have a
sample size of N=2, let’s repeat the exercise again for sample sizes from 1 to 10. If we plot the average sample mean and average sample standard deviation as a function of sample size, you get the results shown in this next figure. 

<img src="images/ch16-estimatesim.png"  width="650">

On the left hand side (panel A) is the average sample mean for each sample size, and on the right hand side (panel B) is the average standard deviation for each sample size. The two plots are quite different: no matter the sample size, the average of the sampling distribution of the mean is equal to the population mean. This means it is an **unbiased estimator**. This is the reason why your best estimate for the population mean is the sample mean. 

The plot on the right shows that the average of the sampling distribution of the standard deviation is always smaller than the population standard deviation $\sigma$ for small sample sizes. No matter how many samples you collect, the central tendency of that sampling distribution will be systematically wrong. This means it is a **biased estimator.** In other words, if we want to make a “best guess” $\hat{\sigma}$ about the value of the population standard deviation $\sigma$, we should make sure our guess is a little bit larger than the sample standard deviation $s$.

To fix this systematic bias, we rely on degrees of freedom! If you recall from Chapter 5, the sample variance is defined to be the average of the squared deviations from the sample mean, with one change. That is:

$$s^2 = \frac{1}{N-1}\sum_{i=1}^{N}(X_i-\bar{X})^2$$

where the sum of the squared deviations are divided by N-1 instead of N, like in a normal average. As it turns out, this change is all we need to do to make variance an unbiased estimator. This is also true in the equation for standard deviation: 

$$s = \sqrt{\frac{1}{N-1}\sum_{i=1}^{N}(X_i-\bar{X})^2}$$

When calculating something like mean squared error in a statistical model, we generalize this to: 

$$MSE = \sqrt{\frac{1}{N-k}\sum_{i=1}^{N}(Y_i-\hat{Y_i})^2}$$

Where k is the number of parameters in the model. This is why we use degrees of freedom when making statistical estimates: dividing by a slightly smaller number helps us fix how much estimates of standard error undershoot the actual standard error. 

When we want to know the standard error of the mean (how far off one sample mean of size N is likely to be from the true population mean), we need to use the population parameter $\sigma$ in the calculation. But we don't know it for sure, so our next best option is to use an *estimate* of the population standard deviation, calculated from our sample standard deviation and degrees of freedom:

$$SEM = \frac{\hat{σ}}{\sqrt{N}}$$

## 15.5 Sampling distributions of model outputs

The mean and standard deviation aren't the only sampling distributions we use. All sample statistics have a counterpart parameter that describes the true population, and thus a distribution of estimates can be built from many samples. 

Another sampling distribution that will be relevant to us is the distribution of coefficients from a statistical model. Let's fit a simple linear regression predicting thumb length from height in the ```studentdata``` dataset and look at the estimate of $b_1$:

In [None]:
#fit a linear model of Thumb predicted by Height in the studentdata dataset
height_model <- #YOUR CODE HERE)
height_model
gf_point(Thumb ~ Height, data=studentdata) %>% gf_lm()

The estimate for $b_1$ is 0.87, indicating that a regression line best fits *these data* with the least error if it has a slope of 0.87. In other words, we think the effect of Height in the population is 0.87. But now let's treat ```studentdata$Height``` as a population again, and bootstrap some separate samples in which we fit a model for each. We'll use the function ```slice_sample()``` from the ```dplyr``` package in order to sample entire rows of a data frame. 

In [None]:
#creating an empty vector of 1000 spots
bootstrap_b1 <- vector(length=1000)

#generate 1000 unique samples, saving each mean
for (i in 1:1000) {
    #draw a sample of 50 height value from studentdata$Height, replace = TRUE
    bootstrap_sample <- slice_sample(studentdata, n=50, replace=TRUE)
    bootstrap_model <- lm(Thumb ~ Height, data=bootstrap_sample)
    bootstrap_b1[i] <- bootstrap_model$coefficients[[2]]
}

bootstrap_df <- data.frame(bootstrap_b1)

#plotting distribution of n=50 samples
gf_histogram(~ bootstrap_b1, data=bootstrap_df, fill = "red")

From this, we can clearly see that models built on different data will come up with different coefficient estimates. Some will say that ```Thumb``` and ```Height``` will relate to each other in much the same way as they do in the "population": 

In [None]:
set.seed(23)
bootstrap_sample <- slice_sample(studentdata, n=50, replace=TRUE)
lm(Thumb ~ Height, data=bootstrap_sample)
gf_point(Thumb ~ Height, data=bootstrap_sample) %>% gf_lm()

But others will over-estimate the model slope:

In [None]:
set.seed(735)
bootstrap_sample <- slice_sample(studentdata, n=50, replace=TRUE)
lm(Thumb ~ Height, data=bootstrap_sample)
gf_point(Thumb ~ Height, data=bootstrap_sample) %>% gf_lm()

While still others will even flip the direction of the effect, such that we predict someone's thumb would be *shorter* for each one-inch increase in height: 

In [None]:
set.seed(352)
bootstrap_sample <- slice_sample(studentdata, n=50, replace=TRUE)
lm(Thumb ~ Height, data=bootstrap_sample)
gf_point(Thumb ~ Height, data=bootstrap_sample) %>% gf_lm()

This tells us that the estimate of the coefficient from one dataset may not be the true population coefficient, and is likely to vary around the population parameter by the amount of the standard error. The equation for the standard error of a regression coefficient is a more complicated equation than those for the standard error of the mean so we won't make you learn it, but it is again a function of sample size. You can actually find it directly in R with a new function that works on model objects, ```summary()```. 

In [None]:
summary(height_model)

This gives you a lot more information about a model than just its coefficients. For our purposes right now, look at the table under the heading "Coefficients:". Each row is for each parameter estimated by our model - the intercept $b_0$ and the effect of ```Height``` $b_1$. The first column, "Estimate" shows you the same coefficient estimate as you would get from simply typing the model object name into the console window. 

Some new information is in the next column over, called "Std. Error". This stands for the standard error of the estimate, or how much this estimate is likely to vary on average from the population coefficient based on the sample size for these particular data.

## 15.6 Confidence intervals

All of this is to say, our sample estimates are only that - estimates. They are likely to be more similar to the true population parameter the larger our samples are, but there's still variation among estimates from different data samples. So how well can we ever trust one particular sample estimate from our one humble research study? How certain can we be that our estimate of the mean, model coefficient, etc. is close to the population estimate? It would be nice to say that we are reasonably confident that the true parameter is in some window around this estimate. 

In fact, there is a way we can say exactly that and find an interval of values around our estimate that expresses a degree of confidence that the true parameter is somewhere in that interval. 

Our understanding of sampling distributions now will help us construct this concept. Let's start out with an estimate of the $b_1$ coefficient in a general linear model. Suppose the true population coefficient $\beta_1$ is 0. Due to sampling error, if we collect samples that are smaller than the population size, there is a distribution of $b_1$ estimates that might happen even with the same underlying population parameter. You can see these $b_1$ estimates in the sampling distribution below. Each value in this distribution is one $b_1$ value that would be estimated from a random sample of the population. 

<img src="images/ch15-betadist0.png" width="400">

Note that this sampling distribution is shaped normally. Some $b_1$ estimates are more likely to occur than others. Thus you can think of the sampling distribution also like a probability distribution centered on the population parameter value. In this way, getting a sample whose $b_1$ estimate is 1 (where the distribution has higher density) is more likely to occur. Values that are far away from 0, like 6 (in the tails of the distribution), are less likely to occur.

Now instead imagine that the true population coefficient &beta;<sub>1</sub> is actually 10. Below is the visualization of this situation. In this case it is still possible to get $b_1$ estimates of 1 or 6, but a 6 is more likely and a 1 is less likely.

<img src="images/ch15-betadist10.png" width="450">

The width of these sampling distributions is determined by the standard error of the estimate, which is related to sample size. In these sampling distributions, even though a $b_1$ of 6 does not exactly match the population parameter, it is still not that unusual to draw a sample with that as the estimate. 

Now, remember that we don't actually know what the population parameter is. We only know the $b_1$ estimate we've gotten for our particular sample. Thus if we calculate $b_1 = 6$, from our musings so far we can see that it *could have* been produced by $\beta_1 = 0$ with reasonable likelihood. It also *could have* been produced by $\beta_1 = 10$ with reasonable likelihood. And these are just two of the many possible parameters that could have produced the sample estimate of 6. 

However, consider a population parameter like -10:

<img src="images/ch15-betadist-10.png" width="450">

Getting a sample estimate $b_1 = 6$ out of this population parameter is very unlikely. It's still technically possible because the range of the normal distribution stretches from $-\infty$ to $\infty$, but it's so unlikely that we would be very surprised to see that happen.

Thus, there is a range of $\beta_1$ parameters that we think can produce a $b_1 = 6$ estimate a reasonable amount of the time. There are also $\beta_1$ values that we think are very unlikely to ever produce a $b_1 = 6$ estimate. 

This range of $\beta_1$ parameters that might reasonably produce our sample's $b_1$ estimate is the **confidence interval** of the estimate. Confidence intervals allow us to quantify the expected variation in a sample estimate and make statements such as, “We are 95% confident that the true population parameter falls between these two values.” In order to make such a statement, we need a way to find a lower bound and an upper bound of $\beta_1$ values that are likely to produce our $b_1$ estaimate.

We know from our discussion of standard deviation in Chapter 5 that 95% of a distribution will fall within 2 standard deviations of the distribution's mean (it's actually 1.96 standard deviations, to be precise). Applied to a sampling distribution, we can find the range of $b_1$ estimates that 95% of the time will be generated by the sampling distribution of $\beta_1$. To put that in mathematical terms:

$$\beta_1 - (1.96 * SE) \leq b_1 \leq \beta_1 + (1.96 * SE)$$

where the SE is the standard error (aka the standard deviation of the sampling distribution). This equation tells us that, for 100 sample estimates $b_1$, 95 of them are expected to fall between $\beta_1 - (1.96 * SE)$ and $\beta_1 + (1.96 * SE)$. 

However, that’s not answering the question that we’re actually interested in. The equation above tells us what we should expect about $b_1$ estimates, given that we know what the population parameter is. What we want is to have this work the other way around: we want to know what we should believe about the population parameter, given that we have observed a particular sample estimate. 

So to answer this question instead, we do some algebra. We first find the lower bound $\beta_{low}$ where a specific $b_1$ is at most 1.96 SE's above the population parameter.

$$b_1 = \beta_{low} + 1.96*SE$$

Rearranging this to solve for $\beta_{low}$, we find:

$$\beta_{low} = b_1 - 1.96*SE$$

This means that the lowest $\beta_1$ that is still likely to produce our specific estimate of $b_1$ is $b_1 - (1.96*SE)$. 

The upper bound is the $\beta_1$ where $b_1$ is at most 1.96 SE's *below* the population parameter.

$$b_1 = \beta_{high} - 1.96*SE$$

And rearranged: 

$$\beta_{high} = b_1 + 1.96*SE$$

This means that the highest $\beta_1$ that is still likely to produce our specific estimate of $b_1$ is $b_1 + (1.96*SE)$. 

Thus, if the true $\beta_1$ is anything in this range, $b_1$ is a reasonably likely outcome. 

<img src="images/ch15-cirange.png" width="600">

Since the specific values in a confidence interval are calculated based on an estimate $b_1$, you should think of the specific range of a confidence interval as being an estimate too. If we were to collect 100 samples, we would get 100 unique versions of this interval, since we'd have 100 unique values of $b_1$. But 95 out of 100 of these confidence intervals (95% of them) would contain the true population parameter. This is why we say we are 95% confident that a particular interval we calculate for a particular sample contains the population parameter. We refer to this range as a 95% confidence interval, denoted CI<sub>95</sub>. We can write this as our general formula for the 95% confidence interval:

$$CI_{95} = b \pm (1.96*SE)$$

Where $b$ is some statistic estimate. When reporting statistical results in text, we usually include the coefficient estimate followed by the 95% CI in brackets. I.e., "the estimate of the effect of Height is $b_1 = 0.962$, 95% CI [0.603, 1.321]".

Note that there’s nothing special about the number 1.96 here. That's just the number of standard deviations between which 95% of a distribution can be found. We're using 95% to represent a reasonable probability of getting our sample estimate, but that's just by convention (which we'll discuss more next chapter). If we wanted a 70% confidence interval instead, where we're 70% confident that the interval contains the population parameter, we could have used the ```qnorm()``` function to calculate the 15th and 85th quantiles:

In [None]:
#2.5th and 97.5th quantiles of the normal distribution 
qnorm(p=c(0.025,0.975))

#15th and 85th quantiles of the normal distribution
qnorm(p=c(0.15,0.85))

So the formula for CI<sub>70</sub> would be the same as the formula for CI<sub>95</sub> except that we’d use 1.036 as our magic number rather than 1.96. This would be a narrower range of values than the CI<sub>95</sub>, because to increase our confidence that the population parameter is covered by our interval, we need to include more possibilities (a larger range) in that interval. 

Additionally, because the SE is included in the equation, we can conclude that smaller sample sizes (and thus larger SE) lead to a wider confidence interval for the same level of confidence, and larger sample sizes lead to a narrower confidence interval. In other words, we can be 95% confident a population parameter is in a narrower range of values with larger sample sizes. 

<img src="images/ch16-cis.png" width="700">

*95% confidence intervals. The top (panel A) shows 50 simulated replications of an experiment in which we measure the IQs of 10 people. The dot marks the location of the sample mean, and the line shows the 95% confidence interval. In total 47 of the 50 confidence intervals do contain the true mean (i.e., 100), but the three intervals marked with asterisks do not. The lower graph (panel B) shows a similar simulation, but this time we simulate replications of an experiment that measures the IQs of 25 people. The sample means are generally closer to the population mean, so the CIs can be narrower in order for ~95% of them to contain the true mean and 5% of them to not.*

To calculate the confidence interval of a model coefficient estimate, we can use the Std. Error value in the ```summary()``` output:

In [None]:
summary(height_model)

We can use this to calculate the 95% confidence interval of each parameter. For $b_0$, it would be:

$$b_0 - (1.96 * SE) \leq \beta_0 \leq b_0 + (1.96 * SE)$$

$$2.8432 - (1.96 * 12.6549) \leq \beta_0 \leq 2.8432 + (1.96 * 12.6549)$$

$$b_0 = 2.84 \ [-21.96, 27.65]$$

For $b_1$ the 95% confidence interval is:

$$b_1 - (1.96 * SE) \leq \beta_1 \leq b_1 + (1.96 * SE)$$

$$0.8715 - (1.96 * 0.1914) \leq \beta_1 \leq 0.8715 + (1.96 * 0.1914)$$

$$b_1 = 0.87 \ [0.50, 1.25]$$

However, in practice it's actually better to use the function ```confint()``` directly for this rather than multiplying out 1.96\*SE, since the SE shown in the ```summary()``` table is rounded and we don't want to propogate rounding errors. To use ```confint()```, first pass it a model object, the name of the coefficient you want to get the confidence interval for, and the level of confidence you want.  

In [None]:
#For the intercept:
confint(height_model, "(Intercept)", level=0.95)

#For the Height coefficient: 
confint(height_model, "Height", level=0.95)

One thing you should notice is that the confidence interval for $b_1$ in our data spans only positive numbers and doesn't intersect with 0. This means that, even if the true effect $\beta_1$ is not the same as our estimate, we are quite confident that it is at least a positive number - that is, as height values go up, estimates of thumb length will also go up. In general, we are confident that there is a positive effect of Height. Hold onto this thought for next chapter.

### Pitfalls of interpreting confidence intervals
The hardest thing about confidence intervals is remembering specifically what they mean. There are two major errors people often make when interpreting them. 

First, when you see something like 95% CI [0.606, 1.318], sometimes the first instinct is to treat it as *the* interval in which future sample estimate can be found. This is incorrect. We have to remember that the numbers in this specific interval are based on one sample estimate. If we had a different sample instead, we would compute a different confidence interval. Thus CIs are themselves estimates that vary sample to sample. You can imagine your first sample of data as being pretty unsual by chance (e.g., when measuring the height of students but somehow collecting data only from the basketball team). Other samples, which will probably be closer to the true mean of height, won't often be in the CI calculated for this wonky sample. It is not that 95% of sample estimates will be in this interval, but the range of population parameters with a 95% chance of producing *this* estimate. 

The second common error is to say that there is a 95% *probability* that the population parameter is in this interval. This is also incorrect. The correct statement is that we are 95% *confident* that the population parameter is in this interval. This is an incredibly pedantic distinction, but it matters because of the way probability is calculated. The first sentence implies that, given a particular value of b, a &beta; parameter is 95% likely to be in this range. In other words, the probability of &beta;, given b. In mathematics we'd write this as:

$$P(\beta | b)$$

The "|" is the symbol for saying "given". This is called a **conditional probability** - the probability of &beta; is *conditional* on the state of b. 

But the correct meaning of a confidence interval is that there's a 95% probability of getting this b estimate given that the particular value of &beta; is in this range. Written mathematically: 

$$P(b | \beta)$$

We will learn more about conditional probability in chapter 20. The thing to know right now is that these probabilities are different, and come out to different values. Thus, we should be careful about which we are talking about. The *wrong* statement is "there is a 95% probability that &beta; is in this range". The correct statement is "&beta;s in this range have a 95% probability of producing our statistic estimate." To get around this issue, we can instead be more general and just say "we are 95% *confident* that &beta; is in this range." This way you don't have to deal with probabilities at all, and stay away from the trickiness of conditional probabilities. 

## 15.7 Visualizing uncertainty

As we explored extensively in chapter 7, visualizing data is an important step for communicating the results of all your analyses. For communicating model estimates, we've learned how to plot boxplots, simple regressions, and interactions.

In [None]:
#plotting group differences, showing data points
gf_boxplot(Thumb ~ Sex, data = studentdata, color = ~ Sex) %>%
    gf_jitter(., width = 0.3, alpha = 0.5)

#plotting a regression, showing data points
gf_point(Thumb ~ Height, data = studentdata) %>% 
    gf_lm()

#plotting an interaction
interactionmodel <- lm(Thumb ~ Sex*Height, data = studentdata)
interact_plot(interactionmodel, pred = Height, modx = Sex)

Each of these plots communicate model estimates we made based on a sample of data. In the boxplot, we can see the mean Thumb length per sex group. In the scatter plot, we see the intercept and slope of the best fitting regression line for explaining Thumb length based on Height. In the interaction plot, we see this Thumb ~ Height regression line for each category of sex. In each of these plots, we are visualizing the coefficient estimate of the predictor in our model - either as the difference in boxplot locations, or the slope of the regression line. 

However, we've also learned how our estimates might not be the true value of a population parameter. There's some uncertainty around what that actual value is, and our estimate might not be correct. Thus, to follow the data visualization principle of showing the variation, it's important to include that uncertainty in our visualizations. In each of the plot types above, there is a way to add the confidence intervals of an estimate to the plot. 

For the boxplot, there actually is already some uncertainty communicated via the size of the box. Recall that this is the interquartile range (IQR) of the data in each category. But since confidence intervals tell us about the likely population parameters that produced this data, not just about the variation of the data itself, we can make a version of this plot that shows the range of confidence intervals instead. 

Using ```ggformula```, we're going to do this with a jitter plot divided up by group that adds on summary information using the ```gf_summary()``` function. 

In [None]:
gf_jitter(Thumb ~ Sex, data = studentdata, color = ~ Sex, width = 0.1, alpha = 0.5) %>%
  gf_summary(fun.data = "mean_cl_boot", size = 2, color = ~ Sex)

This summary function adds information that is generated by a background function called "mean_cl_boot", which calculates the confidence level of each group mean via bootstrapping. It essentally adds a dot for mean estimate of each group, and a line that shows the width of the 95% confidence interval. By changing the ```size```, ```linewidth```, and ```color``` arguments, you can control what these lines look like:

In [None]:
gf_jitter(Thumb ~ Sex, data = studentdata, color = ~ Sex, width = 0.1, alpha = 0.5) %>%
  gf_summary(fun.data = "mean_cl_boot", size = 1, linewidth = 1.2, color = "black")

These parts of a plot are called **error bars**. Compared to generating the ```summary()``` and ```confint()``` outputs for a model and interpreting the numbers, adding these to a plot is a very quick way to communicate model estimates and the uncertainty around them. 

In a regression plot, the estimate of an effect is the slope of the regression line. To communicate the range of slope parameters that are likely to produce our specific estimate, we can add a **confidence band** to the ```gf_lm()``` layer:

In [None]:
gf_point(Thumb ~ Height, data = studentdata) %>% 
    gf_lm(interval = "confidence")

We could tilt the regression line like a seesaw within this band, and that would reflect the range of the 95% confidence interval around the slope estimate. 

The ```interact_plot()``` function also allows us to add confidence bands using the pair of arguments ```interval = TRUE``` and ```int.width = 0.95```: 

In [None]:
interact_plot(interactionmodel, pred = Height, modx = Sex,
             interval = TRUE, int.width = 0.95)

Lastly, we've already explored in chapter 13 how it is difficult to visualize relationships between more than two variables. With three we can change the color of points or show scatter plots between one variable and the outcome residuals after regressing out all other predictors, but more than three variables gets even more challenging. In the case where there are several predictors, researchers often make a more simple plot that is called a **forest plot.** Rather than showing the raw data, a forest plot depicts the b estimates for every predictor in a model with error bars around it.

In R there are several ways to do this, and they all take a bit of work to set up. Essentially, you fit a multivariable model, save the b estimates and confidence intervals for each predictor to a separate data frame, and then make a plot out of that. We'll use the ```forestplot``` package to make one here. Read each line of this code slowly and make sure you understand what it is doing. 

In [None]:
install.packages("forestplot")
library(forestplot)

multimodel <- lm(Thumb ~ Height + Siblings + Pinkie + Middle, data = studentdata)
bs <- multimodel$coefficients[2:5] #saving bs of all predictors
predictors <- c("Height", "Siblings", "Pinkie length", "Middle finger length") #vector of predictor names
CI <- confint(multimodel)
upper_ci <- CI[2:5,2] #place to hold all upper CI values
lower_ci <- CI[2:5,1] #place to hold all lower CI values

forest_data <- data.frame(mean = bs,             #the point to visualize on plot
                         upper = upper_ci,       #the value of the top error bar
                         lower = lower_ci,       #the value of the lower error bar
                         predictors = predictors)#names of predictors to put on plot

forestplot(forest_data, labeltext = predictors, boxsize = 0.08)

This plot ultimately gives you a "forest" of b estimates and the confidence interval around them, all positioned relative to each other. This way you can easily see how big the b estimates are, how wide their confidence intervals are, and which confidence intervals overlap with 0 or not. 

## Chapter summary

After reading this chapter, you should be able to:

- Define a sampling distribution
- Describe the relationship between sample size and the shape/center of the sampling distribution
- Define the standard error
- Calculate a 95% confidence interval
- Interpret the meaning of a confidence interval of an estimate
- Add confidence intervals to model visualizations

## New concepts
- **sampling error** - The difference between a sample estimate and the population parameter that is due to randomness in the sampling procedure. 
- **sampling distribution** - The distribution of sample estimates that would be generated via repeated samples drawn from a population. 
- **hypothesis testing** - The process in statistics of examining whether a sample of data is likely under some hypothesized version of the population.
- **bootstrapping** - An approach to data simulation that draws random samples with replacement from a dataset distribution. Doing this many times approximates the sampling distribution in the true population. 
- **standard error** - The standard deviation of a sampling distribution; the expected difference between a sample estimate and the true population parameter for any one study. 
- **Central Limit Theorem** - A theorem in probability theory that states, under appropriate conditions, the distribution of sample means will approach a normal distribution regardless of the underlying data distribution; the variance of the sampling distribution will get smaller as sample size increases; and the mean of the sampling distribution matches the mean of the population.  
- **unbiased estimator** - An estimate that will have a sampling distribution mean that matches the population mean; unbiased estimates from one sample are equally likely to be above or below the true population parameter. 
- **biased estimator** - An estimate that will have a sampling distribution mean that differs from the population mean; biased estimates from one sample systematically differ from the true population parameter in a particular direction. 
- **confidence interval** - The range of hypothetical population parameters that are likely to generate a certain sample estimate. The 95% confidence interval is often calculated, but other probabilities may also be used. 
- **forest plot** - A type of data plot that visualizes the results of a multivariable model with confidence intervals included around each coefficient estimate. 

## New R functionality
- [qnorm()](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/Normal)
- [summary()](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/summary)
- [confint()](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/confint)
- [ggformula::gf_summary()](https://www.mosaic-web.org/ggformula/reference/gf_linerange.html)
- [forest plot::forestplot()](https://www.rdocumentation.org/packages/forestplot/versions/3.1.3/topics/forestplot)

[Next: Chapter 16 - Significance Testing](https://colab.research.google.com/github/smburns47/Psyc158/blob/main/chapter-16.ipynb)