<a href="https://colab.research.google.com/github/vectrlab/apex-stats-modules/blob/main/Central_Limit_Theorem.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# APEX STATS Module: Central Limit Theorem
Created by David Schuster, Valerie Carr, Morris Jones, and Andy Qui Le

Licensed under CC BY-NC-SA



<img src="https://live.staticflickr.com/65535/49849273307_51fd13f9f4_b.jpg" width="300"/>

Image credit: ["Normal Distribution"](https://live.staticflickr.com/65535/49849273307_51fd13f9f4_b.jpg) by Johnsyweb is licensed under CC BY-NC-ND 2.0

## I. Intro and Learning Objectives
So far, we have encountered population distributions and sample distributions. Now, we introduce a third kind of distribution, one that is created when we take repeated random samples. This third kind of distribution is called a **sampling distribution**. Yes, sampling distribution sounds very close to sample distribution. They sound so similar, people often confuse them. But, they are different kinds of distributions. Sampling distributions have some interesting properties that are important to understanding inferential statistics. 

These exercises map onto several learning objective(s) for the C-ID descriptor for [Introduction to Statistics](https://c-id.net/descriptors/final/show/365). Upon successful completion of the course, you will be able to:  

(1) Interpret data displayed in tables and graphically  

(3) Calculate measures of central tendency and variation for a given data set  

(7) Distinguish the difference between sample and population distributions and analyze the role played by the Central Limit Theorem 

---

## II. Background Reading

Read through this section before starting this module and consider the questions that follow.

In order to understand sampling distributions, we will first take a moment to discuss some of the underlying concepts: random sampling, sample size, population size, and sampling error. 

### Random Sampling

Sampling is the process of selecting a group representative of the population. The best way to do this is to use **random sampling with replacement**. To take a random sample with replacement, the following conditions must be satisified:

1. All units need to be available for selection. Usually, this means that you would need a list of all units in the population. The list units is called the **sampling frame**.
2. The selection is unbiased; each unit has an equal chance of being selected. This is the random part.
3. After a unit is selected, it is returned to the list. This is the replacement part; you are replacing the unit you selected by putting it back. This preserves the prior rule; every time selection is done, every unit has an equal chance of being selected.
4. The random sample is composed of all the units that were selected in step 2. Therefore, not only does the process of selecting units need to be unbiased, the representation of units in the sample also needs to be unbiased.

### Sample Size and Population Size
The **size** of a distribution is the number of units it contains. In mathematics, sample size is typically represented as $n$ and population size is typically represented as $N$. In APA-style writing, however, sample size is represented as $N$, and $n$ is used to represent a subsample (a part of a sample, such as the units in one condition). There does not seem to be a recommended symbol for population size in APA style. Why not? Many times, the population size is unknown.

### Sampling Error
**Sampling error** is a mismatch between a sample statistic and a population parameter. Taking a single random sample does not guarantee perfect representation of a population. In this activity, we will create a population distribution, take a single random sample, and then compare the population and sample distributions. Let's start by focusing on creating and exploring a population distribution.


### Discussion Questions

1. Imagine you wanted to randomly sample (with replacement) the mood of fans of the winning team at a stadium. Describe one way you could create a sampling frame, a list of all members of a population.
2. Describe how you would implement random selection.
3. What does it mean to do the selection with replacement?
4. Next, imagine you wanted to randomly sample the attitudes of commuters in your city. What are some challenges in creating a sampling frame?
5. Imagine you had access to auto registration records for your city and participation from companies who provided lists of commuting employees. How might a sampling frame constructed this way be biased?
4. Is there a conflict between informed consent and random sampling?
5. Is it possible to determine the size of the following populations? If so, how? If not, why not?
    * Students enrolled in your school this semester
    * People staying in a hotel
    * Passengers on a flight
    * Adults in the United States
    * Dogs living in your city
6. APA is a format for statistical results used in Psychology and some other disciplines. If you were reading statistics in APA style, what would this mean? $N$ = 50, $n_1$ = 22, $n_2$ = 28

## III. Activity

This activity will use real data collected by the World Bank. In our data file, we have the 2018 life expectancy at birth, in years, for over 100 countries. Imagine that we want to understand the data for the included countries and only the included countries. If the included countries are the only ones we care about, then we can treat these data as a population. For the purpose of our example, we will treat these data as a population. However, if you wanted to define all countries as your population, the data file is a subset of all countries. Although it can be challenging that these data could either be a population or a sample, it illustrates that populations are defined by researchers. The population is the group of interest to the researcher.

Set up this activity by running the code block below. This will import the data. Reminder: to run the cell, you can either use `Shift` + `Enter`, or you can hit the play button.

In [None]:
#Setup Example Data
import pandas as pd # import library
data = pd.read_csv("https://raw.githubusercontent.com/vectrlab/apex-python-datasets/main/healthnutritionandpopulation/example.csv") # read the datafile
data # display the data

### 1. Explore the Population Distribution

When you imported the data in the block above, you should have seen a preview of it in this notebook. You might have noticed that country names are in a variable called `data["Y"]`, and the life expectancy appears in variables `data["X1"]` through `data["X5"]`. These correspond to the following variables:

- `data["Y"]`: Country name
- `data["X"]`: Life expectancy at birth in 2018, in years
- `data["X1"]`: Life expectancy at birth in 2008, in years (ten years earlier)
- `data["X2"]`: Life expectancy at birth in 1998, in years
- `data["X3"]`: Life expectancy at birth in 1988, in years
- `data["X4"]`: Life expectancy at birth in 1978, in years
- `data["X5"]`: Life expectancy at birth in 1968, in years

Since we are focusing on life expectancy at birth in 2018, the only variable we will need for now is `X`. This is our population distribution.

In the next few steps, we will find the population size, compute descriptive statistics for the population (which are called **population parameters**), and visualize the population.

To start, preview `data["X"]` by copying and pasting it in the box below. The first value displayed should be for Afghanistan, and it should be 64.833.


In [None]:
# enter data["X"] on its own line in this box, and run it to see a list of 2018 life expectancies



### 2. Population size

Next, how many countries are included in our population? This is the population size. Python calls this the length of the variable, so we use a function called `len()`.

In [None]:
#@title Population size
len(data["X"])

### 3. Population parameters

For this exercise we want to determine the mean and standard deviation of the population. We will need these values for the next step. Let's start with the mean.

In [None]:
#@title Population mean
import numpy as np # import library
np.mean(data["X"]) # display the mean

Next, let's determine the population standard deviation:

In [None]:
#@title Population standard deviation
import numpy as np # import library
np.std(data["X"]) # display the population standard deviation

You're doing great! These are the descriptive statistics for our **population distribution**.

### 4. Visualize the population

Run the code below to generate a histogram of the population.

In [None]:
#@title Generate a histogram with automatic binning and custom color
# color names that work should include https://matplotlib.org/stable/gallery/color/named_colors.html
import seaborn as sns # import library
custom_color = input("Type the name of a color : ") # get user input
sns.histplot(data["X"], color = custom_color, binwidth = 1) # display the histogram

**⍰ Consider the following questions:**

- Where is the population mean located on the histogram?
- How would you explain the meaning of this population mean to someone?
- What is the shape of the population distribution? Is it normal?
- How does the standard deviation relate to the histogram?
- What color do you prefer for your histogram?

----
### 5. Explore a sample distribution

Click the play button below to create a single sample of 5 values from the population distribution. We ask Python to pick five values at random from the population distribution. This creates a new **sample distribution** called `one_sample`.

In [None]:
#@title Generate a random sample
import random # import libraries
import numpy as np
n = 5 # sample size of 5
one_sample = np.random.choice(data["X"], n) # generate a single sample
one_sample # show the sample distribution

Use the `len` function to determine how many values are in your sample, which is named `one_sample`. **Hint**: Your code will look identical to the population size, except you'll use `one_sample` as input, instead of `data["X"]`.

Based on the code above, you already know what the value is going to be. Make a prediction before you run the code.

In [None]:
#@title Sample size
# Fix this code! len(data["X"]) 



Next, do the same modification with the `mean()` function to determine the mean of your sample. Before you do, make a prediction of the value.

In [None]:
#@title Sample mean
import numpy as np # import library
# Fix this code! np.mean(data["X"]) # display the mean



And we will also use the function introduced earlier to find the standard deviation. However, we are finding the standard deviation of a sample distribution, so we need to use the sample standard deviation formula. Python lets us do that by adding `ddof=1`. You can run this block as-is.

In [None]:
#@title Sample standard deviation
import numpy as np # import library
np.std(onesample, ddof=1) # display the sample standard deviation; ddof is delta degrees of freedom and N - ddof is used in the variance calculation

Finally, run this block to create a histogram of your sample distribution. This time, you can modify the code yourself to change the color.

In [None]:
#@title Generate a histogram with automatic binning and custom color
import seaborn as sns # import library
custom_color = "blue" # set bar color
sns.histplot(one_sample, color = custom_color, binwidth = 1) # display the histogram

**⍰ Consider the following questions:**

- What was the difference between the sample mean and population mean? 

- If random sampling was always perfect, the mean difference would be zero. Even when these values are randomly generated, we can be fairly confident the difference will not be zero. Why is your sample mean not exactly equal to your population mean?

- How do the shapes of the two distributions compare?

As you can see, a single sample was not perfect. But, sampling error is not all-or-nothing. Sometimes we can observe larger or smaller sampling error. In the next section, we will see how this works with a larger sample size.


### 6. Explore a larger sample distribution
First, we’ll change our sample size from 5 to 900.

In [None]:
#@title Generate a random sample
import random # import libraries
import numpy as np 
n = 900 # sample size of 900
big_one_sample = np.random.choice(data["X"], n) # generate a single sample
big_one_sample # show the sample distribution

You already know how to do the descriptives on this. To save you copy-and-paste time, this code is ready to run:

In [None]:
#@title Sample size
len(big_one_sample) 

In [None]:
#@title Sample mean
import numpy as np # import library
np.mean(big_one_sample) # display the mean

In [None]:
#@title Sample standard deviation
import numpy as np # import library
np.std(big_one_sample, ddof=1) # display the sample standard deviation; ddof is delta degrees of freedom and N - ddof is used in the variance calculation

In [None]:
#@title Generate a histogram with automatic binning and custom color
import seaborn as sns # import library
custom_color = "blue" # set bar color
sns.histplot(big_one_sample, color = custom_color, binwidth = 1) # display the histogram

**⍰ Consider the following questions:**

- What was the difference between the sample mean and population mean? Which had a bigger difference, `big_one_sample` or `one_sample`?

- Looking across all the descriptives, which of the two samples, `big_one_sample` or `one_sample`, was more representative of the population. Why?

These exercises have shown us that samples have variability because populations have variability; the sample mean ultimately depends on which values get selected to be in the sample. The larger the sample size, the smaller the sampling error. 

Although we have not seen it yet, the greater the variability in the population, the larger the sampling error. In all, just two things affect sampling error: population standard deviation and sample size.

### Discussion Questions
Review the relationship between sampling error, sample size, population size, and population variability

* How does sample size relate to sampling error?
* How does population size relate to sampling error?
* How does population variability relate to sampling error?
* Next, we will expand this discussion to consider what would happen if we took repeated samples. That is, we will take one sample, then take another and another. Our units will be sample means instead of scores.

----
### 7. Explore the Sampling Distribution
A **sampling distribution** is a distribution of sample means. 

If you take repeated samples, you can plot the mean of each sample. A collection of sample means forms a sampling distribution of the mean. Sampling distributions are made of many samples.

For this exercise, we'll create a sampling distribution based on a sample size of 5. In total, we'll take 2000 samples. Then, we will:
1. Determine the mean and standard deviation of the sampling distribution
2. Create a histogram of the sampling distribution
3. Find the difference between the standard deviation of the sampling distribution and the calculated value of standard error

### 8. Generate the Sampling Distribution
Click the play button below to generate the sampling distribution. Specifically, this will sample from the population 2000 times, randomly choose 5 values each time, calculate the mean of the sample, and add it to the sampling distribution. 

In [None]:
#@title Generate a sampling distribution
import random # import libraries
import numpy as np 
num_samples = 2000 # how many samples to include in the sampling distribution
n = 5 # sample size
def drange():
  x = random.randrange(0, len(data["X"]) - (n + 1)) # pick a random starting spot in the distribution
  return slice(x, x + n) # select n values starting from that spot - Dave note: We may need to fix this, as the selection will be biased if data are in a nonrandom order
sampling = [np.mean(data["X"][drange()]) for x in range(num_samples)] # assemble the sampling distribution by finding means of repeated samples
sampling # show the sampling distribution

### 9. Explore the sampling distribution

Use NumPy's `mean` function to determine the mean of `sampling`:

In [None]:
#@title Mean of the sampling distribution
# Hint: The sampling distribution is called sampling



Next, use NumPy's `std` function to determine the standard deviation of `sampling`. Note that we need to use the population standard deviation in this case. This code is ready to run:

In [None]:
#@title Sampling distribution standard deviation
import numpy as np # import library
np.std(sampling) # display the population standard deviation

Finally, create a histogram of the sampling distribution using Seaborn's `histplot` function and `sampling` as your input.

In [None]:
#@title Generate a histogram with automatic binning and custom color
import seaborn as sns # import library
custom_color = "blue" # set bar color
sns.histplot(sampling, color = custom_color, binwidth = 1) # display the histogram

**⍰ Consider the following questions:**

- What was the difference between the **sampling** distribution mean and the **population** mean? Which mean was closest overall, `big_one_sample`, `one_sample`, or `sampling`?

- What is the shape of the sampling distribution? How does the shape of your sampling distribution compare to your classmates'?

### 10. Section Review

Some interesting things happened in this last exercise. Before we continue, make sure you understand the steps that you took to construct a sampling distribution:

1. We started with a population distribution. The shape of the population distribution is not important (did you notice that all the populations were close to a uniform distribution? None of the populations were normally distributed). While we're at it, the population _size_ is also not important. This example would have worked with a population of 50 or a population of 300,000.
2. We took a random sample from the population, with replacement. We found the mean of our random sample. 
3. We repeated Step 2 many times to create a list of 2000 sample means stored in a variable called `sampling`. This is our sampling distribution. **Sampling distributions are made of sample means**. Put another way, the units of a sampling distribution are sample means.

What did you you notice when you look at the histogram of the sampling distribution? It should have been close to normally distributed! We started with a non-normal population and ended up with a normally distributed sampling distribution. This is one of the outcomes specified by the central limit theorem.

What did you notice about the difference between the mean of the sampling distribution and the mean of the population? It should have been small. It should have been the smallest value of all the examples in this section. This is another outcome specified by the central limit theorem. As we collect more and more sample means, the mean of the sampling distribution will approach the mean of the population.

Why is this useful? The tendency of repeated random samples to perfectly approximate the population they are drawn from says that even if a sample is not perfect (it has sampling error), we can use random samples to estimate population parameters. We also can predict how imperfect our samples are; the larger the sample and the lower the population standard deviation, the better we can rely on our single sample. This means we should be more skeptical when we see smaller sample sizes, samples of very diverse populations, or both.

It has taken us a lot of steps and several examples to get here. We are now ready for a more formal definition of the central limit theorem.

----
### 11. Define the Central Limit Theorem
The central limit theorem says that sampling distributions have special properties.

The CLT says that: (1) assuming two things, (2) if you do a series of steps, then (3) you will obtain an outcome. The outcome has implications for us.

- The two **assumptions** are a random sample and a variable that is continuous.
- The **steps** are to take repeated random samples of the population and calculate the mean of each of those samples. Construct a sampling distribution from the sample means.
- The **outcome** is that the histogram of the sample means is normally distributed. We call this the sampling distribution of the mean. It will always be normally distributed under the CLT, as long as we have a sufficiently large sample size.
- This frequency distribution, like all frequency distributions, has a standard deviation called the standard error of the mean.

----
### 12. Explain Standard Error

Sampling distributions have a mean and standard deviation, just like any other distribution we have seen. However, the standard deviation of a sampling distribution has a special name: the standard error.

Standard error is calculated using this formula: $\sigma_{\bar{X}}=\frac{\sigma}{\sqrt{n}}$

In words: divide the standard deviation of the population by the square root of the sample size.

Because standard error is a fancy term for standard deviation of the sampling distribution, we can determine standard error two different ways. We can compute the standard error using the formula, or we can calculate the standard deviation of our sampling distribution (you did this previously).

First, so we have it handy, copy and paste the code to find the **standard deviation of the sampling distribution** from above:

In [None]:
#@title Sampling distribution standard deviation
import numpy as np # import library
# Add your code here to display the population standard deviation!



The second way to calculcate the standard error is using the formula:

In [None]:
#@title Caclulate standard error
import numpy as np # import library
n = 5 # sample size
se = np.std(data["X"]) / np.sqrt(n)
se

You should find that the two calculations were close. Why do we need the standard error formula if we could just find the standard deviation of the sampling distribution? Well, we typically do not work with the sampling distribution directly. We simply understand that it exists. Creating a sampling distribution requires the population distribution to be available to us, and reearchers often do not have population data available. Further, the central limit theorem specifies taking an _unlimited_ number of samples in order for the sampling distribution mean to equal the population mean. It also requires a sample size of $N\ge30$ when the population is not normally distributed. We have violated that rule (we used a non-normal population distribution and we set our sample size as low as 5), but the numbers still came out pretty close.

Notice that if we assume that the central limit theorem applies, we already know the shape, mean, and standard deviation of a sampling distribution without having to construct it.

#### Discussion Questions

1. What is the shape of a sampling distribution?
2. What is the mean of a sampling distribution?
3. What is the standard deviation of a sampling distribution called? How is it calculated?

----
## IV. What's Next

Where this gets useful is using the sampling distribution to make statements about the probability of obtaining a single sample mean. In many research contexts, we work with a single sample distribution. We do not have access to the population distribution nor the sampling distribution. But, we can use the central limit theorem to imagine what the sampling distribution looks like (it is normal with its mean equal to the population mean and a standard error based on population standard deviation and sample size). Because the sampling distribution is made of sample means, it tells us about what we can expect if we take one single random sample from a population. 

#### Discussion questions

1. What is the mode of a sampling distribution?
2. What is the most common score in a sampling distribution?
3. Whenever a researcher takes a single random sample and finds the mean, what is the most likely value of the mean?
4. What is the aproximate probability of obtaining a sample mean higher than the population mean?
4. What is the aproximate probability of obtaining a sample mean lower than the population mean?

----
## V. Summary

To summarize, the central limit theorem allows us to say useful things for research:

- A single random sample will have a mean that approximates the population mean. We can use samples in place of having to measure every member of the population.
- Each time we take a random sample and calculate the mean, we are most likely to get the population mean.
- Our sample means will vary. We can predict how much they vary by calculating the standard error.
- It is possible to take a random sample and calculate the mean only to get a sample mean that is far away from the population mean, but this is unlikely to happen.
- A larger sample size reduces the standard error of the mean. Larger sample sizes give us better estimates of the mean.

---
## VI. All done, congrats! 

Today you've not only learned about sampling distributions and the central limit theorem, but you've also learned how to write some Python code using Google Colab. High five!

<img src="https://live.staticflickr.com/3471/3904325807_8ab0190152_b.jpg" alt="High-five!" width="100"/>

["High-five!"](https://live.staticflickr.com/3471/3904325807_8ab0190152_b.jpg) by Nick J Webb is licensed under CC BY 2.0