# Notebook 4: Central Limit Theorem

**Name(s):**

1.

2.

3.

<br>

---

In this notebook, we will have a look at the Central Limit Theorem in action!

<br>

---

## Exercise 1 - Estimating Mean Income from a Population

The file `income_data.csv` contains Age and Income (in dollars) information from the entire population of a fictitious city with 5000 residents.  

Execute the following cell to read the data from my GitHub repository and load the data into a matrix called `dat`.

`dat` is a matrix with 5000 rows and 2 columns. The first column of `dat` is everyone's age, and the income is the second column of `dat`. In R, these two dimensions are represented by having 2 indices within `dat`. For example, `dat[6,2]` refers to the element in the 6th row of `dat` and the end column.

We can reference _all_ elements in a matrix in R by _slicing_ the matrix by referencing a set of elements in one or both of the dimensions. For example, `dat[1:60,1]` refers to the vector formed by taking the first 60 the elements in the 1st column of `dat`.

*Pro tip: you can also get **all** of the elements in a certain dimension in R by just leaving that element in the indices blank.*

In [None]:
# read the data in by executing this cell
# be sure to actually read the exposition above!
dat = read.csv("https://raw.githubusercontent.com/tonyewong/math251_fall2023/master/income_data.csv", header=TRUE)

### Task 1:

Finish the code below to set the vectors `age` and `income` equal to the appropriate columns from `dat`.

In [None]:
# Finish this code
age = 0    # <-- TODO: edit this!
income = 0 # <-- TODO: edit this!

### Task 2:

Use methods from our previous notebooks to create a probability density histogram of the `income` data. Label your axes appropriately.

_Hint: what argument of the `hist` function makes a **density** histogram instead of a frequency one?_

### Task 3:

In real life, we have populations much bigger than $5000$.  If we want to estimate the mean of the population we have to draw a sample from the population and compute the sample mean.  The important questions we have to ask are things like:

- Is the sample mean a good approximation of the population mean?
- How large does my sample need to be in order for the sample mean to approximate the population mean well?

The following code uses the `sample` function (which you saw in previous notebooks) to sample $n = 10$ elements from `income` and compute the estimated mean based on that small sample.  This yields a single sample for $\bar{X}$, which is reported to the screen.

In [None]:
sample_1 = mean(sample(income, size=10, replace=FALSE))
print(sample_1)

Now, that's a single value for $\bar{X}$. We want to verify that the Central Limit Theorem (CLT) works, and the CLT is a statement about the _distribution_ for $\bar{X}$. So, to approximate that distribution, we are going to need to obtain a great many estimates of $\bar{X}$, and then plot up the approximate distribution of $\bar{X}$ using a probability density histogram.

We can create a set of 5 values for $\bar{X}$ by running the above code 5 times. But, we want to save all of the different $\bar{X}$ values. We can do this by first creating a vector of five 0s as a placeholder to store those 5 $\bar{X}$ values. Then, we can replace each of the five 0s with one of the samples for $\bar{X}$. Complete the following code to do just that. For now, we'll keep using sets of 10 individuals from the population for each calculation of a sample mean, $\bar{X}$.

In [None]:
# Create a placeholder vector of five 0s
samples_5 = c(0, 0, 0, 0, 0)

# Fill in the five values for $\bar{X}$ using a sample of n=10 income values
# Use what we did above as a guide!
set.seed(251) # don't change this
samples_5[1] = 0 # <-- TODO: fill this in!
samples_5[2] = 0 # <-- TODO: fill this in!
samples_5[3] = 0 # <-- TODO: fill this in!
samples_5[4] = 0 # <-- TODO: fill this in!
samples_5[5] = 0 # <-- TODO: fill this in!

# Print the five samples for $\bar{X}$
print(samples_5)

### Task 4:

By creating a sample of 5 values for $\bar{X}$, we copy-pasted a bunch of code. That was rather annoying, but not too bad. However, earlier we said we wanted to do this "a great many times". 5 is hardly "a great many". There are so many numbers larger than 5.

To do a great many samples for $\bar{X}$, we're going to need to be more clever. That's where `for` loops become useful! We can also use the `rep()` function to create our placeholder vector of 0s of much longer lengths, without explicitly writing down all of the 0s. The way `rep` works is that it creates a vector that is just the first argument you give it (in this case, 0) repeated a number of times equal to the second argument that you give it (in this case, the integer $M$).

Run the cell below to create a placeholder vector of 20 zeros.

In [None]:
# Set the number of samples for $\bar{X}$ that you want
M = 20

# Create a placeholder vector of M zeros
samples_M = rep(0,M)

Now we can use a `for` loop to _iterate_ over some set of indices. Some notes about the structure of the `for` loop:
* In the first line, `i in 1:M` tells you that this loop will execute the code inside the `{ }` braces for every value of the _index_ `i` that is _in_ the set `1:M`.
* `1:M` is shorthand for saying "each integer starting with 1 and ending with $M$".
* The second line is all the stuff inside the `{ }` braces, which is the code that will be executed for each value of `i`.

Add whatever code you used above to create a single sample for $\bar{X}$ to the code cell below, to get this `for` loop running.

In [None]:
for (i in 1:M) {
  samples_M[i] = 0 # <-- TODO: fill this in!
}

Now print out all $M=20$ of the values in `samples_M` by running the cell below.

In [None]:
print(samples_M)

### Task 5:

With $n=10$ individuals from the population in each estimate of $\bar{X}$, the values for $\bar{X}$ are kind of all over the place. You probably ended up with values as low as in the \$40,000s, and as high as in the \$70,000s or \$80,000s.

But, we can fix this! Recall that we saw in lecture that the variance in the sample mean decreases as $n$ increases. So let's try using $n=100$. Re-run your `for` loop using $M=20$ again, but this time change $n=100$ by modifying the `size` argument in your `sample` function. Some starter code is provided for your convenience.

In [None]:
M = 20               # set number of Xbar samples to take
n = 100              # set the sample size for each Xbar
samples_M = rep(0,M) # create a placeholder vector of M zeros
for (i in 1:M) {
  samples_M[i] = 0   # <-- TODO: fill this in!
}
print(samples_M)

That ought to look a little better. The values should all be a bit closer to the known mean of (spoilers!) about \$61,000.

### Task 6:

We were originally out to examine the distribution of $\bar{X}$. We just got a sample of size $M=20$ from the distribution of $\bar{X}$, but that doesn't feel like enough. But to get a picture of the whole distribution, we want many more samples than just 20. Copy-paste your code from the previous task and then modify it to create a sample of size 1,000 from the distribution of $\bar{X}$. As in the last task, let's keep using $n=100$ samples for each value of $\bar{X}$ that we calculate.

Be sure you aren't accidentally printing them all to the screen still! _Do_, however, calculate the mean of your sample of $\bar{X}$ values and print it to the screen. Comment on the agreement between this mean and the mean of the actual population. (You may want to actually calculate the mean of the whole population using R's `mean()` function.)

### Task 7:

Create a probability density histogram of the values in `samples_M` from Task 6. Be sure to label your axes so renegade physics professors don't shank you. Comment on whether the distribution appears to be roughly normal or not.

### Task 8:

If your histogram in the previous task is correct, and you use the default histogram bins, then the y axis should have tick marks that are something like $10^{-5}$. That's a **tiny** probability! Why are those heights so small?

### Task 9:

The Central Limit Theorem tells us that increases which quantity, $M$ or $n$, will make this distribution for $\bar{X}$ more normal?

### Task 10:

Create a new histogram of 1000 samples for $\bar{X}$, but this time use $n=5$ individuals from the population for each sample of $\bar{X}$ by changing the `size` argument in your `sample` function. You should be able to modify your code from Tasks 6 and 7 to do this. Comment on whether the distribution of $\bar{X}$ that you obtain appears to be normal or not. How does the Central Limit Theorem explain this result?