In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab08.ipynb")

# Lab 08: Normal Distribution and Variance of Sample Means

Welcome to Lab 8.

In today's lab, we will learn about [the variance of sample means](https://www.inferentialthinking.com/chapters/14/5/variability-of-the-sample-mean.html) as well as [the normal distribution](https://www.inferentialthinking.com/chapters/14/3/SD_and_the_Normal_Curve.html).

In [1]:
# Run this cell, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
import math
from datascience import *

# These lines do some fancy plotting.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# 1. Normal Distributions

When we visualize the distribution of a sample, we are often interested in the mean and the standard deviation of the sample (for the rest of this lab, we will abbreviate “standard deviation” as “SD”). These two summary statistics can give us a bird’s eye view of the distribution - by letting us know where the distribution sits on the number line and how spread out it is, respectively. 

<!-- BEGIN QUESTION -->

### Question 1.1.
The next cell loads the table `births` from a previous lesson, which is a large random sample of US births and includes information about mother-child pairs. 

Plot the **distribution** of mother’s ages from the table. Don’t change the last line of code, which will automatically plot the mean value of the sample on the distribution as a red triangle.

<!--
BEGIN QUESTION
name: q1_1
manual: true
-->

In [2]:
births = Table().read_table('baby.csv')
...

# Do not change this line
plt.scatter(np.mean(births.column('Maternal Age')), -0.001, color = 'red', s = 50 , marker = "^");

<!-- END QUESTION -->



From the plot above, we can see that the mean is the center of gravity or balance point of the distribution. If you cut the distribution out of cardboard, and then placed your finger at the mean, the distribution would perfectly balance on your finger. Since the distribution above is right skewed (which means it has a long right tail), we know that the mean of the distribution is larger than the median, which is the “halfway” point of the data. Conversely, if the distribution had been left skewed, we know the mean would be smaller than the median.

Run the following cell to compare the mean (red) and median (green) of the distribution of mothers ages.

In [3]:
births.hist('Maternal Age')
plt.scatter(np.mean(births.column('Maternal Age')), -0.001, color = 'red', s = 50, marker = "^");
plt.scatter(np.median(births.column('Maternal Age')), -0.001, color = 'green', s = 50, marker = "^");

### Question 1.2.
Assign `mean_median` to one of the following three integers, depending on the value of the mean (red) and median (green) for Maternal Age.

1. The mean equal to the median.
2. The mean is greater than the median.
3. The mean is less than the median.

<!--
BEGIN QUESTION
name: q1_2
-->

In [4]:
mean_median = ...

In [None]:
grader.check("q1_2")

We are also interested in the standard deviation of mother’s ages. The SD gives us a sense of how variable mothers' ages are around the average mothers' age. If the SD is large, then the mothers' age should spread over a large range from the mean. If the SD is small, then the mothers' age should be tightly clustered around the average mother age. 

**The SD of an array is defined as the root mean square of deviations (differences) from the average**.

Fun fact, σ (Greek letter sigma) is used to represent the SD of a population and  μ (Greek letter mu) is used for the mean of a population.

### Question 1.3.
Complete the cell below to calculate the mean and SD of `Maternal Age`. Assign these values to `age_mean` and `age_sd` respectively. 

Then, run the cell to see blue triangles that are one SD away from the sample mean marked in red.

<!--
BEGIN QUESTION
name: q1_3
-->

In [7]:
age_mean = ...
age_sd = ...
births.hist('Maternal Age')

plt.scatter(age_mean, -0.001, color = 'red', s = 50, marker = '^');
plt.scatter(age_mean + age_sd, -0.001, marker = '^', color = 'blue', s = 50);
plt.scatter(age_mean - age_sd, -0.001, marker = '^', color = 'blue', s = 50);

In [None]:
grader.check("q1_3")

In the histogram above, there aren't any characteristics or shapes that make estimating the standard deviation easy just by looking at the graph.

However, the distributions of some variables allow us to easily spot the standard deviation from the histogram. Specifically, if a sample follows a **normal distribution**, the standard deviation is easily spotted at the point of inflection (the point where the curve begins to change the direction of its curvature) of the distribution.

### Question 1.4.
Fill in the following code to calculate the mean and standard deviation of maternal heights, which **are** roughly normally distributed. 

Then, run the provided code to plot the standard deviation on the histogram, as before - notice where one standard deviation (blue) away from the mean (red) falls on the plot.

<!--
BEGIN QUESTION
name: q1_4
-->

In [12]:
height_mean = ...
height_sd = ...
births.hist('Maternal Height', bins = np.arange(55,75,1))

plt.scatter((height_mean), -0.003, color = 'red', s = 50, marker = '^');
plt.scatter(height_mean + height_sd, -0.003, marker = '^', color = 'blue', s = 50);
plt.scatter(height_mean - height_sd, -0.003, marker = '^', color = 'blue', s = 50);

In [None]:
grader.check("q1_4")

We don’t always know how a variable will be distributed, and making assumptions about whether or not a variable will follow a normal distribution is dangerous. However, the **Central Limit Theorem** defines one distribution that *always* follows a normal distribution. The distribution of the *sums* and *means* of many large random samples drawn with replacement from a single distribution (regardless of the distributions original shape) will be normally distributed. 

**Remember:** the Central Limit Theorem refers to the distribution of a *statistic* calculated from a distribution, not the distribution of the original sample or population.

The next section will explore distributions of one statistic, the sample mean, and you will see how the standard deviation of these distributions depends on the size of your sample.

# 2. Variability of the Sample Mean

As mentioned in the previous question, the [Central Limit Theorem](https://www.inferentialthinking.com/chapters/14/4/Central_Limit_Theorem.html) guarantees that the probability distribution of the mean of a large random sample will be roughly normal. The bell shaped curve of the sample means will be centered at the mean of the population. Due to chance, some of the sample means are higher than the population mean and some will be lower, but the deviations from the population mean are roughly symmetric on either side, as we have seen repeatedly. Formally, probability theory shows that the sample mean is an **unbiased estimate** of the population mean.

In our simulations, we also noticed that the means of larger samples tend to be more tightly clustered around the population mean than means of smaller samples. In this section, we will quantify the [variability of the sample mean](https://www.inferentialthinking.com/chapters/14/5/Variability_of_the_Sample_Mean.html) and develop a relation between the variability and the sample size.

Let's take a look at the salaries of employees of the City of San Francisco in 2014. The mean salary reported by the city government was about $75,463.92.

**Note:** If you get stuck on any part of this lab, please refer to [Chapter 14 of the textbook](https://www.inferentialthinking.com/chapters/14/Why_the_Mean_Matters.html).

In [17]:
salaries = Table().read_table('sf_salaries_2014.csv').select('salary')
salaries

Running the cell below will calculate the mean salary from the 2014 dataset. Since this dataset encompasses **every** city employee, we can consider it our population.

In [18]:
salary_mean = np.mean(salaries.column('salary'))
print('Mean salary of San Francisco city employees in 2014 was ', round(salary_mean, 2))

Running the cell below will show the distribution of salaries for city employees and show the population mean marked with a red triangle.

In [19]:
salaries.hist('salary', bins = np.arange(0, 300000+10000*2, 10000))
plt.scatter(salary_mean, -0.0000002, marker = '^', color = 'red', s = 50);
plt.title('2014 Salaries of City of SF Employees');

Clearly, this population does not follow a normal distribution due to the large percentage of city workers that earn between \$0 and \$10,000. Keep that in mind as we progress through these exercises.

In this question we will take random samples **without replacement**, compute the mean value of each sample, and visually inspect the distribution of the sample means. The goal will be to investigate how the size of the sample that we take impacts the distribution of the sample means. 

Throughout this problem, remember, this is an investigation to uncover a pattern between sample size and the distribution of sample means. If all you were interested in was computing the average salary of a San Francisco city worker, this would be unnecessary (we already know that value!). This is to help us better understand how if we were to take a sample from a population that we didn't have a full dataset about, how the choice of the sample size may impact the analysis that we'd do.

### Question 2.1.
Define a function `one_sample_mean`. Its arguments are be `table` (the table to use), `label` (the label of the column containing the variable), and `sample size`(the number of employees in the sample). It should sample with replacement from the table and return the mean of the data in the `label` column in the sample.

Running `one_sample_mean(salaries, 'salary', 100)` would draw one random sample of size 100 from the `salaries` table and return the average value of the `salary` column of that sample.

<!--
BEGIN QUESTION
name: q2_1
-->

In [20]:
def one_sample_mean(table, label, sample_size):
    new_sample = ...
    new_sample_mean = ...
    ...

In [None]:
grader.check("q2_1")

<!-- BEGIN QUESTION -->

### Question 2.2.

Now, let's run a simulation to generate many samples and compute their corresponding sample means. To do this, define the function `simulate_sample_mean` whose arguments are the name of the table, the label of the column containing the variable, the sample size, and the number of simulations. Be sure to use the function `one_sample_mean` you defined in the previous problem to create a sample and compute the mean. Use the provided array `means` to append each of the sample means into as they are computed.

The remaining code in the function will create a table named `sample_means` out of the array named `means`, compute some statistics about the simulated sample means, and then display a histogram of the distribution of sample means. Lastly, it returns a single value which is the standard deviation of the simulated sample means.

This one function does a lot of things!

<!--
BEGIN QUESTION
name: q2_2
manual: true
-->

In [22]:
"""Empirical distribution of random sample means"""
def simulate_sample_mean(table, label, sample_size, repetitions):
    means = make_array()
    for i in np.arange(repetitions):
        new_sample_mean = ...
        means = ...
    sample_means = Table().with_column('Sample Means', means)
    
    # Display empirical histogram and print all relevant quantities – don't change this!
    sample_means.hist(bins = 20)
    plt.xlabel('Sample Means')
    plt.title('Sample Size ' + str(sample_size))
    print('Sample size: ', sample_size)
    print('Population mean: ', np.mean(table.column(label)))
    print('Average of sample means: ', np.mean(means))
    print('Population SD: ', np.std(table.column(label)))
    print('SD of sample means: ', np.std(means))
    return np.std(means)

<!-- END QUESTION -->



In the following cell, the code will use your `simulate_sample_mean` function to create 10,000 samples, each of size 100, from `salaries`, compute the sample mean for each sample, display the statistics about those sample means, then create a histogram so you can see how those sample means are distributed. If any of those steps don't seem to be working correctly, reach out to a classmate or instructor to sort out the issue. 

**Important:** The rest of the lab requires this function to be working correctly, so make sure that everything is working as intended before moving on!

In [23]:
simulate_sample_mean(salaries, 'salary', 100, 10000) 
plt.xlim(50000, 100000);

<!-- BEGIN QUESTION -->

### Question 2.3.a
Let's put the `simulate_sample_mean` function to the test!

First, use the `simulate_sample_mean` function to investigate the distribution of sample means for the `'salary'` column in the `salaries` table that is created when you compute 10,000 samples of size of 400. 

**Note:** Don't worry about the `plots.xlim()` line – it just makes sure that all of the plots generated in this section have the same x-axis, ranging from 50,000 to 100,000. 

<!--
BEGIN QUESTION
name: q2_3a
manual: true
-->

In [24]:
simulate_sample_mean(..., ..., ..., ...)
plt.xlim(50000, 100000);

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.3.b
Now, use the `simulate_sample_mean` function to investigate the distribution of sample means for the `'salary'` column in the `salaries` table that is created when you compute 10,000 samples of size of 625. 

<!--
BEGIN QUESTION
name: q2_3b
manual: true
-->

In [25]:
simulate_sample_mean(..., ..., ..., ...)
plt.xlim(50000, 100000);

<!-- END QUESTION -->

### Question 2.4.
Take a moment and compare the histograms that were generated in the previous problems. Then, use the `make_array` function to assign `q2_4` to an array of numbers corresponding to those statements below that are TRUE about the plots from Question 2.3.

1. We see the Central Limit Theorem (CLT) in action because the distributions of the sample means are bell-shaped.
2. We see the Law of Averages in action because the distributions of the sample means look like the distribution of the population.
3. One of the conditions for CLT is that we have to draw a small random sample with replacement from the population.
4. One of the conditions for CLT is that we have to draw a large random sample with replacement from the population.
5. One of the conditions for CLT is that the population must be normally distributed.
6. Both plots in 2.3 are roughly centered around the population mean.
7. Both plots in 2.3 are roughly centered around the mean of a particular sample.
8. The distribution of sample means for sample size 625 has less variability than the distribution of sample means for sample size 400.
9. The distribution of sample means for sample size 625 has more variability than the distribution of sample means for sample size 400.

<!--
BEGIN QUESTION
name: q2_4
-->

In [26]:
q2_4 = ...

In [None]:
grader.check("q2_4")

## Number of Samples

Next, we'll look at what happens if we keep the sample size fixed, but take an increasing number of samples. Notice that in each line of code, the sample size is 100. What changes is how many samples are drawn from the `salaries` table: 500, 1000, 5000, 10000. As you run the cells, think about how the distribution of the sample means change?

In [29]:
simulate_sample_mean(salaries, 'salary', 100, 500)
plt.xlim(50000, 100000);

In [30]:
simulate_sample_mean(salaries, 'salary', 100, 1000)
plt.xlim(50000, 100000);

In [31]:
simulate_sample_mean(salaries, 'salary', 100, 5000)
plt.xlim(50000, 100000);

In [32]:
simulate_sample_mean(salaries, 'salary', 100, 10000)
plt.xlim(50000, 100000);

Discuss with your classmates what you noticed about the distributions of sample means in the four histograms above. You'll use your observations to answer the following questions.

### Question 2.5.
Assign the variable `SD_of_sample_means` to the integer corresponding to your answer to the following question:

When I increase the number of samples that I take, for a fixed sample size, the SD of my sample means will...

1. Increase
2. Decrease
3. Stay about the same
4. Vary widly

<!--
BEGIN QUESTION
name: q2_5
-->

In [33]:
SD_of_sample_means = ...

In [None]:
grader.check("q2_5")

### Question 2.6.
Let's think about how the relationships between:
* The SD of the values in a population (population SD)
* The SD of the values in a single sample (sample SD)
* The SD of the sample mean statistic (SD of sample means)

change with varying sample size. 

Which of the following is true? Assign the variable `pop_vs_sample` to an array of integer(s) that correspond to true statement(s).

1. Sample SD gets smaller with increasing sample size.
2. Sample SD gets larger with increasing sample size.
3. Sample SD becomes more consistent with population SD with increasing sample size.
4. SD of sample means gets smaller with increasing sample size.
5. SD of sample means gets larger with increasing sample size.
6. SD of sample means stays the same with increasing sample size.

<!--
BEGIN QUESTION
name: q2_6
-->

In [36]:
pop_vs_sample = ...

In [None]:
grader.check("q2_6")

If you need help making a decision for the previous question, read/run the following cells.

## Additional Data on Sample SD and SD of Sample Means

Run the following three cells multiple times and examine how the sample SD and the SD of sample means change with sample size.

The first histogram is of the sample; the second histogram is the distribution of sample means with that particular sample size. Adjust the bins as necessary.

In [39]:
sample_10 = salaries.sample(10)
sample_10.hist("salary")
plt.title('Distribution of salary for sample size 10')
print("Sample SD: ", np.std(sample_10.column("salary")))
simulate_sample_mean(salaries, 'salary', 10, 1000)
plt.xlim(5,120000);
plt.ylim(0, .0001);
plt.title('Distribution of sample means for sample size 10');

In [40]:
sample_200 = salaries.sample(200)
sample_200.hist("salary")
plt.title('Distribution of salary for sample size 200')
print("Sample SD: ", np.std(sample_200.column("salary")))
simulate_sample_mean(salaries, 'salary', 200, 1000)
plt.xlim(5,100000)
plt.ylim(0, .00015);
plt.title('Distribution of sample means for sample size 200');

In [41]:
sample_1000 = salaries.sample(1000)
sample_1000.hist("salary")
plt.title('Distribution of salary for sample size 1000')
print("Sample SD: ", np.std(sample_1000.column("salary")))
simulate_sample_mean(salaries, 'salary', 1000, 1000)
plt.xlim(5,100000)
plt.ylim(0, .00025);
plt.title('Distribution of sample means for sample size 1000');

You should notice that the distribution of means gets narrower and spikier, and that the distribution of the sample increasingly looks like the distribution of the population as we get to larger sample sizes. 

Let's illustrate these trends. Below, you will see how the sample SD changes with respect to sample size (N). The blue line is the population SD.

In [42]:
# Don't change this cell, just run it!
pop_sd = np.std(salaries.column('salary'))
sample_sds = make_array()
sample_sizes = make_array()
for i in np.arange(10, 500, 10):
    sample_sds = np.append(sample_sds, [np.std(salaries.sample(i).column("salary")) for d in np.arange(100)])
    sample_sizes = np.append(sample_sizes, np.ones(100) * i)
Table().with_columns("Sample SD", sample_sds, "N", sample_sizes).scatter("N", "Sample SD")
matplotlib.pyplot.axhline(y=pop_sd, color='blue', linestyle='-');

The next cell shows how the SD of the sample means changes relative to the sample size (N).

In [43]:
# Don't change this cell, just run it!
def sample_means(sample_size):
    means = make_array()
    for i in np.arange(1000):
        sample = salaries.sample(sample_size).column('salary')
        means = np.append(means, np.mean(sample))
    return np.std(means)

sample_mean_SDs = make_array()
for i in np.arange(50, 1000, 100):
    sample_mean_SDs = np.append(sample_mean_SDs, sample_means(i))
Table().with_columns("SD of sample means", sample_mean_SDs, "Sample Size", np.arange(50, 1000, 100))\
.plot("Sample Size", "SD of sample means")

From these two plots, we can see that the SD of our **sample** approaches the SD of our population as our sample size increases, but the SD of our **sample means** (in other words, the variability of the sample mean) decreases as our sample size increases.

## What about Bootstrapping?
Throughout this lab, we have been taking many random samples from a population. However, all of these principles hold for bootstrapped resamples from a single sample. If your original sample is relatively large, all of your re-samples will also be relatively large, and so the SD of resampled means will be relatively small. 

In order to change the variability of your sample mean, you’d have to change the size of the original sample from which you are taking bootstrapped resamples.

That's it. You've completed Lab 8. There weren't many tests, but there were a lot of points at which you should've stopped and understood exactly what was going on. Consult the textbook or ask your instructor if you have any other questions.

## Submitting your work
You're done with Lab 08! All assignments in the course will be distributed as notebooks like this one, and you will submit your work by doing the following:
* Save your notebook
* Restart the kernel and run up to this cell.
* Run all the tests by running the cell containing `grader.check_all()`. Make sure they pass the way you expect them to.
* Run the cell below with the code `grader.export(...)`.
* Download the file named `lab08zip`, found in the explorer pane on the left side of the screen.
* Upload `lab08.zip` to the Lab 08 assignment on Canvas.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

When done exporting, download the .zip file by finding it in the file browswer on the left side of the screen, then right-click and select **Download**. You'll submit this .zip file for the assignment in Canvas to Gradescope for grading.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()