## Lab 7.  Why The Mean Matters in School Life. 

Welcome to Lab 7! This lab is due **Friday 11/22 at 11:59pm**. In this lab you will practice calculating variance, calculating standard deviation, and converting values to standard units. You will use these skills to compare grades in a course and you will use Chebyshev’s bounds to predict how hard the students should work in order to rank in the top 5% of the class. Finally, you will use confidence intervals to help college administrators plan for next quarter by predicting the enrollment in a new course.


Reading:
* [Chapter 14.2](https://www.inferentialthinking.com/chapters/14/2/Variability.html): Variabilty, Standard Deviation, Standard units, Chebyshev's Bounds.
* [Chapter 14.3](https://www.inferentialthinking.com/chapters/14/3/SD_and_the_Normal_Curve.html): The Standard Deviation (SD) and the Normal Curve 
* [Chapter 14.4](https://www.inferentialthinking.com/chapters/14/4/Central_Limit_Theorem.html): The Central Limit Theorem
* [Chapter 14.5](https://www.inferentialthinking.com/chapters/14/5/Variability_of_the_Sample_Mean.html): The Variability of the Sample Mean

As usual, **run the cell** below to prepare the lab and the automatic tests.

In [None]:
# Run this cell to set up the notebook, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

# These lines load the tests.
from client.api.notebook import Notebook
ok = Notebook('lab07.ok')
_ = ok.auth(inline=True)

### 0. Comparing Grades Using Standard Units. 

Two of your friends, Cathy and Sam, just took their midterms. Cathy took her circuits design midterm and Sam took his Spanish vocabulary midterm. Cathy received a **B+** on her midterm (87%) and Sam received an **A-** (92%). Cathy claims that while she recieved a lower grade on her midterm, she actually did better (relative to the rest of the class) than Sam. Sam disagrees. Knowing that you are taking DSC10, your two friends come to you to settle their argument. 


They show you two tables: `circuits_midterm` and `spanish_midterm` that represent the grades for their classes. Both exams are out of 100 points.  The tables have the columns `Student` which is the student number and `Score (name of midterm)` which is the midterm scores of all students.

**Note** You do ***not*** need to make any changes to the below cell. It is for you to visualize your datasets

In [None]:
# Cathy's histogram
circuits_midterm = Table.read_table("circuits_midterm.csv")
circuits_midterm.hist('Score (Circuits Midterm)', bins=range(0,101,1))
cathy_score = 87
print("Cathy's Score: " + str(cathy_score))

# Sam's histogram
spanish_midterm = Table.read_table("spanish_midterm.csv")
spanish_midterm.hist('Score (Spanish Midterm)', bins=range(0,101,1))
sam_score = 92
print("Sam's Score: " + str(sam_score))


You know that instead of comparing their actual scores, you should first convert their scores into **standard units**. Standard units are often represented as $\textbf{z}$.
$$ z = \frac{\mbox{value - average}}{SD}$$

To compute the midterm score in standard units for each friend, we need to:
1. Compute the **average** grade for the entire class. We will use the `np.mean` method to do this.
2. Compute the **standard deviation** (SD) of the midterm scores for the entire class. We *could* use `np.std`, but we will write our own function to do that. 

*Reminder 1*: Standard deviation is the square root of the variance.  Therefore let's make a function that computes the **variance** first. 

*Reminder 2*: Variance is the mean squared deviation from the average:

$$ variance = \frac{(value_1 - average)^2 + (value_2 - average)^2 +...+ (value_n - average)^2}{n},$$
where `n` is the number of values (exam scores for our problem).


**Question 0.1** Fill in the missing code to complete the function `compute_variance`. It takes as input an *array of numbers* (`data`) and returns the variance as a single number. 

In [None]:
def compute_variance(data):
    average = ... # Find the average of the data
    diff = ...    # diff should be an array that contains the differnce between every entry in data and the average
    square_diff = ... # square every entry of diff, the result should still be an array
    sum_square_diff = ... # the sum of the entries of square diff
    variance = ...    # a single value (the variance of data)
    return variance

 

Then use the `compute_variance` function to compute the variance of the two samples. 

In [None]:
circuits_midterm_var = ...
print("Variance of circuits midterm: " + str(circuits_midterm_var))

spanish_midterm_var = ...
print("Variance of Spanish midterm: " + str(spanish_midterm_var))

In [None]:
_ = ok.grade('q0_1')


**Question 0.2** After calculating the variance, we want to write a function that calculates the standard deviation. Fill in the missing code to complete the function `compute_sd`. It takes as *input an array of numbers* and returns the *standard deviation* as a single number. 


*Hint:* Your solution should only take ***one line*** (the one with `return`) and use method `compute_variance`.

In [None]:
def compute_sd(data):
    return ...
    


Then use `compute_sd` function to compute the standard deviation of the two midterms.

In [None]:
circuits_midterm_sd = ...
print("Standard Deviation of circuits midterm: " + str(circuits_midterm_sd))

spanish_midterm_sd = ...
print("Standard Deviation of Spanish midterm: " + str(spanish_midterm_sd))

In [None]:
_ = ok.grade('q0_2')

**Question 0.3** After writing a function that calculates the standard deviation, you are equipped to write a function that converts a given score to standard units. Fill in the missing code to complete the function `compute_su`. It takes a *score*, the *average score*, and the *standard deviation* and returns the score in standard units. 

**Warning**: Be careful with order of operations

In [None]:
def compute_su(score, avg, sd):
    standard_unit = ...
    return standard_unit
    

Then use `compute_su` function to transform the scores earned by each friend into standard units.

In [None]:
cathy_su = ...
print("Standard Unit of Cathy's Score: " + str(cathy_su))

sam_su = ...
print("Standard Unit of Sam's Score: " + str(sam_su))

In [None]:
_ = ok.grade('q0_3')

**Question 0.4** Cathy's score *is* higher than Sam's score when we convert to standard units, which can be seen as evidence that she did better on her exam relative to her classmates than Sam did relative to his. 

Another way to measure their relative performances is directly from the tables `circuits_midterm` and `spanish_midterm`, by calculating, for each of Cathy and Sam, the proportion of students they scored higher than (or the same as). Comparing Cathy's proportion to Sam's proportion will give us an alternative way of measuring who did better relative to their classmates. Calculate Cathy's proportion and Sam's proportion below.



In [None]:
cathy_proportion = ...
print("Cathy's Percentage: " + str(cathy_proportion))

sam_proportion = ...
print("Sam's Percentage: " + str(sam_proportion))


In [None]:
_ = ok.grade('q0_4')

## 1. Chebyshev's Bounds and Normal Curves

Lets look at the histograms of the two midterms again.

In [None]:
circuits_midterm.hist('Score (Circuits Midterm)', bins=range(0,101,1))
spanish_midterm.hist('Score (Spanish Midterm)', bins=range(0,101,1))

**Question 1.1**
Which of the two graphs is a roughly normal curve?

1. `Only the upper graph (Circuits Midterm)  is normal.`
2. `Only the lower graph (Spanish Midterm) is normal. `
3. `Both graphs are normal. `
4. `Neither graph is normal.`

Remember all normal curves have the following characteristics:

* The mean (average) is always in the center of a normal curve.
* A normal curve has only one mode (peak).

Set variable `q1_1` to either 1, 2, 3 or 4 depending on your answer. 

In [None]:
q1_1 = ...

In [None]:
_ = ok.grade('q1_1')

**Question 1.2**
From looking at the histogram of the Spanish exam above, rank the following values in order **from smallest to largest**.

1. `The mean score.  `
2. `The median score.`
3. `The most common score (the mode).`

Set variable `q1_2` to a list containing the numbers 1, 2, 3 **in the appropriate order**.

In [None]:
q1_2 = []

In [None]:
_ = ok.grade('q1_2')

### Chebyshev's Bounds, recap

Chebyshev's Bounds state that *no matter what the shape of the distribution*, some proportion of the values falls in the range

$$average \pm z \mbox { Standard Deviations is } at \space least \space 1 - \frac{1}{z^{2}}$$ 

**It's also important to note that these are lower bounds, not approximations:** 75% of the data is guaranteed to lie within plus or minus of 2 standard deviations of the mean, but 100% of the data might also lie within plus or minus 2 standard deviations of the mean. 

### On the other hand...
**If we know that our data forms a normal curve**, the standard deviation is even more informative. 
<img src="chebyshev.png/">

**Note that for a normal distribution, the numbers in the last column of the table above are approximations, not lower bounds.**  
* If the distribution is perfectly normal, then 68% of the data (not more, not less) will lie between plus and minus one standard deviation of the mean. 
* Additionally because a normal curve is symmetric, we know that 34% of the data lies between the average and the average plus one standard deviation. 
 


**Question 1.3**  Cathy, who is majoring in Engineering, really wanted to score in the top 5% of the class. But before taking the exam, she did not know if the scores would be normally distributed or not. 

Without making any assumptions about the distribution of scores, how many standard deviations above the mean would she have needed to score to **guarantee** that she fell in the top 5% of the the class? Set variable `q1_3` to either 1, 2, 3, or 4, depending on your answer. 


1. Cathy would need to score roughly 4.5 standard deviations above the average to guarantee being in the top five percent. Using Chebyshev's bounds, setting z = $\sqrt{20} \approx 4.5$  gives that 95% of the data will lie between plus or minus 4.5 SDs. If Sam scores above 4.5 SDs, then he is guaranteed to have scored better than 95% of the other students. 

2. Cathy would need to score above 2 SDs. Since 95% of the data falls between plus or minus 2 SDs, if Cathy scores above 2 SDs, she is guaranteed to score above 95% of the class. 

3. Cathy would need to score slightly less than 2 SDs. 50% of the class will have scored below the average. Which means that if Cathy scores 2 standard deviations above the average she'll have scored higher than 50% + (95% / 2) = 97.5%. 

4. No matter how many standard deviations above the mean Cathy scores, there is no guarantee that she will score in the top 5% of the class. 

In [None]:
q1_3 = ...

In [None]:
_ = ok.grade('q1_3')

**Question 1.4** Now, assuming that the scores for the exam will be normally distributed (as many exams are), how many standard deviations above the mean would Cathy have needed to score to **guarantee** that she fell in the top 5% of the class? Set variable `q1_4` to either 1, 2, 3, or 4, depending on your answer. 


1. Cathy would need to score roughly 4.5 standard deviations above the average to guarantee being in the top five percent. Using Chebyshev bounds, setting z = $\sqrt{20} \approx 4.5 $ gives that 95% of the data will lie between plus or minus 4.5 SDs. If Cathy scores above 4.5 SDs, then she is guaranteed to have scored better than 95% of the other students. 

2. Cathy would need to score above 2 SDs. Since 95% of the data falls between plus or minus 2 SDs, if Cathy scores above 2 SDs she is guaranteed to score above 95% of the class. 

3. Cathy would need to score slightly less than 2 SDs. 50% of the class will have scored below the average. Which means that if Cathy scores 2 standard deviations above the average she'll have scored higher than 50% + (95% / 2) = 97.5%. 

4. No matter how many standard deviations above the mean she scores there is no guarantee that she will score in the top 5% of the class. 

In [None]:
q1_4 = ...

In [None]:
_ = ok.grade('q1_4')

# 2. Planning Class Size (Choosing Sample Size)

For a recap on choosing sample sizes, please refer to this part of the [textbook](https://www.inferentialthinking.com/chapters/14/6/Choosing_a_Sample_Size.html)


A new class is being offered and the administration wants to know how many students will be taking the class so they know how big of a classroom it will need. To take the class, a student must have satisfied the prerequisites first. 

The administration knows there are 900 students eligible to take the class, but they don't have the resources to ask each of them whether they are going to take the class. They decide to ask a sample of the students, but they don't know how many students to ask. They want the width of their confidence interval to be at most 10 students. 

For example, if the results of their sample concluded that with 95% confidence between 200 and 210 students would take the class, the adminstration would be happy with that sample. However if the results of the sample concluded that with 95% confidence between 200 and 300 students would take the class, the sample would not have been informative enough because that range is too wide. We are going to help determine how big of a sample the administration should take. 

The population parameter we are interested in measuring is the proportion of eligible students who will take the class. We will estimate this using a sample statistic, the proportion of eligible students in the sample who plan to take the class. 

So *where do we start*?

We go to the professor and they tell us that regardless of the distribution of our population, the distribution of the sample statistic will always be normal. Let's run a simulation to see for ourselves. 

Below is the data for the whole population. (If the administration had the resources to ask every student whether they were going to take the class, this is what they would see. "0" means they won't take the class and "1" means they will.)

In [None]:
population = Table.read_table("population.csv")
population.hist("Planning on taking", bins=np.arange(0,3,1))
population.show(5)

**Question 2.1** Below is partially implemented code to run a simulation. The simulation will repeatedly take samples (without replacement) from the population and calculate the proportion of students who plan on taking the class. Fill in the missing parts. 

In [None]:
def simulation(population, num_iterations, sample_size):
    results = make_array()
    for i in np.arange(num_iterations):
        sample = ...
        proportion_taking_class = ...
        ...
    Table().with_column("Proportion", results).hist("Proportion", bins=np.arange(-0.05, 1.05, 1/sample_size))


In [None]:
simulation(population, 10000, 40)

Does the distribution of the sample statistic look more like a normal curve or more like the population distribution? 

1. `More like a normal curve.`

2. `More like the original population. `

In [None]:
q2_1 = ...

In [None]:
_ = ok.grade('q2_1')

The professor also tells us that as we increase the sample size, the standard deviation of our sample statistic distribution will decrease. Again we decide to run a simulation to double check. Run the following cell to see how the distribution of the sample statistic changes as we increase the size of our sample. **It might take a while to run.** 

In [None]:
simulation(population, 10000, 20)
simulation(population, 10000, 50)
simulation(population, 10000, 100)
simulation(population, 10000, 850)

This trend can actually be expressed in a formula
$$ Sample \space Statistic \space SD = \frac{Population\space SD}{\sqrt{sample\space size}}$$

We can use this formula to find the sample size we need to get a desired standard deviation of the sample statistic, and thus a certain confidence interval for that sample statistic. However, before taking our sample, we don't have any way of knowing the standard deviation of our population. The textbook and homework include some ways to get around this problem; here we will use the actual population standard deviation. 

**Question 2.2** The administration wants the *confidence interval* to have a *width* of **10 students**, but we have been calculating the proportion of students who are ***planning on taking the class*** that are also eligible. Let's calculate the confidence we would need to have as a proportion eligible students. In other words, from the number of students who are eligible to take the class, determine what proportion of that number equals 10 students.

In [None]:
num_eligible_students = ...
print(num_eligible_students)
width_as_proportion = ...
print(width_as_proportion)

In [None]:
_ = ok.grade('q2_2')

**Question 2.3** Now let's calculate the sample standard deviation we would need for our 95% confidence interval to have a width of (your answer to q2_2). Remember that for a normal distribution, 95% of the data lies between *plus and minus* 2 SDs of the mean. Set the variable `target_sd` to equal the standard deviation we would need for our 95% confidence interval to have a width of (your answer to q2_2). 

In [None]:
target_sd = ...
target_sd

In [None]:
_ = ok.grade('q2_3')

**Question 2.4** We also need to calculate the standard deviation of the total population. Calculate this value and store it in the variable `population_sd`. 

In [None]:
population_sd = ...
population_sd

In [None]:
_ = ok.grade('q2_4')

**Question 2.5** Now calculate the required ***sample size*** and store your result as `req_sample_size`. Recall that
$$ Sample \space Statistics \space SD = \frac{Population\space SD}{\sqrt{sample\space size}}$$

In [None]:
req_sample_size = ...
req_sample_size

In [None]:
_ = ok.grade('q2_5')

**Question 2.6** Our required sample size is bigger than our entire population. For each part, say whether it is `True` or `False`.

1. The administration will have to settle for a wider interval to get 95% confidence.
2. Sampling with replacement will be a feasible way to determine the information the administration needs.
3. The administration will have to settle for a lower degree of confidence to get an interval of width 10.
4. We should increase the size of the population until the sample size is smaller than the size of the population.

Set each variable below to either `True` or `False`.


In [None]:
statement_1 = ...
statement_2 = ...
statement_3 = ...
statement_4 = ...



In [None]:
_ = ok.grade('q2_6')

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]

## Before submitting, select "Kernel" -> "Restart & Run All" from the menu!

Then make sure that all of your cells ran without error.

In [None]:
_ = ok.submit()