# Homework 8: Confidence Intervals and Sample Size

## Due Sunday December 2nd, 11:59pm

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to load the provided tests. Each time you start your server, you will need to execute this cell again to load the tests.

Reading:
- [Chapter 13.3](https://ucsd-dsc10.gitbooks.io/textbook/content/chapters/11/estimation.html): Confidence intervals
- [Chapter 13.4](https://ucsd-dsc10.gitbooks.io/textbook/content/chapters/11/3/confidence-intervals.html): Interpreting confidence intervals
- [Chapter 14](https://ucsd-dsc10.gitbooks.io/textbook/content/chapters/12/why-the-mean-matters.html): Center/spread, the Central Limit Theorem, etc.

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

You should start early so that you have time to get help if you're stuck.

In [1]:
#: Don't change this cell; just run it. 

import numpy as np
from datascience import *

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

from client.api.notebook import Notebook
ok = Notebook('hw08.ok')
_ = ok.auth(inline=True)

**Important**: The `ok` tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach).

Once you're finished, you must do two things:

### a. Turn into OK
Select "Save and Checkpoint" in the File menu and then execute the `submit` cell below. The result will contain a link that you can use to check that your assignment has been submitted successfully. If you submit more than once before the deadline, we will only grade your final submission.

In [2]:
#: run this to submit your homework
_ = ok.submit()

### b. Turn PDF into Gradescope
Select File > Download As > PDF via LaTeX in the File menu. Turn in this PDF file into the respective assignment at https://gradescope.com/.
<br>
If you submit more than once before the deadline, we will only grade your final submission

## 1. Polling


Four candidates are running for President of Dataland. A polling company surveys 1000 people selected uniformly at random from among voters in Dataland, and it asks each one who they are planning on voting for. After compiling the results, the polling company releases the following proportions from their sample:

|Candidate  | Proportion|
|:------------:|:------------:|
|Candidate C | 0.49 |
|Candidate T | 0.36 |
|Candidate J | 0.08 |
|Candidate S | 0.03 |
|Undecided   | 0.04 |

These proportions represent a uniform random sample of the population of Dataland. We will attempt to estimate the corresponding *population parameters* - the proportions of each kind of voter in the entire population.  We will use confidence intervals to compute a range of values that reflects the uncertainty of our estimate.

The table `voters` contains the results of the survey. Candidates are represented by their initials. Undecided voters are denoted by `U`.

In [3]:
#: run this to randomly sample the data -- don't change this cell!
votes = Table().with_column('vote', np.array(['C']*490 + ['T']*360 + ['J']*80 + ['S']*30 + ['U']*40))
num_votes = votes.num_rows
votes.sample()

Below, we have given you code that will use bootstrapped samples to compute estimates of the true proportion of voters who are planning on voting for **Candidate C**.

In [4]:
#: run the bootstrap!
def proportions_in_resamples():
    statistics = make_array()
    for i in np.arange(5000):
        bootstrap = votes.sample()
        sample_statistic = np.count_nonzero(bootstrap.column('vote') == 'C')/num_votes
        statistics = np.append(statistics, sample_statistic)
    return statistics

boot_proportions = proportions_in_resamples()
Table().with_column('Estimated Proportion', boot_proportions).hist(bins=np.arange(0.2,0.6,0.01))

**Question 1.** Using the array `boot_proportions`, compute an approximate 95% confidence interval for the true proportions of voters planning on voting for candidate C.  (Compute the lower and upper ends of the interval, named `lower_bound` and `upper_bound`, respectively.)

In [5]:
lower_bound = ...
lower_bound

In [6]:
upper_bound = ...
upper_bound

In [7]:
#: print the confidence interval
print("Bootstrapped 95% confidence interval for the proportion of C voters in the population: [{:f}, {:f}]".format(lower_bound, upper_bound))

In [8]:
#: grade
_ = ok.grade('q1_1')

**Question 2.** The survey results seem to indicate that Candidate C is beating Candidate T among voters. We would like to use confidence intervals to determine a range of likely values for her true *lead*. Candidate C's lead over Candidate T is:

$$\text{(Candidate C's proportion of the vote)} - \text{(Candidate T's proportion of the vote)}.$$

Use the bootstrap to compute an approximate distribution for Candidate C's lead over Candidate T, and store your bootstrap estimates in `boot_leads`. Plot a histogram of the resulting samples.

*Hint*: Use the code for `proportions_in_resamples` given to you above as a starting point.

In [9]:
# Use the bootstrap here

boot_leads = ...

# Display a histogram here

In [10]:
#: grade
_ = ok.grade('q1_2')

**Question 3.** Compute an approximate 95% confidence interval for the difference in proportions.

In [11]:
diff_lower_bound = ...
diff_lower_bound

In [12]:
diff_upper_bound = ...
diff_upper_bound

In [13]:
#: print the confidence interval
print("Bootstrapped 95% confidence interval for Candidate C's true lead over Candidate T: [{:f}, {:f}]".format(diff_lower_bound, diff_upper_bound))

In [14]:
#: grade
_ = ok.grade('q1_3')

The staff computed the following 95% confidence interval for the proportion of Candidate C voters: 

$$[.46, .52]$$

(Your answer might have been slightly different, but that doesn't mean it was wrong since the data was randomly sampled.)

**Question 4.**
Can we say that 95% of the population lies in the range $[.46, .52]$? Explain your answer. 

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

<i>Write your answer here.</i>

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

**Question 5.**
Can we say that there is approximately a 95% probability that the interval [.46, .52] contains the true proportion of the population who is voting for Candidate C? Explain your answer.

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

<i>Write your answer here.</i>

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

**Question 6.**
Suppose we produced 10,000 new samples (each one a uniform random sample of 1,000 voters) and created a 95% confidence interval from each one. Roughly how many of those 10,000 intervals do you expect will actually contain the true proportion of the population? Assign your answer to the variable `how_many` below. It should be the *number* of intervals, not the proportion or percentage.

In [15]:
how_many = ...
how_many

In [16]:
#: grade
_ = ok.grade('q1_6')

**Question 7.**

The staff also created 80%, 90%, and 99% confidence intervals from one sample, but we forgot to label which confidence interval represented which percentages! Match the interval to the percent of confidence the interval represents. (Write the percentage after each interval below.) **Then**, explain your thought process. Tip: Draw out the confidence intervals on a piece a paper to help you visualize them better.

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

**Answers:**

$[.464,.517]$:

$[.47,.511]$:

$[.446,.533]$:


<i>Write your explanation here.</i>

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">


## 2. Grouped Means


Suppose you'd like to know about the ages of the people in a small town.  The local government collects this data about everyone in the town, but to ensure that you don't see any individual's age, it only makes public the number of people of each age.  (This could have been done by calling `group` on the original data table.)  So the first few rows of the dataset look something like this:

In [17]:
#: run this cell, but don't change it!
ages = Table().with_columns('age', [0, 1, 2, 3, 5, 6], 'count', [2, 5, 1, 4, 10, 1])
ages

That means there were 2 people age 0, 5 people age 1, etc. Nobody is age 4.

**Question 1.** You first want to compute the mean age of the people in the town.

Write a function called `grouped_mean`.  It should take as its argument a table like the one above, except that the columns might have different names.  It should return the mean of the numbers in the dataset, assuming the first column contains the numbers themselves and the second column contains the count of each number, as in the example.

*Remember:* Even if you don't know the column name for the first column, you can access it by saying `tbl.column(0)`.

In [18]:
# define your function here


In [19]:
#: here's what your function says about the mean age
grouped_mean(ages)

In [20]:
#: this cell tests your function on two examples
example = Table().with_columns('age', [0, 1, 2, 3, 5, 6], 'count', [2, 5, 1, 4, 10, 1])
example2 = Table().with_columns('age', [10, 11, 12, 23, 25, 26], 'count', [2, 5, 1, 4, 10, 1])

if not (3.258 <= grouped_mean(example) <= 3.265):
    print('Your code fails the first example.')
else:
    print('Your code passes the first example.')
    
if not (19.77 <= grouped_mean(example2) <= 19.8):
    print('Your code fails the second example.')
else:
    print('Your code passes the second example.')

In [21]:
#: grade
_ = ok.grade('q2_1')

**Question 2.**
You next want to summarize how spread out the ages are, so you decide to compute their standard deviation.

Write a function called `grouped_std`.  It should take as its argument a table like the one above, except that the columns might have different names.  It should return the standard deviation of the numbers in the dataset, assuming the first column contains the numbers and the second column contains the count of each number, as in the example.

*Hint:* You can think of the standard deviation as the square root of the mean of a transformed version of the original dataset.  The numbers in the transformed dataset are the squared deviations from the mean.  You've already written a function that computes means of grouped numbers, so that should be useful.

In [22]:
# define your function here


In [23]:
#: here's what your function says about the standard deviation of the ages
grouped_std(ages)

In [24]:
#: this cell tests your function on two examples
if not (1.935 <= grouped_std(example) <= 1.945):
    print('Your code fails the first example.')
else:
    print('Your code passes the first example.')
    
if not (6.54 <= grouped_std(example2) <= 6.58):
    print('Your code fails the second example.')
else:
    print('Your code passes the second example.')

In [25]:
#: grade
_ = ok.grade('q2_2')

**Question 3.**
Maybe you aren't sure whether your code for the previous question is correct. We want to test `grouped_mean` and `grouped_std` against the analogous NumPy functions. But to do that, we need to do some preprocessing of the data. There's a NumPy function that will make this easy, but we haven't seen it before. Luckily, NumPy comes with complete documentation. In this problem, we'll get practice in reading it.

The built-in NumPy function `np.std` computes the standard deviation of an array of numbers.  It doesn't work for grouped data, so you couldn't have just used it in your answer to question 2!  But we can use it to check `grouped_std` by manually duplicating each number once for each count, putting the duplicated numbers into an array, and calling `np.mean` or `np.std` on the result. That is, given the following table:

|age|count|
|-|-|
|10|1|
|15|2|

we could create the following array by hand:

$$\verb|make_array(10, 15, 15)|$$

Then we could use `np.std` on this new array to check that our function `grouped_std` works as intended.

But manually creating such an array is a pain! If the town has 1,000 residents, you'll be stuck typing an array of 1,000 entries. It turns out that NumPy has a function, `np.repeat`, which can help us here. The documentation of the function is shown below:

In [26]:
#: run this to read the documentation for `np.repeat`
help(np.repeat)

The documentation above doesn't tell us *exactly* how to do what we want, but it is a good starting point. Try experimenting with the function on some small examples. You can ignore the `axis` keyword argument.

Now, using the above documentation, write a function called `ungroup_table` which accepts a table with two columns -- the first being the objects (such as ages), and the second column being the count of each object -- and returns an array with each object duplicated the appropriate number of times. Your function should not assume anything about the column names.

*Hint*: Your function should be simple if you use `np.repeat` properly.

In [27]:
# define your function here


Now we can check to see if your functions above -- `grouped_mean` and `grouped_std` -- agree with the NumPy functions. They should give very similar answers!

In [28]:
#: check this against `grouped_std(ages)`
np.std(ungroup_table(ages))

In [29]:
#: check this against `grouped_mean(ages)`
np.mean(ungroup_table(ages))

In [30]:
#: grade
_ = ok.grade('q2_3')

## 3. Testing the Central Limit Theorem


The Central Limit Theorem tells us that the probability distribution of the sum or average of a large random sample drawn with replacement will be roughly normal, *regardless of the distribution of the population from which the sample is drawn*.

That's a pretty big claim, but the theorem doesn't stop there. It further states that the standard deviation of this normal distribution is given by $$\frac{\text{sd of the original distribution}}{\sqrt{\text{sample size}}}$$ In other words, suppose we start with *any distribution* that has standard deviation $\sigma$, take a sample of size $n$ (where $n$ is a large number) from that distribution with replacement, and compute the mean of that sample. If we repeat this procedure many times, then those sample means will have a normal distribution with standard deviation $\frac{\sigma}{\sqrt{n}}$.

That's an even bigger claim than the first one! The proof of the theorem is beyond the scope of this class, but in this exercise, we will be exploring some data to see the CLT in action.

**Question 1.** The CLT only applies when sample sizes are "sufficiently large." This isn't a very precise statement. Is 10 large?  How about 50?  The truth is that it depends both on the original population distribution and just how "normal" you want the result to look. Let's use a simulation to get a feel for how the distribution of the sample mean changes as sample size goes up.

Consider a coin flip. If we say `Heads` is $1$ and `Tails` is $0$, then there's a 50% chance of getting a 1 and a 50% chance of getting a 0, which is definitely not a normal distribution.  The average of several coin tosses is equal to the proportion of heads in those coin tosses, so the CLT should apply if we compute the sample proportion of heads many times.

Write a function called `simulate_sample_n` that takes in a sample size $n$. It should return an array that contains 5000 sample proportions of heads, each from $n$ coin flips.

In [31]:
# define your function here


<div class="hide">\pagebreak</div>
The code below will use the function you just defined to plot the empirical distribution of the sample mean for several different sample sizes. The x- and y-scales are kept the same to facilitate comparisons.

In [32]:
#: run this cell to visualize
bins = np.arange(-0.01,1.05,0.02)

for sample_size in make_array(2, 5, 10, 20, 50, 100, 200, 400):
    Table().with_column('Sample Size: {}'.format(sample_size), simulate_sample_n(sample_size)).hist(bins=bins)
    plots.ylim(0, 30)

You can see that even the means of samples of 10 items follow a roughly bell-shaped distribution.  A sample of 50 items looks quite bell-shaped.

**Question 2** In the plot for a sample size of 10, why are the bars spaced at intervals of .1, with gaps in between?

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

<i>Write your answer here.</i>

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

<div class="hide">\pagebreak</div>
Now we will test the second claim of the CLT: That the SD of the sample mean is the SD of the original distribution, divided by the square root of the sample size.

We have imported the flight delay data and computed its standard deviation for you.

In [33]:
#: run this cell, but don't change it under penalty of law!
united = Table.read_table('united_summer2015.csv')
united_std = np.std(united.column('Delay'))
united_std

**Question 3.** Write a function called `predict_sd`.  It takes a sample size `n` (a number) as its argument.  It returns the predicted standard deviation of the sample mean for samples of size `n` from the flight delays.

In [34]:
# define your function here


In [35]:
#: the following should be True
39.45 <= predict_sd(1) <= 39.485

In [36]:
#: grade
_ = ok.grade('q3_3')

**Question 4.** Write a function called `empirical_sd` that takes a sample size `n` as its argument. The function should simulate 500 samples of size `n` from the flight delays dataset, and it should return the standard deviation of the **means of those 500 samples**.

*Hint:* This function will be similar to the `simulate_sample_n` function you wrote earlier.

In [37]:
# define your function here


In [38]:
#: this should be True
28 <= empirical_sd(1) <= 50

In [39]:
_ = ok.grade('q3_4')

The cell below will plot the predicted and empirical SDs for the delay data for various sample sizes. It may take a few moments to run.

In [40]:
#: run this cell to visualize
sd_table = Table().with_column('Sample Size', np.arange(1,101))
predicted = sd_table.apply(predict_sd, 'Sample Size')
empirical = sd_table.apply(empirical_sd, 'Sample Size')
sd_table = sd_table.with_columns('Predicted SD', predicted, 'Empirical SD', empirical)
sd_table.scatter('Sample Size')

**Question 5.** The empirical SDs are very close to the predicted SDs, but they're not exactly the same.  Why?

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

<i>Write your answer here.</i>

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

## 4. Polling and the Normal Distribution


Michelle is a statistical consultant, and she works for a group that supports Proposition 68 (which would mandate labeling of all horizontal or vertical axes), called Yes on 68.  They want to know how many Californians will vote for the proposition.

Michelle polls a uniform random sample of all California voters, and she finds that 215 of the 400 sampled voters will vote in favor of the proposition.

In [41]:
#: run this cell, but don't change it!
sample = Table().with_columns(
    "Vote",  make_array("Yes", "No"),
    "Count", make_array(215,   185))
sample_size = sum(sample.column("Count"))
sample_proportions = sample.with_column(
    "Proportion", sample.column("Count") / sample_size)
sample_proportions

She uses 10,000 bootstrap resamples to compute a confidence interval for the proportion of all California voters who will vote Yes.  Run the next cell to see the empirical distribution of Yes proportions in the 10,000 resamples.

In [42]:
#: run this cell, but don't change it!
resample_yes_proportions = make_array()
for i in np.arange(10000):
    resample = proportions_from_distribution(sample_proportions, "Proportion", sample_size)
    resample_yes_proportions = np.append(resample_yes_proportions, resample.column("Random Sample").item(0))
Table().with_column("Resample Yes proportion", resample_yes_proportions).hist(bins=np.arange(.2, .8, .01))

**Question 1.**
There are two distributions here: the distribution of opinions on Proposition 68, and the bootstrap distribution of estimates of the proportion in favor of the proposition. The Central Limit Theorem applies to one of these distributions; which is it? Explain *why* the theorem is applicable here.

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

<i>Write your answer here.</i>

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

<div class="hide">\pagebreak</div>
In a population whose members are 0 and 1, there is a simple formula for the standard deviation of that population:

$$\text{standard deviation} = \sqrt{(\text{proportion of 0s}) \times (\text{proportion of 1s})}$$

(Figuring out this formula, starting from the definition of the standard deviation, is a fun exercise for those who enjoy algebra -- and who doesn't?)

**Question 2.**
**Without accessing the data in `resample_yes_proportions` in any way**, and instead using only the Central Limit Theorem and the numbers of Yes and No voters in our sample of 400, compute a number `approximate_sd` that's the predicted standard deviation of the array `resample_yes_proportions` according to the central limit theorem. Since you don't know the true proportions of 0s and 1s in the population, use the proportions in the sample instead (since they're probably similar).

In [43]:
approximate_sd = ...
approximate_sd

In [44]:
#: grade
_ = ok.grade('q4_2')

**Question 3.**
Compute the standard deviation of the array `resample_yes_proportions` to verify that your answer to question 2 is approximately right.

In [45]:
exact_sd = ...
exact_sd

In [46]:
#: grade
_ = ok.grade('q4_3')

**Question 4.**
**Still without accessing `resample_yes_proportions` in any way**, compute an approximate 95% confidence interval for the proportion of Yes voters in California.  The cell below draws your interval as a red bar below the histogram of `resample_yes_proportions`; use that to verify that your answer looks right.

*Hint*: Before, we've used `percentile` on the bootstrap distribution to find the bounds for the confidence interval. Now the question says that we can't use the bootstrap distribution -- but we don't need it! We know (from the Central Limit Theorem) that the distribution of the sample mean is Normal with a certain standard deviation. We also know that 95% of the area of the normal distribution falls within a certain number of standard deviations from the mean.

If you're still stuck, try studying [Section 14.3](https://ucsd-ets.github.io/dsc10-fa18-textbook/chapters/14/3/SD_and_the_Normal_Curve) in the textbook.

In [47]:
lower_limit = ...
lower_limit

In [48]:
upper_limit = ...
upper_limit

In [49]:
#: print the confidence interval
print('lower:', lower_limit, 'upper:', upper_limit)

In [50]:
#: grade
_ = ok.grade('q4_4')

In [51]:
# Run this cell to plot your confidence interval.
Table().with_column('Resample Yes proportion', resample_yes_proportions).hist(bins=np.arange(.2, .8, .01))
plots.plot(make_array(lower_limit, upper_limit), make_array(0, 0), c='r', lw=10);

Your confidence interval should overlap the number 0.5.  That means we can't be very sure whether Proposition 68 is winning, even though the sample Yes proportion is a bit above 0.5.

The Yes on 68 campaign really needs to know whether they're winning. To have more confidence in the result of the poll, the decide to redo it with a larger sample. They'd be happy if the standard deviation of the sample mean were only 0.005.  They ask Michelle to run a new poll with a sample size that's large enough to achieve that.  (Polling is expensive, so the sample also shouldn't be bigger than necessary.)

Michelle consults Chapter 14 of your textbook.  Instead of making the conservative assumption that the population standard deviation is 0.5 (coding Yes voters as 1 and No voters as 0), she decides to assume that it's equal to the standard deviation of the sample,

$$\sqrt{(\text{Yes proportion in the sample}) \times (\text{No proportion in the sample})}.$$

Under that assumption, Michelle computes the smallest sample size necessary in order to be confident that the standard deviation of the sample mean is only 0.005.

**Question 5.**
What sample size did she find? Assign your answer to the variable `sample_size`.

In [52]:
sample_size = ...
sample_size

In [53]:
#: grade
_ = ok.grade('q4_5')

To submit:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
3. Read through the notebook to make sure everything is fine and all tests passed.
4. Submit using the cell below.
5. Save PDF and submit to gradescope

In [54]:
#: Run all tests at once
import os
_ = [ok.grade(q[:-3]) for q in os.listdir('tests') if q.startswith('q')]

## Before submitting, select "Kernel" -> "Restart & Run All" from the menu!

Then make sure that all of your cells ran without error.

In [55]:
#: run this to submit your homework
_ = ok.submit()

## Don't forget to submit to both OK and Gradescope!