In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab08.ipynb")

# Lab 8: Confidence Intervals

**Helpful Resource:**
- [Python Reference](http://data8.org/fa21/python-reference.html): Cheat sheet of helpful array & table methods used in Data 8!

**Recommended Readings**: 
* [Ch 13. Estimation](https://www.inferentialthinking.com/chapters/13/Estimation)

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to setup the notebook by importing some helpful libraries. Each time you start your server, you will need to execute this cell again.  For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. **Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook!** For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!

**Note: This homework has hidden tests on it. That means even though tests may say 100% passed, it doesn't mean your final grade will be 100%. We will be running more tests for correctness once everyone turns in the homework.**

In [None]:
# Run this cell to set up the notebook.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines make plots look nice and hide some messy Python warnings.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
warnings.simplefilter('ignore', np.VisibleDeprecationWarning)

## 1. Spring Street Restaurants (30 pts)



We are trying to see what the best restaurant on Spring Street is is. We surveyed 1,500 Williams students selected uniformly at random and asked each student which of the following four restaurants is the best. (*Note: This data is entirely fabricated for the purposes of this homework.*) The choices of restaurants are Pera, Blue Mango, Spring St. Market, and Taste of India. After compiling the results, we release the following percentages from their sample:

| Restaurant  | Percentage|
|:------------ |:------------:|
|Pera | 8.2% |
|Blue Mango | 52.8% |
|Spring St. Market | 25% |
|Taste of India | 14% |

These percentages represent a uniform random sample of the population of UC Berkeley students. We will attempt to estimate the corresponding *parameters*, or the percentage of the votes that each restaurant will receive from the population (i.e. all UC Berkeley students). We will use confidence intervals to compute a range of values that reflects the uncertainty of our estimates.

The table `votes` contains the results of Ben and Frank's survey.

In [None]:
# Just run this cell
votes = Table.read_table('votes.csv')
votes

#### Part 1.1 (5 pts)


 Complete the function `one_resampled_percentage` below. It should return Blue Mango's **percentage** of votes after taking the original table (`tbl`) and performing one bootstrap sample of it. Reminder that a percentage is between 0 and 100. 

*Note:* `tbl` will always be in the same format as `votes`.



In [None]:
def one_resampled_percentage(tbl):
    ...

one_resampled_percentage(votes)

In [None]:
grader.check("q1.1")

#### Part 1.2 (5 pts)


 Complete the `percentages_in_resamples` function such that it simulates and returns an array of 2000 bootstrapped estimates of the percentage of voters who will vote for Blue Mango. You should use the `one_resampled_percentage` function you wrote above. 

In [None]:
def percentages_in_resamples():
    percentage_imm = make_array()
    ...


In [None]:
grader.check("q1.2")

In the following cell, we run the function you just defined, `percentages_in_resamples`, and create a histogram of the calculated statistic for the 2000 bootstrap estimates of the percentage of voters who voted for Blue Mango. 

*Note:* This might take a few seconds to run.

In [None]:
resampled_percentages = percentages_in_resamples()
Table().with_column('Estimated Percentage', resampled_percentages).hist("Estimated Percentage")

#### Part 1.3 (5 pts)


 Using the array `resampled_percentages`, find the values at the two edges of the middle 95% of the bootstrapped percentage estimates. (Compute the lower and upper ends of the interval, named `imm_lower_bound` and `imm_upper_bound`, respectively.) 

*Hint:* If you are stuck on this question, try looking over [Chapter 13](https://inferentialthinking.com/chapters/13/Estimation.html) of the textbook.

In [None]:
imm_lower_bound = ...
imm_upper_bound = ...
print(f"Bootstrapped 95% confidence interval for the percentage of Blue Mango voters in the population: [{imm_lower_bound:.2f}, {imm_upper_bound:.2f}]")

In [None]:
grader.check("q1.3")

#### Part 1.4 (5 pts)


The survey results seem to indicate that Blue Mango is beating all the other restaurants combined among voters. We would like to use confidence intervals to determine a range of likely values for Blue Mango's true lead over all the other restaurants combined. The calculation for Blue Mango's lead over Pera, Spring St. Market, and Taste of India combined is:

$$\text{Blue Mango's % of the vote} - (\text{100 %} - \text{Blue Mango's % of Vote})$$

Define the function `one_resampled_difference` that returns **exactly one value** of Blue Mango's percentage lead over Pera, Spring St. Market, and Taste of India combined from one bootstrap sample of `tbl`. 

*Hint 1:* Blue Mango's lead can be negative.

*Hint 2:* Given a table of votes, how can you figure out what percentage of the votes are for a certain restaurant? **Be sure to use percentages, not proportions, for this question!**



In [None]:
def one_resampled_difference(tbl):
    bootstrap = tbl.sample()  # SOUTION
    imm_percentage = ...
    return imm_percentage


In [None]:
grader.check("q1.4")

<!-- BEGIN QUESTION -->

#### Part 1.5 (5 pts)


 Write a function called `leads_in_resamples` that finds 2000 bootstrapped estimates (the result of calling `one_resampled_difference`) of Blue Mango's lead over Pera, Spring St. Market, and Taste of India combined. Plot a histogram of the resulting samples. 

*Hint:* If you see an error involving “NoneType”, consider what components a function needs to have. 

In [None]:
def leads_in_resamples():
    ...

sampled_leads = leads_in_resamples()
Table().with_column('Estimated Lead', sampled_leads).hist("Estimated Lead")

<!-- END QUESTION -->

#### Part 1.6 (5 pts)


 Use the simulated data in `sampled_leads` from Question 1.5 to compute an approximate 95% confidence interval for Blue Mango's true lead over Pera, Spring St. Market, and Taste of India combined. 



In [None]:
diff_lower_bound = ...
diff_upper_bound = ...
print("Bootstrapped 95% confidence interval for Blue Mango's true lead over Pera, Spring St. Market, and Taste of India combined: [{:f}%, {:f}%]".format(diff_lower_bound, diff_upper_bound))

In [None]:
grader.check("q1.6")

## 2. Interpreting Confidence Intervals (25 pts)



The staff computed the following 95% confidence interval for the percentage of Blue Mango voters: 

$$[50.40, 55.40]$$

(Your answer may have been a bit different due to randomness; that doesn't mean it was wrong!)

<!-- BEGIN QUESTION -->

#### Part 2.1 (5 pts)


 The staff also created 70%, 90%, and 99% confidence intervals from the same sample, but we forgot to label which confidence interval represented which percentages! First, match each confidence level (70%, 90%, 99%) with its corresponding interval in the cell below (e.g. __ % CI: [52.1, 54] $\rightarrow$ replace the blank with one of the three confidence levels). **Then**, explain your thought process and how you came up with your answers. 

The intervals are below:

* [51.47, 54.20]
* [49.60, 56.13]
* [50.80, 55.00]



_Type your answer here, replacing this text._

<!-- END QUESTION -->

#### Part 2.2 (5 pts)


 Suppose we produced 5,000 new samples (each one a uniform random sample of 1,500 voters/students) from the population and created a 95% confidence interval from each one. Roughly how many of those 5,000 intervals do you expect will actually contain the true percentage of the population? 

Assign your answer to `true_percentage_intervals`.



In [None]:
true_percentage_intervals = ...

In [None]:
grader.check("q2.2")

Recall the second bootstrap confidence interval you created, which estimated Blue Mango's lead over Pera, Spring St. Market, and Taste of India combined. Among
voters in the sample, Blue Mango's lead was 6%. The staff's 95% confidence interval for the true lead (in the population of all voters) was:

$$[0.933%, 10.933%]$$

Suppose we are interested in testing a simple yes-or-no question:

> "Is the percentage of votes for Blue Mango equal to the percentage of votes for Pera, Spring St. Market, and Taste of India combined?"

Our null hypothesis is that the percentages are equal, or equivalently, that Blue Mango's lead is exactly 0. Our alternative hypothesis is that Blue Mango's lead is not equal to 0.  In the questions below, don't compute any confidence interval yourself - use only the staff's 95% confidence interval.

#### Part 2.3 (5 pts)


 Say we use a 5% p-value cutoff. Do we reject the null, fail to reject the null, or are we unable to tell using the staff's confidence interval? 

Assign `restaurants_equal` to the number corresponding to the correct answer.

1. Reject the null / Data is consistent with the alternative hypothesis
2. Fail to reject the null / Data is consistent with the null hypothesis
3. Unable to tell using our staff confidence interval

*Hint:* Consider the relationship between the p-value cutoff and confidence. If you're confused, take a look at [this chapter](https://inferentialthinking.com/chapters/13/4/Using_Confidence_Intervals.html) of the textbook.



In [None]:
restaurants_equal = ...

In [None]:
# TEST
1 <= restaurants_equal <= 3

In [None]:
# HIDDEN
restaurants_equal == 2

#### Part 2.4 (5 pts)


 What if, instead, we use a P-value cutoff of 1%? Do we reject the null, fail to reject the null, or are we unable to tell using our staff confidence interval? 

Assign `cutoff_one_percent` to the number corresponding to the correct answer.

1. Reject the null / Data is consistent with the alternative hypothesis
2. Fail to reject the null / Data is consistent with the null hypothesis
3. Unable to tell using our staff confidence interval



In [None]:
cutoff_one_percent = ...


In [None]:
grader.check("q2.4")

#### Part 2.5 (5 pts)


 What if we use a p-value cutoff of 10%? Do we reject, fail to reject, or are we unable to tell using our confidence interval? 

Assign `cutoff_ten_percent` to the number corresponding to the correct answer.

1. Reject the null / Data is consistent with the alternative hypothesis
2. Fail to reject the null / Data is consistent with the null hypothesis
3. Unable to tell using our staff confidence interval



In [None]:
cutoff_ten_percent = ...

In [None]:
grader.check("q2.5")

You're done with Lab 8!  

**Important submission steps:** 
1. Run the tests and verify that they all pass.
2. Choose **Save** from the **File** menu, then **run the final cell**. 
3. Click the link to download the zip file.
4. Go to [Gradescope](https://www.gradescope.com/) and submit the zip file to the corresponding assignment. The name of this assignment is "Lab 8 Autograder". 

**Be sure your work is saved before running the last cell.**

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()