In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw08.ipynb")

# Homework 8: Confidence Intervals

**Helpful Resource:**

- [Python Reference](http://data8.org/sp22/python-reference.html): Cheat sheet of helpful array & table methods used in Data 8!

**Recommended Reading**:

* [Estimation](https://www.inferentialthinking.com/chapters/13/Estimation)

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to setup the notebook by importing some helpful libraries. Each time you start your server, you will need to execute this cell again.

For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. **Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook!** For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!


**Note: This homework has hidden tests on it. That means even though the tests may say 100% passed, it doesn't mean your final grade will be 100%. We will be running more tests for correctness once everyone turns in the homework.**


Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged.

You should start early so that you have time to get help if you're stuck.

In [None]:
# Don't change this cell; just run it.

import numpy as np
from datascience import *


# These lines do some fancy plotting magic.",
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

## A Note on Confidence Intervals

**Remember:** A *confidence interval* represents a range of numbers in which a mean (or other statistical summary statistic) is likely to be. The likelihood of the *actual* mean being within our confidence interval is defined by a percentage (or confidence level).


This homework focuses on exploring confidence intervals from **two viewpoints**:


1.   Constructing confidence intervals from our given data based on the assumption that our data is *normal* (meaning the underlying data is bell-shaped). This methodology requires a concept called the $z$-score, which we will go through below.
2.   Constructing confidence intervals using a bootstrap approach. This methodology requires no real assumptions about the underlying data being normal, but rather leverages the idea that we can *sample* from a large data set to get an estimate of what our confidence intervals should be.

These two methodologies are used *everywhere* in data science. You will explore data through statistical analysis, resampling, anomaly detection, and visualizations, with hands-on coding exercises and interpretive questions.

## 0. NOAA Tide Gauge Water Level Analysis

This part of the homework focuses on  NOAA tidal water gauges, which provide data that Cal Maritime Oceanographer Dr. Maryam Mohammadpour uses in her research, including to understand things like tsunamis!



### Part 1: Exploratory Data Analysis

We're working with a dataset of water levels over time. Our goal is to identify trends and anomalies.

Run the code below to load the data set.

In [None]:
data = Table.read_table('NOAA_Tide_Gauge_Water_Level_Dataset.csv')
print(data)

**Question A.** What variables do you notice in the dataset? What do you think each variable represents?

*Type your answer here, replacing this text.*

Edit the code below to compute the mean, median, and standard deviation to understand the data's central tendencies and spread.

In [None]:
water_levels = data.column('Water_Level_m')
mean_water = np.mean(...)
median_water = ...
std_dev_water = ...
print(f"Mean: {mean_water}, Median: {median_water}, Std Dev: {std_dev_water}")

**Question B.** How does the mean help us summarize the data? Why is it important to also consider the standard deviation?


*Type your answer here, replacing this text.*

###Part 2: Visualizing Trends

Now, we will plot the data to helps us visualize patterns over time. Run the code below.

In [None]:
# Convert 'Timestamp' to a datetime object for plotting
data_with_datetime = data.with_column(
    'Datetime', data.apply(lambda x: np.datetime64(x), 'Timestamp')
)

# Extract datetime and water level columns
timestamps = data_with_datetime.column('Datetime')
water_levels = data_with_datetime.column('Water_Level_m')

# Create a line plot with improved aesthetics
plt.figure(figsize=(12, 6))  # Larger figure size for better readability
plt.plot(timestamps, water_levels, linewidth=1, color='blue', label='Water Level (m)')


# Display the plot
plt.show()

**Question C:**
What trends do you notice in the graph? Are there any sudden spikes or drops?

*Type your answer here, replacing this text.*

###Part 3: Aggregating Data

Aggregating data into daily, monthly, or yearly averages helps identify broader trends. Let's now look at annual data to better understand trends over longer timescales.

Run the code in the cell below.

In [None]:
# Extract year from Timestamp
data_with_year = data.with_column('Year', data.apply(lambda x: x[:4], 'Timestamp'))

# Group by year and calculate the mean water level for each year
yearly_averages = data_with_year.group('Year', np.mean)

# Rename the 'Water_Level_m mean' column to 'Average Water Level'
yearly_averages = yearly_averages.relabel('Water_Level_m mean', 'Average Water Level')

yearly_averages

**Question D.** What are the benefits and drawbacks of analyzing yearly data instead of hourly?

*Type your answer here, replacing this text.*

### Part 4: Identifying Anomalies

Anomalies are data points significantly different from others. These might indicate unusual events like floods or droughts.

One basic way to do this is to compare our data to the mean, and then measure how far away the data is from the mean. If a data point is very far from the mean, it could be considered anomalous. In other words, it is helpful to look at:

$X - \bar{X}$

where $\bar{X}$ is the mean of $X$.

First, let's edit the code below to create define  `water_level_from_mean` as the difference between `water_level` and the mean water level we found earlier.

In [None]:
# Calculate the differences by subtracting the mean
water_level_from_mean = ...

# Creat a new table with the normalized water levels
difference_data = data_with_datetime.select('Datetime', 'Water_Level_m').with_column('Water_Level_From_Mean_m', water_level_from_mean)

# Print the first few 5 rows of the new table to verify
print(difference_data.show(5))

Now that we have water levels measured as differences from the mean water level, we want to understand these differences in a standard way.

For this, we will use our standard deviation (which is a measure of the typical spreaad of the data) as a way to "normalize" the `water_level_from_mean` data. This is what we call a $z$-score, which is defined by:

$ \displaystyle z = \frac{X - \bar{X}}{s}$

where $s$ is the standard deviation of our data.

Edit the code below to create a `normalized_water_level`.

In [None]:
normalized_water_level = ...

# Creat a new table with the normalized water levels
normalized_data = difference_data.with_column('Normalized_Water_Level', normalized_water_level)

# Print the first few rows of the new table to verify
print(normalized_data.show(5))

Notice the differences between the three water level columns in this data table. These are simply three different ways to display the same data.

The normalized data, however, is nice to consider because we have a general understanding of data like this:


*   **Numbers that are bigger than 2** (in absolute value) represent data that is outside of the middle 95% (approximately) of data (if the data is *normal* or shaped like a bell curve). In other words, these data are in the **top 5% of possible anomalies or outliers**.
*   **Numbers that are bigger than 3** (in absolute value) represent data that is outside of the middle 99% (approximately) of data (if the data is *normal* or shaped like a bell curve). In other words, these data are in the **top 1% of possible anomalies or outliers**.


 Edit the code below to filter our data to just include data strictly above 3 in absolute value.

In [None]:
# Filter the normalized data to remove values between -3 and 3
anomaly_data = normalized_data.where(
    abs(normalized_data.column('Normalized_Water_Level') )>...
)

# Print the first few rows of the filtered data
print(anomaly_data.show(5))

Now let's plot these potential anomalies:


In [None]:
# Create the scatter plot

plt.scatter(anomaly_data.column('Datetime'), anomaly_data.column('Water_Level_m'), color='red', label='Anomalies')
plt.xlabel('Year')
plt.ylabel('Water Level (m)')
plt.title('Scatter Plot of Water Levels Over Time')
plt.legend()
plt.show()

**Question E.** What might cause the anomalies you identified? Are these anomalies clustered during specific times?

*Type your answer here, replacing this text.*

### Part 5: Confidence Intervals

A confidence interval provides a range where the true mean is likely to fall. To define these, we choose a *confidence level* (such as 95%) that corresponds to the likelihood that the true mean lies within within the interval we are going to find. In this scenario, we again use a $z$-score but in reverse of what we just did for the anomaly data:



*   **Choosing 95%** means that $z$ is approximately 2.
*   **Choosing 99%** means that $z$ is approximately 3.

Then, we can define a margin of error for our confidence interval using the formula:

margin of error $\displaystyle = z\frac{s}{\sqrt{n}}$

where again $s$ represents the standard deviation of our data, and $n$ is the number of data points in our data set. Another name for $\frac{s}{\sqrt{n}}$ is the *standard error*.

Edit the code below to find our margin of error for our original water gauge data, assuming we want to use a 95% confidence level.


In [None]:
z = 2
margin_of_error = ...
print(f"Margin of Error: {margin_of_error}")

Now let's compute our 95% confidence interval by editing the code below.

In [None]:
lower_bound = mean_water - margin_of_error
upper_bound = mean_water + ...
print(f"Confidence Interval: [{lower_bound}, {upper_bound}]")

Now try re-running the two code cells above, but for a 99% confidence interval.


**Question F.** How does increasing the confidence level affect the interval? Why?


*Type your answer here, replacing this text.*

## 1. Thai Restaurants in Berkeley

Now we turn to a different way to view a similar problem: bootstrapped confidence intervals.

Oswaldo and Varun are trying to see what the best Thai restaurant in Berkeley is. They survey 1,500 UC Berkeley students selected uniformly at random and ask each student what Thai restaurant is the best. (*Note: This data is fabricated for the purposes of this homework.*) The choices of Thai restaurants are [Lucky House](https://www.google.com/maps/place/Lucky+House+Thai+Cuisine/@37.8707428,-122.270045,15.32z/data=!4m5!3m4!1s0x80857e9e69a8c921:0x7b6d80f58406fb26!8m2!3d37.8721393!4d-122.2672699), [Imm Thai](https://www.google.com/maps/place/Imm+Thai+Street+Food/@37.8704926,-122.2687372,15.51z/data=!4m5!3m4!1s0x80857e9eec4f1e63:0x5f54d96f0dccdb72!8m2!3d37.8719079!4d-122.2691186), [Thai Temple](https://www.google.com/maps/place/Wat+Mongkolratanaram/@37.8689514,-122.2698649,14.75z/data=!4m5!3m4!1s0x80857e886e39daf1:0xe309caa1b5710fc0!8m2!3d37.8563633!4d-122.2707584), and [Thai Basil](https://www.google.com/maps/place/Thai+Basil/@37.8691911,-122.266539,15.37z/data=!4m5!3m4!1s0x80857c2f6ae0e2f1:0x6978b6e8a72d58d4!8m2!3d37.868327!4d-122.258081). After compiling the results, Oswaldo and Varun release the following percentages from their sample:

|Thai Restaurant  | Percentage|
|:------------:|:------------:|
|Lucky House | 8% |
|Imm Thai | 53% |
|Thai Temple | 25% |
|Thai Basil | 14% |

These percentages represent a uniform random sample of the population of UC Berkeley students. We will attempt to estimate the corresponding *parameters*, or the percentage of the votes that each restaurant will receive from the population (i.e. all UC Berkeley students). We will use confidence intervals to compute a range of values that reflects the uncertainty of our estimates.

The table `votes` contains the results of Oswaldo and Varun's survey.

In [None]:
# Just run this cell
votes = Table.read_table('votes.csv')
votes

**Question 1.1.** Complete the function `one_resampled_percentage` below. It should return Imm Thai's ***percentage*** of votes after taking the original table (`tbl`) and performing one bootstrap sample of it. Reminder that a percentage is between 0 and 100. **(9 Points)**

*Note:* `tbl` will always be in the same format as `votes`.

*Hint:* Given a table of votes, how can you figure out what percentage of the votes are for a certain restaurant? **Be sure to use percentages, not proportions, for this question!**


In [None]:
def one_resampled_percentage(tbl):
    ...

one_resampled_percentage(votes)

In [None]:
grader.check("q1_1")

**Question 1.2.** Complete the `percentages_in_resamples` function such that it simulates and returns an array of 2022 elements, where each element represents a bootstrapped estimate of the percentage of voters who will vote for Imm Thai. You should use the `one_resampled_percentage` function you wrote above. **(9 Points)**


In [None]:
def percentages_in_resamples():
    percentage_imm = make_array()
    ...

In [None]:
grader.check("q1_2")

In the following cell, we run the function you just defined, `percentages_in_resamples`, and create a histogram of the calculated statistic for the 2022 bootstrap estimates of the percentage of voters who voted for Imm Thai.

*Note:* This might take a few seconds to run.

In [None]:
resampled_percentages = percentages_in_resamples()
Table().with_column('Estimated Percentage', resampled_percentages).hist("Estimated Percentage")

**Question 1.3.** Using the array `resampled_percentages`, find the values at the two edges of the middle 95% of the bootstrapped percentage estimates. Compute the lower and upper ends of the interval, named `imm_lower_bound` and `imm_upper_bound` respectively. **(9 Points)**

*Hint:* If you are stuck on this question, try looking over [Chapter 13](https://inferentialthinking.com/chapters/13/Estimation.html) of the textbook.


In [None]:
imm_lower_bound = ...
imm_upper_bound = ...
print(f"Bootstrapped 95% confidence interval for the percentage of Imm Thai voters in the population: [{imm_lower_bound:.2f}, {imm_upper_bound:.2f}]")

In [None]:
grader.check("q1_3")

**Question 1.4.** The survey results seem to indicate that Imm Thai is beating all the other Thai restaurants among the voters. We would like to use confidence intervals to determine a range of likely values for Imm Thai's true lead over all the other restaurants combined. The calculation for Imm Thai's lead over Lucky House, Thai Temple, and Thai Basil combined is:

$$\text{Imm Thai's percent of the vote} - (\text{100 percent} - \text{Imm Thai's percent of vote})$$

Define the function `one_resampled_difference` that returns **exactly one value** of Imm Thai's percentage lead over Lucky House, Thai Temple, and Thai Basil combined from one bootstrap sample of `tbl`. **(9 Points)**

*Hint 1:* Imm Thai's lead can be negative.

*Hint 2:* Given a table of votes, how can you figure out what percentage of the votes are for a certain restaurant? **Be sure to use percentages, not proportions, for this question!**

*Note:* If the skeleton code provided within the function is not helpful for you, feel free to approach the question using your own variables.


In [None]:
def one_resampled_difference(tbl):
    bootstrap = ...
    imm_percentage = ...
    ...

In [None]:
grader.check("q1_4")

<!-- BEGIN QUESTION -->

**Question 1.5.** Write a function called `leads_in_resamples` that returns an array of 2022 elements representing the bootstrapped estimates (the result of calling `one_resampled_difference`) of Imm Thai's lead over Lucky House, Thai Temple, and Thai Basil combined. Afterwards, run the cell to plot a histogram of the resulting samples. **(9 Points)**

*Hint:* If you see an error involving `NoneType`, consider what components a function needs to have!


In [None]:
def leads_in_resamples():
    ...

sampled_leads = leads_in_resamples()
Table().with_column('Estimated Lead', sampled_leads).hist("Estimated Lead")

<!-- END QUESTION -->

**Question 1.6.** Use the simulated data in `sampled_leads` from Question 1.5 to compute an approximate 95% confidence interval for Imm Thai's true lead over Lucky House, Thai Temple, and Thai Basil combined. **(9 Points)**


In [None]:
diff_lower_bound = ...
diff_upper_bound = ...
print("Bootstrapped 95% confidence interval for Imm Thai's true lead over Lucky House, Thai Temple, and Thai Basil combined: [{:f}%, {:f}%]".format(diff_lower_bound, diff_upper_bound))

In [None]:
grader.check("q1_6")

## 2. Interpreting Confidence Intervals

The staff computed the following 95% confidence interval for the percentage of Imm Thai voters:

$$[50.53, 55.53]$$

(Your answer may have been a bit different due to randomness; that doesn't mean it was wrong!)

<!-- BEGIN QUESTION -->

**Question 2.1.** The staff also created 70%, 90%, and 99% confidence intervals from the same sample, but we forgot to label which confidence interval represented which percentages! **First**, match each confidence level (70%, 90%, 99%) with its corresponding interval in the cell below (e.g. __ % CI: [52.1, 54] $\rightarrow$ replace the blank with one of the three confidence levels). **Then**, explain your thought process and how you came up with your answers. **(10 Points)**

The intervals are below:

* [50.03, 55.94]
* [52.1, 54]
* [50.97, 54.99]



_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 2.2.** Suppose we produced 6,000 new samples (each one a new/distinct uniform random sample of 1,500 students) from the population and created a 95% confidence interval from each one. Roughly how many of those 6,000 intervals do you expect will actually contain the true percentage of the population? **(9 Points)**

Assign your answer to `true_percentage_intervals`.


In [None]:
true_percentage_intervals = ...

In [None]:
grader.check("q2_2")

Recall the second bootstrap confidence interval you created, which estimated Imm Thai's lead over Lucky House, Thai Temple, and Thai Basil combined. Among
voters in the sample, Imm Thai's lead was 6%. The staff's 95% confidence interval for the true lead (in the population of all voters) was:

$$[1.2, 11.2]$$

Suppose we are interested in testing a simple yes-or-no question:

> "Is the percentage of votes for Imm Thai equal to the percentage of votes for Lucky House, Thai Temple, and Thai Basil combined?"

Our null hypothesis is that the percentages are equal, or equivalently, that Imm Thai's lead is exactly 0. Our alternative hypothesis is that Imm Thai's lead is not equal to 0.  In the questions below, don't compute any confidence interval yourself—use only the staff's 95% confidence interval.

**Question 2.3.** Say we use a 5% p-value cutoff. Do we reject the null, fail to reject the null, or are we unable to tell using the staff's confidence interval? **(9 Points)**

Assign `cutoff_five_percent` to the number corresponding to the correct answer.

1. Reject the null / Data is consistent with the alternative hypothesis
2. Fail to reject the null / Data is consistent with the null hypothesis
3. Unable to tell using our staff confidence interval

*Hint:* Consider the relationship between the p-value cutoff and confidence. If you're confused, take a look at [this chapter](https://inferentialthinking.com/chapters/13/4/Using_Confidence_Intervals.html) of the textbook.


In [None]:
cutoff_five_percent = ...

In [None]:
grader.check("q2_3")

**Question 2.4.** What if, instead, we use a p-value cutoff of 1%? Do we reject the null, fail to reject the null, or are we unable to tell using our staff confidence interval? **(9 Points)**

Assign `cutoff_one_percent` to the number corresponding to the correct answer.

1. Reject the null / Data is consistent with the alternative hypothesis
2. Fail to reject the null / Data is consistent with the null hypothesis
3. Unable to tell using our staff confidence interval


In [None]:
cutoff_one_percent = ...

In [None]:
grader.check("q2_4")

**Question 2.5.** What if we use a p-value cutoff of 10%? Do we reject, fail to reject, or are we unable to tell using our confidence interval? **(9 Points)**

Assign `cutoff_ten_percent` to the number corresponding to the correct answer.

1. Reject the null / Data is consistent with the alternative hypothesis
2. Fail to reject the null / Data is consistent with the null hypothesis
3. Unable to tell using our staff confidence interval


In [None]:
cutoff_ten_percent = ...

In [None]:
grader.check("q2_5")

You're done with Homework 8!  

**Important submission steps:**
1. Run the tests and verify that they all pass.
2. Choose **Save Notebook** from the **File** menu, then **run the final cell**.
3. Click the link to download the zip file.
4. Then submit the zip file to the corresponding assignment according to your instructor's directions.

**It is your responsibility to make sure your work is saved before running the last cell.**

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)