# Project 3: The Last One

**Name(s):**

1.

2.

3.


---

In this project, we will work with confidence intervals and joint probability distributions, using some real-world data.

---

**Project Rules:**

* You will lose 3 "logistics" points if you supply an equation with mathematical formatting that is _not_ formatted in LaTeX. (E.g., "sqrt(2)" instead of $\sqrt{2}$, or "x^3" instead of $x^3$)
* You can work in groups of **up to 3 students**.
* You are encouraged to collaborate with others and other groups in the class, but different groups' submissions should be different.
* You must **show all work** and **fully justify** your solutions.
* You will lose points if you supply a text answer (say, providing an explanation, or the distribution of a parameter) in a code cell.
* The notebook is partitioned into a few different sections, each of which accompanies a different day of lecture as detailed on the course schedule on myCourses. Toward the end, the problems will rely on previous days of lecture material as well, and **may require some time outside of class to complete**. Please plan accordingly with your group.

---

<br>

<br>

---

## Accompanying Tuesday November 14 - Confidence Intervals for the Mean...

Run the code cell below to read in the data from our sleep survey on Tuesday November 7.

In [None]:
sleep = read.csv("https://raw.githubusercontent.com/tonyewong/math251_fall2023/master/sleep_survey.csv",header=FALSE)[[1]]

First, let's revisit our old methods from Exploratory Data Analysis to explore this data set.

### Task 1

When encountering a new data set for the first time, it's good to first get a general sense of what is going on. A **histogram** contains a wealth of information about the data set: you can see the central tendency, the modality (how many modes it has), any apparent skew, and the range and variability.

Create a probability density histogram and label your axes, including units where appropriate. Write a few sentences to describe the data set based on your histogram, being sure to address the points above.

### Task 2

Let's now compute some numerical summaries for our data. First, compute the **Tukey 5-Number Summary** for this data set. The 5-number summary includes:
* the minimum
* the first quartile (25th percentile)
* the median
* the third quartile (75th percentile)
* the maximum

Also compute the mean, standard deviation, and sample size for the data set.

Write a few sentences that refine your earlier remarks about central tendency, range, and skew, making use of the new statistics that you have calculated. For example, now you can be more specific about the range and where the "center" of the data is.

Based on these statistics, what proportion of our class do you estimate is getting the National Institutes of Health's recommended [7-9 hours of sleep](https://www.nhlbi.nih.gov/health/sleep/how-much-sleep) per night? Explain your answer using only the statistics that you calculated in the other parts of this Task.

### Task 3

If someone asks you "What's a typical amount of sleep for people in your Prob/Stat class?", how would you respond? Justify your answer using concepts from our work in Exploratory Data Analysis, which might include using the histogram and/or statistics that you generated in the previous tasks!  _Many acceptable answers are possible._

### Task 4

Okay. Now let's calculate a 66% confidence interval for the mean amount of sleep folks in our class get. We're going to break this up into a few steps, so hang tight.

Since we do not know the variance of the population, we will estimate it using $s$, the sample standard deviation. So a generic CI with confidence level $100(1-\alpha)\%$ is given by
$$\left[\bar{x} - z_{\alpha/2}\frac{s}{\sqrt{n}}, \ \bar{x} + z_{\alpha/2}\frac{s}{\sqrt{n}}\right]$$

First, to calculate a 66% CI, what is $\alpha$? Write this in a sentence in Markdown/LaTeX.

Then, recall that $z_{\alpha/2}$ is the value of a standard normal random variable that has probability $\alpha/2$ _to the right_ of it. What is $z_{\alpha/2}$ for our 66% CI? This one you can just compute in a code cell (but you can use Markdown text cells too if you'd like!).

### Task 5

<img src="https://us-tuna-sounds-images.voicemod.net/cc7b5a05-4e0a-4665-a14b-cf9160c2d73e.jpg" width=200>

Good news everyone! You've already calculated $\bar{X}$, $s$, and $n$. And in the last task, you also calculated $z_{\alpha/2}$ for our 66% CI. So plug all that into the CI formula from the last task and report your CI in the form $[A,B]$ in a **text cell** in a complete sentence. Round the values of the CI bounds $A$ and $B$ to 1 decimal place and include units.

### Task 6

What value of $\alpha$ would place 7 hours of sleep at the upper bound of our CI? What is the corresponding confidence level?

Your calculations should be done in R in code cells below, but the set up and symbolic algebra should all be done in Markdown/LaTeX in a text cell. You can and should add different types of cells if that is useful.

### Task 7

Determine what the maximum confidence level is that leads to a CI for the mean amount of sleep whose width is no greater than 10 minutes.

Your calculations should be done in R in code cells below, but the set up and symbolic algebra should all be done in Markdown/LaTeX in a text cell. You can and should add different types of cells if that is useful.

<br>

---

<br>

## Accompanying Thursday November 16 - Confidence Intervals Using the T Distribution...

## Example:

S'pose the fat content (in percentage) of 10 randomly selected hot dogs has a sample mean of 21.9 and sample standard deviation of $s = 4.134$. From historical data, we know that the distribution of fat content is approximately normal.

### Task 8

Since our sample size $n=10$ is pretty small, and the population distribution is approximately normal, we know a confidence interval for the population mean is given by:
$$CI = \bar{x} \pm t_{\alpha/2,n-1}\frac{s}{\sqrt{n}}$$

The only new piece here is $t_{\alpha/2,n-1}$. The degrees of freedom, $\nu = n-1$. How many degrees of freedom does our $t$ distribution have?

### Task 9

We can compute the $t$ critical value $t_{\alpha/2,n-1}$ using the `qt` function to obtain quantiles from the $t$ distribution. We give it two arguments:
* `p` = probability to the *left* of the quantile returned
* `df` = the degrees of freedom of the $t$ distribution

Replace `p` and `df` below with values appropriate for computing a 95% CI for our example.

In [None]:
t_alp2 = qt(p, df)  # <-- TODO: replace p and df to compute the appropriate t critical value
print(t_alp2)

### Task 10

Compute a 95% confidence interval for the mean fat content of a hot dog. Report the CI in a **complete** sentence in a Markdown/text cell, including appropriate units.

### Task 11

Find a 95% prediction interval for the fat content of our 13th hot dog.

Potentially useful formula: $$PI = \bar{x} \pm t_{\alpha/2,n-1}\cdot s \sqrt{1 + \frac{1}{n}}$$


<br>

## Back to sleep [example]

Suppose we were conducting the sleep survey in real time so that we do not actually know all of the data points all at once. For instance, if we run the code cell below, we can artificially limit our data to just the first 12 data points.

In [None]:
sleepr = sleep[1:12]

### Task 12

Recall that if our data are _approximately_ normal, if our sample size is less than 30, we can still use a t distribution to generate a confidence interval for the sample mean.

Make a probability density histogram of the reduced data set, labeling all axes as appropriate (including units). Write a few sentences to justify using a t distribution to construct a confidence interval.

**Solution:**

Many answers are possible, but should present at least a few features that are consistent with a normal distribution. For example:
* higher in the middle and lower on the sides
* unimodal

### Task 13

Now construct a 90% t confidence interval for the mean amount of sleep that students get, based on this reduced data set of just 12 students.

### Task 14

Recall from class that for situations in which we do not know the population standard deviation $\sigma$, a $100\times (1-\alpha)\%$ *prediction interval* for the 13th data point is given by:
$$PI = \bar{x} \pm t_{\alpha/2,n-1}\cdot s \sqrt{1 + \frac{1}{n}}$$

Compute a 90% prediction interval for the 13th data point.

### Task 15

Consider the rest of the data set besides these first 13 points. What proportion of the rest of the data fall within your 90% prediction interval? Does this match what you would expect? Regardless of how close it matches expectations, name and explain at least one reason why your finding might not match expectations.

<br>

---

<br>

## Accompanying Thursday November 30 - Joint Distributions...

Run the code cell below to load and print the first few rows of a data set containing a month of weather data for Rochester, New York. The data includes daily average temperature (`temp`, degrees Fahrenheit), daily average humidity (`humidity`, %), and daily total precipitation (`precip`, inches).

In [None]:
dat = read.csv("https://raw.githubusercontent.com/tonyewong/math251_fall2023/master/weather_rochester.csv")
head(dat)

Recall that if you would like to set a specific variable for (say) the temperature data, you can do this by _slicing_ out the first column of `dat`:
```
temp = dat[,"temp"]
```
Feel free to create new varibles for each of the different columns of `dat` if you would like.

### Task 16

Based on the data, what is the overall probability of a day with any precipitation? _Hint: if you aren't sure, think back to how we estimated probability in our Exploratory Data Analysis work._

### Task 17

Also based on the data, what is the conditional probability of any precipitation if **temperature** is at least 65 degrees Fahrenheit? And what is the conditional probability of precipitation if **humidity** is above 75%? What about if both temperature is above 65 degrees and humidity is above 75%? (Note that "above" implies "not equal to".)

Note that this is asking you to calculate 3 probabilities. Set up your calculation in Markdown/LaTeX in a text cell, and clearly specify the conditional probabilities that you must find. Then actually do the calculations in a code cell. **Report all of the results written as _conditional probabilities_ in Markdown/LaTeX in a text cell**.

### Task 18

Based on the previous 2 tasks, write a couple of sentences summarizing what you think is the relationship between temperature, humidity, and precipitation. Do you think that temperature and humidity are independent or dependent? Use evidence from the previous few tasks to support your answer.

### Task 19

Now let's fit a discrete joint probability mass function to the distribution of temperature and humidity. This will allow us to use our official checks for (in)dependence to determine whether temperature and humidty are independent.

Define "high" and "low" cases for temperature and humidity as follows:
* high temperature corresponds to temperatures greater than 65 degrees and low temperature are at or below 65, and
* high humidity corresponds to humidities greater than 75% and low humidity is at or below that.

Fit a discrete joint probability mass function to temperature and humidity by counting up the numbers of low and high cases for each. Fill in the pmf in the table provided below. For example, in the low-temperature/low-humidity entry, the ? should be replaced by your estimate of the probability that we would see both low temperature and low humidity, based on the data. You can either count these up by hand, or use (say) the `which` function in R like we did in our Exploratory Data Analysis notebooks.


|          |      |      |temperature |
|----------|------|------|------------|
|          |      |  low | high       |
| humidity | low  |   ?  |    ?       |
|          | high |   ?  |    ?       |



### Task 20

Use your fitted joint probability mass function to prove whether temperature and humidity are dependent or independent. State your result and how you know they are/aren't dependent in a sentence in a text cell.