# Read these instructions completely in order to receive full credit

- Before you submit the problem set, make sure everything runs as expected. Go to the menu bar at the top of Jupyter Notebook and click `Kernel > Restart & Run All`. Your code should run from top to bottom with no errors. Failure to do this will result in loss of points.

- You should not use `install.packages()` anywhere. You may assume that we have already installed all the packages needed to run your code.

- Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE" and delete the `stop()` functions. The `stop()` functions produce an error and are there to remind you of cells that need an answer.

- If you are working in a group, make sure you and your collaborators have been [added to a group on Canvas](https://umich.instructure.com/courses/270337/discussion_topics/658777). You should also specify your group members when submitting to Gradescope.
- As a backup, *also* fill in your uniqid as well as those of your collaborators below:

Your uniqid: `<replace with your uniqid>`

Uniqids of your collaborator(s): `<replace with their uniqids>`

- This assignment should be submitted to both Canvas and Gradescope using the [instructions](https://piazza.com/class/jqh1wx3xw9amg?cid=55) posted on Piazza. **You must upload the problem set to __both__ Canvas and Gradescope in order to recieve full credit.** 
- **Carefully proofread the PDF that you upload to Canvas. PDFs that have missing or truncated code cannot be graded and will not receive credit.**

---

In [None]:
library(tidyverse)
library(modelr)

# STATS 306
## Problem Set 10: Regression and Inference
Each problem is worth two points, for a total of 20.

#### Problem 1
Finish the example I started but did not finish at the end of Wednesday's lecture: use the `gapminder` data to construct a plot which overlays the growth in life expectancy for Botswana, Lesotho, Swaziland, Rwanda, Zambia, and Zimbabwe, with what they *would have* experienced if they continued to grow at their 1950-1980 rates.

In [None]:
# YOUR CODE HERE
stop()

### Other uses of linear regression
So far we have mainly focused on using linear regression to understand patterns in data. In the following exercises, you will see how regression can be used to carry out many types of inference that previously required you to know and understand a specialized type of test. (The equivalences demonstrated by these exercises hold when the sample size is fairly large. For small samples, you should still rely on the specialized tests.)

#### Problem 2
##### Pearson's correlation
Pearson's correlation coefficient, denoted $\rho$, measures the strength of a linear relationship between two variables $x$ and $y$. It's implemented using the `cor()` function in R:

In [None]:
set.seed(1)
x = rnorm(100, sd=.1)
y = 3 + 2 * x + rnorm(100, sd = 1)
cor(x, y)

To test whether a linear relationship exists between two variables, we can use the function `cor.test()` to test the null hypothesis $H_0: \rho=0$:

In [None]:
cor.test(x, y)

Consider how you might test whether the correlation between $x$ and $y$ is zero using the linear model. By running an appropriate regression, show that you get *the exact same* $p$-value as that obtained by `cor.test()`. How can we infer the correlation coefficient $\rho=0.183$ from the regression result?

In [None]:
# YOUR CODE HERE
stop()

#### Problem 3
#### Spearman's rank correlation

A noted criticism of Pearson's correlation is that it only measures the strength of a *linear* correlation between two random variables. Consider the variables $x$ and $y$ defined in the file `spearman.csv`:

In [None]:
sp <- read_csv("spearman.csv") %>% print

Are $x$ and $y$ correlated according to Pearson's test? Are they related at all? Support your answer with an appropriate visual or statistical argument.

In [None]:
# YOUR CODE HERE
stop()

#### Problem 4
Recall that the `rank()` function maps a vector to a vector of integers denoting the numerical rank of each entry of the vector:

In [None]:
rank(c(2,6,9,10,8))

To address potential shortcomings in Pearson's test, *Spearman's rank correlation test* looks for correlations between the *ranks* of two vectors $x$ and $y$. This will do a better job of picking out a non-linear relationship between $x$ and $y$, so long as that relationship is [monotonic](https://en.wikipedia.org/wiki/Monotonic_function). Verify this visualizing the relationship of `rank(x)` and `rank(y)`.

In [None]:
# YOUR CODE HERE
stop()

Spearman's test is implemented using the `cor.test(..., method="spearman")` command.

Using the rank transformation and an appropriate regression, show that the linear model gives you the *exact same* $p$-values and estimates for $\rho$.

In [None]:
# YOUR CODE HERE
stop()

Spearman's test is an example of a [non-parametric](https://en.wikipedia.org/wiki/Nonparametric_statistics) test: it does not make any assumptions about the distribution of the data. We will see other examples of non-parametric tests below.

#### Problem 5
##### The $t$-test
The one-sample $t$-test is used to test the null hypothesis that the mean of a random variable is zero. It's implemented in R using the `t.test()` command:

In [None]:
x <- rnorm(100, mean=.1)
t.test(x)

By running an appropriate regression, show that the linear model produces *the exact same* $t$ statistic, confidence intervals, and $p$-values as the $t$-test.

In [None]:
# YOUR CODE HERE
stop()

#### Problem 6
##### Wilcoxon test
The $t$-test assumes that the underlying data are normally distributed. In cases where this assumption far from accurate, the test can fail. Consider the $t$ distribution itself, with one degree of freedom:

In [None]:
set.seed(1)
qplot(rt(50, 1), geom="histogram")

This distribution has [heavy tails](https://en.wikipedia.org/wiki/Heavy-tailed_distribution), and in fact, does not even have a theoretical mean. Hence, running the $t$-test will not produce a sensible answer here. We will check this by repeatedly simulating data from this distribution and checking whether the $t$-test would have us reject the null hypothesis at the 5% level. 

Implement this simulation: for each of 10,000 trials, generate 50 samples from the $t_1$ distribution shown above, and calculate the proportion of times that the $t$ test would have us reject the null hypothesis that the mean equals zero with 95% confidence:

In [None]:
set.seed(1)
# YOUR CODE HERE
stop()

Run your simulation a few times to verify that your results are consistent. If the test were properly calibrated, what proportion of the times should we reject the null? How does this compare with the results of your simulation?

YOUR ANSWER HERE

#### Problem 7

To remedy the shortcomings of the $t$-test, Wilcoxon proposed a non-parametric test of a different but related question: is the distribution of $v$ [symmetric about zero](https://en.wikipedia.org/wiki/Symmetric_probability_distribution)? This led him to define the *signed rank test*. The signed rank of $v$ is defined as the rank vector of $|v|$, defined above, times another vector which captures the sign of each entry of $v$:

In [None]:
v <- c(-1, 2, -3, 4, -5)
rank(abs(v))
sign(v)

signed_rank <- function(v) rank(abs(v)) * sign(v)
signed_rank(v)

The intuition behind the signed rank vector is as follows: for a distribution that is symmetric around zero, the mean of the signed ranks should approximately equal zero. 

The signed rank test is implemented using the function `wilcox.test()`. Re-run the experiment above and verify that Wilcoxon's test is properly calibrated:

In [None]:
# YOUR CODE HERE
stop()

Using the signed-rank transformation and an appropriate model, show that the linear regression produces nearly the same $p$-value as `wilcox.test()` when applied to the vector `x`.

In [None]:
# YOUR CODE HERE
stop()

#### Problem 8
Suppose I want to test whether the means of two random variables are equal. In STATS 250 you saw that the appropriate test for this is Student's $t$-test, or, in the case of unequal variances, Welch's $t$-test. In STATS 306 you learned that these tests are equivalent to inference in certain linear models:

In [None]:
y1 <- rnorm(100)
y2 <- rnorm(100, sd=2) + .1
## Student's
t.test(y1, y2, var.equal = T)$p.value
df <- tibble(y=c(y1, y2), x=factor(c(rep(1, 100), rep(2, 100))))
lm(y ~ x, data=df) %>% summary %>% coef

The "rank" version of the two-sample Student's $t$-test is called the Mann-Whitney U or Wilcoxon rank sum test. It tests the null hypothesis that a sample from one population is equally likely to be larger or smaller than a sample from the second population. (Note again that this is a non-parametric test.)

This test is implemented in R by passing *both* vectors to the `wilcox.test()` function:

In [None]:
wilcox.test(y1, y2) %>% print

Verify that by rank-transforming the data and running an appropriate regression, you get the same $p$-value from the linear model as from the specialized `wilcox.test()` command.

In [None]:
# YOUR CODE HERE
stop()

#### Problem 9
##### ANOVA
The analysis of variance (ANOVA) is a method for testing whether the means of two or more populations are equal. It is implemented using the command `aov()`.

In [None]:
df <- tibble(x = c(rnorm(100), rnorm(100), rnorm(100) + .2),
             pop = c(rep("a", 100), rep("b", 100), rep("c", 100)))

This toy data frame has three populations, "a", "b", and "c", each with one hundred observations. The means of populations "a" and "b" are equal, while the mean of population "c" is slightly higher:

In [None]:
df %>% group_by(pop) %>% summarize(mean(x))

ANOVA is designed to test whether the means of these populations are different:

In [None]:
aov(x ~ pop, df) %>% summary

Show that the $p$-value reported by ANOVA is exactly the same as that reported by an appropriate linear regression (hint: consider the null hypothesis that the true model is the intercept-only model.) Additionally, show that with only two populations, ANOVA is identical to the $t$-test.

In [None]:
# YOUR CODE HERE
stop()

#### Problem 10
As you might have guessed by now, there is a non-parametric version of ANOVA that operates on ranks. It's called the Kruskal-Wallis test, and is designed for situations where the assumptions of ANOVA (normally distributed residuals, equivalent to linear regression) are not appropriate. To investigate this test, we will regenerate the data frame from the previous exercise, this time using $t$-distributed random variables instead of normally-distributed variables:

In [None]:
set.seed(1)
df <- tibble(x = c(rt(100,2), rt(100,2), rt(100,2) + .5),
             pop = c(rep("a", 100), rep("b", 100), rep("c", 100)))

ANOVA will generally fail to reject in this setting:

In [None]:
aov(x ~ pop, df) %>% summary

On the other hand, the Kruskal test correctly rejects the null:

In [None]:
dfa <- filter(df, pop == "a")$x
dfb <- filter(df, pop == "b")$x
dfc <- filter(df, pop == "c")$x
kruskal.test(list(dfa, dfb, dfc))

Show that the Kruskal-Wallis p-values are *almost* the same as what you get using the rank transformation and an appropriate linear model.

In [None]:
# YOUR CODE HERE
stop()