Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). Your code should run from top to bottom with no errors. Failure to do this will result in loss of points.

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE" and delete the `stop()` functions, as well as your name and collaborators below:

In [None]:
NAME = ""  # your uniqname 
COLLABORATORS = c()  # vector of uniqnames of your collaborators, if any
## IMPORTANT: also enter your group information in Canvas when you upload the assignment

---

In [None]:
library(tidyverse)
library(modelr)

# STATS 306
## Problem set 10: The end
Each part of each problem is worth two to four points, depending on difficulty, for a total of 20.

*Note*: you do not need to use `install.packages()` in this notebook. You may assume that we have already installed all of the necessary packages when we run your code.

#### Problem 1 (2 pts.)
If we regress `y` on `x`, then the regression coefficient gives the average rate of change of `y` with respect to `x`. For example:

In [None]:
lm(cty ~ hwy, mpg) %>% coef

This says that if highway mileage increases by one mile per gallon, average city mileage increases by about 0.68 miles per gallon, e.g.:

In [None]:
dcity = mpg %>% filter(hwy %in% 24:25) %>% 
                group_by(hwy) %>% summarize(mean(cty)) %>% print
diff(dcity[[2]])

In class we saw that `log(price)` has approximately a linear relationship with `log(carat)`:

In [None]:
lm(log(price) ~ log(carat), diamonds) %>% coef

What does this regression say about the relationship between __price__ (not `log(price)`) and __carat__ (not `log(carat)`)?

YOUR ANSWER HERE

#### Problem 2 (2 pts.)
In lecture we said that regressing `log(price)` on `log(carat)` "removed" the effect of size on a diamond's price, enabling us to visualize net effect of cut, color, and clarity on price by looking at the residuals of that regression.

If the residuals are not affected by `log(carat)`, then intuitively they should be uncorrelated with `log(carat)`. Verify that this is true by computing the correlation between the residuals and the size predictor in this regression. Due to numerical error it will be very small but not exactly zero.

In [None]:
# YOUR CODE HERE
stop()

Explain why this implies that the predicted values are also uncorrelated with the residuals in a linear regression.

YOUR ANSWER HERE

#### Problem 3 (4 pts.)
The standard error of a regression coefficient measures how much that estimate varies if we were to run the same data experiment many times and repeatedly re-fit a linear model. For example, suppose we get data from the following experiment:

In [None]:
experiment = function() {
    tibble(
        x1 = rexp(rate = 10, n = 100),
        x2 = rnorm(n = 100),
        y = 3 * x1 - 2 * x2 + rnorm(n = 100, sd = .1)
    )
}

If we regress `y` on `x1` and `x2`, the standard error of `x1` coefficient in estimated to be:

In [None]:
experiment_data = experiment()
lm(y ~ x1 + x2, experiment_data) %>% summary %>% coef %>% .['x2', 'Std. Error']

We can verify that this is (approximately) correct by repeatedly running the experiment, fitting the linear model, and computing the standard error of the resulting estimates:

In [None]:
extract_coef = function(df) lm(y ~ x1 + x2, df) %>% coef %>% .['x2'] 
map(1:1000, ~ experiment()) %>% map_dbl(extract_coef) %>% sd

Of course, in real life, we usually do not have the luxury of generating as many new data sets as we want; we just have a single data set. A famous idea in statistics is to use just our one data set to generate many new data sets by sampling the observations with replacement. This is called "[bootstrapping][1]"; starting with just one data set we "pull ourselves up by our bootstraps".

[1]: https://en.wikipedia.org/wiki/Bootstrapping_(statistics)

**(2 pts.)** Write a function `bootstrap_df(df)` which, given a data frame containing `n` rows, returns a new data frame containing `n` rows sampled randomly _with replacement_ from `df`. (The "with replacement" part is important for obvious reasons.) For example, if
```{r}
df = tribble(
    ~x, ~y,
    1,   2,
    3,   4,
    5,   6
)
```
then `bootstrap_df(df)` might return:

```{r}
  x y
1 1 2
2 1 2
3 3 4
```

In [None]:
bootstrap_df = function(df) {
    # YOUR CODE HERE
    stop()
}

In [None]:
bs = bootstrap_df(sim1)
stopifnot(nrow(bs) == nrow(sim1))
bs = bootstrap_df(tibble(x = rep(1, 100)))
stopifnot(bs$x == 1)

Use `bootstrap_df` to generate 1000 bootstrap replicates from `experiment_df` defined above, and use them to verify that the standard error of the `x2` regression coefficient is again $\approx 0.01$.

In [None]:
# YOUR CODE HERE
stop()

Notice that the bootstrap is very general: nothing about it is specific to linear regression. This is why it is so popular, because it lets us estimate uncertainty in more complicated models where (unlike regression) we don't have  exact formulas for the standard error.

#### Problem 4 (8 pts.)
In lecture we looked at a few examples of variable (a.k.a. model) selection: given a response variable, and some predictors, we studied residual plots in order to understand which predictors should be added to the model.

This procedure works okay for small data sets, but quickly becomes unmanageable on larger data sets. Already with ten predictors, there are $2^{10} = 1024$ different linear models that we could fit (not including interaction terms)! So, on real data, it is usually not possible to do model selection by hand. Some sort of automated procedure is needed.

------

In this problem you will implement one such procedure, known as *all subsets regression*. The idea is simple: given a response variable $y$ and predictors $x_1, \dots, x_p$, we fit all possible linear models involving $y$ and different combinations of the $x_i$. For each fitted model we record some measure of the quality of the fit. We select the model that has the highest quality score.

**(6 pts.)** Write a function `all_subsets_reg(df, qual)` which takes a data frame `df` and a function `qual` and implements all-subsets regression. The function `qual(mod)` takes as input a fitted model, and returns a quality score; higher scores are better. You may assume that the first column of `df` is the response variable, and the remaining columns of `df` are potential predictors. 

The function should return a vector containing the indices of the predictors that were selected under the best model. For example, if `df` is a tibble with four columns `y`, `x1`, `x2` and `x3`, and the algorithm selects `y ~ x1 + x3` as the best model, the `all_subsets(df, qual)` should return the vector `c(1, 3)`. 

If you want a more detailed description of how the algorithm should work, see the notebook [all_subsets_demo.ipynb](all_subsets_demo.ipynb) in this folder.

(*Hint*: the `combn(v, k)` function can be used to return all subsets of the vector `v` of size `k`).

In [None]:
all_subsets_reg = function(df, qual) {
    # YOUR CODE HERE
    stop()
}

The measure of model quality we will use is [AIC](https://en.wikipedia.org/wiki/Akaike_information_criterion). You don't need to understand how AIC works, but you should know that lower scores of AIC indicate a better model fit. Hence, we define our quality function to be:  

In [None]:
qual_aic = function(mdl) -AIC(mdl)

In [None]:
set.seed(1)
# if there are no predictors, return empty vector
stopifnot(is.null(all_subsets_reg(tibble(y = 1:10), qual_aic)))
# x perfectly predicts y, so of course we include it
stopifnot(identical(all_subsets_reg(tibble(y = 1:10, x = y), qual_aic), 1L))
# x2 is nearly co-linear with x1, so we do not include it.
stopifnot(identical(all_subsets_reg(tibble(y = 1:10, 
                                           x1 = y, 
                                           x2 = x1 + rnorm(n = 10)),
                                    qual_aic), 1L))
# if intercept-only model is selected, again return NULL
stopifnot(is.null(all_subsets_reg(tibble(y = 1:10, x = rnorm(n = 10)), qual_aic)))
# highway data
df = sim4 %>% select(y, everything()) 
stopifnot(setequal(all_subsets_reg(df, qual_aic), c(1, 2)))
# iris data
data(iris)
stopifnot(setequal(all_subsets_reg(iris, qual_aic), 1:4))
# diamonds data are too big, take subset
df = slice(diamonds, 1:1000) %>% select(price, everything())
stopifnot(setequal(all_subsets_reg(df, qual_aic), c(2, 5, 6, 7, 8, 9)))

**(2 pts.)** Suppose that instead of AIC we had chosen $R^2$ as our measure of model quality:

In [None]:
qual_R2 = function(mdl) summary(mdl)$r.squared

Try running `all_subsets_reg(df, qual_R2)` for a few data sets. You should notice a pattern in terms of the variables that get selected. Explain why this pattern occurs. Is $R^2$ appropriate to use for model selection?

YOUR ANSWER HERE

#### Problem 5 (4 pts.)
The previous problem studied variable selection. Why is variable selection necessary at all? Shouldn't we use all the available data, rather than adding only a subset of the predictors to our model? 

One argument for using fewer variables is parsimony: as we discussed in lecture, simpler models are more interpretable. However, even if we do not care about interpretability, there is another reason for preferring less complex models. In this problem, we will see that reason.

**(2 pt.)** Write a function `split_data(df, p)` which, given a data frame `df` and a number $p \in (0, 1)$, returns a list with two named elements, `train` and `test`. The function will randomly sample  (without replacement) $100 \times p \%$ of the rows (rounded to the nearest integer) from `df` and assign them to `train`. The remaining rows will be stored in `test`. For example, if `df` is a tibble with 100 rows and two columns then
```{r}
> split_data(df, p)
$train
# A tibble: 50 x 2
<...>
$test
# A tibble: 50 x 2
<...>
```

In [None]:
split_data = function(df, p) {
    # YOUR CODE HERE
    stop()
}

In [None]:
stopifnot(nrow(split_data(sim1, .5)$train) == 15)
stopifnot(nrow(split_data(sim1, .5)$test) == 15)

**(1 pt.)** Next, write a function `ssr(df, mod)` which, given a data frame `df` and a model `mod`, returns the sum of squared residuals obtained by applying `mod` to `df`.

In [None]:
ssr = function(df, mod) {
    # YOUR CODE HERE
    stop()
}

In [None]:
stopifnot(near(ssr(sim1, lm(y ~ x, sim1)), 135.8746, tol=1e-1))
stopifnot(near(ssr(sim4, lm(y ~ x1 + x2, sim4)), 1322.714, tol=1e-1))

Now we will split the `diamonds` into two halves:

In [None]:
set.seed(1)
df = split_data(diamonds, .5)

On the training data we will fit two models. The first is `log(price) ~ log(carat)` which we saw in lecture. The second, `log(price) ~ log(carat) + .^2`, contains hundreds more interaction terms:

In [None]:
mod1 = lm(log(price) ~ log(carat), df$train)
mod2 = lm(log(price) ~ log(carat) + .^2, df$train)

`mod2` is a much better predictor of `log(price)` on the training data set:

In [None]:
ssr(df$train, mod1)
ssr(df$train, mod2)

However, `mod2` does much *worse* if we compare them on the test data set:

In [None]:
ssr(df$test, mod1)
ssr(df$test, mod2)

This is called "[overfitting](https://en.wikipedia.org/wiki/Overfitting)". `mod2` has so many extra predictors that it can fit not only the actual patterns which are present in `df$train`, but also the random noise in that data. Since the noise is different in `df$test`, `mod2` does a worse job of predicting "out of sample".