/
03-linear-regression.Rmd
717 lines (519 loc) · 28.4 KB
/
03-linear-regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
# Linear Regression {#chapter-3}
Chapter 3 is about (ordinary least squares and polynomial) linear regression, a parametric supervised statistical learning method. Many of the more advanced statistical learning approaches are generalizations or extensions of linear regression, so having a good grasp on linear regression is important for understanding later methods in the book.
It also discusses K-nearest neighbours regression (KNN regression), a non-parametric supervised statistical learning method.
## Exercises
### Prerequisites {.unnumbered}
To access the data sets and functions used to complete the Chapter 3 exercises, load the following packages.
```{r prerequisites, message=FALSE}
library(ISLR2)
library(tidyverse)
library(tidymodels)
# library(skimr)
# library(GGally)
library(patchwork)
```
### Conceptual {.unnumbered}
::: exercise
Describe the null hypotheses to which the p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model.
| Term | Coefficient | Std. Error | t-statistic | p-value |
|-----------|-------------|------------|-------------|----------|
| intercept | 2.939 | 0.3119 | 9.420 | \< .001 |
| TV | 0.046 | 0.0014 | 32.81 | \< .0001 |
| radio | 0.189 | 0.0086 | 21.89 | \< .0001 |
| newspaper | -0.001 | 0.0059 | -0.18 | .8599 |
: Table 3.4
*Answer*.
The null hypotheses to which the p-values in Table 3.4 correspond is that the coefficient for each term is equal to zero, $\beta_i = 0$, which we use to test whether a term is associated with the response variable. We can use the p-values in the table to infer whether the coefficients are sufficiently far from zero that we can be confident they are non-zero, and thus associated with the response variable.
Based on the p-values in Table 3.4, we can conclude that:
- A $1000 increase in the TV advertising budget is associated with an average increase in product sales by 46 units while holding the advertising budgets for radio and newspaper fixed.
- A $1000 increase in the radio advertising budget is associated with an average increase in product sales by 189 units while holding the advertising budgets for TV and newspaper fixed.
- A $1000 increase in the newspaper advertising budget is not associated with a change in product sales.
Data note: For the Advertising data sales are in thousands of units and advertising budgets in thousands of dollars.
:::
::: exercise
Carefully explain the differences between the KNN classifier and KNN regression methods.
*Answer*.
Given a value for $K$ and a test observation $x_0$, both the KNN classifier and KNN regression first identify the $K$ training observations that are closest to $x_0$, represented by $\mathcal N_0$. For the KNN classifier $x_0$ would be a discrete class, and for KNN regression $x_0$ would be a continuous value. The methods then diverge in what they do with $\mathcal N_0$.
The KNN classifier then estimates the conditional probability that the test observation $x_0$ belongs to class $j$ as the fraction of training observations in $\mathcal N_0$ whose response value $y_i$ equals $j$:
$$
\mathrm{Pr}(Y = j|X = x_0) = \frac{1}{K} \sum_{i \in \mathcal N_0} I(y_i = j).
(\#eq:knn-classification)
$$
KNN regression, on the other hand, then estimates $f(x_0)$ using the average of all the training observations in $\mathcal N_0$:
$$
\hat f(x_0) = \frac{1}{K} \sum_{i \in {\mathcal N}_0} y_i.
(\#eq:knn-regression)
$$
As can be seen in Equations \@ref(eq:knn-classification) and \@ref(eq:knn-regression), there are two main differences between the KNN classifier and KNN regression methods:
- On the right side of the equation, the KNN classifier uses an indicator function $I(y_i = j)$ to determine whether a response value $y_i$ equals $j$ (if so then it equals 1, if not 0). These are summed to get the numerator of the conditional probability for each class (where the denominator is $K$). KNN regression, on the other hand, simply uses the response value $y_i$ as is, since it's a continuous value. These are also summed to get the numerator of the fraction (again with $K$ as the denominator), although it makes more sense to think of this as the average of all the training observations in $\mathcal N_0$.
- On the left hand side of the equation, for the KNN classifier we estimate the conditional probability that a test observation $x_0$ belongs to a class $j$. We can then use this probability to assign $x_0$ to the class $j$ with the highest probability, if we so choose (although we don't have to, the probabilities are equally useful). For KNN regression we estimate the form of $f(x_0)$, which corresponds to our prediction point for our test observation $x_0$ since $\hat Y = \hat f(X)$.
Thus, the main difference between the two methods is that the KNN classifier method is used to predict what class an observation likely belongs to based on the classes of its $K$ nearest neighbours, whereas the KNN regression method is used to predict what value an observation will have on a response variable based on the response values of its $K$ nearest neighbours.
:::
::: exercise
Suppose we have a data set with five predictors, $X_1 =$ GPA, $X_2 =$ IQ, $X_3 =$ Level (1 for College and 0 for High School), $X_4 =$ Interaction between GPA and IQ, and $X_5 =$ Interaction between GPA and Level. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get $\hat \beta_0 = 50$, $\hat \beta_1 = 20$, $\hat \beta_2 = 0.07$, $\hat \beta_3 = 35$, $\hat \beta_4 = 0.01$, $\hat \beta_5 = −10$.
(a) Which answer is correct, and why?
i. For a fixed value of IQ and GPA, high school graduates earn more, on average, than college graduates.
ii. For a fixed value of IQ and GPA, college graduates earn more, on average, than high school graduates.
iii. For a fixed value of IQ and GPA, high school graduates earn more, on average, than college graduates provided that the GPA is high enough.
iv. For a fixed value of IQ and GPA, college graduates earn more, on average, than high school graduates provided that the GPA is high enough.
*Answer*. The fourth description is correct. The coefficient for Level is positive, so when it is included in the model (i.e., when Level is not zero) then predictions for starting salary will be higher; however, because the coefficient for the interaction between GPA and level is negative, this will only be true provided that college graduate GPA is high enough.
(b) Predict the salary of a college graduate with IQ of 110 and a GPA of 4.0.
*Answer*.
The predicted salary for this college graduate is $137,100.
$$
\begin{align}
\hat y &= 50 + 20(4) + 0.07(110) + 35(1) + 0.01(4*110) - 10(4*1) \\
&= 50 + 80 + 7.7 + 35 + 4.4 - 40 \\
&= 137.1
\end{align}
$$
(c) True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.
*Answer*. False. The size of the coefficient does not provide evidence for or against an interaction effect. We use the p-value associated with the coefficient to determine whether it is sufficiently far from zero that we can be confident it is non-zero, and thus associated with the response variable. If our estimates are precise enough we can find a small coefficient and be confident that it is non-zero. But we cannot make a decision about this based on the coefficient alone.
:::
::: exercise
I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then fit a linear regression model to the data, as well as a separate cubic regression, i.e. $Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \epsilon$.
(a) Suppose that the true relationship between $X$ and $Y$ is linear,
i.e. $Y = \beta_0 + \beta_1X + \epsilon$. Consider the training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.
*Answer*.
Given that the true relationship between $X$ and $Y$ is linear we would expect the residual sum of squares to be lower for the linear regression than the cubic regression because:
- The least squares line minimizes RSS, and
- The linear regression matches the true form of $f(X)$, so
- The linear regression will fit better and thus have a smaller RSS.
(b) Answer (a) using test rather than training RSS.
*Answer*.
We would expect the same outcome with the test RSS as with the training RSS.
(c) Suppose that the true relationship between $X$ and $Y$ is not linear, but we don’t know how far it is from linear. Consider the training RSS for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.
*Answer*.
Given that we don't know how far the true relationship between $X$ and $Y$ is from linear, it's hard to say for sure. If the true relationship is far from linear we would expect training RSS to be lower for the cubic regression than the linear regression because the bias-variance trade off would be poorer for the linear regression. We would expect the same if the true relationship was cubic because the cubic regression would match the true form of $f(X)$. However, if the true relationship was only slightly nonlinear then it's hard to say which would have lower training RSS, if any. There are likely situations where one or the other is better and where they are the same; although in general the cubic regression might overfit and thus have a lower RSS.
(d) Answer (c) using test rather than training RSS.
*Answer*.
Again, it's hard to say for sure. If the true relationship between $X$ and $Y$ is slightly nonlinear, we might expect the cubic regression to have overfit the training data, so the bias of the linear regression may lead to better predictions (and a lower RSS) in the test data. If the true relationship is cubic then the cubic regression would have a lower test RSS since it would match the true form of $f(X)$. If the true relationship is far from linear then we would expect the cubic regression to have lower RSS since it would have a better bias-variance trade off than the linear regression.
:::
::: exercise
Consider the fitted values that result from performing linear regression without an intercept. In this setting, the $i$th fitted value takes the form
$$
\hat y_i = \hat \beta x_i,
$$
where
$$
\hat \beta = \bigg(\sum_{i = 1}^n x_i y_i \bigg) / \bigg(\sum_{i' = 1}^n x_{i'}^2 \bigg).
$$
Show that we can write
$$
\hat y_i = \sum_{i' = 1}^n a_{i'} y_{i'}.
$$
What is $a_{i'}$?
*Answer*.
We can get to the solution by:
1. Plugging the expression for $\hat \beta$ into the formula for the $i$th fitted value.
2. Moving the constant $x_i$ into the numerator summation expression and changing the numerator summation dummy variable to $i'$, since $x_i$ is independent of $i'$.
3. Moving the summation for $i'$ out of the numerator.
4. Moving $y_{i'}$ out of the numerator.
Which looks like
$$
\begin{align}
\hat y_i &= x_i \hat \beta \\
&= x_i \frac{\sum_{i = 1}^n x_i y_i}{\sum_{i' = 1}^n x_{i'}^2} \\
&= \frac{\sum_{i' = 1}^n x_i x_{i'} y_{i'}}{\sum_{i'' = 1}^n x_{i''}^2} \\
&= \sum_{i' = 1}^n \frac{x_i x_{i'} y_{i'}}{\sum_{i'' = 1}^n x_{i''}^2} \\
&= \sum_{i' = 1}^n \frac{x_i x_{i'}}{\sum_{i'' = 1}^n x_{i''}^2} y_{i'} \\
&= \sum_{i' = 1}^n a_{i'} y_{i'},
\end{align}
$$
where
$$
a_{i'} = \frac{x_i x_{i'}}{\sum_{i'' = 1}^n x_{i''}^2}.
$$
*Note*: We interpret this result by saying that the fitted values from linear regression are linear combinations of the response values.
:::
::: exercise
Using the least squares coefficient estimates equation, argue that in the case of simple linear regression, the least squares line always passes through the point $(\bar x, \bar y)$.
*Answer*.
The the least squares coefficient estimates are given by
$$
\begin{align}
\hat \beta_1 &= \frac{\sum_{i = 1}^n (x_i - \bar x)(y_i - \bar y)}
{\sum_{i = 1}^n (x_i - \bar x)^2}, \\
\hat \beta_0 &= \bar y - \hat \beta_1 \bar x.
\end{align}
$$
The least squares line is given by
$$
\hat y = \hat \beta_0 + \hat \beta_1 x.
$$
When $x = \bar x$ then $\hat y = \bar y$ on the least squares line. This can be demonstrated by substituting the least squares coefficient estimate equation for the intercept into the least squares line equation,
$$
\begin{align}
\hat y &= \hat \beta_0 + \hat \beta_1 \bar x \\
&= \bar y - \hat \beta_1 \bar x + \hat \beta_1 \bar x \\
&= \bar y.
\end{align}
$$
Since the least squares line passes through all values of $X$, it will always pass through the mean of the predictor $\bar x$, which exists somewhere between the smallest and largest values of $x$. As shown above, when $x = \bar x$ then $\hat y = \bar y$ on the least squares line, thus the least squares line always passes through the point $(\bar x, \bar y)$.
:::
::: exercise
It is claimed in the text that in the case of simple linear regression of $Y$ onto $X$, the $R^2$ statistic is equal to the square of the correlation between $X$ and $Y$. Prove that this is the case. For simplicity, you may assume that $\bar x = \bar y = 0$.
*Answer*.
Skipping this one.
<!--
The $R^2$ statistic is given by
$$
R^2 = \frac{\mathrm{TSS} - \mathrm{RSS}}{\mathrm{TSS}}
= 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}},
$$
where $\mathrm{TSS} = \sum_{i = 1}^n (y_i - \bar y)^2$ and $\mathrm{RSS} = \sum_{i = 1}^n (y_i - \hat y_i)^2$. So
$$
R^2 = \frac{
\sum_{i = 1}^n (y_i - \bar y)^2 - \sum_{i = 1}^n (y_i - \hat y_i)^2
}{
\sum_{i = 1}^n (y_i - \bar y)^2
} = 1 - \frac{
\sum_{i = 1}^n (y_i - \hat y_i)^2
}{
\sum_{i = 1}^n (y_i - \bar y)^2
}
$$
The correlation between $X$ and $Y$.
$$
\mathrm{Cor}(X,Y) = \frac{
\sum_{i = 1}^n (x_i - \bar x)(y_i - \bar y)
}{
\sqrt{\sum_{i = 1}^n (x_i - \bar x)^2} \sqrt{\sum_{i = 1}^n (y_i - \bar y)^2}
}
$$
Since $\bar x = \bar y = 0$ these can be simplified to
$$
R^2 = \frac{
\sum_{i = 1}^n y_i^2 - \sum_{i = 1}^n (y_i - \hat y_i)^2
}{
\sum_{i = 1}^n y_i^2
} = 1 - \frac{
\sum_{i = 1}^n (y_i - \hat y_i)^2
}{
\sum_{i = 1}^n y_i^2
},
$$
and
$$
\mathrm{Cor}(X,Y) = \frac{
\sum_{i = 1}^n x_i y_i
}{
\sqrt{\sum_{i = 1}^n x_i^2 y_i^2}
}.
$$
RSS can also be simplified using the least squares coefficient estimates
$$
\begin{align}
\mathrm{RSS} &= \sum_{i = 1}^n (y_i - \hat y_i)^2 \\
&= \sum_{i = 1}^n (y_i - \hat y_i)(y_i - \hat y_i) \\
&= \sum_{i = 1}^n y_i^2 - 2 y_i \hat y_i + \hat y_i^2 \\
&= \sum_{i = 1}^n y_i^2 - 2 y_i (\hat \beta_0 + \hat \beta_1 x_i) +
(\hat \beta_0 + \hat \beta_1 x_i)^2 \\
&= \sum_{i = 1}^n y_i^2 - 2 x_i y_i \hat \beta_1 + (\hat \beta_1 x_i)^2 \\
&= \sum_{i = 1}^n y_i^2 - 2 x_i y_i
\bigg(\frac{\sum_{i = 1}^n x_i y_i}{\sum_{i = 1}^n x_i^2}\bigg) +
x_i^2
\bigg(\frac{\sum_{i = 1}^n x_i y_i}{\sum_{i = 1}^n x_i^2}\bigg)^2 \\
&= \sum_{i = 1}^n y_i^2 - 2
(\sum_{i = 1}^n x_i y_i)
\bigg(\frac{\sum_{i = 1}^n x_i y_i}{\sum_{i = 1}^n x_i^2}\bigg) +
x_i^2
\bigg(\frac{\sum_{i = 1}^n x_i y_i}{\sum_{i = 1}^n x_i^2}\bigg)^2 \\
&= \sum_{i = 1}^n y_i^2 - 2
\frac{(\sum_{i = 1}^n x_i y_i)^2}{\sum_{i = 1}^n x_i^2} +
x_i^2
\bigg(\frac{\sum_{i = 1}^n x_i y_i}{\sum_{i = 1}^n x_i^2}\bigg)^2 \\
\end{align}
$$
-->
:::
### Applied {.unnumbered}
::: exercise
This question involves the use of simple linear regression on the `ISLR2::Auto`
data set.
(a) Use the `lm()` function to perform a simple linear regression with `mpg` as the response and `horsepower` as the predictor. Use the `summary()` function to print the results. Comment on the output.
*Answer*.
```{r}
auto_model <- lm(mpg ~ horsepower, data = Auto)
summary(auto_model)
```
From the output above we can see that:
- There is a relationship between horsepower and mpg such that a one horsepower increase is associated with an average decrease in fuel economy of 0.157845 miles per gallon.
- Horsepower explains 60.59% of the variability in mpg.
If we make some predictions based on a few common horsepower values for different classes of vehicles (small car, average car, SUV) we can see that this model is misspecified.
```{r}
auto_model %>%
predict(
tibble(horsepower = c(98, 180, 300)),
type = "response",
interval = "confidence"
)
```
When a vehicle has 300 horsepower our model predicts that it will have a fuel economy that falls somewhere between -9.94281 mpg and -4.892308 mpg. This implies that vehicles with high horsepower *produce* fuel during travel, rather than consuming it. It's a nonsense statistic that indicates this model is misspecified in a fairly severe way. Or perhaps that we have a data quality issue.
(b) Plot the response and the predictor. Display the least squares regression line using an reference line (abline).
*Answer*.
```{r}
ggplot(Auto, aes(x = horsepower, y = mpg)) +
geom_point(alpha = 0.5) +
geom_abline(
slope = coef(auto_model)["horsepower"],
intercept = coef(auto_model)["(Intercept)"],
colour = "blue"
) +
coord_cartesian(ylim = c(0, 50))
```
This plot gives some good evidence our model is misspecified. The relationship between horsepower and mpg is clearly nonlinear in our data. We also see that the range of horsepower in our data is limited in comparison to the amount of horsepower we might be interested in making predictions for.
(c) Produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.
*Answer*.
```{r}
auto_model %>%
augment() %>%
ggplot(aes(x = .fitted, y = .resid)) +
geom_point()
```
The diagnostic plot clearly shows problems with the fit:
- The U-shape of the residuals gives a strong indication of nonlinearity (and we know there is nonlinearity in the data already).
- The non-constant variance of the residuals is a sign of heteroscedasticity.
- There are at least a couple outliers.
:::
::: exercise
This question involves the use of multiple linear regression on the `Auto` data set.
(a) Produce a scatterplot matrix which includes all of the variables
in the data set.
*Answer*.
```{r}
plot(Auto)
```
(b) Compute the matrix of correlations between the variables.
*Answer*.
```{r}
Auto %>%
select(-name) %>%
cor()
```
(c) Use the `lm()` function to perform a multiple linear regression with `mpg` as the response and all other variables except `name` and `origin` as the predictors. Use the summary() function to print the results. Comment on the output.
*Answer*.
```{r}
auto_model <- Auto %>%
mutate(
across(origin, as_factor),
across(
origin,
~ fct_recode(.x, American = "1", European = "2", Japanese = "3")
)
) %>%
lm(
mpg ~ cylinders + displacement + horsepower + weight +
acceleration + year + origin,
data = .
)
summary(auto_model)
```
From the output above we can see that:
- The F-statistic indicates at least one predictor is non-zero.
- American vehicles have an average fuel economy of -17.9546021 miles per gallon when all other variables are 0. This is a nonsense statistic, as are holding most of the other variables at 0. Centring might help here.
- European vehicles have an average fuel economy 2.63 miles per gallon higher than American vehicles while holding all other variables fixed.
- Japanese vehicles have an average fuel economy 2.85 miles per gallon higher than American vehicles while holding all other variables fixed.
- A one year increase in vehicle model is associated with an average increase in fuel economy of 0.78 miles per gallon for American vehicles. So newer American vehicles are expected to have higher fuel economy.
- All other predictors except for cylinders, horsepower, and acceleration, have a statistically significant relationship with mpg.
(d) Produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high
leverage?
*Answer*.
```{r}
auto_model %>%
augment() %>%
ggplot(aes(x = .fitted, y = .resid)) +
geom_point()
```
The diagnostic plot clearly shows problems with the fit:
- The U-shape of the residuals gives a strong indication of nonlinearity (and we know there is nonlinearity in the data already).
- The non-constant variance of the residuals is a sign of heteroscedasticity.
The leverage plot shows one or two high leverage points. We can also see some possible outliers whose standardized residual value is greater than three.
```{r}
auto_model %>%
augment() %>%
# The standardized residual returned by augment() is equivalent to the
# "internally studentized residuals", which the textbook refers to simply as
# "studentized residuals".
ggplot(aes(x = .hat, y = .std.resid)) +
geom_point() +
geom_text(aes(label = 1:nrow(Auto)), nudge_x = -0.007) +
labs(x = "leverage")
```
(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
*Answer*.
In a model with interactions between all predictors there are a number of statistically significant interactions (not shown because its output is really long).
Some simpler models show some interesting interactions. For example, here we see that the average effect of a one horsepower increase on fuel economy depends on a vehicle's country of origin (although not by a meaningful amount it seems).
```{r}
Auto %>%
mutate(
across(origin, as_factor),
across(
origin,
~ fct_recode(.x, American = "1", European = "2", Japanese = "3")
)
) %>%
lm(
mpg ~ horsepower*origin,
data = .
) %>%
summary()
```
(f) Try a few different transformations of the variables, such as $\mathrm{log}(X)$, $\sqrt X$, $X^2$. Comment on your findings.
*Answer*.
Here I'll fit the same model with the different transformations. I chose the predictors that have a nonlinear relationship with mpg. As can be seen from the output of the models, the transformations affect all aspects of the model fits and parameter estimates.
```{r}
auto_model_1 <- lm(
mpg ~ displacement + horsepower + weight,
data = Auto
)
auto_model_2 <- lm(
mpg ~ log(displacement) + log(horsepower) + log(weight),
data = Auto
)
auto_model_3 <- lm(
mpg ~ sqrt(displacement) + sqrt(horsepower) + sqrt(weight),
data = Auto
)
auto_model_4 <- lm(
mpg ~ I(displacement^2) + I(horsepower^2) + I(weight^2),
data = Auto
)
summary(auto_model_1)
summary(auto_model_2)
summary(auto_model_3)
summary(auto_model_4)
```
:::
::: exercise
This question should be answered using the `Carseats` data set.
(a) Fit a multiple regression model to predict `Sales` using `Price`, `Urban`, and `US`.
*Answer*.
```{r}
carseats <- Carseats %>%
as_tibble() %>%
mutate(
across(where(is.character), as_factor),
across(ShelveLoc, ~ fct_relevel(.x, "Medium", after = 1))
)
carseats_model <- lm(Sales ~ Price + Urban + US, data = carseats)
summary(carseats_model)
```
(b) Provide an interpretation of each coefficient in the model. Be careful---some of the variables in the model are qualitative!
*Answer*.
The coefficients can be interpreted as follows:
- The intercept does not make sense to interpret since the car seats are never sold for free.
- A one dollar increase in car seat price is associated with an average decrease in sales by 54 units at rural stores located outside the US.
- Urban stores located outside the US sell 22 fewer car seats on average compared to rural stores located outside the US while holding price fixed.
- Rural stores located in the US sell 1200 more car seats on average compared to rural stores located outside the US while holding price fixed.
(c) Write out the model in equation form, being careful to handle
the qualitative variables properly.
*Answer*.
$$
y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i + \beta_3 x_i + \epsilon_i =
\begin{cases}
\beta_0 + \beta_1 x_i + \epsilon_i &
\text{if ith location is rural outside US} \\
\beta_0 + \beta_1 x_i + \beta_2 x_i + \epsilon_i &
\text{if ith location is urban outside US} \\
\beta_0 + \beta_1 x_i + \beta_3 x_i + \epsilon_i &
\text{if ith location is rural inside US},
\end{cases}
$$
where
$$
\begin{align}
\beta_0 &= 13.043469 \\
\beta_1 &= -0.054459 \\
\beta_2 &= -0.021916 \\
\beta_3 &= 1.200573.
\end{align}
$$
(d) For which of the predictors can you reject the null hypothesis $H_0: \beta_j = 0$?
*Answer*. Intercept, Price, and US.
(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
*Answer*.
```{r}
carseats_model_2 <- lm(Sales ~ Price + US, data = carseats)
summary(carseats_model_2)
```
(f) How well do the models in (a) and (e) fit the data?
*Answer*.
The residual standard error and $R^2$ of the models are, respectively, similar and the same. A likelihood ratio test between the two models also shows no difference in the compatibility of the models with the data.
```{r}
anova(carseats_model, carseats_model_2, test = "LRT")
```
(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
*Answer*.
```{r}
confint(carseats_model_2)
```
(h) Is there evidence of outliers or high leverage observations in the model from (e)?
*Answer*.
As can be seen in the diagnostic plot below, there do not appear to be outliers, but there are at least a few high leverage points.
```{r}
carseats_model_2 %>%
augment() %>%
# The standardized residual returned by augment() is equivalent to the
# "internally studentized residuals", which the textbook refers to simply as
# "studentized residuals".
ggplot(aes(x = .hat, y = .std.resid)) +
geom_point() +
geom_text(aes(label = 1:nrow(carseats)), nudge_x = -0.001) +
labs(x = "leverage")
```
:::
::: exercise
In this problem we will investigate the $t$-statistic for the null hypothesis $H_0: \beta = 0$ in simple linear regression without an intercept. To
begin, we generate a predictor `x` and a response `y` as follows.
```{r}
set.seed(1)
x <- rnorm(100)
y <- 2 * x + rnorm(100)
```
(a) Perform a simple linear regression of `y` onto `x`, without an intercept. Report the coefficient estimate $\hat \beta$, the standard error of this coefficient estimate, and the $t$-statistic and $p$-value associated with the null hypothesis $H_0: \beta = 0$. Comment on these results.
*Answer*.
```{r}
lm(y ~ x + 0) %>%
summary()
```
From the output above we see that the slope of the regression line is 1.9939. The standard error is pretty small so there's good precision for the estimate. The large $t$-statistic and small $p$-value indicate statistical significance.
(b) Now perform a simple linear regression of `x` onto `y` without an intercept, and report the coefficient estimate, its standard error, and the corresponding $t$-statistic and $p$-value associated with the null hypothesis $H_0: \beta = 0$. Comment on these results.
*Answer*.
```{r}
lm(x ~ y + 0) %>%
summary()
```
Although the coefficient and standard error are different from the previous model, the $t$-statistic and $p$-value are the same. The reason for this is that it's just the test statistic for the correlation between `x` and `y`, so it doesn't matter which is the response or predictor.
```{r}
cor.test(x, y)
```
The slopes between the models are slightly different though since `x` and `y` aren't symmetric, so flipping them will change the slope.
```{r}
model_a <- ggplot(mapping = aes(x = x, y = y)) +
geom_point() +
stat_smooth(method = "lm", formula = y ~ x + 0)
model_b <- ggplot(mapping = aes(x = y, y = x)) +
geom_point() +
stat_smooth(method = "lm", formula = y ~ x + 0)
model_a + model_b
```
(c) What is the relationship between the results obtained in (a) and (b)?
*Answer*. See above.
(d) Skipping.
(e) Using the results from (d), argue that the t-statistic for the regression of `y` onto `x` is the same as the $t$-statistic for the regression of `x` onto `y`.
*Answer*.
As noted above, the test statistic comes from the correlation test, which is why they're the same.
(f) In R, show that when regression is performed with an intercept,
the $t$-statistic for $H_0: \beta_1 = 0$ is the same for the regression of `y`
onto `x` as it is for the regression of `x` onto `y`.
*Answer*.
```{r}
lm(y ~ x) %>%
summary()
lm(x ~ y) %>%
summary()
```
It's the same as in the model without an intercept too.
:::
TODO: Questions 12-15