-
Notifications
You must be signed in to change notification settings - Fork 2
/
55-regression-important-matrices.Rmd
258 lines (174 loc) · 8.12 KB
/
55-regression-important-matrices.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
# (PART\*) Regression Applications {-}
# Important Matrices in Regression
In this chapter, you will learn about important and useful matrices in regression applications. To illustrate these matrices, we will use the following toy data set to fit a regression model using SAT and Self-Esteem to predict variation in GPA.
```{r tab-14-01, echo=FALSE}
d = data.frame(
ID = 1:6,
SAT = c(560, 780, 620, 600, 720, 380),
GPA = c(3.0, 3.9, 2.9, 2.7, 3.7, 2.4),
Self = c(11, 10, 19, 7, 18, 13),
IQ = c(112, 143, 124, 129, 130, 82)
)
knitr::kable(
d,
col.names = c("ID", "SAT", "GPA", "Self-Esteem", "IQ"),
caption = "Example set of education data.",
align = "c",
table.attr = "style='width:60%;'",
escape = FALSE
) %>%
kable_classic()
```
<br />
Recall that the regression model is denoted as
$$
\mathbf{y} = \mathbf{Xb} + \mathbf{e}
$$
where **y** is the outcome vector, **X** is the design matrix, **e** is the vector of residuals, and **b** is the vector of coefficients estimated as,
$$
\mathbf{b} = (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{y}
$$
## Design Matrix
The design matrix, **X** plays a critical role in the regression model and in any estimation computed from the model. There is also a direct link from the model formula inputted in `lm()` and the design matrix. Namely, that each term "added" in the right-hand side of the formula (after the tilde) indicates a unique column in the design matrix. Below we use the toy data set to show the connection between different `lm()` formulations and the design matrix used.
$$
\begin{split}
&\mathtt{lm(GPA \sim 1 + SAT)} \\[2ex]
& \mathbf{X} = \begin{bmatrix}1 & 560 \\ 1 & 780 \\ 1 & 620\\ 1 & 600\\ 1 & 720\\ 1 & 380 \end{bmatrix}
\end{split}
$$
In this design matrix we have two columns, indicated by the two terms being added on the right-hand side of the formula. The `1` in the formula indicates that the design matrix should include a column of 1s. The second term (we added `SAT`) is comprised of the SAT scores.
$$
\begin{split}
&\mathtt{lm(GPA \sim 1 + SAT + IQ)} \\[2ex]
& \mathbf{X} = \begin{bmatrix}1 & 560 & 112 \\ 1 & 780 & 143 \\ 1 & 620 & 124\\ 1 & 600 & 129\\ 1 & 720 & 130\\ 1 & 380 & 82 \end{bmatrix}
\end{split}
$$
In this design matrix we have three columns, indicated by the three terms being added on the right-hand side of the formula. The first two columns are the same as in the previous design matrix, but now we also include a column of the IQ scores.
$$
\begin{split}
&\mathtt{lm(GPA \sim 1 + SAT + IQ + SAT:IQ)} \\[2ex]
& \mathbf{X} = \begin{bmatrix}1 & 560 & 112 & 62720 \\ 1 & 780 & 143 & 111540 \\ 1 & 620 & 124 & 76880\\ 1 & 600 & 129 & 77400\\ 1 & 720 & 130 & 93600\\ 1 & 380 & 82 & 31160 \end{bmatrix}
\end{split}
$$
In this design matrix we have four columns, indicated by the four terms being added on the right-hand side of the formula. The first three columns are the same as in the previous design matrix. The column of interaction terms is based on the product of the SAT and IQ scores.
$$
\begin{split}
&\mathtt{lm(GPA \sim 1 + SAT + I(SAT\widehat{}2)} \\[2ex]
& \mathbf{X} = \begin{bmatrix}1 & 560 & 313600 \\ 1 & 780 & 608400 \\ 1 & 620 & 384400\\ 1 & 600 & 360000\\ 1 & 720 & 518400\\ 1 & 380 & 144400 \end{bmatrix}
\end{split}
$$
In this design matrix we have three columns, indicated by the three terms being added on the right-hand side of the formula. The first two columns are the same as in the previous design matrix. The quadratic effect of SAT constitutes the elements in the third column.^[The `I()` function makes the quadratic a specific column in the design matrix. You need the `I()` function because `^` is a special character in formula syntax that indicates crossing. Without the `I()` term the design matrix would only have two columns; a column of 1s and a column of SAT scores. You can read more by accesing the help page for `formula`, namely `help(formula)`.]
<br />
## Vector of Fitted Values
We can compute the vector of fitted values using
$$
\hat{\mathbf{y}} = \mathbf{Xb}
$$
Here **X** has dimensions $n \times k$ and **b** has dimensions $k \times 1$. In both cases $k=p+1$ where *p* is equal to the number of predictors in the model. This product gives the $n\times 1$ vector of fitted values, $\hat{\mathbf{y}}$, which is,
$$
\begin{bmatrix}\hat{y}_1 \\ \hat{y}_2 \\ \hat{y}_3 \\ \vdots \\ \hat{y}_n\end{bmatrix}
$$
```{r}
# Create vector of outcomes
y = c(3.0, 3.9, 2.9, 2.7, 3.7, 2.4)
# Create design matrix
X = matrix(
data = c(
rep(1, 6),
560, 780, 620, 600, 720, 380,
11, 10, 19, 7, 18, 13
),
ncol = 3
)
# Estimate coefficients
b = solve(t(X) %*% X) %*% t(X) %*% y
# Obtain vector of fitted values
y_hat = X %*% b
# View y_hat
y_hat
```
<br />
## The H-Matrix (Hat Matrix)
One useful matrix in regression is the *hat matrix*, or the **H**-matrix. To obtain the **H**-matrix we substitute the matrix formulation of the coefficient vector, **b**, into the equation to compute the fitted values,
$$
\begin{split}
\mathbf{\hat{y}} &= \mathbf{Xb} \\[2ex]
&= \mathbf{X}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal}\mathbf{y}
\end{split}
$$
The set of products in this expression involving the **X** terms is referred to as **H**-matrix, that is:
$$
\begin{split}
\mathbf{\hat{y}} &= \underbrace{\mathbf{X}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal}}_\mathbf{H}\mathbf{y} \\[2ex]
&= \mathbf{Hy}
\end{split}
$$
where,
$$
\mathbf{H} = \mathbf{X}(\mathbf{X}^{\intercal}\mathbf{X})^{-1}\mathbf{X}^{\intercal}
$$
Note that the **H**-matrix can be created completely from the design matrix and its transpose.
```{r}
# Create H
H = X %*% solve(t(X) %*% X) %*% t(X)
# View H
H
```
It also has the following properties:
- **H** is a square $n \times n$ matrix
- **H** is a symmetric matrix.
- **H** is an idempotent matrix.
Idempotency is a mathematical property that implies certain operations can be repeatedly applied to a structure without changing it. An idempotent matrix is a matrix that can be post-multiplied by itself and the result is the originasl matrix. In our example, since **H** is idempotent,
$$
\mathbf{HH} = \mathbf{H}
$$
We can verify this using R
```{r}
# Verify H is idempotent
H %*% H
```
As shown previously, we can also use the **H**-matrix to compute the fitted values, by post-multiplying **H** by the vector of outcomes:
```{r}
# Compute fitted values
H %*% y
```
In this computation, we can see that the fitted values can be expressed as linear combinations of the response vector **Y** using coefficients found in **H**. (This is why **H** is often referred to as the *hat matrix*.) For example, the first fitted value is computed as:
$$
\begin{split}
\hat{y}_1 &= 0.2244(3.0) + 0.1377(3.9) + 0.0595(2.9) + 0.2738(2.7) + 0.0294(3.7) + 0.2751(2.4) \\[2ex]
&= 2.891
\end{split}
$$
The **H**-matrix has many uses in regression. Aside from using it to compute the vectors of fitted values and residuals, it is also used to obtain leverage measures for each of the observations. Additionally, the trace of the **H**-matrix indicates the number of parameters (including the intercept) that are being estimated in the regression model. For example, in our regression model, we are estimating three parameters: $\beta_0$, $\beta_{\mathrm{SAT}}$, and $\beta_{\mathrm{Self\mbox{-}Esteem}}$.
```{r}
# Compute trace of H
sum(diag(H))
```
<br />
## Vector of Residuals
The $n\times 1$ vector of *residuals*, **e**, can be computed as,
$$
\mathbf{e} = \begin{bmatrix}\epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \\ \vdots \\ \epsilon_n\end{bmatrix} = \mathbf{y} - \mathbf{\hat{y}}
$$
Doing a little substitution,
$$
\begin{split}
\mathbf{e} &= \mathbf{y} - \mathbf{\hat{y}} \\
&= \mathbf{y} - \mathbf{Hy} \\
&= (\mathbf{I} - \mathbf{H}) \mathbf{y}
\end{split}
$$
Thus the residuals can also be expressed as linear combinations of the response vector **y**. The matrix $(\mathbf{I}-\mathbf{H})$ has some of the same properties as **H**, namely, it is
- Square,
- Symmetric, and
- Idempotent
We can compute the residual vector using R
```{r}
# Create 6x6 identity matrix
I = diag(6)
# Compute the vector of residuals
residuals = (I - H) %*% y
# View vector of residuals
residuals
```
<br />