-
Notifications
You must be signed in to change notification settings - Fork 23
/
matrices-in-R.qmd
579 lines (394 loc) · 19.3 KB
/
matrices-in-R.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
# Matrices in R
When the number of variables associated with each observation is large and they can all be represented as a number, it is often more convenient to store them in a matrix and perform the analysis with linear algebra operations, rather than storing them in a data frame and performing the analysis with **tidyverse** or **data.table** functions. With matrices, variables for each observation are stored in a row, resulting in a matrix with as many columns as variables. In statistics, we refer to values represented in the rows of the matrix as the *covariates* or *predictors* and, in machine learning, we refer to them as the *features*.
In linear algebra, we have three types of objects: scalars, vectors, and matrices. We have already learned about vectors in R, and, although there is no data type for scalars, we can represent them as vectors of length 1. In this chapter, we learn how to work with matrices in R and relate them to linear algebra notation and concepts.
## Case study: MNIST {#sec-mnist}
The first step in handling mail received in the post office is to sort letters by zip code:
![](../ml/img/how-to-write-a-address-on-an-envelope-how-to-write-the-address-on-an-envelope-write-address-on-envelope-india-finishedenvelope-x69070.png){width="40%" style="display:block; margin:auto;"}
In the Machine Learning part of this book, we will describe how we can build computer algorithms to read handwritten digits, which robots then use to sort the letters. To do this, we first need to collect data, which in this case is a high-dimensional dataset and best stored in a matrix.
The MNIST dataset was generated by digitizing thousands of handwritten digits, already read and annotated by humans[^matrices-in-r-1]. Below are three images of written digits.
[^matrices-in-r-1]: http://yann.lecun.com/exdb/mnist/
```{r digit-images-example, echo=FALSE, cache=TRUE, message=FALSE}
library(tidyverse)
library(dslabs)
if (!exists("mnist")) mnist <- read_mnist()
tmp <- lapply( c(1,4,5), function(i){
expand.grid(Row = 1:28, Column = 1:28) |>
mutate(id = i, label = mnist$train$label[i],
value = unlist(mnist$train$images[i,]))
})
tmp <- Reduce(rbind, tmp)
tmp |> ggplot(aes(Row, Column, fill = value)) +
geom_raster(show.legend = FALSE) +
scale_y_reverse() +
scale_fill_gradient(low = "white", high = "black") +
facet_grid(.~label)
```
The images are converted into $28 \times 28 = 784$ pixels and, for each pixel, we obtain a grey scale intensity between 0 (white) and 255 (black). The following plot shows the individual features for each image:
```{r example-images, echo=FALSE}
tmp |> ggplot(aes(Row, Column, fill = value)) +
geom_point(pch = 21) +
scale_y_reverse() +
scale_fill_gradient(low = "white", high = "black") +
facet_grid(.~label)
```
For each digitized image, indexed by $i$, we are provided with 784 variables and a categorical outcome, or *label*, representing the digit among $0, 1, 2, 3, 4, 5, 6, 7 , 8,$ and $9$ that the image is representing. Let's load the data using the **dslabs** package:
```{r echo=FALSE}
library(tidyverse)
library(dslabs)
if (!exists("mnist")) mnist <- read_mnist()
```
```{r eval=FALSE}
library(tidyverse)
library(dslabs)
mnist <- read_mnist()
```
In these cases, the pixel intensities are saved in a matrix:
```{r}
class(mnist$train$images)
```
The labels associated with each image are included in a vector:
```{r}
table(mnist$train$labels)
```
### Motivating tasks
To motivate the use of matrices in R, we will pose six tasks related to the handwritten digits data and then show the fast and simple code that solves them.
1\. Visualize the original image. The pixel intensities are provided as rows in a matrix. We will show how to convert these to a matrix that we can visualize.
2\. Do some digits require more ink to write than others? We will study the distribution of the total pixel darkness and how it varies by digits.
3\. Are some pixels uninformative? We will study the variation of each pixel across digits and remove predictors (columns) associated with pixels that don't change much and thus can't provide much information for classification.
4\. Can we remove smudges? We will first look at the distribution of all pixel values. Next, we will use this to pick a cutoff to define unwritten space. Then, we set anything below that cutoff to 0.
5\. Binarize the data. First, we will look at the distribution of all pixel values. We will then use this to pick a cutoff to distinguish between writing and no writing. Afterward, we will convert all entries into either 1 or 0.
6\. Standardize the digits. We will scale each of the predictors in each entry to have the same average and standard deviation.
To complete these, we will have to perform mathematical operations involving several variables. The **tidyverse** or **data.table** are not developed to perform these types of mathematical operations. For this task, it is convenient to use matrices.
To simplify the code below, we will rename these `x` and `y` respectively:
```{r}
x <- mnist$train$images
y <- mnist$train$labels
```
## Dimensions of a matrix
The dimension of a matrix is an important characteristic needed to assure that certain linear algebra operations can be performed. The dimension is a two-number summary defined as the number of rows $\times$ the number of columns.
The `nrow` function tells us how many rows that matrix has:
```{r}
nrow(x)
```
and `ncol` tells us how many columns:
```{r}
ncol(x)
```
We learn that our dataset contains `r prettyNum(nrow(mnist$train$images), big.mark=",")` observations (images) and `r ncol(mnist$train$images)` features (pixels).
The `dim` function returns the rows and columns:
```{r}
dim(x)
```
## Creating a matrix
In R, we can create a matrix using the `matrix` function. The first argument is a vector containing the elements that will fill up the matrix. The second and third arguments determine the number of row and columns, respectively. So a typical way to create a matrix is to first obtain a vector of numbers containing the elements of the matrix and feeding it to the `matrix` function. For example, to create a $100 \times 2$ matrix of normally distributed random variables, we write:
```{r}
z <- matrix(rnorm(100*2), 100, 2)
```
Note that by default the matrix is filled in column by column:
```{r}
matrix(1:15, 3, 5)
```
To fill the matrix row by row, we can use the `byrow` argument:
```{r}
matrix(1:15, 3, 5, byrow = TRUE)
```
The function `as.vector` converts a matrix back into a vector:
```{r}
as.vector(matrix(1:15, 3, 5))
```
:::{callout-warning}
If the product of columns and rows does not match the length of the vector provided in the first argument, `matrix` recycles values. If the length of the vector is a sub-multiple or multiple of the number of rows, this happens **without warning**:
```{r}
matrix(1:3, 3, 5)
```
:::
## Subsetting
To extract a specific entry from a matrix, for example the 300th row of the 100th column, we write:
```{r, eval=FALSE}
x[300,100]
```
We can extract subsets of the matrices by using vectors of indexes. For example, we can extract the first 100 pixels from the first 300 observations like this:
```{r, eval=FALSE}
x[1:300,1:100]
```
To extract an entire row or subset of rows, we leave the column dimension blank. So the following code returns all the pixels for the first 300 observations:
```{r, eval=FALSE}
x[1:300,]
```
Similarly, we can subset any number of columns by keeping the first dimension blank. Here is the code to extract the first 100 pixels:
```{r, eval=FALSE}
x[,1:100]
```
:::{.callout-warning}
If we subset just one row or just one column, the resulting object is no longer a matrix. For example:
```{r}
dim(x[300,])
```
To avoid this, we can use the `drop` argument:
```{r}
dim(x[100,,drop = FALSE])
```
:::
### Task 1: Visualize the original image {.unnumbered}
For instance, let's try to visualize the third observation. From the label, we know this is a:
```{r}
mnist$train$label[3]
```
The third row of the matrix `x[3,]` contains the 784 pixel intensities. We can assume these were entered in order and convert them back to a $28 \times 28$ matrix using:
```{r}
grid <- matrix(x[3,], 28, 28)
```
To visualize the data, we can use `image` in the followin way:
```{r, eval=FALSE}
image(1:28, 1:28, grid)
```
However, because the y-axis in `image` goes bottom to top and `x` stores pixels top to bottom the code above shows shows a flipped image. To flip it back we can use:
```{r, eval=FALSE}
image(1:28, 1:28, grid[, 28:1])
```
```{r matrix-image, fig.width = 4, fig.height = 4, echo=FALSE}
rafalib::mypar(1,1)
image(1:28, 1:28, grid[, 28:1])
```
## Mathematical notation {#sec-matrix-notation}
Matrices are usually represented with bold upper case letters:
$$
\mathbf{X} =
\begin{bmatrix}
x_{1,1}&x_{1,2}&\dots & x_{1,p}\\
x_{2,1}&x_{2,2}&\dots & x_{2,p}\\
\vdots & \vdots & \ddots & \vdots\\
x_{n,1}&x_{n,2}&\dots&x_{n,p}\\
\end{bmatrix}
$$
with $x_{i,j}$ representing the $j$-the feature for the $i$-th observation.
We denote vectors with lower case bold letters and represent them as one column matrices, often referred to as *column vectors*. R follows this convention when converting a vector to a matrix:
```{r}
dim(matrix(x[300,]))
```
However, *column vectors* should not be confused with the columns of the matrix. They have this name simply because they have one column.
Mathematical descriptions of machine learning often make reference to vectors representing the $p$ features:
$$
\mathbf{x} =
\begin{bmatrix}
x_1\\\
x_2\\\
\vdots\\\
x_p
\end{bmatrix}
$$
To distinguish between features associated with the observations $i=1,\dots,n$, we add an index:
$$
\mathbf{x}_i = \begin{bmatrix}
x_{i,1}\\
x_{i,2}\\
\vdots\\
x_{i,p}
\end{bmatrix}
$$
::: {callout-warning}
Bold lower case letters are also commonly used to represent matrix columns rather than rows. This can be confusing because $\mathbf{x}_1$ can represent either the first row or the first column of $\mathbf{X}$. One way to distinguish is to use notation similar to computer code: using the colon $:$ to represent *all*. So $\mathbf{X}_{1,:}$ represents the first row and $\mathbf{X}_{:,1}$ is the first column. Another approach is to distinguish by the letter used to index, with $i$ used for rows and $j$ used for columns. So $\mathbf{x}_i$ is the $i$th row and $\mathbf{x}_j$ is the $j$th column. With this approach, it is important to clarify which dimension, row or column is being represented. Further confusion can arise because, as aforementioned, it is common to represent all vectors, including the rows of a matrix, as one-column matrices.
:::
## The transpose
A common operation when working with matrices is the *transpose*. We use the transpose to understand several concepts described in the next several sections. This operation simply converts the rows of a matrix into columns. We use the symbols $\top$ or $'$ next to the bold upper case letter to denote the transpose:
$$
\text{if } \,
\mathbf{X} =
\begin{bmatrix}
x_{1,1}&\dots & x_{1,p} \\
x_{2,1}&\dots & x_{2,p} \\
\vdots & \ddots & \vdots & \\
x_{n,1}&\dots & x_{n,p}
\end{bmatrix} \text{ then }\,
\mathbf{X}^\top =
\begin{bmatrix}
x_{1,1}&x_{2,1}&\dots & x_{n,1} \\
\vdots & \vdots & \ddots & \vdots \\
x_{1,p}&x_{2,p}&\dots & x_{n,p}
\end{bmatrix}
$$
In R we compute the transpose using the function `t`
```{r}
dim(x)
dim(t(x))
```
One use of the transpose is that we can write the matrix $\mathbf{X}$ as rows of the column vectors representing the features for each individual observation in the following way:
$$
\mathbf{X} =
\begin{bmatrix}
\mathbf{x}_1^\top\\
\mathbf{x}_2^\top\\
\vdots\\
\mathbf{x}_n^\top
\end{bmatrix}
$$
## Row and column summaries
A common operation with matrices is to apply the same function to each row or to each column. For example, we may want to compute row averages and standard deviations. The `apply` function lets you do this. The first argument is the matrix, the second is the dimension, 1 for rows, 2 for columns, and the third is the function to be applied.
So, for example, to compute the averages and standard deviations of each row, we write:
```{r}
avgs <- apply(x, 1, mean)
sds <- apply(x, 1, sd)
```
To compute these for the columns, we simply change the 1 to a 2:
```{r}
avgs <- apply(x, 1, mean)
sds <- apply(x, 1, sd)
```
Because these operations are so common, special functions are available to perform them. So, for example, the functions `rowMeans` computes the average of each row:
```{r, message=FALSE}
avg <- rowMeans(x)
```
and the **matrixStats** function `rowSds` computes the standard deviations for each row:
```{r, message=FALSE}
library(matrixStats)
sds <- rowSds(x)
```
The functions `colMeans` and `colSds` provide the version for columns. For more fast implementations consider the functions available in **matrixStats**.
### Task 2: Do some digits require more ink to write than others? {.unnumbered}
For the second task, related to total pixel darkness, we want to see the average use of ink plotted against digit. We have already computed this average and can generate a boxplot to answer the question:
```{r boxplot-of-digit-averages}
avg <- rowMeans(x)
boxplot(avg ~ y)
```
From this plot we see that, not surprisingly, 1s use less ink than other digits.
## Conditional filtering
One of the advantages of matrices operations over **tidyverse** operations is that we can easily select columns based on summaries of the columns.
Note that logical filters can be used to subset matrices in a similar way in which they can be used to subset vectors. Here is a simple example subsetting columns with logicals:
```{r}
matrix(1:15, 3, 5)[,c(FALSE, TRUE, TRUE, FALSE, TRUE)]
```
This implies that we can select rows with conditional expression. In the following example we remove all observations containing at least one `NA`:
```{r}
#| eval: false
x[apply(!is.na(x), 1, all),]
```
This being a common operation, we have a **matrixStats** function to do it faster:
```{r}
#| eval: false
x[!rowAnyNAs(x),]
```
### Task 3: Are some pixels uninformative? {.unnumbered}
We can use these ideas to remove columns associated with pixels that don't change much and thus do not inform digit classification. We will quantify the variation of each pixel with its standard deviation across all entries. Since each column represents a pixel, we use the `colSds` function from the **matrixStats** package:
```{r}
sds <- colSds(x)
```
A quick look at the distribution of these values shows that some pixels have very low entry to entry variability:
```{r sds-histogram, echo=FALSE}
rafalib::mypar()
hist(sds, breaks = 30, main = "SDs")
```
```{r}
#| eval: false
hist(sds, breaks = 30, main = "SDs")
```
This makes sense since we don't write in some parts of the box. Here is the variance plotted by location:
```{r, eval=FALSE}
image(1:28, 1:28, matrix(sds, 28, 28)[, 28:1])
```
```{r pixel-variance, fig.width = 3, fig.height = 3, echo=FALSE, out.width="50%"}
rafalib::mypar()
image(1:28, 1:28, matrix(sds, 28, 28)[, 28:1])
```
We see that there is little variation in the corners.
We could remove features that have no variation since these can't help us predict.
So if we wanted to remove uninformative predictors from our matrix, we could write this one line of code:
```{r}
new_x <- x[,colSds(x) > 60]
dim(new_x)
```
Only the columns for which the standard deviation is above 60 are kept, which removes over half the predictors.
## Indexing with matrices
An operation that facilitates efficient coding is that we can change entries of a matrix based on conditionals applied to that same matrix. Here is a simple example:
```{r}
mat <- matrix(1:15, 3, 5)
mat[mat > 6 & mat < 12] <- 0
mat
```
A useful application of this approach is that we can change all the `NA` entries of a matrix to something else:
```{r}
#| eval: false
x[!is.na(x)] <- 0
```
### Task 4: Can we remove smudges?
A histogram of all our predictor data:
```{r histogram-all-pixels}
#| echo: false
rafalib::mypar()
hist(as.vector(x), breaks = 30, main = "Pixel intensities")
```
```{r}
#| eval: false
hist(as.vector(x), breaks = 30, main = "Pixel intensities")
```
shows a clear dichotomy which is explained as parts of the image with ink and parts without. If we think that values below, say, 50 are smudges, we can quickly make them zero using:
```{r, eval=FALSE}
new_x <- x
new_x[new_x < 50] <- 0
```
### Task 5: Binarizing the data {.unnumbered}
The histogram above seems to suggest that this data is mostly binary. A pixel either has ink or does not. Applying what we have learned, we can binarize the data using just matrix operations:
```{r}
bin_x <- x
bin_x[bin_x < 255/2] <- 0
bin_x[bin_x > 255/2] <- 1
```
We can also convert to a matrix of logicals and then coerce to numbers like this:
```{r}
bin_X <- (x > 255/2)*1
```
## Vectorization for matrices
In R, if we subtract a vector from a matrix, the first element of the vector is subtracted from the first row, the second element from the second row, and so on. Using mathematical notation, we would write it as follows:
$$
\begin{bmatrix}
X_{1,1}&\dots & X_{1,p} \\
X_{2,1}&\dots & X_{2,p} \\
& \vdots & \\
X_{n,1}&\dots & X_{n,p}
\end{bmatrix}
-
\begin{bmatrix}
a_1\\\
a_2\\\
\vdots\\\
a_n
\end{bmatrix}
=
\begin{bmatrix}
X_{1,1}-a_1&\dots & X_{1,p} -a_1\\
X_{2,1}-a_2&\dots & X_{2,p} -a_2\\
& \vdots & \\
X_{n,1}-a_n&\dots & X_{n,p} -a_n
\end{bmatrix}
$$
The same holds true for other arithmetic operations.
The function `sweep` facilitates this type of operation. It works similarly to `apply`. It takes each entry of a vector and applies an arithmetic operation to the corresponding row. Subtraction is the default arithmetic operation. So, for example, to center each row around the average, we can use:
```{r}
#| eval: false
sweep(x, 1, rowMeans(x))
```
### Task 6: Standardize the digits {.unnumbered}
The way R vectorizes arithmetic operations implies that we can scale each row of a matrix as follows:
```{r, eval=FALSE}
(x - rowMeans(x))/rowSds(x)
```
Yet this approach does not work for columns. For columns, we can `sweep`:
```{r, eval=FALSE}
x_mean_0 <- sweep(x, 2, colMeans(x))
```
To divide by the standard deviation, we change the default arithmetic operation to division as follows:
```{r, eval=FALSE}
x_standardized <- sweep(x_mean_0, 2, colSds(x), FUN = "/")
```
In R, if you add, subtract, multiple or divide two matrices, the operation is done elementwise. For example, if two matrices are stored in `x` and `y`, then:
```{r, eval=FALSE}
x*y
```
does not result in matrix multiplication. Instead, the entry in row $i$ and column $j$ of this product is the product of the entry in row $i$ and column $j$ of `x` and `y`, respectively.
## Exercises
1\. Create a 100 by 10 matrix of randomly generated normal numbers. Put the result in `x`.
2\. Apply the three R functions that give you the dimension of `x`, the number of rows of `x`, and the number of columns of `x`, respectively.
3\. Add the scalar 1 to row 1, the scalar 2 to row 2, and so on, to the matrix `x`.
4\. Add the scalar 1 to column 1, the scalar 2 to column 2, and so on, to the matrix `x`. Hint: Use `sweep` with `FUN = "+"`.
5\. Compute the average of each row of `x`.
6\. Compute the average of each column of `x`.
7\. For each digit in the MNIST training data, compute the proportion of pixels that are in a *grey area*, defined as values between 50 and 205. Make a boxplot by digit class. Hint: Use logical operators and `rowMeans`.