This repository has been archived by the owner on Jan 3, 2018. It is now read-only.
/
03-supp-loops-in-depth.Rmd
142 lines (110 loc) · 5.15 KB
/
03-supp-loops-in-depth.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
layout: lesson
root: ../..
---
```{r, include = FALSE}
source("chunk_options.R")
```
### To loop or not to loop...?
In R you have multiple options when repeating calculations: vectorized operations, `for` loops, and `apply` functions.
This lesson is an extension of [Analyzing Multiple Data Sets](03-loops-R.html).
In that lesson, we introduced how to run a custom function, `analyze`, over multiple data files:
```{r analyze-function}
analyze <- function(filename) {
# Plots the average, min, and max inflammation over time.
# Input is character string of a csv file.
dat <- read.csv(file = filename, header = FALSE)
avg_day_inflammation <- apply(dat, 2, mean)
plot(avg_day_inflammation)
max_day_inflammation <- apply(dat, 2, max)
plot(max_day_inflammation)
min_day_inflammation <- apply(dat, 2, min)
plot(min_day_inflammation)
}
```
```{r files}
filenames <- list.files(pattern = "csv")
```
#### Vectorized operations
A key difference between R and many other languages is a topic known as *vectorization*.
When you wrote the `total` function, we mentioned that R already has `sum` to do this; `sum` is *much* faster than the interpreted `for` loop because `sum` is coded in C to work with a vector of numbers.
Many of R's functions work this way; the loop is hidden from you in C.
Learning to use vectorized operations is a key skill in R.
For example, to add pairs of numbers contained in two vectors
```{r}
a <- 1:10
b <- 1:10
```
you could loop over the pairs adding each in turn, but that would be very inefficient in R.
```{r}
res <- numeric(length = length(a))
for (i in seq_along(a)) {
res[i] <- a[i] + b[i]
}
res
```
Instead, `+` is a *vectorized* function which can operate on entire vectors at once
```{r}
res2 <- a + b
all.equal(res, res2)
```
#### `for` or `apply`?
A `for` loop is used to apply the same function calls to a collection of objects.
R has a family of functions, the `apply` family, which can be used in much the same way.
You've already used one of the family, `apply` in the first [lesson](../01-starting-with-data.html).
The `apply` family members include
* `apply` - apply over the margins of an array (e.g. the rows or columns of a matrix)
* `lapply` - apply over an object and return list
* `sapply` - apply over an object and return a simplified object (an array) if possible
* `vapply` - similar to `sapply` but you specify the type of object returned by the iterations
Each of these has an argument `FUN` which takes a function to apply to each element of the object.
Instead of looping over `filenames` and calling `analyze`, as you did earlier, you could `sapply` over `filenames` with `FUN = analyze`:
```{r, eval=FALSE}
sapply(filenames, FUN = analyze)
```
Deciding whether to use `for` or one of the `apply` family is really personal preference.
Using an `apply` family function forces to you encapsulate your operations as a function rather than separate calls with `for`.
`for` loops are often more natural in some circumstances; for several related operations, a `for` loop will avoid you having to pass in a lot of extra arguments to your function.
#### Loops in R are slow
No, they are not! *If* you follow some golden rules:
1. Don't use a loop when a vectorised alternative exists
2. Don't grow objects (via `c`, `cbind`, etc) during the loop - R has to create a new object and copy across the information just to add a new element or row/column
3. Allocate an object to hold the results and fill it in during the loop
As an example, we'll create a new version of `analyze` that will return the mean inflammation per day (column) of each file.
```{r}
analyze2 <- function(filenames) {
for (f in seq_along(filenames)) {
fdata <- read.csv(filenames[f], header = FALSE)
res <- apply(fdata, 2, mean)
if (f == 1) {
out <- res
} else {
# The loop is slowed by this call to cbind that grows the object
out <- cbind(out, res)
}
}
return(out)
}
system.time(avg2 <- analyze2(filenames))
```
Note how we add a new column to `out` at each iteration?
This is a cardinal sin of writing a `for` loop in R.
Instead, we can create an empty matrix with the right dimensions (rows/columns) to hold the results.
Then we loop over the files but this time we fill in the `f`th column of our results matrix `out`.
This time there is no copying/growing for R to deal with.
```{r}
analyze3 <- function(filenames) {
out <- matrix(ncol = length(filenames), nrow = 40) ## assuming 40 here from files
for (f in seq_along(filenames)) {
fdata <- read.csv(filenames[f], header = FALSE)
out[, f] <- apply(fdata, 2, mean)
}
return(out)
}
system.time(avg3 <- analyze3(filenames))
```
In this simple example there is little difference in the compute time of `analyze2` and `analyze3`.
This is because we are only iterating over 12 files and hence we only incur 12 copy/grow operations.
If we were doing this over more files or the data objects we were growing were larger, the penalty for copying/growing would be much larger.
Note that `apply` handles these memory allocation issues for you, but then you have to write the loop part as a function to pass to `apply`.
At its heart, `apply` is just a `for` loop with extra convenience.