novice/r/03-supp-loops-in-depth.Rmd

---
layout: lesson
root: ../..
---

```{r, include = FALSE}
source("chunk_options.R")
```

### To loop or not to loop...?

In R you have multiple options when repeating calculations: vectorized operations, `for` loops, and `apply` functions.

This lesson is an extension of [Analyzing Multiple Data Sets](03-loops-R.html).
In that lesson, we introduced how to run a custom function, `analyze`, over multiple data files:

```{r analyze-function}
analyze <- function(filename) {
  # Plots the average, min, and max inflammation over time.
  # Input is character string of a csv file.
  dat <- read.csv(file = filename, header = FALSE)
  avg_day_inflammation <- apply(dat, 2, mean)
  plot(avg_day_inflammation)
  max_day_inflammation <- apply(dat, 2, max)
  plot(max_day_inflammation)
  min_day_inflammation <- apply(dat, 2, min)
  plot(min_day_inflammation)
}
```

```{r files}
filenames <- list.files(pattern = "csv")
```

#### Vectorized operations

A key difference between R and many other languages is a topic known as *vectorization*.
When you wrote the `total` function, we mentioned that R already has `sum` to do this; `sum` is *much* faster than the interpreted `for` loop because `sum` is coded in C to work with a vector of numbers.
Many of R's functions work this way; the loop is hidden from you in C.
Learning to use vectorized operations is a key skill in R.

For example, to add pairs of numbers contained in two vectors

```{r}
a <- 1:10
b <- 1:10
```

you could loop over the pairs adding each in turn, but that would be very inefficient in R.

```{r}
res <- numeric(length = length(a))
for (i in seq_along(a)) {
  res[i] <- a[i] + b[i]
}
res
```

Instead, `+` is a *vectorized* function which can operate on entire vectors at once

```{r}
res2 <- a + b
all.equal(res, res2)
```

#### `for` or `apply`?

A `for` loop is used to apply the same function calls to a collection of objects.
R has a family of functions, the `apply` family, which can be used in much the same way.
You've already used one of the family, `apply` in the first [lesson](../01-starting-with-data.html).
The `apply` family members include

 * `apply`  - apply over the margins of an array (e.g. the rows or columns of a matrix)
 * `lapply` - apply over an object and return list
 * `sapply` - apply over an object and return a simplified object (an array) if possible
 * `vapply` - similar to `sapply` but you specify the type of object returned by the iterations

Each of these has an argument `FUN` which takes a function to apply to each element of the object.
Instead of looping over `filenames` and calling `analyze`, as you did earlier, you could `sapply` over `filenames` with `FUN = analyze`:

```{r, eval=FALSE}
sapply(filenames, FUN = analyze)
```

Deciding whether to use `for` or one of the `apply` family is really personal preference.
Using an `apply` family function forces to you encapsulate your operations as a function rather than separate calls with `for`.
`for` loops are often more natural in some circumstances; for several related operations, a `for` loop will avoid you having to pass in a lot of extra arguments to your function.

#### Loops in R are slow

No, they are not! *If* you follow some golden rules:

 1. Don't use a loop when a vectorised alternative exists
 2. Don't grow objects (via `c`, `cbind`, etc) during the loop - R has to create a new object and copy across the information just to add a new element or row/column
 3. Allocate an object to hold the results and fill it in during the loop

As an example, we'll create a new version of `analyze` that will return the mean inflammation per day (column) of each file.

```{r}
analyze2 <- function(filenames) {
  for (f in seq_along(filenames)) {
    fdata <- read.csv(filenames[f], header = FALSE)
    res <- apply(fdata, 2, mean)
    if (f == 1) {
      out <- res
    } else {
      # The loop is slowed by this call to cbind that grows the object
      out <- cbind(out, res)
    }
  }
  return(out)
}

system.time(avg2 <- analyze2(filenames))
```

Note how we add a new column to `out` at each iteration?
This is a cardinal sin of writing a `for` loop in R.

Instead, we can create an empty matrix with the right dimensions (rows/columns) to hold the results.
Then we loop over the files but this time we fill in the `f`th column of our results matrix `out`.
This time there is no copying/growing for R to deal with.

```{r}
analyze3 <- function(filenames) {
  out <- matrix(ncol = length(filenames), nrow = 40) ## assuming 40 here from files 
  for (f in seq_along(filenames)) {
    fdata <- read.csv(filenames[f], header = FALSE)
    out[, f] <- apply(fdata, 2, mean)
  }
  return(out)
}

system.time(avg3 <- analyze3(filenames))
```

In this simple example there is little difference in the compute time of `analyze2` and `analyze3`.
This is because we are only iterating over 12 files and hence we only incur 12 copy/grow operations.
If we were doing this over more files or the data objects we were growing were larger, the penalty for copying/growing would be much larger.

Note that `apply` handles these memory allocation issues for you, but then you have to write the loop part as a function to pass to `apply`.
At its heart, `apply` is just a `for` loop with extra convenience.