<a class="anchor" id="jump_to_top"></a>
# Iteration
---

### Table of Contents
* [For loops](#loops)
* [while loops](#while)
* [iteration with purrr package](#purr)
* [](#)
* [](#)
* [](#)


<a class="anchor" id="loops"></a>
<a class="anchor" id="loops"></a>
<a class="anchor" id="loops"></a>

In the previous notebook we saw how we can use functions to reduce duplication in our code. Another great tool for reducing duplication is **iteration**, which helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.

In [1]:
# Attaching libraries
library(tidyverse)

# install.packages('nycflights13')  
library(nycflights13)

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.0.0     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.6
✔ tidyr   0.8.1     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


<a class="anchor" id="loops"></a>
## For loops
Imagine we have this simple data frame comprised of some random numbers:

In [2]:
df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
df

a,b,c,d
-0.7845129,1.571392415,-0.7681117,-0.6014233
1.9015759,0.104698478,0.3141739,-1.3604725
1.4321215,-1.436483295,0.6771271,0.1640555
-0.1132742,1.487543791,0.9274347,0.1612923
-1.5108355,2.431599116,-1.1042574,-0.9364819
0.1368907,0.002127967,1.257678,-0.1790018
0.8999673,-0.853713939,1.1523855,0.6955371
-1.1564681,-0.34473183,0.4267772,-1.7666327
-2.5725805,0.158900992,-2.7351831,-1.1730102
0.8544514,-0.562837593,-0.31687,0.303955


We want to compute the median of each column. You could do with copy-and-paste:

In [3]:
median(df$a)
median(df$b)
median(df$c)
median(df$d)

But that breaks our rule of thumb: never copy and paste more than twice. Instead, we could use a for loop:

In [4]:
for (i in 1:4) {
    print(paste0("Median for column ", colnames(df)[i], ": ", median(df[[i]])))
}

[1] "Median for column a: 0.0118082685156667"
[1] "Median for column b: 0.0534132221640036"
[1] "Median for column c: 0.370475535281854"
[1] "Median for column d: -0.390212546784606"


If we would want to use these values again it's a good practice to store them:

In [5]:
output <- vector("double", ncol(df))  # 1. output
for (i in seq_along(df)) {            # 2. sequence
  output[[i]] <- median(df[[i]])      # 3. body
}
output

Every for loop has three components:
1. **Output**: `output <- vector("double", length(x))`. Before you start the loop, you must always allocate sufficient space for the output to increase efficiency. A general way of creating an empty vector of given length is the `vector()` function. It has two arguments: the type of the vector ("logical", "integer", "double", "character", etc) and the length of the vector.
2. **Sequence**: `i in seq_along(df)`. This determines what to loop over: each run of the for loop will assign `i` to a different value from `seq_along(df)`.
3. **Body**: `output[[i]] <- median(df[[i]])`. This is the code that does the work. It's run repeatedly, each time with a different value for `i`. The first iteration will run `output[[1]] <- median(df[[1]])`, the second will run `output[[2]] <- median(df[[2]])`, and so on.

---
### Exercise 1
Write for loops to:
1. Compute the mean of every column in `mtcars`.
2. Determine the type of each column in `nycflights13::flights`. (Note: You might need to install nycflights13 package)
3. Compute the number of unique values in each column of `iris`.

In [6]:
# Your answer goes here.

---
### Exercise 2
Eliminate the for loop in each of the following examples by taking advantage of an existing function that works with vectors:

In [7]:
out <- ""
for (x in letters) {
  out <- stringr::str_c(out, x)
}

In [8]:
x <- sample(100)
sd <- 0
for (i in seq_along(x)) {
  sd <- sd + (x[i] - mean(x)) ^ 2
}
sd <- sqrt(sd / (length(x) - 1))

---
## For loop variations
There are four variations on the basic theme of the for loop:

1. Modifying an existing object
2. Looping Patterns
3. Unknown output length
4. Unknown sequence length, `while` loops

### 1. Modifying an existing object
Sometimes you want to use a for loop to modify an existing object, instead of creating a new one. For example, our code from functions. We wanted to rescale every column in a data frame:

In [9]:
df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
Rescale <- function(x) {
  min <- min(x, na.rm = TRUE)
  max <- max(x, na.rm = TRUE)
  (x - min) / (max - min)
}

df$a <- Rescale(df$a)
df$b <- Rescale(df$b)
df$c <- Rescale(df$c)
df$d <- Rescale(df$d)

To solve this with a for loop we again think about the three components:

1. **Output**: we already have the output - it's the same as the input!

2. **Sequence**: we can think about a data frame as a list of columns, so we can iterate over each column with `seq_along(df)`.

3. **Body**: apply `Rescale()`.

This gives us:

In [10]:
for (i in seq_along(df)) {
  df[[i]] <- Rescale(df[[i]])
}

Typically you'll be modifying a list or data frame with this sort of loop, so remember to use `[[`, not `[`. 

### 2. Looping Patterns 
Looping over names or values, instead of indices.

There are three basic ways to loop over a vector. So far you've seen the most general: looping over the numeric indices with `for (i in seq_along(xs))`, and extracting the value with `x[[i]]`. There are two other forms:

1.Loop over the elements: `for (x in xs)`. This is most useful if you only care about side-effects, like plotting or saving a file, because it's difficult to save the output efficiently.

In [11]:
xs <- c("a1", "b2", "c3")
for (x in xs) {
    print(x)
}

[1] "a1"
[1] "b2"
[1] "c3"


2.Loop over the names: `for (nm in names(xs))`. This gives you name, which you can use to access the value with `x[[nm]]`

In [12]:
for (x in names(df)) {
    print(x)
}

[1] "a"
[1] "b"
[1] "c"
[1] "d"


### 3. Unknown output length
Sometimes you might not know how long the output will be. For example, imagine you want to simulate some random vectors of random lengths. You might be tempted to solve this problem by progressively growing the vector:

In [13]:
means <- c(0, 1, 2)

output <- double()
for (i in seq_along(means)) {
  n <- sample(100, 1)  # picking a number from 1 to 100
  output <- c(output, rnorm(n, means[[i]]))  # combining outputs of n random numbers around different means
}
str(output)

 num [1:131] -1.201 -0.321 0.933 0.315 -0.201 ...


But this is not very efficient because in each iteration, R has to copy all the data from the previous iterations.

A better solution to save the results in a list, and then combine into a single vector after the loop is done:

In [14]:
out <- vector("list", length(means))
for (i in seq_along(means)) {
  n <- sample(100, 1)
  out[[i]] <- rnorm(n, means[[i]])
}
str(out)

List of 3
 $ : num [1:86] 0.0363 -0.9415 -0.1402 0.0898 0.4808 ...
 $ : num [1:91] 1.0903 1.3233 0.0911 2.8354 0.7703 ...
 $ : num [1:53] 2.732 0.604 2.135 4.874 -0.179 ...


In [15]:
# flatten a list of vectors into a single vector
str(unlist(out))

 num [1:230] 0.0363 -0.9415 -0.1402 0.0898 0.4808 ...


<a class="anchor" id="while"></a>
### 4. Unknown sequence length, `while` loops
Sometimes you don't even know how long the input sequence should run for. This is common when doing simulations. For example, you might want to loop until you get three heads in a row. You can't do that sort of iteration with the for loop. Instead, you can use a while loop. A while loop is simpler than for loop because it only has two components, a condition and a body:

> `while (condition) {
  body
}`

A while loop is also more general than a for loop, because you can rewrite any for loop as a while loop, but you can't rewrite every while loop as a for loop:

In [16]:
for (i in seq_along(x)) {
  # body
}

# Equivalent to
i <- 1
while (i <= length(x)) {
  # body
  i <- i + 1 
}

Here's how we could use a while loop to find how many tries it takes to get three heads in a row:

In [17]:
flip <- function() {
    sample(c("T", "H"), 1)
}

flips <- 0
nheads <- 0

while (nheads < 3) {
  if (flip() == "H") {
    nheads <- nheads + 1
  } else {
    nheads <- 0
  }
  flips <- flips + 1
}
flips

I mention while loops only briefly, because I hardly ever use them. They're most often used for simulation, which is outside of the scope of these notebooks.

---
### Exercise 3
Write a function that prints the mean of each numeric column in a data frame, along with its name. For example, `show_mean(iris)` would print:

> `Sepal.Length 5.84
Sepal.Width 3.06 
Petal.Length 3.76 
Petal.Width 1.20`

---
### For loops vs. functionals
For loops are not as important in R as they are in other languages because R is a functional programming language. This means that it's possible to wrap up for loops in a function, and call that function instead of using the for loop directly.

To see why this is important, consider this simple data frame:

In [18]:
df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

In [19]:
# adding an argument that supplies the function to apply to each column
col_summary <- function(df, fun) {
  out <- vector("double", length(df))
  for (i in seq_along(df)) {
    out[i] <- fun(df[[i]])
  }
  out
}
col_summary(df, median)

In [20]:
col_summary(df, mean)

The idea of passing a function to another function is extremely powerful idea, and it's one of the behaviors that makes R a functional programming language. It might take you a while to wrap your head around the idea, but it's worth the investment. 

<a class="anchor" id="purr"></a>
## iteration with purrr package
The goal of using **purrr** functions instead of for loops is to allow you break common list manipulation challenges into independent pieces. Once you've solved the problem for a single element of the list, purrr takes care of generalizing your solution to every element in the list.

The purrr package provides functions that eliminate the need for many common *for loops*. The apply family of functions in base R (`apply()`, `lapply()`, `tapply()`, etc) solve a similar problem, but purrr is more consistent and thus is easier to learn.

<div style="text-align: right"> [[Jump to top]](#jump_to_top) </div>