<a class="anchor" id="jump_to_top"></a>
# Iteration
---

### Table of Contents
* [For loops](#loops)
* [while loops](#while)
* [iteration with purrr package](#purr)
    * [The map functions](#map)
    * [map2()](#map2)
    * [pmap](#pmap)


<a class="anchor" id="loops"></a>

In the previous notebook we saw how we can use functions to reduce duplication in our code. Another great tool for reducing duplication is **iteration**, which helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.

In [1]:
# Attaching libraries
library(tidyverse)

# install.packages('nycflights13')  
library(nycflights13)

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.0.0     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.6
✔ tidyr   0.8.1     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


<a class="anchor" id="loops"></a>
## For loops
Imagine we have this simple data frame comprised of some random numbers:

In [2]:
df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
df

a,b,c,d
-2.44778641,0.2742707,3.173161764,0.6429304
1.22550298,0.5997888,0.166420797,-1.2324496
-0.79494404,-0.4500257,0.331656586,-1.6720384
-0.52911698,0.4191066,-0.389523249,1.4370108
-0.2150005,0.1727371,-1.649719824,-0.4529674
0.03500575,-0.6950085,0.434642459,0.8060227
1.099359,1.6846784,-0.002950232,-0.1198143
-0.78294162,0.2770025,1.696926012,-0.6933271
-1.30525469,0.9257105,0.833516797,-0.3819635
0.14797624,-0.7796936,2.036830978,1.9090268


We want to compute the median of each column. You could do with copy-and-paste:

In [3]:
median(df$a)
median(df$b)
median(df$c)
median(df$d)

But that breaks our rule of thumb: never copy and paste more than twice. Instead, we could use a for loop:

In [4]:
for (i in 1:4) {
    print(paste0("Median for column ", colnames(df)[i], ": ", median(df[[i]])))
}

[1] "Median for column a: -0.372058741418796"
[1] "Median for column b: 0.275636625128039"
[1] "Median for column c: 0.383149522438577"
[1] "Median for column d: -0.250888887312892"


If we would want to use these values again it's a good practice to store them:

In [5]:
output <- vector("double", ncol(df))  # 1. output
for (i in seq_along(df)) {            # 2. sequence
  output[[i]] <- median(df[[i]])      # 3. body
}
output

In [6]:
seq_along(df)

Every for loop has three components:
1. **Output**: `output <- vector("double", length(x))`. Before you start the loop, you must always allocate sufficient space for the output to increase efficiency. A general way of creating an empty vector of given length is the `vector()` function. It has two arguments: the type of the vector ("logical", "integer", "double", "character", etc) and the length of the vector.
2. **Sequence**: `i in seq_along(df)`. This determines what to loop over: each run of the for loop will assign `i` to a different value from `seq_along(df)`.
3. **Body**: `output[[i]] <- median(df[[i]])`. This is the code that does the work. It's run repeatedly, each time with a different value for `i`. The first iteration will run `output[[1]] <- median(df[[1]])`, the second will run `output[[2]] <- median(df[[2]])`, and so on.

---
### Exercise 1
Write for loops to:
1. Compute the mean of every column in `mtcars`.
2. Determine the type of each column in `nycflights13::flights`. (Note: You might need to install nycflights13 package)
3. Compute the number of unique values in each column of `iris`.

In [7]:
# Your answer goes here.

---
### Exercise 2
Eliminate the for loop in each of the following examples by taking advantage of an existing function that works with vectors:

In [8]:
out <- ""
for (x in letters) {
  out <- stringr::str_c(out, x)
}

In [9]:
x <- sample(100)
sd <- 0
for (i in seq_along(x)) {
  sd <- sd + (x[i] - mean(x)) ^ 2
}
sd <- sqrt(sd / (length(x) - 1))

---
## For loop variations
There are four variations on the basic theme of the for loop:

1. Modifying an existing object
2. Looping Patterns
3. Unknown output length
4. Unknown sequence length, `while` loops

### 1. Modifying an existing object
Sometimes you want to use a for loop to modify an existing object, instead of creating a new one. For example, our code from functions notebook. We wanted to rescale every column in a data frame:

In [10]:
df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
Rescale <- function(x) {
  # Rescales each column to a range from 0 to 1
  #
  # Args:
  #   x: the vector that is being rescaled.
  #
  # Returns:
  #   The new rescaled vector.
    
  min <- min(x, na.rm = TRUE)
  max <- max(x, na.rm = TRUE)
  (x - min) / (max - min)
}

df$a <- Rescale(df$a)
df$b <- Rescale(df$b)
df$c <- Rescale(df$c)
df$d <- Rescale(df$d)

To solve this with a for loop we again think about the three components:

1. **Output**: we already have the output - it's the same as the input!

2. **Sequence**: we can think about a data frame as a list of columns, so we can iterate over each column with `seq_along(df)`.

3. **Body**: apply `Rescale()`.

This gives us:

In [11]:
for (i in seq_along(df)) {
  df[[i]] <- Rescale(df[[i]])
}

Typically you'll be modifying a list or data frame with this sort of loop, so remember to use `[[`, not `[`. 

### 2. Looping Patterns 
Looping over names or values, instead of indices.

There are three basic ways to loop over a vector. So far you've seen the most general: looping over the numeric indices with `for (i in seq_along(xs))`, and extracting the value with `x[[i]]`. There are two other forms:

1.Loop over the elements: `for (x in xs)`. This is most useful if you only care about side-effects, like plotting or saving a file, because it's difficult to save the output efficiently.

In [12]:
xs <- c("a1", "b2", "c3")
for (x in xs) {
    print(x)
}

[1] "a1"
[1] "b2"
[1] "c3"


2.Loop over the names: `for (nm in names(xs))`. This gives you name, which you can use to access the value with `x[[nm]]`

In [13]:
for (x in names(df)) {
    print(x)
}

[1] "a"
[1] "b"
[1] "c"
[1] "d"


### 3. Unknown output length
Sometimes you might not know how long the output will be. For example, imagine you want to simulate some random vectors of random lengths. You might be tempted to solve this problem by progressively growing the vector:

In [14]:
means <- c(0, 1, 2)

output <- double()
for (i in seq_along(means)) {
  n <- sample(100, 1)  # picking a number from 1 to 100
  output <- c(output, rnorm(n, means[[i]]))  # combining outputs of n random numbers around different means
}
str(output)

 num [1:244] -1.2 0.301 0.138 -0.351 -1.448 ...


But this is not very efficient because in each iteration, R has to copy all the data from the previous iterations.

A better solution to save the results in a list, and then combine into a single vector after the loop is done:

In [15]:
out <- vector("list", length(means))
for (i in seq_along(means)) {
  n <- sample(100, 1)
  out[[i]] <- rnorm(n, means[[i]])
}
str(out)

List of 3
 $ : num [1:2] -0.163 0.603
 $ : num [1:82] -0.1419 2.8986 0.4163 1.0983 0.0961 ...
 $ : num [1:18] 1.25 1.84 1.71 4.55 3.27 ...


In [16]:
# flatten a list of vectors into a single vector
str(unlist(out))

 num [1:102] -0.163 0.603 -0.142 2.899 0.416 ...


<a class="anchor" id="while"></a>
### 4. Unknown sequence length, `while` loops
Sometimes you don't even know how long the input sequence should run for. This is common when doing simulations. For example, you might want to loop until you get three heads in a row. You can't do that sort of iteration with the for loop. Instead, you can use a while loop. A while loop is simpler than for loop because it only has two components, a condition and a body:

> `while (condition) {
  body
}`

A while loop is also more general than a for loop, because you can rewrite any for loop as a while loop, but you can't rewrite every while loop as a for loop:

In [17]:
for (i in seq_along(x)) {
  # body
}

# Equivalent to
i <- 1
while (i <= length(x)) {
  # body
  i <- i + 1 
}

Here's how we could use a while loop to find how many tries it takes to get three heads in a row:

In [18]:
flip <- function() {
    sample(c("T", "H"), 1)
}

flips <- 0
nheads <- 0

while (nheads < 3) {
  if (flip() == "H") {
    nheads <- nheads + 1
  } else {
    nheads <- 0
  }
  flips <- flips + 1
}
flips

I mention while loops only briefly, because I hardly ever use them. They're most often used for simulation, which is outside of the scope of these notebooks.

---
### Exercise 3
Write a function that prints the mean of each numeric column in a data frame, along with its name. For example, `show_mean(iris)` would print:

> `Sepal.Length 5.84
Sepal.Width 3.06 
Petal.Length 3.76 
Petal.Width 1.20`

---
### For loops vs. functionals
For loops are not as important in R as they are in other languages because R is a functional programming language. This means that it's possible to wrap up for loops in a function, and call that function instead of using the for loop directly.

To see why this is important, consider this simple data frame:

In [19]:
df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

In [20]:
# adding an argument that supplies the function to apply to each column
col_summary <- function(df, fun) {
  out <- vector("double", length(df))
  for (i in seq_along(df)) {
    out[i] <- fun(df[[i]])
  }
  out
}
col_summary(df, median)

In [21]:
col_summary(df, mean)

The idea of passing a function to another function is extremely powerful idea, and it's one of the behaviors that makes R a functional programming language. It might take you a while to wrap your head around the idea, but it's worth the investment. 

---
### Exercise 4
Adapt `col_summary()` so that it only applies to numeric columns You might want to start with an `is_numeric()` function that returns a logical vector that has a TRUE corresponding to each numeric column.

In [22]:
# Your code goes here

<div style="text-align: right"> [[Jump to top]](#jump_to_top) </div>

<a class="anchor" id="purr"></a>
## iteration with purrr package
The goal of using **purrr** functions instead of for loops is to allow you break common list manipulation challenges into independent pieces. Once you've solved the problem for a single element of the list, purrr takes care of generalizing your solution to every element in the list.

The purrr package provides functions that eliminate the need for many common *for loops*. The apply family of functions in base R (`apply()`, `lapply()`, `tapply()`, etc) solve a similar problem, but purrr is more consistent and thus is easier to learn.

<a class="anchor" id="map"></a>
### The map functions
The pattern of looping over a vector, doing something to each element and saving the results is so common that the purrr package provides a family of functions to do it for you. There is one function for each type of output:

* `map()` makes a list.
* `map_lgl()` makes a logical vector.
* `map_int()` makes an integer vector.
* `map_dbl()` makes a double vector.
* `map_chr()` makes a character vector.

Each function takes a vector as input, applies a function to each piece, and then returns a new vector that's the same length (and has the same names) as the input. The type of the vector is determined by the suffix to the map function.

Once you master these functions, you'll find it takes much less time to solve iteration problems. But you should never feel bad about using a for loop instead of a map function.The map functions are a step up a tower of abstraction, and it can take a long time to get your head around how they work. The important thing is that you solve the problem that you're working on, not write the most concise and elegant code (although that's definitely something you want to strive towards!).

The chief benefits of using functions like `map()` is not speed, but clarity: they make your code easier to write and to read.

We can use these functions to perform the same computations as the last for loop. Those summary functions returned doubles, so we need to use `map_dbl()`:

In [23]:
map_dbl(df, mean)

In [24]:
map_dbl(df, median)

The map functions preserve names:

In [25]:
z <- list(x = 1:3, y = 4:5)
map_int(z, length)

---
### Exercise 5
Write code that uses one of the map functions to:
1. Compute the mean of every column in `mtcars`.
2. Determine the type of each column in `nycflights13::flights`. (Note: You might need to install nycflights13 package)
3. Compute the number of unique values in each column of `iris`.

In [26]:
# Your code goes here

---
### Exercise 6
How can you create a single vector that for each column in a data frame indicates whether or not it's a factor?

In [27]:
# Your code goes here

---
### Exercise 7
What happens when you use the map functions on vectors that aren't lists? What does `map(1:5, runif)` do?

In [28]:
# Your code goes here

---
### Exercise 8
What does `map(-2:2, rnorm, n = 5)` do? Why? What does `map_dbl(-2:2, rnorm, n = 5)` do? 

In [29]:
# Your code goes here

---
<div style="text-align: right"> [[Jump to top]](#jump_to_top) </div> 

<a class="anchor" id="map2"></a>
## map2()
So far we've mapped along a single input. But often you have multiple related inputs that you need iterate along in parallel. That's the job of the `map2()` and `pmap()` functions.

In [30]:
mu <- c(5, 10, -3)
sigma <- c(1, 5, 10)
map2(mu, sigma, rnorm, n = 5)

<img src="../png/map2.png" width="600px" align="center">

Note that the arguments that vary for each call come before the function; arguments that are the same for every call come after.

Like `map()`, `map2()` is just a wrapper around a for loop:

> `map2 <- function(x, y, f, ...) {
  out <- vector("list", length(x))
  for (i in seq_along(x)) {
    out[[i]] <- f(x[[i]], y[[i]], ...)
  }
  out
}`

<a class="anchor" id="pmap"></a>
## pmap()
You could also imagine `map3()`, `map4()`, `map5()`, `map6()` etc, but that would get tedious quickly. Instead, purrr provides `pmap()` which takes a list of arguments. You might use that if you wanted to vary the mean, standard deviation, and number of samples:

In [31]:
n <- c(1, 3, 5)
pmap(list(n, mu, sigma), rnorm) 

If you don't name the elements of list, `pmap()` will use positional matching when calling the function. That's a little fragile, and makes the code harder to read, so it's better to name the arguments:

In [32]:
args2 <- list(mean = mu, sd = sigma, n = n)
pmap(args2, rnorm)

<img src="../png/pmap.png" width="600px" align="center">

We can wrap up the arguments in a dataframe since they are all the same length:

In [33]:
params <- tribble(
  ~mean, ~sd, ~n,
    5,     1,  1,
   10,     5,  3,
   -3,    10,  5
)
params
pmap(params, rnorm)

mean,sd,n
5,1,1
10,5,3
-3,10,5


As soon as your code gets complicated, I think a data frame is a good approach because it ensures that each column has a name and is the same length as all the other columns.

<div style="text-align: right"> [[Jump to top]](#jump_to_top) </div> 