In [3]:
library(tidyverse)
library(nycflights13)

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1     ✔ purrr   0.2.4
✔ tibble  1.3.4     ✔ dplyr   0.7.4
✔ tidyr   0.7.2     ✔ stringr 1.2.0
✔ readr   1.1.1     ✔ forcats 0.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()  masks stats::filter()
✖ purrr::flatten() masks jsonlite::flatten()
✖ dplyr::lag()     masks stats::lag()


# Lecture 16: Vectors, lists, iteration & FP
In this lecture we'll learn about:
- [Atomic vectors](#Atomic-vectors), or what we have been calling vectors up to this point.
- [Lists](#Lists), a.k.a. recursive vectors.
- [Iteration](#Iteration): `for`/`while` loops.
- [Functional programming](#Functional-programming) (FP): functions that operate on other functions.
- [Error handling](#Error-handling): what to do when things go wrong.

## Atomic vectors
Vectors are sequences of data elements in R. So far we have exclusively studied *atomic* vectors, which are sequences of elements that all have the same type. The two most important elements of a vector are its *type* and its *length*:

In [5]:
(x = 1:3)  # atomic vector of integers
typeof(x)
length(x)

[1] 1 2 3

[1] "integer"

[1] 3

In [6]:
(x = c('a', 'b', 'c'))  # atomic vector of characters
typeof(x)
length(x)

[1] "a" "b" "c"

[1] "character"

[1] 3

A single data element is called a *scalar.* An important thing to realize is that, to R, there is no distinction between scalars and vectors -- a scalar is simply an atomic vector of length one.

In [8]:
1  # scalar
c(1)  # vector

[1] 1

[1] 1

### Types of atomic vectors
The most important types of atomic vector are logical, numeric, and character.

Logical vectors hold the values `TRUE`, `FALSE` and `NA`.

In [10]:
x = c(TRUE, TRUE, FALSE, NA)
typeof(x)

[1] "logical"

Numeric vectors hold integers or doubles. By default, if you enter a number in R it is stored as a double:

In [54]:
typeof(1)

[1] "double"

If you want to explicitly store integers, attach a capital `L` to the number:

In [58]:
typeof(1L)

[1] "integer"

### Names and attributes
It is possible to assign names to each entry of a vector:

In [68]:
(v = c(a=1, b=2, c=3))
names(v)

a b c 
1 2 3 

[1] "a" "b" "c"

Each vector has *attributes*:

In [72]:
attributes(v)

$names
[1] "a" "b" "c"


You can assign your own attributes to a vector using the `attr` function:

In [84]:
attr(v, "myattr") = 1
attributes(v)

$names
[1] "a" "b" "c"

$myattr
[1] 1


## Lists
Lists are another type of sequence data type found in R. Unlike atomic vectors, lists can hold objects of multiple types:

In [12]:
(x = list('a', 1L, FALSE, pi))

[[1]]
[1] "a"

[[2]]
[1] 1

[[3]]
[1] FALSE

[[4]]
[1] 3.141593


As the printout suggests, you can think of a list as a "vector of vectors". For this reason, they are sometimes referred to as "recursive vectors".

The `str` command will print out the **str**ucture of a vector:

In [15]:
str(x)

List of 4
 $ : chr "a"
 $ : int 1
 $ : logi FALSE
 $ : num 3.14


Just like atomic vectors, you can name each individual entry of a list:

In [17]:
x_named <- list(a = 1, b = 2, c = 3)
str(x_named)
names(x_named)

List of 3
 $ a: num 1
 $ b: num 2
 $ c: num 3


[1] "a" "b" "c"

### Sub-setting lists
Subsetting lists is a little more complex than subsetting atomic vectors. We will use the following example list:

In [21]:
str(a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5)))

List of 4
 $ a: int [1:3] 1 2 3
 $ b: chr "a string"
 $ c: num 3.14
 $ d:List of 2
  ..$ : num -1
  ..$ : num -5


#### `[]`
The `[]` operator extracts a sub-list. That is, the return type will always be a list:

In [26]:
a[1]
str(a[1])

$a
[1] 1 2 3


List of 1
 $ a: int [1:3] 1 2 3


As with atomic vectors, the single brackets accept integer, logical and character vectors:

In [34]:
str(a[c(1,2,4)])
str(a[c('a', 'd')])
str(a[c(T, F)])  # what happened here?

List of 3
 $ a: int [1:3] 1 2 3
 $ b: chr "a string"
 $ d:List of 2
  ..$ : num -1
  ..$ : num -5
List of 2
 $ a: int [1:3] 1 2 3
 $ d:List of 2
  ..$ : num -1
  ..$ : num -5
List of 2
 $ a: int [1:3] 1 2 3
 $ c: num 3.14


#### `[[]]`
The double-brackets will extract a single component from the list:

In [42]:
a[[1]]

[1] 1 2 3

You can also pass an integer vector to `[[]]`. This will index into successive levels of the list:

In [53]:
a[[4]]
a[[c(4,1,1)]]
a[[c(4,2,1)]]

[[1]]
[1] -1

[[2]]
[1] -5


[1] -1

[1] -5

## Iteration
Iteration means, roughly, "running the same piece of code repeatedly". There are many ways to perform iteration in R. The one you have probably heard of is the *for loop*:
```{r}
for (<index> in <vector>) {
    [do something for each value of <index>]
}
```

For example, suppose we wanted to compute the median for each column of the following tibble:

In [87]:
df = tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

One option is to repeatedly write out the call `median` for each column:

In [88]:
median(df$a)
median(df$b)
median(df$c)
median(df$d)

[1] 0.2565755

[1] 0.4918723

[1] 0.009218122

[1] -0.05655922

But this involves too much repetition, and we argued last lecture that repetition is generally a bad idea when coding. Instead, we can use a for loop to "loop over" each column of `df` and grab the median:

In [90]:
output = vector("double", ncol(df))   # 1. output
for (i in seq_along(df)) {            # 2. sequence
  output[[i]] = median(df[[i]])       # 3. body
}
output

[1]  0.256575548  0.491872279  0.009218122 -0.056559219

The for loop should have three components:
1. The *output*, in this case a vector with one entry per column of `df`.
2. The *sequence* of values along which we will iterate. Here we are using `seq_along(df)`, which generates a sequence of numbers from one up to `ncol(df)`. (This relies on the fact that a `data.frame` is really a list with one entry per column of data.)
3. The *body*, which is the piece of code that gets executed in each iteration of the loop. In the example above, the body first runs `output[[1]] = median(df[[1]])`, then `output[[2]] = median(df[[2]])`, etc., on up to `i=4`.

### Methods of iterating in for loops
In the example above we used `seq_along` to create a numeric vector, and then iterated over it in our for loop. This was useful because each entry of the vector corresponded to a column, so we could use `output[[i]]` to store the value of the median for each column. There are also a couple of other common ways to iterate:

We can iterate over the elements of a list or vector directly:

In [96]:
for (column in mpg) {  # iterate over the column vectors in mpg
    print(typeof(column))
}

[1] "character"
[1] "character"
[1] "double"
[1] "integer"
[1] "integer"
[1] "character"
[1] "character"
[1] "integer"
[1] "integer"
[1] "character"
[1] "character"


This is mainly useful for calling commands that have side-effects (like `print`) because there is no obvious way to store the output of each iterate.

Or, we can iterate over the names of a list or vector:

In [101]:
output = vector('list', ncol(mpg))
names(output) = names(mpg)
for (col_name in names(mpg)) {
    output[[col_name]] = typeof(mpg[[col_name]])
}
output

$manufacturer
[1] "character"

$model
[1] "character"

$displ
[1] "double"

$year
[1] "integer"

$cyl
[1] "integer"

$trans
[1] "character"

$drv
[1] "character"

$cty
[1] "integer"

$hwy
[1] "integer"

$fl
[1] "character"

$class
[1] "character"


### Unknown output length
In each of the examples above we "pre-allocated" the `output` vector before running the `for` loop. Sometimes you may not know in advance how much output will be generated. For example, the following code draws three random numbers between 0 and 100, and for each number appends that many randomly normal entries to `output`:

In [106]:
means = c(0, 1, 2)
output = double()
for (i in seq_along(means)) {
  n = sample(100, 1)
  output = c(output, rnorm(n, means[[i]]))
}
length(output)

[1] 188

This code works perfectly well, but it turns out to be inefficient. The reason is that each time we append to `output` via the command `output = c(output, rnorm(n, means[[i]]))`, R ends up having to copy all of the data from the previous iterations. 

A more efficient option is to store the results of each iteration in a list, and then concatenate all the entries of the list together after the for loop terminates:

In [107]:
means = c(0, 1, 2)
out = vector("list", length(means))
for (i in seq_along(means)) {
  n = sample(100, 1)
  out[[i]] <- rnorm(n, means[[i]])
}
str(out)

List of 3
 $ : num [1:41] 1.282 0.602 -0.307 -0.418 0.355 ...
 $ : num [1:3] 0.238 0.708 0.425
 $ : num [1:33] 3.58 3.68 2.49 2.88 1.86 ...


In [108]:
str(unlist(out))

 num [1:77] 1.282 0.602 -0.307 -0.418 0.355 ...


To convince you that this is actually more efficient, we will run a *benchmark* of the two methods. I will wrap the two approaches in functions called `f1` and `f2`, and then use the `microbenchmark` library to test which one runs faster.

In [119]:
f1 = function(n) { 
    means = 1:n
    output = double()
    for (i in seq_along(means)) {
        n = sample(100, 1)
        output = c(output, rnorm(n, means[[i]]))
    }
    output
}

f2 = function(n) {
    means = 1:n
    out = vector("list", length(means))
    for (i in seq_along(means)) {
        n = sample(100, 1)
        out[[i]] <- rnorm(n, means[[i]])
    }
    unlist(out)
}

In [123]:
library(microbenchmark)
microbenchmark(
    f1(500),
    f2(500)
) %>% print

Unit: milliseconds
    expr       min        lq      mean    median        uq       max neval
 f1(500) 40.158469 45.298423 51.250111 47.656748 51.642540 111.02328   100
 f2(500)  4.914606  5.182897  6.103179  5.626745  6.746227  10.32502   100


### Unknown sequence length
In some cases you don't even know how long is the sequence over which you are iterating. Here it is not possible to use a `for` loop; instead you must use a `while` loop:
```{r}
while (<condition>) {
    <body>
}
```
The `while` loop will continue running until `<condition>` returns `FALSE`.

Here's an example of how we would use a `while` loop. The following command counts the number of heads and tails encountered in tosses of a fair coin until the third head is encountered:

In [127]:
n_head = 0
n_tail = 0
while (n_head < 3) {
    if (runif(1) < .5) {
        n_head = n_head + 1
    }
    else {
        n_tail = n_tail + 1
    }
}
n_head

[1] 3

(Bonus question: what is the distribution of `n_head + n_tail`?)

As you might suspect, `while` loops are used mainly in random simulations. They don't come up a lot in data analysis. Still, it's useful to know about them.

## Functional programming
R is a *functional programming language*, which means, loosely, that functions are treated just like any other data. In particular, they can be passed to other functions. As we will see, this means that most `for` loop type iterations can be replaced by cleaner, functional constructs.

### Example
In the following series of examples, we'll see how the need to write extensible code naturally leads to ideas from functional programming (FP). Above we've seen several examples of functions that apply the `mean` or `median` function to each column of a tibble:

In [129]:
df = tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
output <- vector("double", length(df))
for (i in seq_along(df)) {
  output[[i]] <- mean(df[[i]])
}
output

[1]  0.002576805  0.324640613  0.094145946 -0.531406959

As we have already used this code (or a close variant) on several occasions, it makes sense to extract it out to a function:

In [130]:
col_mean = function(df) {
  output = vector("double", length(df))
  for (i in seq_along(df)) {
    output[i] = mean(df[[i]])
  }
  output
}
df %>% col_mean

[1]  0.002576805  0.324640613  0.094145946 -0.531406959

The function `col_mean` could just as easily be used to compute the `median` or `rescale01` of each column. Indeed, we would only need to change a single function call in the body of the for loop:
```{r}
output[i] = mean(df[[i]])
```
So it makes sense to generalize `col_mean` to a new function that takes as parameters a data frame `df` as well as a function `f` to apply to each column:

In [132]:
col_summary = function(df, f) {
  output = vector("double", length(df))
  for (i in seq_along(df)) {
    output[i] = f(df[[i]])
  }
  output
}
df %>% col_summary(median)
df %>% col_summary(sd)

[1]  0.1070448  0.1780868  0.3062926 -0.3165558

[1] 0.9391513 1.5637266 0.9687816 0.9539779

Notice how much more elegant and readable `df %>% col_summary(median)` is compared to
```{r}
output <- vector("double", length(df))
for (i in seq_along(df)) {
  output[[i]] <- median(df[[i]])
}
output
```
If you understand why the former is preferable, you understand the Zen of Functional Programming!

## map functions
The pattern of looping over a sequence, doing something to each element and saving the results turns out to be extremely common in data analysis (and FP more generally). It even has a name: "map".

There is a set of functions in `tidyverse` designed to help you map over data as easily as possible:
- `map()` makes a list.
- `map_lgl()` makes a logical vector.
- `map_int()` makes an integer vector.
- `map_dbl()` makes a double vector.
- `map_chr()` makes a character vector.

In most cases we will be able to replace `for` loops with calls to these functions, leading to simpler and more readable code.

### Example
How would we write `col_summary` using one of the `map` functions?

In [136]:
# Code

Compared to `col_summary`, the `map_` functions have a few advantages. One, we can forward additional arguments to the called function:

In [138]:
df %>% map_dbl(mean, na.rm=T)

           a            b            c            d 
 0.002576805  0.324640613  0.094145946 -0.531406959 

Two, names are preserved:

In [142]:
x = list(a=1, b=2, c=c(2, 3))
map_dbl(x, mean)
col_summary(x, mean)

  a   b   c 
1.0 2.0 2.5 

[1] 1.0 2.0 2.5

Three, the `map_` functions allow for some handy shortcuts in addition to taking actual function values. If you pass a *formula* instead of a function, R will convert every instance of `.` to the current list element:

In [145]:
df[3,]
df %>% map(~ .[[3]])  # take the third element of each column, aka the third row

  a         b         c         d         
1 0.5410005 0.1726862 0.3203768 -0.2605813

$a
[1] 0.5410005

$b
[1] 0.1726862

$c
[1] 0.3203768

$d
[1] -0.2605813


If you supply a string to a map function, R will extract the attribute with that name from each list element:

In [150]:
list(a=list(a=1, b=2), b=list(a=5, b=3), d=list(a=8, b=4)) %>% map("a")

$a
[1] 1

$b
[1] 5

$d
[1] 8


Similary, an integer will extract the value at that index for each list element:

In [147]:
list(a=list(a=1, b=2), b=list(a=1, b=3), d=list(a=1, b=4)) %>% map(2)

$a
[1] 2

$b
[1] 3

$d
[1] 4


### map-like functions in base R
Base R has the `apply` functions which also perform mapping. The `map_` functions in `tidyverse` have a better interface and should generally be preferred. However, because the `apply` functions are so common, we will briefly go over them here.

The `lapply()` function is identical to `map()`, but it does not allow for some of the convenience shortcuts that we reviewed above.

In [155]:
lst = list(a=list(a=1, b=2), b=list(a=1, b=3), d=list(a=1, b=4))
map(lst, "a")
# lapply(lst, "a")  # error
lapply(lst, function(x) x$a)

$a
[1] 1

$b
[1] 1

$d
[1] 1


$a
[1] 1

$b
[1] 1

$d
[1] 1


`sapply` is a wrapper around `lapply` that applies some simplifications to the output. I avoid `sapply` because I can never remember what are its rules for simplifying. Consider this example from the book:

In [157]:
x1 <- list(
  c(0.27, 0.37, 0.57, 0.91, 0.20),
  c(0.90, 0.94, 0.66, 0.63, 0.06), 
  c(0.21, 0.18, 0.69, 0.38, 0.77)
)
x2 <- list(
  c(0.50, 0.72, 0.99, 0.38, 0.78), 
  c(0.93, 0.21, 0.65, 0.13, 0.27), 
  c(0.39, 0.01, 0.38, 0.87, 0.34)
)

threshold <- function(x, cutoff = 0.8) x[x > cutoff]

Can somebody explain to me why 

In [158]:
x1 %>% sapply(threshold) %>% str()

List of 3
 $ : num 0.91
 $ : num [1:2] 0.9 0.94
 $ : num(0) 


but 

In [159]:
x2 %>% sapply(threshold) %>% str()

 num [1:3] 0.99 0.93 0.87


??

## Error handling
All of the usages of `map` so far have been toy examples with no chance of failure. In real data you will encounter errors. If you do not handle them then your computation will return errors or fail:

In [10]:
# error
# map_dbl(list("a", -1, 2, 3, 4), log)

To handle this type of situation, `tidyverse` provides you with the `safely()` command. `safely()` is an adverb: it takes a function (a verb) and returns a modified version. The modified function will never throw an error; instead, it returns a list with two elements:
1. `result` is the original result. If there was an error, this will be `NULL`.
2. `error` is an error object. If the operation was successful, this will be `NULL`.



In [183]:
(res = map(list("a", -1, 2, 3, 4), safely(log))) %>% str

“NaNs produced”

List of 5
 $ :List of 2
  ..$ result: NULL
  ..$ error :List of 2
  .. ..$ message: chr "non-numeric argument to mathematical function"
  .. ..$ call   : language log(x = x, base = base)
  .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
 $ :List of 2
  ..$ result: num NaN
  ..$ error : NULL
 $ :List of 2
  ..$ result: num 0.693
  ..$ error : NULL
 $ :List of 2
  ..$ result: num 1.1
  ..$ error : NULL
 $ :List of 2
  ..$ result: num 1.39
  ..$ error : NULL


To get the results we need to extract the `result` attribute from each element of `res`. We already learned how to do this using `map`:

In [182]:
map(res, "result") %>% str

List of 5
 $ : NULL
 $ : num NaN
 $ : num 0.693
 $ : num 1.1
 $ : num 1.39


Alternatively, thinking of `res` as a matrix with two columns, we can transpose `res` and take the `result` attribute:

In [181]:
transpose(res)$result %>% str

List of 5
 $ : NULL
 $ : num NaN
 $ : num 0.693
 $ : num 1.1
 $ : num 1.39


The related command `possibly` will return a default value wherever an error is encountered:

In [180]:
map(list('a', -1, 2, 3), possibly(log, NA_real_)) %>% str

“NaNs produced”

List of 4
 $ : num NA
 $ : num NaN
 $ : num 0.693
 $ : num 1.1


Finally, to capture and suppress the warning message we can use `quietly()`:

In [188]:
map(list(-1, 2, 3), quietly(log)) %>% str

List of 3
 $ :List of 4
  ..$ result  : num NaN
  ..$ output  : chr ""
  ..$ messages: chr(0) 
 $ :List of 4
  ..$ result  : num 0.693
  ..$ output  : chr ""
  ..$ messages: chr(0) 
 $ :List of 4
  ..$ result  : num 1.1
  ..$ output  : chr ""
  ..$ messages: chr(0) 


## Iterating over multiple sequences at once
Sometimes we want to iterate over multiple sequences. For example, suppose we had a vector `mu` of means and an equal length vector `sigma` of standard deviations. For each pair `mu[[i]],sigma[[i]]` we would like to generate a five standard normal random variable using `rnorm`.

Using `map`, we could accomplish this by

In [192]:
mu = list(5, 10, -3)
sigma = list(1, 5, 10)
seq_along(mu) %>% 
  map(~rnorm(5, mu[[.]], sigma[[.]])) %>% 
  str()

List of 3
 $ : num [1:5] 5.81 3.03 5.24 3.33 5.01
 $ : num [1:5] 9.92 7.6 10.6 12.18 11.31
 $ : num [1:5] -1.65 2.76 -9.31 -2.73 -2.87


This code could be improved -- because we don't yet know how to `map` over more than one sequence at a time, we are forced to "hack it" by iterating over `seq_along(mu)`. This hides the true intent of what we set out to accomplish.

To iterate over two sequences at once we have the `map2` command:
```{r}
map2(seq1, seq2, f, ...)
```
will call `f(seq1[[i]], seq2[[i]], ...)` for each value of `i`. Indeed, `map2` is equivalent to:
```{r}
map2 <- function(x, y, f, ...) {
  out <- vector("list", length(x))
  for (i in seq_along(x)) {
    out[[i]] <- f(x[[i]], y[[i]], ...)
  }
  out
}
```

`map2` lets us succinctly rewrite the sampling code given above:

In [196]:
map2(mu, sigma, rnorm, n = 5)

[[1]]
[1] 5.447983 5.446291 5.066780 5.075793 4.059494

[[2]]
[1]  3.518528 19.366107  7.197257  5.081700 20.487066

[[3]]
[1] -9.655409  8.922227  4.592133  4.237675 -1.451280


We can map over arbitrarily many sequences using `pmap`. The first argument of `pmap` is a list of sequences, and the second is a function:

In [201]:
pmap(list(mu, sigma), rnorm, n = 5)

[[1]]
[1] 5.404549 5.931190 5.489289 6.243161 4.969922

[[2]]
[1] 11.36999 10.78532 16.00125 16.17708 12.38797

[[3]]
[1] 11.84802  2.68887 13.69661 11.51808 11.91021


```{r}
pmap(list(mu, sigma), rnorm, n = 5)
```
will call `rnorm(mu[[i]], sigma[[i]], n=5)`. This relies on the correct ordering of the `mu` and `sigma` options to `rnorm`. To prevent errors, you can name each sequence in the first argument:
```{r}
pmap(list(mu=mu, sigma=sigma, n=5), rnorm)
```
will call `rnorm(mu=mu[[i]], sigma=sigma[[i]], n=5)` using named parameters. This is a bit safer so I recommend using this form.