In [3]:
library(tidyverse)
library(nycflights13)

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1     ✔ purrr   0.2.4
✔ tibble  1.3.4     ✔ dplyr   0.7.4
✔ tidyr   0.7.2     ✔ stringr 1.2.0
✔ readr   1.1.1     ✔ forcats 0.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()  masks stats::filter()
✖ purrr::flatten() masks jsonlite::flatten()
✖ dplyr::lag()     masks stats::lag()


# Lecture 16: Vectors, lists, iteration & FP
In this lecture we'll learn about:
- [Atomic vectors](#Atomic-vectors), or what we have been calling vectors up to this point.
- [Lists](#Lists), a.k.a. recursive vectors.
- [Iteration](#Iteration): `for`/`while` loops.
- [Functional programming](#Functional-programming) (FP): functions that operate on other functions.
Most of this material can be found in chapters 19-21 of R4DS. 

## Atomic vectors
Vectors are sequences of data elements in R. So far we have exclusively studied *atomic* vectors, which are sequences of elements that all have the same type. The two most important properties of a vector are its *type* and its *length*:

In [5]:
(x = 1:3)  # atomic vector of integers
typeof(x)
length(x)

[1] 1 2 3

[1] "integer"

[1] 3

In [6]:
(x = c('a', 'b', 'c'))  # atomic vector of characters
typeof(x)
length(x)

[1] "a" "b" "c"

[1] "character"

[1] 3

A single data element is called a *scalar.* An important thing to realize is that, to R, there is no distinction between scalars and vectors -- a scalar is simply an atomic vector of length one.

In [79]:
1     # scalar
c(1)  # vector

[1] 1

[1] 1

 [1]  0.8414710  0.9092974  0.1411200 -0.7568025 -0.9589243 -0.2794155
 [7]  0.6569866  0.9893582  0.4121185 -0.5440211

### Types of atomic vectors
The most important types of atomic vector are logical, numeric, and character.

Logical vectors hold the values `TRUE`, `FALSE` and `NA`.

In [81]:
(x = c(TRUE, TRUE, FALSE, NA))
typeof(x)
typeof(NA)

[1]  TRUE  TRUE FALSE    NA

[1] "logical"

[1] "logical"

Numeric vectors hold integers or doubles. By default, if you enter a number in R it is stored as a double:

In [82]:
typeof(1)

[1] "double"

If you want to explicitly store integers, attach a capital `L` to the number:

In [84]:
typeof(100.00101L)

[1] "double"

### Names and attributes
It is possible to assign names to each entry of a vector:

In [86]:
(v = c(a=1, b=2, c=3))
names(v)

a b c 
1 2 3 

[1] "a" "b" "c"

Each vector has *attributes*:

In [87]:
attributes(v)

$names
[1] "a" "b" "c"


You can assign your own attributes to a vector using the `attr` function:

In [94]:
attr(v, "myattr") = 1
attr(v, "names") <- c(4:6)
attributes(v)

$names
[1] "4" "5" "6"

$myattr
[1] 1


## Lists
Lists are another type of sequence data type found in R. Unlike atomic vectors, lists can hold objects of multiple types:

In [95]:
(x = list('a', 1L, FALSE, pi, list(1:3)))

[[1]]
[1] "a"

[[2]]
[1] 1

[[3]]
[1] FALSE

[[4]]
[1] 3.141593

[[5]]
[[5]][[1]]
[1] 1 2 3



As the printout suggests, you can think of a list as a "vector of vectors". For this reason, they are sometimes referred to as "recursive vectors".

The `str` command will print out the **str**ucture of a vector:

In [96]:
str(x)

List of 5
 $ : chr "a"
 $ : int 1
 $ : logi FALSE
 $ : num 3.14
 $ :List of 1
  ..$ : int [1:3] 1 2 3


Just like atomic vectors, you can name each individual entry of a list:

In [17]:
x_named <- list(a = 1, b = 2, c = 3)
str(x_named)
names(x_named)

List of 3
 $ a: num 1
 $ b: num 2
 $ c: num 3


[1] "a" "b" "c"

### Sub-setting lists
Subsetting lists is a little more complex than subsetting atomic vectors. We will use the following example list:

In [98]:
str(a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5)))

List of 4
 $ a: int [1:3] 1 2 3
 $ b: chr "a string"
 $ c: num 3.14
 $ d:List of 2
  ..$ : num -1
  ..$ : num -5


#### `[]`
The `[]` operator extracts a sub-list. That is, the return type will always be a list:

In [138]:
str(a)
a[[1]]
# str(a[1])

List of 4
 $ a: int [1:3] 1 2 3
 $ b: chr "a string"
 $ c: num 3.14
 $ d:List of 2
  ..$ : num -1
  ..$ : num -5


[1] 1 2 3

As with atomic vectors, the single brackets accept integer, logical and character vectors:

In [107]:
# str(a[c(1,2,4)])
# str(a[c('a', 'd')])
str(a[c(TRUE, FALSE, TRUE)])  # what happened here?

List of 3
 $ a: int [1:3] 1 2 3
 $ c: num 3.14
 $ d:List of 2
  ..$ : num -1
  ..$ : num -5


#### `[[]]`
The double-brackets will extract a single component from the list:

In [108]:
str(a)
a[[1]]

List of 4
 $ a: int [1:3] 1 2 3
 $ b: chr "a string"
 $ c: num 3.14
 $ d:List of 2
  ..$ : num -1
  ..$ : num -5


[1] 1 2 3

You can also pass an integer vector to `[[]]`. This will index into successive levels of the list:

In [114]:
# str(a)
# a[[4]]
str(a)
a[[c(4,2)]] 

List of 4
 $ a: int [1:3] 1 2 3
 $ b: chr "a string"
 $ c: num 3.14
 $ d:List of 2
  ..$ : num -1
  ..$ : num -5


[1] -5

### Data frames are lists
Many data types in R are actually lists plus some additional attributes. For example, tibbles and data frames are both lists:

In [115]:
typeof(tibble())
typeof(data.frame())

[1] "list"

[1] "list"

The `names()` of a tibble/data frame correspond to columns. This means we can use the list indexing methods shown above to access columns:

In [120]:
# (df = tibble(a=1:3, b=c('a', 'b', 'c'))) %>% print
str(df)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	3 obs. of  2 variables:
 $ a: int  1 2 3
 $ b: chr  "a" "b" "c"


Note that the *class* of a tibble is different from the *type*:

In [123]:
# class(tibble())
attributes(tibble())

$names
character(0)

$class
[1] "tbl_df"     "tbl"        "data.frame"

$row.names
integer(0)


### Classes vs. types
Classes and types are not the same. Class is an attribute that R checks in order to know how to call certain functions (e.g. `print`) when presented with an object. Changing the class will change the way that R handles the object:

In [127]:
df %>% print
attributes(df)

# A tibble: 3 x 2
      a     b
  <int> <chr>
1     1     a
2     2     b
3     3     c


$names
[1] "a" "b"

$class
[1] "tbl_df"     "tbl"        "data.frame"

$row.names
[1] 1 2 3


In [131]:
# attr(df, "class") = NULL
# attributes(df)
print(df)

$a
[1] 1 2 3

$b
[1] "a" "b" "c"

attr(,"row.names")
[1] 1 2 3


In [136]:
attr(df, "class") = c("tbl_df", "tbl", "data.frame")
df %>% print

# A tibble: 3 x 2
      a     b
  <int> <chr>
1     1     a
2     2     b
3     3     c


## Iteration
Iteration means, roughly, "running the same piece of code repeatedly". There are many ways to perform iteration in R. The one you have probably heard of is the *for loop*:
```{r}
for (<index> in <vector>) {
    [do something for each value of <index>]
}
```

For example, suppose we wanted to compute the median for each column of the following tibble:

In [140]:
df = tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

One option is to repeatedly write out the call `median` for each column:

In [141]:
median(df$a)
median(df$b)
median(df$c)
median(df$d)

[1] 0.2644509

[1] 0.5417678

[1] -0.5843742

[1] 0.5461129

But this involves too much repetition, and we argued last lecture that repetition is generally a bad idea when coding. Instead, we can use a for loop to "loop over" each column of `df` and grab the median:

In [160]:
output = vector("double", ncol(df))   # 1. output
for (i in seq_along(df)) {            # 2. sequence
  output[[i]] = median(df[[i]])       # 3. body
}
output

[1]  0.2644509  0.5417678 -0.5843742  0.5461129

The for loop should have three components:
1. The *output*, in this case a vector with one entry per column of `df`.
2. The *sequence* of values along which we will iterate. Here we are using `seq_along(df)`, which generates a sequence of numbers from one up to `ncol(df)`. (This relies on the fact that a `data.frame` is really a list with one entry per column of data.)
3. The *body*, which is the piece of code that gets executed in each iteration of the loop. In the example above, the body first runs `output[[1]] = median(df[[1]])`, then `output[[2]] = median(df[[2]])`, etc., on up to `i=4`.

### Methods of iterating in for loops
In the example above we used `seq_along` to create a numeric vector, and then iterated over it in our for loop. This was useful because each entry of the vector corresponded to a column, so we could use `output[[i]]` to store the value of the median for each column. There are also a couple of other common ways to iterate:

We can iterate over the elements of a list or vector directly:

In [168]:
output = vector("double", 4)
i = 1
for (column in df) {  # iterate over the column vectors in mpg
    output[[i]] = sum(column)
    i = i + 1
}
output

[1]  0.7855311  0.5617025 -6.8963655  5.6473291

This is mainly useful for calling commands that have side-effects (like `print`) because there is no obvious way to store the output of each iterate.

Or, we can iterate over the names of a list or vector:

In [174]:
output = vector('list', ncol(df))
for (col_name in names(df)) {
    output[[col_name]] = typeof(df[[col_name]])
}
output

[[1]]
NULL

[[2]]
NULL

[[3]]
NULL

[[4]]
NULL

$a
[1] "double"

$b
[1] "double"

$c
[1] "double"

$d
[1] "double"


### Unknown output length
In each of the examples above we "pre-allocated" the `output` vector before running the `for` loop. Sometimes you may not know in advance how much output will be generated. For example, the following code draws three random numbers between 0 and 100, and for each number appends that many randomly normal entries to `output`:

In [180]:
means = c(0, 1, 2)
output = double()
str(output)
for (i in seq_along(means)) {
  n = sample(100, 1)
  print(c(n=n, means=means[[i]], length=length(output)))
  output = c(output, rnorm(n, means[[i]]))
}
length(output)

 num(0) 
     n  means length 
    65      0      0 
     n  means length 
    92      1     65 
     n  means length 
   100      2    157 


[1] 257

This code works perfectly well, but it turns out to be inefficient. The reason is that each time we append to `output` via the command `output = c(output, rnorm(n, means[[i]]))`, R ends up having to copy all of the data from the previous iterations. 

A more efficient option is to store the results of each iteration in a list, and then concatenate all the entries of the list together after the for loop terminates:

In [184]:
means = c(0, 1, 2)
out = vector("list", length(means))
# str(out)
for (i in seq_along(means)) {
  n = sample(100, 1)
  out[[i]] <- rnorm(n, means[[i]])
}
str(out)

List of 3
 $ : num [1:60] 1.181 -1.109 -0.151 0.673 -1.086 ...
 $ : num [1:9] 2.454 0.937 1.353 0.467 0.919 ...
 $ : num [1:98] 3.242 0.497 2.778 0.941 1.674 ...


In [186]:
str(unlist(out))

 num [1:167] 1.181 -1.109 -0.151 0.673 -1.086 ...


To convince you that this is actually more efficient, we will run a *benchmark* of the two methods. I will wrap the two approaches in functions called `f1` and `f2`, and then use the `microbenchmark` library to test which one runs faster.

In [198]:
f1 = function(n) { 
    means = 1:n
    output = double()
    for (i in seq_along(means)) {
        n = sample(100, 1)
        output = c(output, rnorm(n, means[[i]]))
    }
    output
}

f2 = function(n) {
    means = 1:n
    out = vector("list", length(means))
    for (i in seq_along(means)) {
        n = sample(100, 1)
        out[[i]] <- rnorm(n, means[[i]])
    }
    unlist(out)
}

In [200]:
library(microbenchmark)
microbenchmark(
    f1(1000),
    f2(1000)
) %>% print

Unit: milliseconds
     expr        min        lq      mean   median        uq       max neval
 f1(1000) 176.943549 184.36428 190.45303 188.3972 194.00933 258.13552   100
 f2(1000)   9.875242  10.56653  11.42021  10.8246  11.15312  22.33311   100


### Unknown sequence length
In some cases you don't even know how long is the sequence over which you are iterating. Here it is not possible to use a `for` loop; instead you must use a `while` loop:
```{r}
while (<condition>) {
    <body>
}
```
The `while` loop will continue running until `<condition>` returns `FALSE`.

Here's an example of how we would use a `while` loop. The following command counts the number of heads and tails encountered in tosses of a fair coin until the third head is encountered:

In [239]:
n_head = 0
n_tail = 0
while (n_head < 3) {
    if (runif(1) < .5) {
        n_head = n_head + 1
    }
    else {
        n_tail = n_tail + 1
    }
}
n_head + n_tail

[1] 3

(Bonus question: what is the distribution of `n_head + n_tail`?)

As you might suspect, `while` loops are used mainly in random simulations. They don't come up a lot in data analysis. Still, it's useful to know about them.

## Functional programming
R is a *functional programming language*, which means, loosely, that functions are treated just like any other data. In particular, they can be passed to other functions. As we will see, this means that most `for` loop type iterations can be replaced by cleaner, functional constructs.

### Example
In the following series of examples, we'll see how the need to write extensible code naturally leads to ideas from functional programming (FP). Above we've seen several examples of functions that apply the `mean` or `median` function to each column of a tibble:

In [242]:
df = tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
output <- vector("double", length(df))
for (i in seq_along(df)) {
  output[[i]] <- sd(df[[i]])
}
output

[1] 1.0616774 0.7446653 0.7113265 0.8258072

As we have already used this code (or a close variant) on several occasions, it makes sense to extract it out to a function:

In [130]:
col_mean = function(df) {
  output = vector("double", length(df))
  for (i in seq_along(df)) {
    output[i] = mean(df[[i]])
  }
  output
}
df %>% col_mean

[1]  0.002576805  0.324640613  0.094145946 -0.531406959

The function `col_mean` could just as easily be used to compute the `median` or `rescale01` of each column. Indeed, we would only need to change a single function call in the body of the for loop:
```{r}
output[i] = mean(df[[i]])
```
So it makes sense to generalize `col_mean` to a new function that takes as parameters a data frame `df` as well as a function `f` to apply to each column:

In [132]:
col_summary = function(df, f) {
  output = vector("double", length(df))
  for (i in seq_along(df)) {
    output[i] = f(df[[i]])
  }
  output
}
df %>% col_summary(median)
df %>% col_summary(sd)

[1]  0.1070448  0.1780868  0.3062926 -0.3165558

[1] 0.9391513 1.5637266 0.9687816 0.9539779

Notice how much more elegant and readable `df %>% col_summary(median)` is compared to
```{r}
output <- vector("double", length(df))
for (i in seq_along(df)) {
  output[[i]] <- median(df[[i]])
}
output
```
If you understand why the former is preferable, you understand the Zen of Functional Programming!

## map functions
The pattern of looping over a sequence, doing something to each element and saving the results turns out to be extremely common in data analysis (and FP more generally). It even has a name: "map".

There is a set of functions in `tidyverse` designed to help you map over data as easily as possible:
- `map()` makes a list.
- `map_lgl()` makes a logical vector.
- `map_int()` makes an integer vector.
- `map_dbl()` makes a double vector.
- `map_chr()` makes a character vector.

In most cases we will be able to replace `for` loops with calls to these functions, leading to simpler and more readable code.

### Example
How would we write `col_summary` using one of the `map` functions?

In [247]:
map(df, median)

$a
[1] -0.02254271

$b
[1] 0.6294009

$c
[1] -0.5443686

$d
[1] 0.3964494


Compared to `col_summary`, the `map_` functions have a few advantages. One, we can forward additional arguments to the called function:

In [250]:
df %>% map_dbl(mean, na.rm = T)

         a          b          c          d 
-0.1294314  0.4249829 -0.3484361  0.4063167 

Two, names are preserved:

In [251]:
x = list(a=1, b=2, c=c(2, 3))
map_dbl(x, function(x) mean(x + 1))

  a   b   c 
2.0 3.0 3.5 

Three, the `map_` functions allow for some handy shortcuts in addition to taking actual function values. If you pass a *formula* instead of a function, R will convert every instance of `.` to the current list element:

In [254]:
df[3,]
df %>% map(~ mean(1 + .))  # take the third element of each column, aka the third row

  a         b         c       d        
1 0.3259983 0.6441628 -1.1072 0.7911347

$a
[1] 0.8705686

$b
[1] 1.424983

$c
[1] 0.6515639

$d
[1] 1.406317


If you supply a string to a map function, R will extract the attribute with that name from each list element:

In [150]:
list(a=list(a=1, b=2), b=list(a=5, b=3), d=list(a=8, b=4)) %>% map("a")

$a
[1] 1

$b
[1] 5

$d
[1] 8


Similary, an integer will extract the value at that index for each list element:

In [147]:
list(a=list(a=1, b=2), b=list(a=1, b=3), d=list(a=1, b=4)) %>% map(2)

$a
[1] 2

$b
[1] 3

$d
[1] 4


### map-like functions in base R
Base R has the `apply` functions which also perform mapping. The `map_` functions in `tidyverse` have a better interface and should generally be preferred. However, because the `apply` functions are so common, we will briefly go over them here.

The `lapply()` function is identical to `map()`, but it does not allow for some of the convenience shortcuts that we reviewed above.

In [155]:
lst = list(a=list(a=1, b=2), b=list(a=1, b=3), d=list(a=1, b=4))
map(lst, "a")
# lapply(lst, "a")  # error
lapply(lst, function(x) x$a)

$a
[1] 1

$b
[1] 1

$d
[1] 1


$a
[1] 1

$b
[1] 1

$d
[1] 1


`sapply` is a wrapper around `lapply` that applies some simplifications to the output. I avoid `sapply` because I can never remember what are its rules for simplifying. Consider this example from the book:

In [255]:
x1 <- list(
  c(0.27, 0.37, 0.57, 0.91, 0.20),
  c(0.90, 0.94, 0.66, 0.63, 0.06), 
  c(0.21, 0.18, 0.69, 0.38, 0.77)
)
x2 <- list(
  c(0.50, 0.72, 0.99, 0.38, 0.78), 
  c(0.93, 0.21, 0.65, 0.13, 0.27), 
  c(0.39, 0.01, 0.38, 0.87, 0.34)
)
 
threshold <- function(x, cutoff = 0.8) x[x > cutoff]

Can somebody explain to me why 

In [158]:
x1 %>% sapply(threshold) %>% str()

List of 3
 $ : num 0.91
 $ : num [1:2] 0.9 0.94
 $ : num(0) 


but 

In [159]:
x2 %>% sapply(threshold) %>% str()

 num [1:3] 0.99 0.93 0.87


??