2018-04-28-purrr/purrr_intro.Rmd

---
title: "An Intro to purrr"
author: ""
output: slidy_presentation

---


# purrr: Functional programming in R

This presentation was created by Jennifer Thompson in 2017 for R-Ladies Nashville, I've modified it for us today

# Setup

First we'll load the packages we need today

```{r libraries}
## -- Load R libraries ---------------------------------------------------------
suppressPackageStartupMessages(library(purrr)) ## obvs
suppressPackageStartupMessages(library(dplyr)) ## for data management
suppressPackageStartupMessages(library(tidyr)) ## for data management
  ## Note: If you have the tidyverse package installed, library(tidyverse) will
  ## load purrr along with several other core packages, including dplyr & tidyr
suppressPackageStartupMessages(library(stringr)) ## for string manipulation
suppressPackageStartupMessages(library(ggplot2)) ## for plotting
suppressPackageStartupMessages(library(viridis)) ## for lovely color scales

```

# Iteration: A Definition

Doing the same* thing to a bunch of things.

<i>*ish</i>


# But We Have Ways to Do That Already, Right?

Let's try for a toy data set with 100 samples, each with 6 recorded variables.
```{r toy}

load('toy_purrr.Rdata')
head(toy.data)

```

# How could we ...

- find the mean of each variable

Try it now!

# Try solutions in Rstudio

# Iteration methods

- Copying and pasting
- `for` loops
- `lapply()`
- `apply()`, `mapply()`, `sapply()`, `tapply()`, `vapply()`

Nothing wrong with any of them if they work for you and your use case! But `purrr` can have some advantages.

# Why You Might Use `purrr` vs copy and paste / for loops / `apply()`s

1. Consistent, readable syntax (compare to the `_apply()`s)
1. Efficient (compare to `for` loops)
1. Plays nicely with pipes `%>%`
1. Returns the output you expect (type-stable)
1. Reproducibility/ease of making changes
1. Uses either built-in functions (eg, `mean()`) OR build your own, either inline (anonymous) or separately (user-defined)
1. Particularly useful if you're working with [list-columns](https://jennybc.github.io/purrr-tutorial/ls13_list-columns.html), JSON data, other non-strictly-rectangular data formats

# Preamble: Stop Worrying and Learn to Love the List

You're probably already using lists even if you don't know it (for example, a data.frame is a special kind of list!). Generically, lists in R can have as many elements as you want, and each element can be of whatever type you want (including another list... it's [lists all the way down](https://en.wikipedia.org/wiki/Turtles_all_the_way_down)). For example, a totally valid list:

```{r list_demo}
list("a" = 1:10,          ## numeric vector of length 10
     "b" = list(1:10),    ## list of length 1; element 1 = vector of length 10
     "c" = LETTERS[1:10]) ## character vector of length 10

```

Other examples of lists include model fit objects (we'll see this with `lm` later), `ggplot2` objects - lots of functions return lists. R for Data Science has a great [intro to lists](http://r4ds.had.co.nz/lists.html) for more information.

Lists' flexibility can allow you lots of freedom once you get comfortable with them; that flexibility can also introduce some complexity. `purrr` is built in part to let you take advantage of lists' benefits as well as some help dealing with the potential pitfalls.


# `map`s Are Where It's At

`map()` and its variants are the workhorses of `purrr`. They let us do the same or similar things to a bunch of things, get the output we expect, and sometimes get the final result we want in one step.

# How `map` Works

There are several variants of `map`, but they all work in the same general way:

1. Over a set of arguments (called `.x` in `map()` classic),
1. Do a function (`.f`)

# `map` can work with three kinds of functions:

1. Built-in functions (`mean`, `subset`...)

## An example:

Try finding the mean of each of the variables with `map`

# Solution

```{r builtin}
toy.data %>% map(mean)
```

# `map` can work with three kinds of functions:

1. Built-in functions (`mean`, `subset`...)
2. User-defined functions

## An example:

Try writing your own function to find the two integers on either side of the mean 

Then find these bounds for each of the variables with `map`

# Solution

```{r userdef}
my.mean.bounds <- function(x){
  c(floor(mean(x)),ceiling(mean(x)))
}

toy.data %>% map(my.mean.bounds)
```

# `map` can work with three kinds of functions:

1. Built-in functions (`mean`, `subset`...)
2. User-defined functions
3. Anonymous in-line functions

## Anonymous functions

Also called lambda functions

- ~ lets R know the following stuff will be an anonymous function
- . is each item in the list

map ([list I am iterating over], ~ . )
This would do nothing!

map ([list I am iterating over], ~ ./2)
This would just divide each thing in the list in half

## An example:

Try to find the two integers on either side of the mean for each of the variables with `map`

Using only map, only one line of code!

# Solution

```{r anon}
toy.data %>% map(~ c(floor(mean(.)),ceiling(mean(.))) )
```

# Types of `map`

`map` in its purest form will always give you a list. But if you've ever written `do.call(rbind, lapply(...))`, you know that sometimes you don't actually *want* a list. 
`purrr` is HERE FOR YOU.
`map` has several type-specific variants:

1. `_df`: turns your result into a data.frame/tibble! Can do this via rows (default; also `map_dfr`) or columns (`map_dfc`)
1. `_chr`: results in a character vector
1. `_lgl`: results in a logical vector
1. `_int`: results in an integer vector
1. `_dbl`: results in a double vector

# Review

Let's take two vectors, both `1:10`, and see what happens if we map over both using `map` variants. This will also be a basic introduction to using anonymous functions.

```{r map_examples}
v1 <- 1:10

v1 %>% map(~ . * 3)
## Returns a list, because we used map()

v1 %>% map_dbl(~ . * 3)
## Same values, but returns a vector of doubles

v1 %>% map_chr(~ LETTERS[.])
## Character vector of LETTERS[1:10]

v1 %>% map_lgl(~ . < 5)
## Logical vector that indicates whether the number is less than 5

```

# "Amounts" of `map`

You might use a slightly different version of `map` depending on how many things you want to change for each iteration.

1. `map`: Do the exact same thing to a bunch of things (specifies one argument to a function)
1. `map2`: Do the exact same thing to a bunch of things, except for one thing (specifies two arguments to a function)
1. `pmap`: Do similar things to a bunch of things (specifies many arguments to a function)

Each of these has a match in the `walk` functions. While `map` returns an object, `walk` is called for "side effects" (eg, plots, printed text, etc) and returns nothing. We'll see examples of both later.

# Example Time!

We're going to try out some `map` uses, and some other fun surprises of `purrr`, by looking at some US National Parks Service data. [Happy 101st birthday, National Parks!](https://www.nps.gov/orgs/1207/08-23-2017-nps-birthday.htm) Specifically, we'll use iteration to:

1. Fit the same model to three different outcomes
1. Check assumptions for those models
1. If needed, update the model
1. Visualize our model results

(Our statistical example is purposely kept very simple so the focus can be on iteration)

# Data

We'll be using a few datasets:

1. [Annual Recreation Visits, 2007-2016](https://data.world/nps/annual-park-ranking-recreation-visits)
1. [Annual Backcountry Campers, 2007-2016](https://data.world/nps/annual-park-ranking-backcountry-campers)
1. [Annual Tent Campers, 2007-2016](https://data.world/nps/annual-park-ranking-tent-campers)
1. [NPS Data Glossary](https://data.world/nps/glossary)

```{r data}

load('purrr_data.Rdata')

length(datalist)
# total recreational visitors
head(datalist[[1]])
# tent campers
head(datalist[[2]])
# backcountry visitors
head(datalist[[3]])


```


# Run Models

Let's say we want to predict the number of a) total recreational, b) tent campers, and c) backcountry visitors per year using the year, the region, and an interaction between the two. You guessed it: We can use `map`! This seems like a good time for an anonymous function.

```{r fit_org_models}
## Fit the same model to each dataset
orgmod_list <- map(
  .x = datalist,
  .f = ~ lm(value ~ year * region, data = .)
)

orgmod_sum <- orgmod_list %>% map(summary)

orgmod_sum

```

Looks like everything went well, but lots of us are statisticians, after all. Do these models fit the usual assumptions? Let's quickly look at some residuals vs fitted plots using `purrr`'s `walk()` function, which you can call when you want the *side effects* of a function instead of returning an object.

```{r rpplots}
par(mfrow = c(1, 3))
walk(orgmod_list, ~ plot(resid(.) ~ predict(.)))

```

Hmm... some weirdness. What's the distribution of our outcome?

```{r histograms}
walk(datalist, ~ print(ggplot(data = ., aes(x = value)) + geom_histogram()))

```

Some skewness there! Let's log transform our outcome and refit the models.

```{r logtrans}
## Add log transformed value to each dataset
## One base way
# for(i in 1:length(datalist)){
#   datalist[[i]]$logvalue <- log(datalist[[i]]$value)
# }

## purrr + dplyr way: apply the log function to the value column in each dataset
datalist <- datalist %>%
  map(~ mutate_at(.x, "value", log))

## Refit linear model to each dataset, recheck RP plots
logmod_list <- map(datalist, ~ lm(value ~ year * region, data = .))

par(mfrow = c(1, 3))
walk(logmod_list, ~ plot(resid(.) ~ predict(.)))

```

Looking better. Just out of curiosity, what's our R^2^ on those models? `summary()` of an `lm` object returns a list, of which one element is the adjusted R^2^. We can extract that value for each of our models really quickly using `map_dbl`.

```{r r2}
## You can do this two ways, whichever you find more readable:
## All in one line:
round(map_dbl(logmod_list, ~ summary(.)$adj.r.squared), 2)

## In a pipe:
logmod_list %>%
  map(summary) %>%
  map_dbl(.f = "adj.r.squared") %>%
  ## Passing .f a quoted string means "get this element out of the object in .x"
  round(2)

```

Well, that's not great, but that's not really the point now is it. Moving on!

# Plot Results

Now let's say we want to generate separate plots for the predicted visitors over time by region for each dataset, and save each plot as a PDF. We're going to

1. Create a list of data.frames with predicted values for each region and year
2. Plot each
3. Save those plots

In this chunk of code, we use:

- `purrr::cross_df` to get all possible combinations of two vectors and put them in a data.frame *(this does essentially the same thing as `expand.grid`, but `cross` can also create lists, which can be really helpful for simulations, for example)*
- `purrr::pluck` to extract elements of a list - this can be helpful, since list notation can get confusing in its natural habitat, mixing `[[double brackets]][singlebrackets]$dollarsigns`
- `purrr::map` in a pipeline, starting with one list of elements and putting it through a process with multiple steps

```{r predvals}
## -- Create base data set with records for which we want predicted values -----
preddata <- cross_df(
  ## You can access the columns of one of our datasets using purrr::pluck() or
  ## base R; both ways shown here
  .l = list("year" = unique(pluck(datalist, 1, "year")),
            "region" = levels(datalist[[1]]$region))
)

## -- Get actual predicted values for each year, region ------------------------
pred_list <- logmod_list %>%
  ## Apply the predict function to each model
  map(predict, newdata = preddata, se.fit = TRUE) %>%
  ## predict() returns a list; extract the fit and se.fit elements
  ## Again, elements of our list are extracted two ways to compare
  map(~ data.frame(fit = pluck(., "fit"), se = .$se.fit) %>%
        ## Calculate confidence limits
        mutate(lcl = fit - qnorm(0.975) * se,
               ucl = fit + qnorm(0.975) * se)) %>%
  ## Add year and region onto each
  map(dplyr::bind_cols, preddata)

```

```{r plot_pred}
## -- Write a function to plot values for a given dataset ----------------------
plot_predicted <- function(df, vscale, maintitle){
  ## Make sure df has all the columns we need
  if(!all(c("fit", "se", "lcl", "ucl", "year", "region") %in% names(df))){
    stop("df should have columns fit, se, lcl, ucl, year, region")
  }
  
  ## Create a plot faceted by region
  p <- ggplot(data = df, aes(x = year, y = fit)) +
    facet_wrap(~ region, nrow = 2) +
    geom_ribbon(aes(ymin = lcl, ymax = ucl, fill = region), alpha = 0.4) +
    geom_line(aes(color = region), size = 2) +
    scale_fill_viridis(option = vscale, discrete = TRUE, end = 0.75) +
    scale_colour_viridis(option = vscale, discrete = TRUE, end = 0.75) +
    labs(title = maintitle,
         x = NULL, y = "Log(Visitors)") +
    theme(legend.position = "none")
  
  return(p)
  
}

```

Notice our function has three arguments, which means we can't use `map`. We need the big guns: `pmap`. The `p` stands for `parallel`, and we're going to iterate over a **list** of arguments in *parallel* to get the plots we want. First, we'll set up our named list of arguments.

```{r set_plot_args}
plot_args <- list(
  "df" = pred_list, ## list with three elements
  "vscale" = c("D", "A", "C"),
  "maintitle" = c("Total Recreational Visits",
                  "Tent Campers",
                  "Backcountry Campers")
)

```

Because we wrote our function already, once that list is done, it's one simple line to generate all of our plots:

```{r generate_plots}
nps_plots <- pmap(plot_args, plot_predicted)

```

Notice that nothing printed; `pmap` saved these three plots to a list, but now we need to do something with them. We could print them to our screen with `walk(nps_plots, print)`, OR we could save them to PDFs using `walk2`. Remember, `map2` and `walk2` iterate over *exactly two* arguments - here, it'll be our list of plots, and a list of file names.

```{r save_plots}
walk2(.x = c("rec.pdf", "tent.pdf", "backcountry.pdf"),
      .y = nps_plots,
      ggsave,
      width = 8, height = 6)

```

Thus ends our example!

# BUT WAIT! THERE'S MORE!

A few purrr features we haven't mentioned yet:

- `partial`, for when you want to create a partially specified version of a function (eg, `q25 <- partial(quantile, probs = 0.25, na.rm = TRUE)`)
- `flatten`, for removing hierarchies from a list
- `safely`, `quietly`, `possibly` - can be helpful especially when writing functions or packages
- `invoke`, `modify` - I haven't used these a ton yet
- List-columns can be your friend if you want to store complex data, results, etc in a tidy way; this is likely a whole other meetup, but `purrr` functions can be really helpful when working with these. Jenny Bryan's tutorial linked below is a great resource here.

# purrr resources

- [Official page on tidyverse.org](http://purrr.tidyverse.org/)
- [RStudio cheatsheet](https://www.rstudio.com/resources/cheatsheets/) (under "Apply Functions")
- [DataCamp: Writing Functions in R](https://www.datacamp.com/courses/writing-functions-in-r/)
- [Charlotte Wickham's purrr tutorial](https://github.com/cwickham/purrr-tutorial)
- [Jenny Bryan's purrr tutorial](https://jennybc.github.io/purrr-tutorial/): particularly great if you love the idea of list-columns
- [Hadley Wickham on purrr vs *apply](https://stackoverflow.com/questions/45101045/why-use-purrrmap-instead-of-lapply/47123420#47123420)
- Fun use cases:
    - A [roundup of blog posts](https://maraaverick.rbind.io/2017/09/purrr-ty-posts/) curated by Mara Averick
    - [Peter Kamerman on bootstrap CIs](https://www.painblogr.org/2017-10-18-purrring-through-bootstraps.html)
    - [Ken Butler on handling errors with safely/possibly](https://nxskok.github.io/blog/2017/09/07/safely-possibly/)