/
purrr_intro.Rmd
executable file
·433 lines (291 loc) · 15.4 KB
/
purrr_intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
---
title: "An Intro to purrr"
author: ""
output: slidy_presentation
---
# purrr: Functional programming in R
This presentation was created by Jennifer Thompson in 2017 for R-Ladies Nashville, I've modified it for us today
# Setup
First we'll load the packages we need today
```{r libraries}
## -- Load R libraries ---------------------------------------------------------
suppressPackageStartupMessages(library(purrr)) ## obvs
suppressPackageStartupMessages(library(dplyr)) ## for data management
suppressPackageStartupMessages(library(tidyr)) ## for data management
## Note: If you have the tidyverse package installed, library(tidyverse) will
## load purrr along with several other core packages, including dplyr & tidyr
suppressPackageStartupMessages(library(stringr)) ## for string manipulation
suppressPackageStartupMessages(library(ggplot2)) ## for plotting
suppressPackageStartupMessages(library(viridis)) ## for lovely color scales
```
# Iteration: A Definition
Doing the same* thing to a bunch of things.
<i>*ish</i>
# But We Have Ways to Do That Already, Right?
Let's try for a toy data set with 100 samples, each with 6 recorded variables.
```{r toy}
load('toy_purrr.Rdata')
head(toy.data)
```
# How could we ...
- find the mean of each variable
Try it now!
# Try solutions in Rstudio
# Iteration methods
- Copying and pasting
- `for` loops
- `lapply()`
- `apply()`, `mapply()`, `sapply()`, `tapply()`, `vapply()`
Nothing wrong with any of them if they work for you and your use case! But `purrr` can have some advantages.
# Why You Might Use `purrr` vs copy and paste / for loops / `apply()`s
1. Consistent, readable syntax (compare to the `_apply()`s)
1. Efficient (compare to `for` loops)
1. Plays nicely with pipes `%>%`
1. Returns the output you expect (type-stable)
1. Reproducibility/ease of making changes
1. Uses either built-in functions (eg, `mean()`) OR build your own, either inline (anonymous) or separately (user-defined)
1. Particularly useful if you're working with [list-columns](https://jennybc.github.io/purrr-tutorial/ls13_list-columns.html), JSON data, other non-strictly-rectangular data formats
# Preamble: Stop Worrying and Learn to Love the List
You're probably already using lists even if you don't know it (for example, a data.frame is a special kind of list!). Generically, lists in R can have as many elements as you want, and each element can be of whatever type you want (including another list... it's [lists all the way down](https://en.wikipedia.org/wiki/Turtles_all_the_way_down)). For example, a totally valid list:
```{r list_demo}
list("a" = 1:10, ## numeric vector of length 10
"b" = list(1:10), ## list of length 1; element 1 = vector of length 10
"c" = LETTERS[1:10]) ## character vector of length 10
```
Other examples of lists include model fit objects (we'll see this with `lm` later), `ggplot2` objects - lots of functions return lists. R for Data Science has a great [intro to lists](http://r4ds.had.co.nz/lists.html) for more information.
Lists' flexibility can allow you lots of freedom once you get comfortable with them; that flexibility can also introduce some complexity. `purrr` is built in part to let you take advantage of lists' benefits as well as some help dealing with the potential pitfalls.
# `map`s Are Where It's At
`map()` and its variants are the workhorses of `purrr`. They let us do the same or similar things to a bunch of things, get the output we expect, and sometimes get the final result we want in one step.
# How `map` Works
There are several variants of `map`, but they all work in the same general way:
1. Over a set of arguments (called `.x` in `map()` classic),
1. Do a function (`.f`)
# `map` can work with three kinds of functions:
1. Built-in functions (`mean`, `subset`...)
## An example:
Try finding the mean of each of the variables with `map`
# Solution
```{r builtin}
toy.data %>% map(mean)
```
# `map` can work with three kinds of functions:
1. Built-in functions (`mean`, `subset`...)
2. User-defined functions
## An example:
Try writing your own function to find the two integers on either side of the mean
Then find these bounds for each of the variables with `map`
# Solution
```{r userdef}
my.mean.bounds <- function(x){
c(floor(mean(x)),ceiling(mean(x)))
}
toy.data %>% map(my.mean.bounds)
```
# `map` can work with three kinds of functions:
1. Built-in functions (`mean`, `subset`...)
2. User-defined functions
3. Anonymous in-line functions
## Anonymous functions
Also called lambda functions
- ~ lets R know the following stuff will be an anonymous function
- . is each item in the list
map ([list I am iterating over], ~ . )
This would do nothing!
map ([list I am iterating over], ~ ./2)
This would just divide each thing in the list in half
## An example:
Try to find the two integers on either side of the mean for each of the variables with `map`
Using only map, only one line of code!
# Solution
```{r anon}
toy.data %>% map(~ c(floor(mean(.)),ceiling(mean(.))) )
```
# Types of `map`
`map` in its purest form will always give you a list. But if you've ever written `do.call(rbind, lapply(...))`, you know that sometimes you don't actually *want* a list.
`purrr` is HERE FOR YOU.
`map` has several type-specific variants:
1. `_df`: turns your result into a data.frame/tibble! Can do this via rows (default; also `map_dfr`) or columns (`map_dfc`)
1. `_chr`: results in a character vector
1. `_lgl`: results in a logical vector
1. `_int`: results in an integer vector
1. `_dbl`: results in a double vector
# Review
Let's take two vectors, both `1:10`, and see what happens if we map over both using `map` variants. This will also be a basic introduction to using anonymous functions.
```{r map_examples}
v1 <- 1:10
v1 %>% map(~ . * 3)
## Returns a list, because we used map()
v1 %>% map_dbl(~ . * 3)
## Same values, but returns a vector of doubles
v1 %>% map_chr(~ LETTERS[.])
## Character vector of LETTERS[1:10]
v1 %>% map_lgl(~ . < 5)
## Logical vector that indicates whether the number is less than 5
```
# "Amounts" of `map`
You might use a slightly different version of `map` depending on how many things you want to change for each iteration.
1. `map`: Do the exact same thing to a bunch of things (specifies one argument to a function)
1. `map2`: Do the exact same thing to a bunch of things, except for one thing (specifies two arguments to a function)
1. `pmap`: Do similar things to a bunch of things (specifies many arguments to a function)
Each of these has a match in the `walk` functions. While `map` returns an object, `walk` is called for "side effects" (eg, plots, printed text, etc) and returns nothing. We'll see examples of both later.
# Example Time!
We're going to try out some `map` uses, and some other fun surprises of `purrr`, by looking at some US National Parks Service data. [Happy 101st birthday, National Parks!](https://www.nps.gov/orgs/1207/08-23-2017-nps-birthday.htm) Specifically, we'll use iteration to:
1. Fit the same model to three different outcomes
1. Check assumptions for those models
1. If needed, update the model
1. Visualize our model results
(Our statistical example is purposely kept very simple so the focus can be on iteration)
# Data
We'll be using a few datasets:
1. [Annual Recreation Visits, 2007-2016](https://data.world/nps/annual-park-ranking-recreation-visits)
1. [Annual Backcountry Campers, 2007-2016](https://data.world/nps/annual-park-ranking-backcountry-campers)
1. [Annual Tent Campers, 2007-2016](https://data.world/nps/annual-park-ranking-tent-campers)
1. [NPS Data Glossary](https://data.world/nps/glossary)
```{r data}
load('purrr_data.Rdata')
length(datalist)
# total recreational visitors
head(datalist[[1]])
# tent campers
head(datalist[[2]])
# backcountry visitors
head(datalist[[3]])
```
# Run Models
Let's say we want to predict the number of a) total recreational, b) tent campers, and c) backcountry visitors per year using the year, the region, and an interaction between the two. You guessed it: We can use `map`! This seems like a good time for an anonymous function.
```{r fit_org_models}
## Fit the same model to each dataset
orgmod_list <- map(
.x = datalist,
.f = ~ lm(value ~ year * region, data = .)
)
orgmod_sum <- orgmod_list %>% map(summary)
orgmod_sum
```
Looks like everything went well, but lots of us are statisticians, after all. Do these models fit the usual assumptions? Let's quickly look at some residuals vs fitted plots using `purrr`'s `walk()` function, which you can call when you want the *side effects* of a function instead of returning an object.
```{r rpplots}
par(mfrow = c(1, 3))
walk(orgmod_list, ~ plot(resid(.) ~ predict(.)))
```
Hmm... some weirdness. What's the distribution of our outcome?
```{r histograms}
walk(datalist, ~ print(ggplot(data = ., aes(x = value)) + geom_histogram()))
```
Some skewness there! Let's log transform our outcome and refit the models.
```{r logtrans}
## Add log transformed value to each dataset
## One base way
# for(i in 1:length(datalist)){
# datalist[[i]]$logvalue <- log(datalist[[i]]$value)
# }
## purrr + dplyr way: apply the log function to the value column in each dataset
datalist <- datalist %>%
map(~ mutate_at(.x, "value", log))
## Refit linear model to each dataset, recheck RP plots
logmod_list <- map(datalist, ~ lm(value ~ year * region, data = .))
par(mfrow = c(1, 3))
walk(logmod_list, ~ plot(resid(.) ~ predict(.)))
```
Looking better. Just out of curiosity, what's our R^2^ on those models? `summary()` of an `lm` object returns a list, of which one element is the adjusted R^2^. We can extract that value for each of our models really quickly using `map_dbl`.
```{r r2}
## You can do this two ways, whichever you find more readable:
## All in one line:
round(map_dbl(logmod_list, ~ summary(.)$adj.r.squared), 2)
## In a pipe:
logmod_list %>%
map(summary) %>%
map_dbl(.f = "adj.r.squared") %>%
## Passing .f a quoted string means "get this element out of the object in .x"
round(2)
```
Well, that's not great, but that's not really the point now is it. Moving on!
# Plot Results
Now let's say we want to generate separate plots for the predicted visitors over time by region for each dataset, and save each plot as a PDF. We're going to
1. Create a list of data.frames with predicted values for each region and year
2. Plot each
3. Save those plots
In this chunk of code, we use:
- `purrr::cross_df` to get all possible combinations of two vectors and put them in a data.frame *(this does essentially the same thing as `expand.grid`, but `cross` can also create lists, which can be really helpful for simulations, for example)*
- `purrr::pluck` to extract elements of a list - this can be helpful, since list notation can get confusing in its natural habitat, mixing `[[double brackets]][singlebrackets]$dollarsigns`
- `purrr::map` in a pipeline, starting with one list of elements and putting it through a process with multiple steps
```{r predvals}
## -- Create base data set with records for which we want predicted values -----
preddata <- cross_df(
## You can access the columns of one of our datasets using purrr::pluck() or
## base R; both ways shown here
.l = list("year" = unique(pluck(datalist, 1, "year")),
"region" = levels(datalist[[1]]$region))
)
## -- Get actual predicted values for each year, region ------------------------
pred_list <- logmod_list %>%
## Apply the predict function to each model
map(predict, newdata = preddata, se.fit = TRUE) %>%
## predict() returns a list; extract the fit and se.fit elements
## Again, elements of our list are extracted two ways to compare
map(~ data.frame(fit = pluck(., "fit"), se = .$se.fit) %>%
## Calculate confidence limits
mutate(lcl = fit - qnorm(0.975) * se,
ucl = fit + qnorm(0.975) * se)) %>%
## Add year and region onto each
map(dplyr::bind_cols, preddata)
```
```{r plot_pred}
## -- Write a function to plot values for a given dataset ----------------------
plot_predicted <- function(df, vscale, maintitle){
## Make sure df has all the columns we need
if(!all(c("fit", "se", "lcl", "ucl", "year", "region") %in% names(df))){
stop("df should have columns fit, se, lcl, ucl, year, region")
}
## Create a plot faceted by region
p <- ggplot(data = df, aes(x = year, y = fit)) +
facet_wrap(~ region, nrow = 2) +
geom_ribbon(aes(ymin = lcl, ymax = ucl, fill = region), alpha = 0.4) +
geom_line(aes(color = region), size = 2) +
scale_fill_viridis(option = vscale, discrete = TRUE, end = 0.75) +
scale_colour_viridis(option = vscale, discrete = TRUE, end = 0.75) +
labs(title = maintitle,
x = NULL, y = "Log(Visitors)") +
theme(legend.position = "none")
return(p)
}
```
Notice our function has three arguments, which means we can't use `map`. We need the big guns: `pmap`. The `p` stands for `parallel`, and we're going to iterate over a **list** of arguments in *parallel* to get the plots we want. First, we'll set up our named list of arguments.
```{r set_plot_args}
plot_args <- list(
"df" = pred_list, ## list with three elements
"vscale" = c("D", "A", "C"),
"maintitle" = c("Total Recreational Visits",
"Tent Campers",
"Backcountry Campers")
)
```
Because we wrote our function already, once that list is done, it's one simple line to generate all of our plots:
```{r generate_plots}
nps_plots <- pmap(plot_args, plot_predicted)
```
Notice that nothing printed; `pmap` saved these three plots to a list, but now we need to do something with them. We could print them to our screen with `walk(nps_plots, print)`, OR we could save them to PDFs using `walk2`. Remember, `map2` and `walk2` iterate over *exactly two* arguments - here, it'll be our list of plots, and a list of file names.
```{r save_plots}
walk2(.x = c("rec.pdf", "tent.pdf", "backcountry.pdf"),
.y = nps_plots,
ggsave,
width = 8, height = 6)
```
Thus ends our example!
# BUT WAIT! THERE'S MORE!
A few purrr features we haven't mentioned yet:
- `partial`, for when you want to create a partially specified version of a function (eg, `q25 <- partial(quantile, probs = 0.25, na.rm = TRUE)`)
- `flatten`, for removing hierarchies from a list
- `safely`, `quietly`, `possibly` - can be helpful especially when writing functions or packages
- `invoke`, `modify` - I haven't used these a ton yet
- List-columns can be your friend if you want to store complex data, results, etc in a tidy way; this is likely a whole other meetup, but `purrr` functions can be really helpful when working with these. Jenny Bryan's tutorial linked below is a great resource here.
# purrr resources
- [Official page on tidyverse.org](http://purrr.tidyverse.org/)
- [RStudio cheatsheet](https://www.rstudio.com/resources/cheatsheets/) (under "Apply Functions")
- [DataCamp: Writing Functions in R](https://www.datacamp.com/courses/writing-functions-in-r/)
- [Charlotte Wickham's purrr tutorial](https://github.com/cwickham/purrr-tutorial)
- [Jenny Bryan's purrr tutorial](https://jennybc.github.io/purrr-tutorial/): particularly great if you love the idea of list-columns
- [Hadley Wickham on purrr vs *apply](https://stackoverflow.com/questions/45101045/why-use-purrrmap-instead-of-lapply/47123420#47123420)
- Fun use cases:
- A [roundup of blog posts](https://maraaverick.rbind.io/2017/09/purrr-ty-posts/) curated by Mara Averick
- [Peter Kamerman on bootstrap CIs](https://www.painblogr.org/2017-10-18-purrring-through-bootstraps.html)
- [Ken Butler on handling errors with safely/possibly](https://nxskok.github.io/blog/2017/09/07/safely-possibly/)