Permalink
Find file Copy path
db6f7e8 Jun 28, 2018
3 contributors

Users who have contributed to this file

@alexpghayes @jrnold @nutterb
139 lines (101 sloc) 4.8 KB
---
title: "Tidy bootstrapping"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Tidy bootstrapping}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
message = FALSE
)
```
# Tidy bootstrapping
Another place where combining model fits in a tidy way becomes useful is when performing bootstrapping or permutation tests. These approaches have been explored before, for instance by [Andrew MacDonald here](http://rstudio-pubs-static.s3.amazonaws.com/19698_a4c472606e3c43e4b94720506e49bb7b.html), and [Hadley has explored efficient support for bootstrapping](https://github.com/hadley/dplyr/issues/269) as a potential enhancement to dplyr. broom fits naturally with dplyr in performing these analyses.
Bootstrapping consists of randomly sampling a dataset with replacement, then performing the analysis individually on each bootstrapped replicate. The variation in the resulting estimate is then a reasonable approximation of the variance in our estimate.
Let's say we want to fit a nonlinear model to the weight/mileage relationship in the `mtcars` dataset.
```{r}
library(ggplot2)
ggplot(mtcars, aes(mpg, wt)) +
geom_point()
```
We might use the method of nonlinear least squares (via the `nls` function) to fit a model.
```{r}
nlsfit <- nls(mpg ~ k / wt + b, mtcars, start = list(k = 1, b = 0))
summary(nlsfit)
ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
geom_line(aes(y = predict(nlsfit)))
```
While this does provide a p-value and confidence intervals for the parameters, these are based on model assumptions that may not hold in real data. Bootstrapping is a popular method for providing confidence intervals and predictions that are more robust to the nature of the data.
We can use the `bootstraps` function in the **rsample** package to sample bootstrap replications. First, we construct 100 bootstrap replications of the data, each of which has been randomly sampled with replacement. The resulting object is an `rset`, which is a dataframe with a column of `rsplit` objects.
An `rsplit` object has two main components: an analysis dataset and an assessment dataset, accessible via `analysis(rsplit)` and `assessment(rsplit)` respectively. For bootstrap samples, the analysis dataset is the bootstrap sample itself, and the assessment dataset consists of all the out of bag samples.
```{r}
library(dplyr)
library(rsample)
library(broom)
library(purrr)
set.seed(27)
boots <- bootstraps(mtcars, times = 100)
boots
```
We create a helper function to fit an `nls` model on each bootstrap sample, and then use `purrr::map` to apply this function to all the bootstrap samples at once. Similarly, we create a column of tidy coefficient information by unnesting.
```{r}
fit_nls_on_bootstrap <- function(split) {
nls(mpg ~ k / wt + b, analysis(split), start = list(k = 1, b = 0))
}
boot_models <- boots %>%
mutate(model = map(splits, fit_nls_on_bootstrap),
coef_info = map(model, tidy))
boot_coefs <- boot_models %>%
unnest(coef_info)
```
The unnested coefficient information contains a summary of each replication combined in a single data frame:
```{r}
boot_coefs
```
We can then calculate confidence intervals (using what is called the [percentile method](https://www.uvm.edu/~dhowell/StatPages/Randomization%20Tests/ResamplingWithR/BootstMeans/bootstrapping_means.html)):
```{r}
alpha <- .05
boot_coefs %>%
group_by(term) %>%
summarize(low = quantile(estimate, alpha / 2),
high = quantile(estimate, 1 - alpha / 2))
```
Or we can use histograms to get a more detailed idea of the uncertainty in each estimate:
```{r}
ggplot(boot_coefs, aes(estimate)) +
geom_histogram(binwidth = 2) +
facet_wrap(~ term, scales = "free")
```
Or we can use `augment` to visualize the uncertainty in the curve:
```{r}
boot_aug <- boot_models %>%
mutate(augmented = map(model, augment)) %>%
unnest(augmented)
boot_aug
```
```{r}
ggplot(boot_aug, aes(wt, mpg)) +
geom_point() +
geom_line(aes(y = .fitted, group = id), alpha=.2)
```
With only a few small changes, we could easily perform bootstrapping with other kinds of predictive or hypothesis testing models, since the `tidy` and `augment` functions works for many statistical outputs. As another example, we could use `smooth.spline`, which fits a cubic smoothing spline to data:
```{r}
fit_spline_on_bootstrap <- function(split) {
data <- analysis(split)
smooth.spline(data$wt, data$mpg, df = 4)
}
boot_splines <- boots %>%
mutate(spline = map(splits, fit_spline_on_bootstrap),
aug_train = map(spline, augment))
splines_aug <- boot_splines %>%
unnest(aug_train)
ggplot(splines_aug, aes(x, y)) +
geom_point() +
geom_line(aes(y = .fitted, group = id), alpha = 0.2)
```