Find file Copy path
db6f7e8 Jun 28, 2018
3 contributors

Users who have contributed to this file

@alexpghayes @jrnold @nutterb
139 lines (101 sloc) 4.8 KB
title: "Tidy bootstrapping"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Tidy bootstrapping}
```{r setup, include = FALSE}
collapse = TRUE,
comment = "#>",
message = FALSE
# Tidy bootstrapping
Another place where combining model fits in a tidy way becomes useful is when performing bootstrapping or permutation tests. These approaches have been explored before, for instance by [Andrew MacDonald here](, and [Hadley has explored efficient support for bootstrapping]( as a potential enhancement to dplyr. broom fits naturally with dplyr in performing these analyses.
Bootstrapping consists of randomly sampling a dataset with replacement, then performing the analysis individually on each bootstrapped replicate. The variation in the resulting estimate is then a reasonable approximation of the variance in our estimate.
Let's say we want to fit a nonlinear model to the weight/mileage relationship in the `mtcars` dataset.
ggplot(mtcars, aes(mpg, wt)) +
We might use the method of nonlinear least squares (via the `nls` function) to fit a model.
nlsfit <- nls(mpg ~ k / wt + b, mtcars, start = list(k = 1, b = 0))
ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
geom_line(aes(y = predict(nlsfit)))
While this does provide a p-value and confidence intervals for the parameters, these are based on model assumptions that may not hold in real data. Bootstrapping is a popular method for providing confidence intervals and predictions that are more robust to the nature of the data.
We can use the `bootstraps` function in the **rsample** package to sample bootstrap replications. First, we construct 100 bootstrap replications of the data, each of which has been randomly sampled with replacement. The resulting object is an `rset`, which is a dataframe with a column of `rsplit` objects.
An `rsplit` object has two main components: an analysis dataset and an assessment dataset, accessible via `analysis(rsplit)` and `assessment(rsplit)` respectively. For bootstrap samples, the analysis dataset is the bootstrap sample itself, and the assessment dataset consists of all the out of bag samples.
boots <- bootstraps(mtcars, times = 100)
We create a helper function to fit an `nls` model on each bootstrap sample, and then use `purrr::map` to apply this function to all the bootstrap samples at once. Similarly, we create a column of tidy coefficient information by unnesting.
fit_nls_on_bootstrap <- function(split) {
nls(mpg ~ k / wt + b, analysis(split), start = list(k = 1, b = 0))
boot_models <- boots %>%
mutate(model = map(splits, fit_nls_on_bootstrap),
coef_info = map(model, tidy))
boot_coefs <- boot_models %>%
The unnested coefficient information contains a summary of each replication combined in a single data frame:
We can then calculate confidence intervals (using what is called the [percentile method](
alpha <- .05
boot_coefs %>%
group_by(term) %>%
summarize(low = quantile(estimate, alpha / 2),
high = quantile(estimate, 1 - alpha / 2))
Or we can use histograms to get a more detailed idea of the uncertainty in each estimate:
ggplot(boot_coefs, aes(estimate)) +
geom_histogram(binwidth = 2) +
facet_wrap(~ term, scales = "free")
Or we can use `augment` to visualize the uncertainty in the curve:
boot_aug <- boot_models %>%
mutate(augmented = map(model, augment)) %>%
ggplot(boot_aug, aes(wt, mpg)) +
geom_point() +
geom_line(aes(y = .fitted, group = id), alpha=.2)
With only a few small changes, we could easily perform bootstrapping with other kinds of predictive or hypothesis testing models, since the `tidy` and `augment` functions works for many statistical outputs. As another example, we could use `smooth.spline`, which fits a cubic smoothing spline to data:
fit_spline_on_bootstrap <- function(split) {
data <- analysis(split)
smooth.spline(data$wt, data$mpg, df = 4)
boot_splines <- boots %>%
mutate(spline = map(splits, fit_spline_on_bootstrap),
aug_train = map(spline, augment))
splines_aug <- boot_splines %>%
ggplot(splines_aug, aes(x, y)) +
geom_point() +
geom_line(aes(y = .fitted, group = id), alpha = 0.2)