Thinking about bootstrap grouping #269

hadley · 2014-02-19T17:35:47Z

bootstrap <- function(df, m) {
  n <- nrow(df)

  attr(df, "indices") <- replicate(m, sample(n, replace = TRUE), 
    simplify = FALSE)
  attr(df, "drop") <- TRUE
  attr(df, "group_sizes") <- rep(n, m)
  attr(df, "biggest_group_size") <- n
  attr(df, "labels") <- data.frame(replicate = 1:m)
  attr(df, "vars") <- list(quote(boot)) # list(substitute(bootstrap(m)))
  class(df) <- c("grouped_df", "tbl_df", "tbl", "data.frame")

  df
}

library(dplyr)
mboot <- bootstrap(mtcars, 10)

# Works
mboot %.% summarise(mean(cyl))

# Not obvious what mutate, filter, arrange should do.
# Need to duplicate data.

romainfrancois · 2014-02-19T17:58:37Z

Might be useful to have:

class(df) <- c("grouped_bootstrap_df", "grouped_df", "tbl_df", "tbl", "data.frame")

So that internally we can potentially do something different. Although I guess we can use the vars and labels attribute too.

For filter we could keep the data unchanged and change the indices, group_sizes and biggest_group_size attributes. This way, conceptually we can chain bootstrap, filter and summarise should this be of any interest.
If we go this way, I guess we will need a materialise verb.

Alternatively we could materialise directly, but then I'm not sure what would the grouping mean. We could materialise data for each group adjacently in memory, which can be interesting too.

arrange does already not make a lot of sense in the simple grouping. ATM, it completely ignores the grouping altogether.

mutate is I think the most difficult to make sense of in the bootstrap case. We have to assume that a column created by mutate in one group is independent from other groups. We can't have indices indicating positions of the new variable within a variable of the same size as the original data frame.
I think we'd have to store variables coming from mutate in a list, having as many elements as there are bootstrap samples. The problem is that this distances from a data frame. Once again, we could have materialise transforming this structure into an actual data frame.

nograpes · 2014-09-12T13:54:05Z

I'm not certain about this, but in answering this Stack Overflow question, I believe that I have discovered a bug in this function. Here are the steps to reproduce the bug:

library(dplyr)
mboot <- bootstrap(mtcars, 10)
bootstrap(mtcars, 3) %>% do(data.frame(x=1:2))
# Error: index out of bounds

And here is the proposed fix:

bootstrap <- function(df, m) {
  n <- nrow(df)

  attr(df, "indices") <- replicate(m, sample(n, replace = TRUE), 
                                   simplify = FALSE)
  attr(df, "drop") <- TRUE
  attr(df, "group_sizes") <- rep(n, m)
  attr(df, "biggest_group_size") <- n
  attr(df, "labels") <- data.frame(replicate = 1:m)
  attr(df, "vars") <- list(quote(replicate)) # Change
  #  attr(df, "vars") <- list(quote(boot)) # list(substitute(bootstrap(m)))
  class(df) <- c("grouped_df", "tbl_df", "tbl", "data.frame")

  df
}

Which fixes the above case:

bootstrap(mtcars, 3) %>% do(data.frame(x=1:2))
# Source: local data frame [6 x 2]
# Groups: replicate

#   replicate x
# 1         1 1
# 2         1 2
# 3         2 1
# 4         2 2
# 5         3 1
# 6         3 2

dgrtwo · 2014-09-18T03:12:44Z

It looks like there's another small bug: since a grouped_df's indices are 0-indexed, not 1-indexed,

attr(df, "indices") <- replicate(m, sample(n, replace = TRUE), 
  simplify = FALSE)

should be

attr(df, "indices") <- replicate(m, sample(n, replace = TRUE) - 1, 
  simplify = FALSE)

Otherwise do leads to NAs in the output, as can be confirmed with

bootstrap(mtcars, 3) %>% do(data.frame(sum(is.na(.$mpg))))
Source: local data frame [5 x 2]
Groups: replicate

  replicate sum.is.na...mpg..
1         1                 0
2         2                 1
3         3                 1
4         4                 1
5         5                 1

and summarize leads to 0 (or potentially arbitrary values) introduced by memory overflow:

bootstrap(mtcars, 5) %>% summarize(min(mpg))
Source: local data frame [5 x 2]

  replicate min(mpg)
1         1      0.0
2         2     10.4
3         3      0.0
4         4      0.0
5         5     10.4

hadley · 2016-03-01T19:34:51Z

Now think this should go in separate partition package

hadley added the enhancement label Mar 17, 2014

hadley added this to the 0.3 milestone Mar 17, 2014

hadley modified the milestones: 0.3, 0.3.1 Aug 1, 2014

hadley self-assigned this Aug 1, 2014

hadley modified the milestones: 0.3.1, 0.4 Nov 18, 2014

hadley closed this as completed Mar 1, 2016

lock bot locked as resolved and limited conversation to collaborators Jun 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thinking about bootstrap grouping #269

Thinking about bootstrap grouping #269

hadley commented Feb 19, 2014

romainfrancois commented Feb 19, 2014

nograpes commented Sep 12, 2014

dgrtwo commented Sep 18, 2014

hadley commented Mar 1, 2016

Thinking about bootstrap grouping #269

Thinking about bootstrap grouping #269

Comments

hadley commented Feb 19, 2014

romainfrancois commented Feb 19, 2014

nograpes commented Sep 12, 2014

dgrtwo commented Sep 18, 2014

hadley commented Mar 1, 2016