New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thinking about bootstrap grouping #269

Closed
hadley opened this Issue Feb 19, 2014 · 4 comments

Comments

Projects
None yet
4 participants
@hadley
Member

hadley commented Feb 19, 2014

bootstrap <- function(df, m) {
  n <- nrow(df)

  attr(df, "indices") <- replicate(m, sample(n, replace = TRUE), 
    simplify = FALSE)
  attr(df, "drop") <- TRUE
  attr(df, "group_sizes") <- rep(n, m)
  attr(df, "biggest_group_size") <- n
  attr(df, "labels") <- data.frame(replicate = 1:m)
  attr(df, "vars") <- list(quote(boot)) # list(substitute(bootstrap(m)))
  class(df) <- c("grouped_df", "tbl_df", "tbl", "data.frame")

  df
}

library(dplyr)
mboot <- bootstrap(mtcars, 10)

# Works
mboot %.% summarise(mean(cyl))

# Not obvious what mutate, filter, arrange should do.
# Need to duplicate data. 
@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Feb 19, 2014

Might be useful to have:

class(df) <- c("grouped_bootstrap_df", "grouped_df", "tbl_df", "tbl", "data.frame")

So that internally we can potentially do something different. Although I guess we can use the vars and labels attribute too.

For filter we could keep the data unchanged and change the indices, group_sizes and biggest_group_size attributes. This way, conceptually we can chain bootstrap, filter and summarise should this be of any interest.
If we go this way, I guess we will need a materialise verb.

Alternatively we could materialise directly, but then I'm not sure what would the grouping mean. We could materialise data for each group adjacently in memory, which can be interesting too.

arrange does already not make a lot of sense in the simple grouping. ATM, it completely ignores the grouping altogether.

mutate is I think the most difficult to make sense of in the bootstrap case. We have to assume that a column created by mutate in one group is independent from other groups. We can't have indices indicating positions of the new variable within a variable of the same size as the original data frame.
I think we'd have to store variables coming from mutate in a list, having as many elements as there are bootstrap samples. The problem is that this distances from a data frame. Once again, we could have materialise transforming this structure into an actual data frame.

@hadley hadley added the enhancement label Mar 17, 2014

@hadley hadley added this to the 0.3 milestone Mar 17, 2014

@hadley hadley modified the milestones: 0.3, 0.3.1 Aug 1, 2014

@hadley hadley self-assigned this Aug 1, 2014

@nograpes

This comment has been minimized.

nograpes commented Sep 12, 2014

I'm not certain about this, but in answering this Stack Overflow question, I believe that I have discovered a bug in this function. Here are the steps to reproduce the bug:

library(dplyr)
mboot <- bootstrap(mtcars, 10)
bootstrap(mtcars, 3) %>% do(data.frame(x=1:2))
# Error: index out of bounds

And here is the proposed fix:

bootstrap <- function(df, m) {
  n <- nrow(df)

  attr(df, "indices") <- replicate(m, sample(n, replace = TRUE), 
                                   simplify = FALSE)
  attr(df, "drop") <- TRUE
  attr(df, "group_sizes") <- rep(n, m)
  attr(df, "biggest_group_size") <- n
  attr(df, "labels") <- data.frame(replicate = 1:m)
  attr(df, "vars") <- list(quote(replicate)) # Change
  #  attr(df, "vars") <- list(quote(boot)) # list(substitute(bootstrap(m)))
  class(df) <- c("grouped_df", "tbl_df", "tbl", "data.frame")

  df
}

Which fixes the above case:

bootstrap(mtcars, 3) %>% do(data.frame(x=1:2))
# Source: local data frame [6 x 2]
# Groups: replicate

#   replicate x
# 1         1 1
# 2         1 2
# 3         2 1
# 4         2 2
# 5         3 1
# 6         3 2
@dgrtwo

This comment has been minimized.

Member

dgrtwo commented Sep 18, 2014

It looks like there's another small bug: since a grouped_df's indices are 0-indexed, not 1-indexed,

attr(df, "indices") <- replicate(m, sample(n, replace = TRUE), 
  simplify = FALSE)

should be

attr(df, "indices") <- replicate(m, sample(n, replace = TRUE) - 1, 
  simplify = FALSE)

Otherwise do leads to NAs in the output, as can be confirmed with

bootstrap(mtcars, 3) %>% do(data.frame(sum(is.na(.$mpg))))
Source: local data frame [5 x 2]
Groups: replicate

  replicate sum.is.na...mpg..
1         1                 0
2         2                 1
3         3                 1
4         4                 1
5         5                 1

and summarize leads to 0 (or potentially arbitrary values) introduced by memory overflow:

bootstrap(mtcars, 5) %>% summarize(min(mpg))
Source: local data frame [5 x 2]

  replicate min(mpg)
1         1      0.0
2         2     10.4
3         3      0.0
4         4      0.0
5         5     10.4

@hadley hadley modified the milestones: 0.3.1, 0.4 Nov 18, 2014

@hadley

This comment has been minimized.

Member

hadley commented Mar 1, 2016

Now think this should go in separate partition package

@hadley hadley closed this Mar 1, 2016

@lock lock bot locked as resolved and limited conversation to collaborators Jun 9, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.