Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need to be able to sample groups #361

Closed
hadley opened this issue Mar 28, 2014 · 9 comments
Closed

Need to be able to sample groups #361

hadley opened this issue Mar 28, 2014 · 9 comments
Milestone

Comments

@hadley
Copy link
Member

hadley commented Mar 28, 2014

As well as individuals within groups

@hadley hadley added this to the v0.2 milestone Mar 28, 2014
@hadley
Copy link
Member Author

hadley commented Mar 28, 2014

species <- iris %.% 
  group_by(Species) %.% 
  summarise(wt = sum(Sepal.Length)) %.%
  sample_n(5, replace = T, weight = wt) %.%
  select(-wt)

inner_join(species, iris)

@hadley hadley modified the milestones: 0.3, v0.2 Apr 7, 2014
@hadley hadley closed this as completed Jul 28, 2014
@rcorty
Copy link

rcorty commented Apr 21, 2015

I wonder why this was closed? Seems like a potentially useful feature

iris %>%
    group_by(Species) %>%
    sample_n(1)

to pick all the data from a random species, e.g.

@MarcusWalz
Copy link

MarcusWalz commented Jun 29, 2016

I don't think that sample_n's behavior should change for groups because sampling within groups is its intuitive behavior. However it's often handy to be able to sample groups as a whole. This should be a second function. Here is my implementation:

sample_n_groups = function(tbl, size, replace = FALSE, weight=NULL) {
   # regroup when done
   grps = tbl %>% groups %>% unlist %>% as.character
   # check length of groups non-zero
   keep = tbl %>% summarise() %>% sample_n(size, replace, weight)
   # keep only selected groups, regroup because joins change count.
   # regrouping may be unnecessary but joins do something funky to grouping variable
   tbl %>% semi_join(keep) %>% group_by_(grps) 
}

The example by @rcorty works just expected

iris %>% group_by(Species) %>% sample_n_groups(1)

@kendonB
Copy link

kendonB commented Jul 4, 2016

+1

@drhagen
Copy link

drhagen commented Aug 30, 2016

Edit: A change to dplyr broke this solution; scroll down for an updated version.


For those of you who arrived here via search engine looking for this functionality, the implementation by @MarcusWalz does not sample with replacement when replace = TRUE. The implementation needs to use right_join (or left_join or inner_join) to keep the duplicates:

sample_n_groups = function(tbl, size, replace = FALSE, weight=NULL) {
  # regroup when done
  grps = tbl %>% groups %>% unlist %>% as.character
  # check length of groups non-zero
  keep = tbl %>% summarise() %>% sample_n(size, replace, weight)
  # keep only selected groups, regroup because joins change count.
  # regrouping may be unnecessary but joins do something funky to grouping variable
  tbl %>% right_join(keep, by=grps) %>% group_by_(grps) 
}

@kendonB
Copy link

kendonB commented Dec 13, 2016

Cluster bootstrapping is a wide use case for this feature.

@kendonB
Copy link

kendonB commented Dec 13, 2016

@drhagen, in your implementation, do you have any suggestions for how to generate a new unique group id?

@kendonB
Copy link

kendonB commented Dec 13, 2016

Actually, this is quite easy:

sample_n_groups = function(tbl, size, replace = FALSE, weight=NULL) {
  # regroup when done
  grps = tbl %>% groups %>% unlist %>% as.character
  # check length of groups non-zero
  keep = tbl %>% summarise() %>% sample_n(size, replace, weight) %>% 
    mutate(unique_id = 1:NROW(.))
  # keep only selected groups, regroup because joins change count.
  # regrouping may be unnecessary but joins do something funky to grouping variable
  tbl %>% right_join(keep, by=grps) %>% group_by_(grps) 
}

@kendonB
Copy link

kendonB commented Mar 6, 2017

The answer above by @drhagen looks like it's out of date. This seems to work now:

sample_n_groups = function(tbl, size, replace = FALSE, weight = NULL) {
  # regroup when done
  grps = tbl %>% groups %>% lapply(as.character) %>% unlist
  # check length of groups non-zero
  keep = tbl %>% summarise() %>% ungroup() %>% sample_n(size, replace, weight)
  # keep only selected groups, regroup because joins change count.
  # regrouping may be unnecessary but joins do something funky to grouping variable
  tbl %>% right_join(keep, by=grps) %>% group_by_(.dots = grps)
}

@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants