Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more group-based splitting methods #207

Closed
topepo opened this issue Jan 14, 2021 · 16 comments · Fixed by #316
Closed

more group-based splitting methods #207

topepo opened this issue Jan 14, 2021 · 16 comments · Fixed by #316
Labels
feature a feature request or enhancement

Comments

@topepo
Copy link
Member

@topepo topepo commented Jan 14, 2021

It would be good to have an initial_group_split(data, group, strata, prop) method that can split the data when there are groups (perhaps patients). The strata option might be difficult when the outcome (or other stratification variable) is not constant within each group. We could also use the median or mode on the stratification variable and use that.

Similarly, a mc_group_cv() function would also be a good idea (using the splitting method as above).

@topepo topepo added the feature a feature request or enhancement label Jan 14, 2021
@NateMietk

This comment has been minimized.

@juliasilge

This comment has been minimized.

@hnagaty
Copy link

@hnagaty hnagaty commented Apr 4, 2021

I second this too. I came across a data set that has a grouping variable. I already used group_vfold_cv() for the cross validation resamples, and it would be helpful to have a similar approach to the initial split.

@rkb965
Copy link
Contributor

@rkb965 rkb965 commented Dec 3, 2021

This would be a great idea. :) I'm trying to bootstrap with repeated measures data and want all measurements from an individual to be in the same split. It seems like this grouped initial_split would enable this, correct? Do y'all have any ideas for work arounds in the meantime? Thanks for all of the incredible work on tidymodels!

@juliasilge
Copy link
Member

@juliasilge juliasilge commented Dec 3, 2021

Yes @rkb965 in the meantime, you can create a custom split with make_splits(). Check out:

for some examples on how to use it.

@astamm
Copy link

@astamm astamm commented Jan 20, 2022

Assuming that stratification variable stratifies the groups, could something like this have any sense:

group_initial_split <- function(data, group, prop = 3/4, strata = NULL, breaks = 4, pool = 0.1, ...) {
  strata <- enquo(strata)
  group <- enquo(group)
  data <- nest(data, data = !any_of(c(!!group, !!strata)))
  cell_split <- initial_split(data, prop, !!strata, breaks, pool, ...)
  train_id <- data$id[cell_split$in_id]
  cell_split$data <- unnest(cell_split$data, cols = data)
  cell_split$in_id <-  which(cell_split$data$id %in% train_id)
  cell_split
}

@juliasilge
Copy link
Member

@juliasilge juliasilge commented Jan 25, 2022

We'll want to add group as an argument to mc_cv(), initial_split(), and vfold_cv(). (Nothing else, right?)

Since initial_split() uses mc_cv() internally, let's make the changes in mc_cv(), specifically mc_splits() (along with vfold_splits()).

@MSHelm
Copy link

@MSHelm MSHelm commented Mar 31, 2022

Having the group argument for validation_split() would also be very helpful for my case. Then I could directly use it for hyperparameter tuning, using e.g. tune_grid

@mattwarkentin
Copy link
Contributor

@mattwarkentin mattwarkentin commented Apr 5, 2022

Related: #284

@mikemahoney218
Copy link
Contributor

@mikemahoney218 mikemahoney218 commented Jun 7, 2022

From conversation with Julia (where we talked about this in reference to spatialsample), it seems like adding _group_ functions would be better than adding group arguments. You can't specify both group and strata (plus pool and breaks), which introduces some nasty dependencies between arguments. Separate functions avoids the issue.

@mikemahoney218
Copy link
Contributor

@mikemahoney218 mikemahoney218 commented Jun 23, 2022

We'll want to add group as an argument to mc_cv(), initial_split(), and vfold_cv(). (Nothing else, right?)

Looks like there's a desire for bootstraps() as well:
https://community.rstudio.com/t/how-to-use-rsample-for-multilevel-resampling/140337

@mattwarkentin
Copy link
Contributor

@mattwarkentin mattwarkentin commented Jun 23, 2022

@mikemahoney218 Personally, I would love to see as many of the current resampling methods as is possible gain the ability to respect the hierarchical structure while resampling. Based on a quick skim this seems like it would include: inital_split(), validation_split(), bootstraps(), vfold_cv(), mc_cv(), and perhaps the rolling or sliding resampling methods.

I think group_vfold_cv() is already a group-aware version of loo_cv(), if my thinking is correct.

@mikemahoney218
Copy link
Contributor

@mikemahoney218 mikemahoney218 commented Jun 30, 2022

Howdy folks -- just to update the thread, we've implemented group_initial_split(), group_validation_split(), group_bootstraps(), group_vfold_cv() and group_mc_cv(). They're currently only in the development version of rsample (so the main branch of this repo), which will be the version after 1.0.0. If you have a use case for a grouped variant of another resampling function, please open a new issue with a description of what you'd expect that function to do!

As for stratification of grouped resamples, we've opened a new issue (#317) to try and collect some opinions on what people would expect stratification with groups to do. If you've got thoughts, please let us know over there!

@Jeffrothschild
Copy link

@Jeffrothschild Jeffrothschild commented Jul 13, 2022

@mikemahoney218 this is awesome and just what I need, thank you! One thing I noticed though, is that set.seed doesn't seem to be applied to the splits?

Is it possible to have set.seed allow the group_initial_split() to separate things in the same way?

Here is an example. It does seem to stay the same when repeating the process in the same session, but if you restart R studio you'll see different colors each time.

library(tidyverse)
library(rsample)

df <- starwars %>% 
  mutate(name = factor(name))

set.seed(3332)
group_split <- group_initial_split(df, group = name)
group_train <- training(group_split)
group_test <- testing(group_split)

group_train %>% select(mass, name) %>% mutate(group = "train") %>% 
  bind_rows(group_test %>% select(mass, name) %>% mutate(group = "test")) %>% 
  ggplot(aes(mass, name, color = group))+
  geom_point()

Thanks

@mikemahoney218
Copy link
Contributor

@mikemahoney218 mikemahoney218 commented Jul 13, 2022

Here is an example. It does seem to stay the same when repeating the process in the same session, but if you restart R studio you'll see different colors each time.

Opened #342 (and #343) for this. Thanks for the report!

@github-actions
Copy link

@github-actions github-actions bot commented Jul 28, 2022

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jul 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants