Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] multiple strata in cross-validation #109

Closed
fusaroli opened this issue Aug 16, 2019 · 5 comments
Closed

[feature request] multiple strata in cross-validation #109

fusaroli opened this issue Aug 16, 2019 · 5 comments

Comments

@fusaroli
Copy link

Let's imagine we have a classification problem (presence or absence of a diagnosis) and a dataset with repeated measures (multiple rows per participant).
It would be useful to be able to indicate both group and stratum (if I get the terminology right) in cross-validation. In other words, to make sure that all datapoints from a given participant are in the same fold and that an approximately balanced number of positive and negative diagnosis participants are in each fold.
Am I missing an obvious way of doing this?

@juliasilge
Copy link
Member

We believe the best way to create these kind of strata is to explicitly create them yourself, typically using mutate() and something like paste(). For example, you can paste your two groupings together and use that as your strata:

library(rsample)
#> Loading required package: tidyr
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

mtcars %>%
  mutate(new_strata = paste0(vs, gear, sep = "_")) %>%
  bootstraps(strata = new_strata)
#> # Bootstrap sampling using stratification 
#> # A tibble: 25 x 2
#>    splits          id         
#>    <named list>    <chr>      
#>  1 <split [32/8]>  Bootstrap01
#>  2 <split [32/10]> Bootstrap02
#>  3 <split [32/11]> Bootstrap03
#>  4 <split [32/11]> Bootstrap04
#>  5 <split [32/11]> Bootstrap05
#>  6 <split [32/14]> Bootstrap06
#>  7 <split [32/14]> Bootstrap07
#>  8 <split [32/14]> Bootstrap08
#>  9 <split [32/9]>  Bootstrap09
#> 10 <split [32/13]> Bootstrap10
#> # … with 15 more rows

Created on 2020-05-01 by the reprex package (v0.3.0)

@dhbrand
Copy link

dhbrand commented May 7, 2020

I'm not sure I agree with the response. How would creating another stratification variable solve the issue with grouping as well? I feel like the original OP is asking for the vfold_cv and the group_vfold_cv to be combined. I have many times run into this issue. Group values are non independent and should not be split within each fold while strata variables which are independent
should be split within each fold. Can you tell me what I'm missing?

@juliasilge
Copy link
Member

In that case, have you tried out "double resampling" using nested_cv()?

library(rsample)
double_resampled <- nested_cv(mtcars,
                              group_vfold_cv(group = gear),
                              vfold_cv(strata = vs))

double_resampled$splits[[2]]
#> <Training/Validation/Total>
#> <20/12/32>
double_resampled$inner_resamples[[2]]
#> #  10-fold cross-validation using stratification 
#> # A tibble: 10 x 2
#>    splits         id    
#>    <named list>   <chr> 
#>  1 <split [17/3]> Fold01
#>  2 <split [17/3]> Fold02
#>  3 <split [17/3]> Fold03
#>  4 <split [17/3]> Fold04
#>  5 <split [18/2]> Fold05
#>  6 <split [18/2]> Fold06
#>  7 <split [19/1]> Fold07
#>  8 <split [19/1]> Fold08
#>  9 <split [19/1]> Fold09
#> 10 <split [19/1]> Fold10

Created on 2020-05-07 by the reprex package (v0.3.0)

You can read more about using nested resampling for model evaluation here.

@topepo
Copy link
Member

topepo commented May 7, 2020

We've had requests to be able to pass multiple strata variables in the strata column.

  • I'm opposed to doing this since that feature in caret has cause many errors since people have many strata with very low frequency. It basically can't be done.

  • Our response has been "make a combined strata variable that is not sparse and use the usual tools".

Stratification and grouping are two different things. Stratification is related to how to balance a split within a pre-defined resampling scheme (e.g. cross-validation etc). Grouped resampling uses the groups to define the scheme. "Groupings" here should correspond to your independent experimental unit.

It would be possible for us to try to balance some constant stratification column in the data. I think your concerns are that, for some categorical outcome, you want to make sure that the X participants left out in a fold are not from the same class (where class is constant across each participants data points). Let me know if I got this right.

  • I say "possible" because it would only be helpful when you have a very large number of groupings. Say you have 10 participants and use v = 8. There will be two folds with two participants and the others will have one. Stratifying this is very difficult to do.

  • Since the feasibility of this computation is not uniformly possible, I'd rather you define your grouping variable (that may include multiple participants) and use the current tools. This might sounds like a cop-out/punting, but I've learned not to implement features unless they are almost always possible.

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants