[feature request] multiple strata in cross-validation #109

fusaroli · 2019-08-16T14:48:26Z

Let's imagine we have a classification problem (presence or absence of a diagnosis) and a dataset with repeated measures (multiple rows per participant).
It would be useful to be able to indicate both group and stratum (if I get the terminology right) in cross-validation. In other words, to make sure that all datapoints from a given participant are in the same fold and that an approximately balanced number of positive and negative diagnosis participants are in each fold.
Am I missing an obvious way of doing this?

juliasilge · 2020-05-01T15:40:52Z

We believe the best way to create these kind of strata is to explicitly create them yourself, typically using mutate() and something like paste(). For example, you can paste your two groupings together and use that as your strata:

library(rsample)
#> Loading required package: tidyr
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

mtcars %>%
  mutate(new_strata = paste0(vs, gear, sep = "_")) %>%
  bootstraps(strata = new_strata)
#> # Bootstrap sampling using stratification 
#> # A tibble: 25 x 2
#>    splits          id         
#>    <named list>    <chr>      
#>  1 <split [32/8]>  Bootstrap01
#>  2 <split [32/10]> Bootstrap02
#>  3 <split [32/11]> Bootstrap03
#>  4 <split [32/11]> Bootstrap04
#>  5 <split [32/11]> Bootstrap05
#>  6 <split [32/14]> Bootstrap06
#>  7 <split [32/14]> Bootstrap07
#>  8 <split [32/14]> Bootstrap08
#>  9 <split [32/9]>  Bootstrap09
#> 10 <split [32/13]> Bootstrap10
#> # … with 15 more rows

^{Created on 2020-05-01 by the reprex package (v0.3.0)}

dhbrand · 2020-05-07T20:24:51Z

I'm not sure I agree with the response. How would creating another stratification variable solve the issue with grouping as well? I feel like the original OP is asking for the vfold_cv and the group_vfold_cv to be combined. I have many times run into this issue. Group values are non independent and should not be split within each fold while strata variables which are independent
should be split within each fold. Can you tell me what I'm missing?

juliasilge · 2020-05-07T20:42:30Z

In that case, have you tried out "double resampling" using nested_cv()?

library(rsample)
double_resampled <- nested_cv(mtcars,
                              group_vfold_cv(group = gear),
                              vfold_cv(strata = vs))

double_resampled$splits[[2]]
#> <Training/Validation/Total>
#> <20/12/32>
double_resampled$inner_resamples[[2]]
#> #  10-fold cross-validation using stratification 
#> # A tibble: 10 x 2
#>    splits         id    
#>    <named list>   <chr> 
#>  1 <split [17/3]> Fold01
#>  2 <split [17/3]> Fold02
#>  3 <split [17/3]> Fold03
#>  4 <split [17/3]> Fold04
#>  5 <split [18/2]> Fold05
#>  6 <split [18/2]> Fold06
#>  7 <split [19/1]> Fold07
#>  8 <split [19/1]> Fold08
#>  9 <split [19/1]> Fold09
#> 10 <split [19/1]> Fold10

^{Created on 2020-05-07 by the reprex package (v0.3.0)}

You can read more about using nested resampling for model evaluation here.

topepo · 2020-05-07T21:18:13Z

We've had requests to be able to pass multiple strata variables in the strata column.

I'm opposed to doing this since that feature in caret has cause many errors since people have many strata with very low frequency. It basically can't be done.
Our response has been "make a combined strata variable that is not sparse and use the usual tools".

Stratification and grouping are two different things. Stratification is related to how to balance a split within a pre-defined resampling scheme (e.g. cross-validation etc). Grouped resampling uses the groups to define the scheme. "Groupings" here should correspond to your independent experimental unit.

It would be possible for us to try to balance some constant stratification column in the data. I think your concerns are that, for some categorical outcome, you want to make sure that the X participants left out in a fold are not from the same class (where class is constant across each participants data points). Let me know if I got this right.

I say "possible" because it would only be helpful when you have a very large number of groupings. Say you have 10 participants and use v = 8. There will be two folds with two participants and the others will have one. Stratifying this is very difficult to do.
Since the feasibility of this computation is not uniformly possible, I'd rather you define your grouping variable (that may include multiple participants) and use the current tools. This might sounds like a cop-out/punting, but I've learned not to implement features unless they are almost always possible.

github-actions · 2021-02-21T00:56:44Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

juliasilge closed this as completed May 1, 2020

github-actions bot locked and limited conversation to collaborators Feb 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] multiple strata in cross-validation #109

[feature request] multiple strata in cross-validation #109

fusaroli commented Aug 16, 2019

juliasilge commented May 1, 2020

dhbrand commented May 7, 2020 •

edited

juliasilge commented May 7, 2020

topepo commented May 7, 2020

github-actions bot commented Feb 21, 2021

[feature request] multiple strata in cross-validation #109

[feature request] multiple strata in cross-validation #109

Comments

fusaroli commented Aug 16, 2019

juliasilge commented May 1, 2020

dhbrand commented May 7, 2020 • edited

juliasilge commented May 7, 2020

topepo commented May 7, 2020

github-actions bot commented Feb 21, 2021

dhbrand commented May 7, 2020 •

edited