-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature request] multiple strata in cross-validation #109
Comments
We believe the best way to create these kind of strata is to explicitly create them yourself, typically using library(rsample)
#> Loading required package: tidyr
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
mtcars %>%
mutate(new_strata = paste0(vs, gear, sep = "_")) %>%
bootstraps(strata = new_strata)
#> # Bootstrap sampling using stratification
#> # A tibble: 25 x 2
#> splits id
#> <named list> <chr>
#> 1 <split [32/8]> Bootstrap01
#> 2 <split [32/10]> Bootstrap02
#> 3 <split [32/11]> Bootstrap03
#> 4 <split [32/11]> Bootstrap04
#> 5 <split [32/11]> Bootstrap05
#> 6 <split [32/14]> Bootstrap06
#> 7 <split [32/14]> Bootstrap07
#> 8 <split [32/14]> Bootstrap08
#> 9 <split [32/9]> Bootstrap09
#> 10 <split [32/13]> Bootstrap10
#> # … with 15 more rows Created on 2020-05-01 by the reprex package (v0.3.0) |
I'm not sure I agree with the response. How would creating another stratification variable solve the issue with grouping as well? I feel like the original OP is asking for the vfold_cv and the group_vfold_cv to be combined. I have many times run into this issue. Group values are non independent and should not be split within each fold while strata variables which are independent |
In that case, have you tried out "double resampling" using library(rsample)
double_resampled <- nested_cv(mtcars,
group_vfold_cv(group = gear),
vfold_cv(strata = vs))
double_resampled$splits[[2]]
#> <Training/Validation/Total>
#> <20/12/32>
double_resampled$inner_resamples[[2]]
#> # 10-fold cross-validation using stratification
#> # A tibble: 10 x 2
#> splits id
#> <named list> <chr>
#> 1 <split [17/3]> Fold01
#> 2 <split [17/3]> Fold02
#> 3 <split [17/3]> Fold03
#> 4 <split [17/3]> Fold04
#> 5 <split [18/2]> Fold05
#> 6 <split [18/2]> Fold06
#> 7 <split [19/1]> Fold07
#> 8 <split [19/1]> Fold08
#> 9 <split [19/1]> Fold09
#> 10 <split [19/1]> Fold10 Created on 2020-05-07 by the reprex package (v0.3.0) You can read more about using nested resampling for model evaluation here. |
We've had requests to be able to pass multiple strata variables in the
Stratification and grouping are two different things. Stratification is related to how to balance a split within a pre-defined resampling scheme (e.g. cross-validation etc). Grouped resampling uses the groups to define the scheme. "Groupings" here should correspond to your independent experimental unit. It would be possible for us to try to balance some constant stratification column in the data. I think your concerns are that, for some categorical outcome, you want to make sure that the X participants left out in a fold are not from the same class (where class is constant across each participants data points). Let me know if I got this right.
|
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue. |
Let's imagine we have a classification problem (presence or absence of a diagnosis) and a dataset with repeated measures (multiple rows per participant).
It would be useful to be able to indicate both group and stratum (if I get the terminology right) in cross-validation. In other words, to make sure that all datapoints from a given participant are in the same fold and that an approximately balanced number of positive and negative diagnosis participants are in each fold.
Am I missing an obvious way of doing this?
The text was updated successfully, but these errors were encountered: