Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

splits based on twinning #371

Open
topepo opened this issue Nov 8, 2022 · 1 comment
Open

splits based on twinning #371

topepo opened this issue Nov 8, 2022 · 1 comment
Labels
feature a feature request or enhancement

Comments

@topepo
Copy link
Member

topepo commented Nov 8, 2022

Twinning is a tool to create splits based on making the marginal distributions of the variables the as close as possible. See this paper (pdf) for more details.

There is an R package that can be used.

We could use this in vfold_cv() as well as initial_split() and mc_cv().

@hfrick
Copy link
Member

hfrick commented Jul 21, 2023

Adding some notes from a slack discussion:

  • If this also applies to the group_*() versions but not the time_*() version, it would be good to have that as an argument rather than new functions.
  • We could use this as a multi-variable strata solution; if someone passes a single column to strata, the splits are made as they are now. If 2+ columns are chosen, we could use twinning to do the stratified splits.

@hfrick hfrick added the feature a feature request or enhancement label Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants