Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request - manual split creation #158

Closed
DanOvando opened this issue Jun 1, 2020 · 5 comments
Closed

feature request - manual split creation #158

DanOvando opened this issue Jun 1, 2020 · 5 comments
Labels
feature a feature request or enhancement

Comments

@DanOvando
Copy link

I'm exploring using tidymodels for my predictive modeling workflows and loving it so far. I am running into one problem though. I nearly always want to do my initial split by something like year or country, but have many nested observations within those splits. I can't figure out a clean way of creating these splits using rsample. To use the gapminder dataset, suppose I want to train a model using only data from before the year 2000, and test the model on all data points greater than or equal to the year 2000.

I see how this could be done combining initial_time_split with nesting, but the resulting object would not (I assume) work if passed as the split argument of tune::last_fit, since each resample would be a nested tibble.

My current strategy is to arrange the data such that my training and testing splits are cleanly separated, find the split point in the data, and then use that as the prop argument, to initial_time_split, but that feels pretty rough to me, and makes me nervous with very large datasets, where I'm not sure how exact rsample is being in the splitting (e.g. if the correct split is 0.961234, will rsample be that precise?).

Would it be possible to add something in to initial_split like training_split, where training_split could be a logical vector saying TRUE if so, FALSE if not? That way, users could make the definition of the testing and training splits as complex as needed, and simply pass the resulting logical vector to initial_split.

library(tidyverse)
library(tidymodels)
#> ── Attaching packages ──────────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom     0.5.6      ✓ rsample   0.0.6 
#> ✓ dials     0.0.6      ✓ tune      0.1.0 
#> ✓ infer     0.5.1      ✓ workflows 0.1.1 
#> ✓ parsnip   0.1.1      ✓ yardstick 0.0.6 
#> ✓ recipes   0.1.12
#> ── Conflicts ─────────────────────────────────── tidymodels_conflicts() ──
#> x scales::discard() masks purrr::discard()
#> x dplyr::filter()   masks stats::filter()
#> x recipes::fixed()  masks stringr::fixed()
#> x dplyr::lag()      masks stats::lag()
#> x dials::margin()   masks ggplot2::margin()
#> x yardstick::spec() masks readr::spec()
#> x recipes::step()   masks stats::step()
library(gapminder)

gapminder <- gapminder %>% 
  arrange(year)

split_prop <- (last(which(gapminder$year <= 2000))) / nrow(gapminder)

test <- initial_time_split(gapminder, prop = split_prop)

# gapminder works in 5 year intervals, so max year should be 1997 in training set
training(test)$year %>% unique()
#>  [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997

testing(test)$year %>% unique()
#> [1] 2002 2007

Created on 2020-06-01 by the reprex package (v0.3.0)

@topepo topepo added the feature a feature request or enhancement label Jun 5, 2020
@topepo
Copy link
Member

topepo commented Jun 5, 2020

We'll consider making an api to do this; no guarantees though.

In the meantime, you could make your own class of rset objects. The critical functions are exported and you can use vfold_cv() as a template to look at.

@oude-gao
Copy link

oude-gao commented Jun 24, 2020

How would that rset object be read by last_fit()? I have a similar issue where I created my training/testing split using a very specific method, and wanted to use last_fit(). Since I didn't have anything created by initial_split(), I had to reverse-engineer to get my own rsplit object through the training/testing I created earlier so ideally last_fit() would work. When I fed it to last_fit(), there was an error msg saying there is a different number of rows.

So I went to take a better look at last_fit(), and it seems like inside the function rsplit is transformed into an rset object, by recalculating the prop and resampling the data with mc_cv(). So this rset object is not going to carry the exact same training/testing split as what I originally created even though the sizes would be the same. Am I reading the function wrong? Is there a way to solve this problem?

@DavisVaughan
Copy link
Member

It sounds like you just want rsample::make_splits()? Those can be as custom as you need. Then you can supply those splits to rsample::manual_rset() in the dev version if you need an rset object from them

@topepo
Copy link
Member

topepo commented Sep 14, 2020

Like this:

library(tidyverse)
library(tidymodels)
#> ── Attaching packages ───────────────────────────────────────────── tidymodels 0.1.1 ──
#> ✓ broom     0.7.0          ✓ recipes   0.1.13    
#> ✓ dials     0.0.8.9001     ✓ rsample   0.0.7.9000
#> ✓ infer     0.5.2          ✓ tune      0.1.1.9000
#> ✓ modeldata 0.0.2          ✓ workflows 0.1.3.9000
#> ✓ parsnip   0.1.3          ✓ yardstick 0.0.7
#> ── Conflicts ──────────────────────────────────────────────── tidymodels_conflicts() ──
#> x scales::discard() masks purrr::discard()
#> x dplyr::filter()   masks stats::filter()
#> x recipes::fixed()  masks stringr::fixed()
#> x dplyr::lag()      masks stats::lag()
#> x yardstick::spec() masks readr::spec()
#> x recipes::step()   masks stats::step()
library(gapminder)

gapminder <- gapminder %>% 
  arrange(year) %>% 
  mutate(.row = row_number())

split_prop <- (last(which(gapminder$year <= 2000))) / nrow(gapminder)

indices <-
  list(analysis   = gapminder$.row[gapminder$year <= 2000], 
       assessment = gapminder$.row[gapminder$year >  2000]
  )

split <- make_splits(indices, gapminder %>% select(-.row))
training(split)
#> # A tibble: 1,420 x 6
#>    country     continent  year lifeExp      pop gdpPercap
#>    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779.
#>  2 Albania     Europe     1952    55.2  1282697     1601.
#>  3 Algeria     Africa     1952    43.1  9279525     2449.
#>  4 Angola      Africa     1952    30.0  4232095     3521.
#>  5 Argentina   Americas   1952    62.5 17876956     5911.
#>  6 Australia   Oceania    1952    69.1  8691212    10040.
#>  7 Austria     Europe     1952    66.8  6927772     6137.
#>  8 Bahrain     Asia       1952    50.9   120447     9867.
#>  9 Bangladesh  Asia       1952    37.5 46886859      684.
#> 10 Belgium     Europe     1952    68    8730405     8343.
#> # … with 1,410 more rows
testing(split)
#> # A tibble: 284 x 6
#>    country     continent  year lifeExp       pop gdpPercap
#>    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
#>  1 Afghanistan Asia       2002    42.1  25268405      727.
#>  2 Albania     Europe     2002    75.7   3508512     4604.
#>  3 Algeria     Africa     2002    71.0  31287142     5288.
#>  4 Angola      Africa     2002    41.0  10866106     2773.
#>  5 Argentina   Americas   2002    74.3  38331121     8798.
#>  6 Australia   Oceania    2002    80.4  19546792    30688.
#>  7 Austria     Europe     2002    79.0   8148312    32418.
#>  8 Bahrain     Asia       2002    74.8    656397    23404.
#>  9 Bangladesh  Asia       2002    62.0 135656790     1136.
#> 10 Belgium     Europe     2002    78.3  10311970    30486.
#> # … with 274 more rows

Created on 2020-09-14 by the reprex package (v0.3.0)

You'll need the GH version of rsample (but not any of the other devel versions that I have loaded right now).

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants