feature request - manual split creation #158

DanOvando · 2020-06-01T21:56:13Z

I'm exploring using tidymodels for my predictive modeling workflows and loving it so far. I am running into one problem though. I nearly always want to do my initial split by something like year or country, but have many nested observations within those splits. I can't figure out a clean way of creating these splits using rsample. To use the gapminder dataset, suppose I want to train a model using only data from before the year 2000, and test the model on all data points greater than or equal to the year 2000.

I see how this could be done combining initial_time_split with nesting, but the resulting object would not (I assume) work if passed as the split argument of tune::last_fit, since each resample would be a nested tibble.

My current strategy is to arrange the data such that my training and testing splits are cleanly separated, find the split point in the data, and then use that as the prop argument, to initial_time_split, but that feels pretty rough to me, and makes me nervous with very large datasets, where I'm not sure how exact rsample is being in the splitting (e.g. if the correct split is 0.961234, will rsample be that precise?).

Would it be possible to add something in to initial_split like training_split, where training_split could be a logical vector saying TRUE if so, FALSE if not? That way, users could make the definition of the testing and training splits as complex as needed, and simply pass the resulting logical vector to initial_split.

library(tidyverse)
library(tidymodels)
#> ── Attaching packages ──────────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom     0.5.6      ✓ rsample   0.0.6 
#> ✓ dials     0.0.6      ✓ tune      0.1.0 
#> ✓ infer     0.5.1      ✓ workflows 0.1.1 
#> ✓ parsnip   0.1.1      ✓ yardstick 0.0.6 
#> ✓ recipes   0.1.12
#> ── Conflicts ─────────────────────────────────── tidymodels_conflicts() ──
#> x scales::discard() masks purrr::discard()
#> x dplyr::filter()   masks stats::filter()
#> x recipes::fixed()  masks stringr::fixed()
#> x dplyr::lag()      masks stats::lag()
#> x dials::margin()   masks ggplot2::margin()
#> x yardstick::spec() masks readr::spec()
#> x recipes::step()   masks stats::step()
library(gapminder)

gapminder <- gapminder %>% 
  arrange(year)

split_prop <- (last(which(gapminder$year <= 2000))) / nrow(gapminder)

test <- initial_time_split(gapminder, prop = split_prop)

# gapminder works in 5 year intervals, so max year should be 1997 in training set
training(test)$year %>% unique()
#>  [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997

testing(test)$year %>% unique()
#> [1] 2002 2007

^{Created on 2020-06-01 by the reprex package (v0.3.0)}

The text was updated successfully, but these errors were encountered:

topepo · 2020-06-05T13:34:49Z

We'll consider making an api to do this; no guarantees though.

In the meantime, you could make your own class of rset objects. The critical functions are exported and you can use vfold_cv() as a template to look at.

oude-gao · 2020-06-24T19:45:41Z

How would that rset object be read by last_fit()? I have a similar issue where I created my training/testing split using a very specific method, and wanted to use last_fit(). Since I didn't have anything created by initial_split(), I had to reverse-engineer to get my own rsplit object through the training/testing I created earlier so ideally last_fit() would work. When I fed it to last_fit(), there was an error msg saying there is a different number of rows.

So I went to take a better look at last_fit(), and it seems like inside the function rsplit is transformed into an rset object, by recalculating the prop and resampling the data with mc_cv(). So this rset object is not going to carry the exact same training/testing split as what I originally created even though the sizes would be the same. Am I reading the function wrong? Is there a way to solve this problem?

DavisVaughan · 2020-09-14T12:32:27Z

It sounds like you just want rsample::make_splits()? Those can be as custom as you need. Then you can supply those splits to rsample::manual_rset() in the dev version if you need an rset object from them

topepo · 2020-09-14T23:49:35Z

Like this:

library(tidyverse)
library(tidymodels)
#> ── Attaching packages ───────────────────────────────────────────── tidymodels 0.1.1 ──
#> ✓ broom     0.7.0          ✓ recipes   0.1.13    
#> ✓ dials     0.0.8.9001     ✓ rsample   0.0.7.9000
#> ✓ infer     0.5.2          ✓ tune      0.1.1.9000
#> ✓ modeldata 0.0.2          ✓ workflows 0.1.3.9000
#> ✓ parsnip   0.1.3          ✓ yardstick 0.0.7
#> ── Conflicts ──────────────────────────────────────────────── tidymodels_conflicts() ──
#> x scales::discard() masks purrr::discard()
#> x dplyr::filter()   masks stats::filter()
#> x recipes::fixed()  masks stringr::fixed()
#> x dplyr::lag()      masks stats::lag()
#> x yardstick::spec() masks readr::spec()
#> x recipes::step()   masks stats::step()
library(gapminder)

gapminder <- gapminder %>% 
  arrange(year) %>% 
  mutate(.row = row_number())

split_prop <- (last(which(gapminder$year <= 2000))) / nrow(gapminder)

indices <-
  list(analysis   = gapminder$.row[gapminder$year <= 2000], 
       assessment = gapminder$.row[gapminder$year >  2000]
  )

split <- make_splits(indices, gapminder %>% select(-.row))
training(split)
#> # A tibble: 1,420 x 6
#>    country     continent  year lifeExp      pop gdpPercap
#>    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779.
#>  2 Albania     Europe     1952    55.2  1282697     1601.
#>  3 Algeria     Africa     1952    43.1  9279525     2449.
#>  4 Angola      Africa     1952    30.0  4232095     3521.
#>  5 Argentina   Americas   1952    62.5 17876956     5911.
#>  6 Australia   Oceania    1952    69.1  8691212    10040.
#>  7 Austria     Europe     1952    66.8  6927772     6137.
#>  8 Bahrain     Asia       1952    50.9   120447     9867.
#>  9 Bangladesh  Asia       1952    37.5 46886859      684.
#> 10 Belgium     Europe     1952    68    8730405     8343.
#> # … with 1,410 more rows
testing(split)
#> # A tibble: 284 x 6
#>    country     continent  year lifeExp       pop gdpPercap
#>    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
#>  1 Afghanistan Asia       2002    42.1  25268405      727.
#>  2 Albania     Europe     2002    75.7   3508512     4604.
#>  3 Algeria     Africa     2002    71.0  31287142     5288.
#>  4 Angola      Africa     2002    41.0  10866106     2773.
#>  5 Argentina   Americas   2002    74.3  38331121     8798.
#>  6 Australia   Oceania    2002    80.4  19546792    30688.
#>  7 Austria     Europe     2002    79.0   8148312    32418.
#>  8 Bahrain     Asia       2002    74.8    656397    23404.
#>  9 Bangladesh  Asia       2002    62.0 135656790     1136.
#> 10 Belgium     Europe     2002    78.3  10311970    30486.
#> # … with 274 more rows

^{Created on 2020-09-14 by the reprex package (v0.3.0)}

You'll need the GH version of rsample (but not any of the other devel versions that I have loaded right now).

github-actions · 2021-02-21T00:56:25Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

topepo added the feature a feature request or enhancement label Jun 5, 2020

topepo mentioned this issue Sep 15, 2020

Why is stratification minimum based on percent rather than absolute number of observations? #162

Closed

topepo closed this as completed Sep 15, 2020

larry77 mentioned this issue Dec 3, 2020

Possible Bug in Tidymodels for an Unusual Split of the Data tidymodels/tidymodels.org-legacy#198

Closed

juliasilge mentioned this issue Feb 2, 2021

more group-based splitting methods #207

Closed

github-actions bot locked and limited conversation to collaborators Feb 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request - manual split creation #158

feature request - manual split creation #158

DanOvando commented Jun 1, 2020

topepo commented Jun 5, 2020

oude-gao commented Jun 24, 2020 •

edited

Loading

DavisVaughan commented Sep 14, 2020

topepo commented Sep 14, 2020

github-actions bot commented Feb 21, 2021

feature request - manual split creation #158

feature request - manual split creation #158

Comments

DanOvando commented Jun 1, 2020

topepo commented Jun 5, 2020

oude-gao commented Jun 24, 2020 • edited Loading

DavisVaughan commented Sep 14, 2020

topepo commented Sep 14, 2020

github-actions bot commented Feb 21, 2021

oude-gao commented Jun 24, 2020 •

edited

Loading