-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request - manual split creation #158
Comments
We'll consider making an api to do this; no guarantees though. In the meantime, you could make your own class of |
How would that So I went to take a better look at |
It sounds like you just want |
Like this: library(tidyverse)
library(tidymodels)
#> ── Attaching packages ───────────────────────────────────────────── tidymodels 0.1.1 ──
#> ✓ broom 0.7.0 ✓ recipes 0.1.13
#> ✓ dials 0.0.8.9001 ✓ rsample 0.0.7.9000
#> ✓ infer 0.5.2 ✓ tune 0.1.1.9000
#> ✓ modeldata 0.0.2 ✓ workflows 0.1.3.9000
#> ✓ parsnip 0.1.3 ✓ yardstick 0.0.7
#> ── Conflicts ──────────────────────────────────────────────── tidymodels_conflicts() ──
#> x scales::discard() masks purrr::discard()
#> x dplyr::filter() masks stats::filter()
#> x recipes::fixed() masks stringr::fixed()
#> x dplyr::lag() masks stats::lag()
#> x yardstick::spec() masks readr::spec()
#> x recipes::step() masks stats::step()
library(gapminder)
gapminder <- gapminder %>%
arrange(year) %>%
mutate(.row = row_number())
split_prop <- (last(which(gapminder$year <= 2000))) / nrow(gapminder)
indices <-
list(analysis = gapminder$.row[gapminder$year <= 2000],
assessment = gapminder$.row[gapminder$year > 2000]
)
split <- make_splits(indices, gapminder %>% select(-.row))
training(split)
#> # A tibble: 1,420 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779.
#> 2 Albania Europe 1952 55.2 1282697 1601.
#> 3 Algeria Africa 1952 43.1 9279525 2449.
#> 4 Angola Africa 1952 30.0 4232095 3521.
#> 5 Argentina Americas 1952 62.5 17876956 5911.
#> 6 Australia Oceania 1952 69.1 8691212 10040.
#> 7 Austria Europe 1952 66.8 6927772 6137.
#> 8 Bahrain Asia 1952 50.9 120447 9867.
#> 9 Bangladesh Asia 1952 37.5 46886859 684.
#> 10 Belgium Europe 1952 68 8730405 8343.
#> # … with 1,410 more rows
testing(split)
#> # A tibble: 284 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 2002 42.1 25268405 727.
#> 2 Albania Europe 2002 75.7 3508512 4604.
#> 3 Algeria Africa 2002 71.0 31287142 5288.
#> 4 Angola Africa 2002 41.0 10866106 2773.
#> 5 Argentina Americas 2002 74.3 38331121 8798.
#> 6 Australia Oceania 2002 80.4 19546792 30688.
#> 7 Austria Europe 2002 79.0 8148312 32418.
#> 8 Bahrain Asia 2002 74.8 656397 23404.
#> 9 Bangladesh Asia 2002 62.0 135656790 1136.
#> 10 Belgium Europe 2002 78.3 10311970 30486.
#> # … with 274 more rows Created on 2020-09-14 by the reprex package (v0.3.0) You'll need the GH version of |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue. |
I'm exploring using tidymodels for my predictive modeling workflows and loving it so far. I am running into one problem though. I nearly always want to do my initial split by something like year or country, but have many nested observations within those splits. I can't figure out a clean way of creating these splits using rsample. To use the gapminder dataset, suppose I want to train a model using only data from before the year 2000, and test the model on all data points greater than or equal to the year 2000.
I see how this could be done combining initial_time_split with nesting, but the resulting object would not (I assume) work if passed as the
split
argument oftune::last_fit
, since each resample would be a nested tibble.My current strategy is to arrange the data such that my training and testing splits are cleanly separated, find the split point in the data, and then use that as the prop argument, to initial_time_split, but that feels pretty rough to me, and makes me nervous with very large datasets, where I'm not sure how exact rsample is being in the splitting (e.g. if the correct split is 0.961234, will rsample be that precise?).
Would it be possible to add something in to
initial_split
liketraining_split
, wheretraining_split
could be a logical vector saying TRUE if so, FALSE if not? That way, users could make the definition of the testing and training splits as complex as needed, and simply pass the resulting logical vector toinitial_split
.Created on 2020-06-01 by the reprex package (v0.3.0)
The text was updated successfully, but these errors were encountered: