You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
recipes has a different use case than dplyr: we may run filters that eliminate columns that we would have selected if they were there. In some cases, the selection is stochastic so it is not obvious about how to proceed. We want the step selection to be very robust and not break the recipe if this occurs.
What would have to happen? I see three separate issues:
1. between-step checks
It is possible that a step removes all columns from the data. If that is the case, we should probably error out the recipe with an informative error message (the same is true for a zero-row outcome).
This check would happen in prep().
2. within-step checks
For each step type, we could write code that looks to see if zero columns were selected and have this act accordingly during prep() and bake().
Using step_medianimput() as a test, I outlined how this should happen for this particular step (see commit 72b70ef). This is a simple step and others would be more complicated.
The results:
# devtools::install_github("tidymodels/recipes@no-selected-vars")
library(tidymodels)
#> ── Attaching packages ───────────────────────────────────────────────────── tidymodels 0.1.0 ──#> ✓ broom 0.5.6 ✓ recipes 0.1.12.9000#> ✓ dials 0.0.7.9000 ✓ rsample 0.0.7 #> ✓ dplyr 1.0.0 ✓ tibble 3.0.1 #> ✓ ggplot2 3.3.1 ✓ tune 0.1.0.9000 #> ✓ infer 0.5.1 ✓ workflows 0.1.1 #> ✓ parsnip 0.1.1 ✓ yardstick 0.0.6 #> ✓ purrr 0.3.4#> ── Conflicts ──────────────────────────────────────────────────────── tidymodels_conflicts() ──#> x purrr::discard() masks scales::discard()#> x dplyr::filter() masks stats::filter()#> x dplyr::lag() masks stats::lag()#> x recipes::step() masks stats::step()
data("credit_data", package="modeldata")
## missing data per column
vapply(credit_data, function(x) mean(is.na(x)), c(num=0))
#> Status Seniority Home Time Age Marital #> 0.0000000000 0.0000000000 0.0013471037 0.0000000000 0.0000000000 0.0002245173 #> Records Job Expenses Income Assets Debt #> 0.0000000000 0.0004490346 0.0000000000 0.0855410867 0.0105523125 0.0040413112 #> Amount Price #> 0.0000000000 0.0000000000
set.seed(342)
in_training<- sample(1:nrow(credit_data), 2000)
credit_tr<-credit_data[ in_training, ]
credit_te<-credit_data[-in_training, ]
# Does not work but should:
recipe(Price~., data=credit_tr) %>%
step_medianimpute(Potato) %>%
prep()
#> Error: Can't subset columns that don't exist.#> x Column `Potato` doesn't exist.# Does work:rec<-
recipe(Price~., data=credit_tr) %>%
step_medianimpute(any_of("Potato")) %>%
prep()
#> Warning: `prep.step_medianimpute()` did not select any columns. This step will#> not affect the data.
bake(rec, credit_te) %>% complete.cases() %>% mean()
#> [1] 0.906683
We should really automate the warning message from within a handler used by terms_select(). The user shouldn't have to do this manually each time.
Bare column names should be acceptable (i.e. the first example above with the unquoted Potato).
The helper function printer() would need to be modified to show that no columns were selected.
3. terms_select() changes
We do have a handler called passover() that is used by step_dummy() that does not throw an error when columns are not selected. It is likely that other changes are needed (e.g. the strict argument) to terms_select() for this to work properly.
The text was updated successfully, but these errors were encountered:
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.
What should happen when the terms given by the user select no columns? This has come up a few times (see #140).
The default, especially in the latest
dplyr
/tidyselect
, is to fail.Created on 2020-06-11 by the reprex package (v0.3.0)
recipes
has a different use case thandplyr
: we may run filters that eliminate columns that we would have selected if they were there. In some cases, the selection is stochastic so it is not obvious about how to proceed. We want the step selection to be very robust and not break the recipe if this occurs.What would have to happen? I see three separate issues:
1. between-step checks
It is possible that a step removes all columns from the data. If that is the case, we should probably error out the recipe with an informative error message (the same is true for a zero-row outcome).
This check would happen in
prep()
.2. within-step checks
For each step type, we could write code that looks to see if zero columns were selected and have this act accordingly during
prep()
andbake()
.Using
step_medianimput()
as a test, I outlined how this should happen for this particular step (see commit 72b70ef). This is a simple step and others would be more complicated.The results:
Created on 2020-06-11 by the reprex package (v0.3.0)
Some notes:
We should really automate the warning message from within a handler used by
terms_select()
. The user shouldn't have to do this manually each time.Bare column names should be acceptable (i.e. the first example above with the unquoted
Potato
).The helper function
printer()
would need to be modified to show that no columns were selected.3.
terms_select()
changesWe do have a handler called
passover()
that is used bystep_dummy()
that does not throw an error when columns are not selected. It is likely that other changes are needed (e.g. thestrict
argument) toterms_select()
for this to work properly.The text was updated successfully, but these errors were encountered: