Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

steps that select no columns #531

Closed
topepo opened this issue Jun 11, 2020 · 3 comments
Closed

steps that select no columns #531

topepo opened this issue Jun 11, 2020 · 3 comments
Labels
feature a feature request or enhancement

Comments

@topepo
Copy link
Member

topepo commented Jun 11, 2020

What should happen when the terms given by the user select no columns? This has come up a few times (see #140).

The default, especially in the latest dplyr/tidyselect, is to fail.

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
library(dplyr)

# dplyr: 
mtcars %>% select(torque)
#> Error: Can't subset columns that don't exist.
#> x Column `torque` doesn't exist.

# recipes:
recipe(mpg ~ ., data = mtcars) %>% 
  step_center(torque) %>% 
  prep()
#> Error: Can't subset columns that don't exist.
#> x Column `torque` doesn't exist.

Created on 2020-06-11 by the reprex package (v0.3.0)

recipes has a different use case than dplyr: we may run filters that eliminate columns that we would have selected if they were there. In some cases, the selection is stochastic so it is not obvious about how to proceed. We want the step selection to be very robust and not break the recipe if this occurs.

What would have to happen? I see three separate issues:

1. between-step checks

It is possible that a step removes all columns from the data. If that is the case, we should probably error out the recipe with an informative error message (the same is true for a zero-row outcome).

This check would happen in prep().

2. within-step checks

For each step type, we could write code that looks to see if zero columns were selected and have this act accordingly during prep() and bake().

Using step_medianimput() as a test, I outlined how this should happen for this particular step (see commit 72b70ef). This is a simple step and others would be more complicated.

The results:

# devtools::install_github("tidymodels/recipes@no-selected-vars")
library(tidymodels)
#> ── Attaching packages ───────────────────────────────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom     0.5.6           ✓ recipes   0.1.12.9000
#> ✓ dials     0.0.7.9000      ✓ rsample   0.0.7      
#> ✓ dplyr     1.0.0           ✓ tibble    3.0.1      
#> ✓ ggplot2   3.3.1           ✓ tune      0.1.0.9000 
#> ✓ infer     0.5.1           ✓ workflows 0.1.1      
#> ✓ parsnip   0.1.1           ✓ yardstick 0.0.6      
#> ✓ purrr     0.3.4
#> ── Conflicts ──────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter()  masks stats::filter()
#> x dplyr::lag()     masks stats::lag()
#> x recipes::step()  masks stats::step()
data("credit_data", package = "modeldata")

## missing data per column
vapply(credit_data, function(x) mean(is.na(x)), c(num = 0))
#>       Status    Seniority         Home         Time          Age      Marital 
#> 0.0000000000 0.0000000000 0.0013471037 0.0000000000 0.0000000000 0.0002245173 
#>      Records          Job     Expenses       Income       Assets         Debt 
#> 0.0000000000 0.0004490346 0.0000000000 0.0855410867 0.0105523125 0.0040413112 
#>       Amount        Price 
#> 0.0000000000 0.0000000000

set.seed(342)
in_training <- sample(1:nrow(credit_data), 2000)

credit_tr <- credit_data[ in_training, ]
credit_te <- credit_data[-in_training, ]

# Does not work but should: 
recipe(Price ~ ., data = credit_tr) %>%
  step_medianimpute(Potato) %>% 
  prep()
#> Error: Can't subset columns that don't exist.
#> x Column `Potato` doesn't exist.

# Does work:
rec <- 
  recipe(Price ~ ., data = credit_tr) %>%
  step_medianimpute(any_of("Potato")) %>% 
  prep()
#> Warning: `prep.step_medianimpute()` did not select any columns. This step will
#> not affect the data.

bake(rec, credit_te) %>% complete.cases() %>% mean()
#> [1] 0.906683

Created on 2020-06-11 by the reprex package (v0.3.0)

Some notes:

  • We should really automate the warning message from within a handler used by terms_select(). The user shouldn't have to do this manually each time.

  • Bare column names should be acceptable (i.e. the first example above with the unquoted Potato).

  • The helper function printer() would need to be modified to show that no columns were selected.

3. terms_select() changes

We do have a handler called passover() that is used by step_dummy() that does not throw an error when columns are not selected. It is likely that other changes are needed (e.g. the strict argument) to terms_select() for this to work properly.

@topepo topepo added the feature a feature request or enhancement label Jun 11, 2020
@rorynolan
Copy link
Contributor

FWIW I think it would be cool for steps to have an option for the user to specify whether or not it's OK for that step to select no variables.

@juliasilge
Copy link
Member

Closed in #813 🎉

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Oct 13, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants