Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

step_rm(skip=TRUE) not working with bake() leading predict() to fail #239

Closed
AndrewKostandy opened this issue Nov 5, 2018 · 7 comments
Closed
Labels
bug an unexpected problem or unintended behavior reprex needs a minimal reproducible example

Comments

@AndrewKostandy
Copy link

AndrewKostandy commented Nov 5, 2018

The skip = TRUE argument of step_rm() doesn't seem to work with bake() as the variable still gets removed from the baked dataset. Additionally, the predict() function throws an error despite the fact that the removed variable isn't really needed in the model and the interaction that involved that variable, and is needed by the model, is available in the dataset to be predicted on.

library(tidyverse)
library(recipes)
library(caret)

rec_1 <- recipe(Species ~ ., data = iris) %>% 
  step_interact(terms = ~ Sepal.Length:Sepal.Width) %>% 
  step_rm(Sepal.Length, skip = TRUE)

train_ctrl <- trainControl(method = "none", classProbs = TRUE)

rf_1 <- train(rec_1, data=iris, method = "rf", trControl = train_ctrl)
#> Loading required namespace: randomForest

prep_rec <- prep(rec_1, iris, retain=TRUE)

iris_juiced <- juice(prep_rec)

iris_baked <- bake(prep_rec, newdata = iris)

colnames(iris_juiced)
#> [1] "Sepal.Width"                "Petal.Length"              
#> [3] "Petal.Width"                "Species"                   
#> [5] "Sepal.Length_x_Sepal.Width"

colnames(iris_baked)
#> [1] "Sepal.Width"                "Petal.Length"              
#> [3] "Petal.Width"                "Species"                   
#> [5] "Sepal.Length_x_Sepal.Width"

predict(rf_1, iris_juiced, type = "response")
#> Error in eval(predvars, data, env): object 'Sepal.Length' not found

predict(rf_1, iris_baked, type = "response")
#> Error in eval(predvars, data, env): object 'Sepal.Length' not found

Created on 2018-11-05 by the reprex package (v0.2.1)

Session info
devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.5.1 (2018-07-02)
#>  os       macOS High Sierra 10.13.6   
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                                
#> 
#> ─ Packages ──────────────────────────────────────────────
#>  package      * version    date       lib
#>  tidyverse    * 1.2.1      2017-11-14 [1]
#>  recipes      * 0.1.3.9002 2018-10-26 [1]
#>  caret        * 6.0-80     2018-05-26 [1]
@AndrewKostandy AndrewKostandy changed the title skip=TRUE not working with bake() leading predict() to fail step_rm(skip=TRUE) not working with bake() leading predict() to fail Nov 5, 2018
@topepo topepo added bug an unexpected problem or unintended behavior reprex needs a minimal reproducible example labels Nov 5, 2018
@topepo
Copy link
Member

topepo commented Nov 5, 2018

Internally, the step is skipped. The issue is that the list of variables that should be retained after processing the steps is defined in bake.recipes as:

keepers <- terms_select(terms = terms, info = object$term_info)

where object$term_info holds the information about what variables were available at the end of prep.recipes. In this case:

> object$term_info
# A tibble: 1 x 4
  variable type    role      source  
  <chr>    <chr>   <chr>     <chr>   
1 x2       numeric predictor original

because that was all that was left after prepping the recipes. Clearly wrong in this case so we'll need to think this though.

topepo added a commit that referenced this issue Nov 5, 2018
@topepo topepo added this to To do in Package: recipes Nov 5, 2018
@topepo
Copy link
Member

topepo commented Nov 6, 2018

I think that the line

keepers <- terms_select(terms = terms, info = object$term_info)

needs to occur after newdata is computed and, instead of object$term_info as the context, use newdata.

@AndrewKostandy
Copy link
Author

AndrewKostandy commented Nov 7, 2018

If my understanding is correct, this would keep the variable after using bake if skip=TRUE. But why does predict() still need the variable to be present in the data despite that it was removed at the end of the recipe and isn't used in the model?

@topepo
Copy link
Member

topepo commented Nov 12, 2018

If my understanding is correct, this would keep the variable after using bake if skip=TRUE. But why does predict() still need the variable to be present in the data despite that it was removed at the end of the recipe and isn't used in the model?

It was used in the model since the skip only affects the processing of new data (via bake). To get the processed version of the training set, caret uses juice so it will be there and not in the set of new data being predicted using bake. It is intended for cases where you need to affect the outcome data but avoid errors in processing new samples (since you won't have the outcome in hand).

@topepo
Copy link
Member

topepo commented Nov 12, 2018

I'll merge in a fix for this in a minute. Here is what things look like after the changes:

library(tidyverse)
library(recipes)
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stringr':
#> 
#>     fixed
#> The following object is masked from 'package:stats':
#> 
#>     step
library(caret)
#> Loading required package: lattice
#> 
#> Attaching package: 'caret'
#> The following object is masked from 'package:purrr':
#> 
#>     lift

rec_1 <- recipe(Species ~ ., data = iris) %>% 
  step_interact(terms = ~ Sepal.Length:Sepal.Width) %>% 
  step_rm(Sepal.Length, skip = TRUE)

train_ctrl <- trainControl(method = "none", classProbs = TRUE)

rf_1 <- train(rec_1, data=iris, method = "rf", trControl = train_ctrl)
#> Loading required namespace: randomForest

prep_rec <- prep(rec_1, iris, retain=TRUE)

iris_juiced <- juice(prep_rec)

iris_baked <- bake(prep_rec, newdata = iris)

colnames(iris_juiced)
#> [1] "Sepal.Width"                "Petal.Length"              
#> [3] "Petal.Width"                "Species"                   
#> [5] "Sepal.Length_x_Sepal.Width"

colnames(iris_baked)
#> [1] "Sepal.Length"               "Sepal.Width"               
#> [3] "Petal.Length"               "Petal.Width"               
#> [5] "Species"                    "Sepal.Length_x_Sepal.Width"

Created on 2018-11-12 by the reprex package (v0.2.1)

@topepo topepo closed this as completed in 90eed04 Nov 12, 2018
@topepo topepo moved this from To do to Done in Package: recipes Nov 13, 2018
@AndrewKostandy
Copy link
Author

Works great! Thank you Max.

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 24, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior reprex needs a minimal reproducible example
Projects
Development

No branches or pull requests

2 participants