Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

step_rm(skip=TRUE) not working with bake() leading predict() to fail #239

Closed
AndrewKostandy opened this issue Nov 5, 2018 · 6 comments
Closed
Labels

Comments

@AndrewKostandy
Copy link

@AndrewKostandy AndrewKostandy commented Nov 5, 2018

The skip = TRUE argument of step_rm() doesn't seem to work with bake() as the variable still gets removed from the baked dataset. Additionally, the predict() function throws an error despite the fact that the removed variable isn't really needed in the model and the interaction that involved that variable, and is needed by the model, is available in the dataset to be predicted on.

library(tidyverse)
library(recipes)
library(caret)

rec_1 <- recipe(Species ~ ., data = iris) %>% 
  step_interact(terms = ~ Sepal.Length:Sepal.Width) %>% 
  step_rm(Sepal.Length, skip = TRUE)

train_ctrl <- trainControl(method = "none", classProbs = TRUE)

rf_1 <- train(rec_1, data=iris, method = "rf", trControl = train_ctrl)
#> Loading required namespace: randomForest

prep_rec <- prep(rec_1, iris, retain=TRUE)

iris_juiced <- juice(prep_rec)

iris_baked <- bake(prep_rec, newdata = iris)

colnames(iris_juiced)
#> [1] "Sepal.Width"                "Petal.Length"              
#> [3] "Petal.Width"                "Species"                   
#> [5] "Sepal.Length_x_Sepal.Width"

colnames(iris_baked)
#> [1] "Sepal.Width"                "Petal.Length"              
#> [3] "Petal.Width"                "Species"                   
#> [5] "Sepal.Length_x_Sepal.Width"

predict(rf_1, iris_juiced, type = "response")
#> Error in eval(predvars, data, env): object 'Sepal.Length' not found

predict(rf_1, iris_baked, type = "response")
#> Error in eval(predvars, data, env): object 'Sepal.Length' not found

Created on 2018-11-05 by the reprex package (v0.2.1)

Session info
devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.5.1 (2018-07-02)
#>  os       macOS High Sierra 10.13.6   
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                                
#> 
#> ─ Packages ──────────────────────────────────────────────
#>  package      * version    date       lib
#>  tidyverse    * 1.2.1      2017-11-14 [1]
#>  recipes      * 0.1.3.9002 2018-10-26 [1]
#>  caret        * 6.0-80     2018-05-26 [1]
@AndrewKostandy AndrewKostandy changed the title skip=TRUE not working with bake() leading predict() to fail step_rm(skip=TRUE) not working with bake() leading predict() to fail Nov 5, 2018
@topepo topepo added bug reprex labels Nov 5, 2018
@topepo
Copy link
Collaborator

@topepo topepo commented Nov 5, 2018

Internally, the step is skipped. The issue is that the list of variables that should be retained after processing the steps is defined in bake.recipes as:

keepers <- terms_select(terms = terms, info = object$term_info)

where object$term_info holds the information about what variables were available at the end of prep.recipes. In this case:

> object$term_info
# A tibble: 1 x 4
  variable type    role      source  
  <chr>    <chr>   <chr>     <chr>   
1 x2       numeric predictor original

because that was all that was left after prepping the recipes. Clearly wrong in this case so we'll need to think this though.

topepo added a commit that referenced this issue Nov 5, 2018
@topepo topepo added this to To do in Package: recipes Nov 5, 2018
@topepo
Copy link
Collaborator

@topepo topepo commented Nov 6, 2018

I think that the line

keepers <- terms_select(terms = terms, info = object$term_info)

needs to occur after newdata is computed and, instead of object$term_info as the context, use newdata.

@AndrewKostandy
Copy link
Author

@AndrewKostandy AndrewKostandy commented Nov 7, 2018

If my understanding is correct, this would keep the variable after using bake if skip=TRUE. But why does predict() still need the variable to be present in the data despite that it was removed at the end of the recipe and isn't used in the model?

@topepo
Copy link
Collaborator

@topepo topepo commented Nov 12, 2018

If my understanding is correct, this would keep the variable after using bake if skip=TRUE. But why does predict() still need the variable to be present in the data despite that it was removed at the end of the recipe and isn't used in the model?

It was used in the model since the skip only affects the processing of new data (via bake). To get the processed version of the training set, caret uses juice so it will be there and not in the set of new data being predicted using bake. It is intended for cases where you need to affect the outcome data but avoid errors in processing new samples (since you won't have the outcome in hand).

@topepo
Copy link
Collaborator

@topepo topepo commented Nov 12, 2018

I'll merge in a fix for this in a minute. Here is what things look like after the changes:

library(tidyverse)
library(recipes)
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stringr':
#> 
#>     fixed
#> The following object is masked from 'package:stats':
#> 
#>     step
library(caret)
#> Loading required package: lattice
#> 
#> Attaching package: 'caret'
#> The following object is masked from 'package:purrr':
#> 
#>     lift

rec_1 <- recipe(Species ~ ., data = iris) %>% 
  step_interact(terms = ~ Sepal.Length:Sepal.Width) %>% 
  step_rm(Sepal.Length, skip = TRUE)

train_ctrl <- trainControl(method = "none", classProbs = TRUE)

rf_1 <- train(rec_1, data=iris, method = "rf", trControl = train_ctrl)
#> Loading required namespace: randomForest

prep_rec <- prep(rec_1, iris, retain=TRUE)

iris_juiced <- juice(prep_rec)

iris_baked <- bake(prep_rec, newdata = iris)

colnames(iris_juiced)
#> [1] "Sepal.Width"                "Petal.Length"              
#> [3] "Petal.Width"                "Species"                   
#> [5] "Sepal.Length_x_Sepal.Width"

colnames(iris_baked)
#> [1] "Sepal.Length"               "Sepal.Width"               
#> [3] "Petal.Length"               "Petal.Width"               
#> [5] "Species"                    "Sepal.Length_x_Sepal.Width"

Created on 2018-11-12 by the reprex package (v0.2.1)

@topepo topepo closed this in 90eed04 Nov 12, 2018
@topepo topepo moved this from To do to Done in Package: recipes Nov 13, 2018
@AndrewKostandy
Copy link
Author

@AndrewKostandy AndrewKostandy commented Nov 13, 2018

Works great! Thank you Max.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.