Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recipe returned by train != recipe trained for finalModel when indexFinal is specified #928

Closed
sheffe opened this issue Aug 17, 2018 · 6 comments

Comments

@sheffe
Copy link

sheffe commented Aug 17, 2018

(First - thanks for caret, recipes, and the whole list of fun toys coming down the pipe. I use your tools every day!)

I'm a heavy user of custom resampling setups, specified in trainControl with index, indexOut, and indexFinal. I've switched most of my feature engineering logic to recipes run through train in (what I think is) the recommended way. Here's the issue I stumbled on:

train returns a recipe object. I have always expected that object to be a prepped recipe, with prep done using the same training dataset as finalModel, so that you can use this for baking new data. This is true generally, but I think there's an exception when an indexFinal is specified in trainControl. It looks like (line 1003 in caret/train.default.R) the recipe train is done directly on the data input, so it bypasses the indexFinal subsetting.

The consequence is that predict.train.recipe fails whenever (by coincidence) the recipe prepped on full data has a different set of terms than the recipe prepped on the indexFinal subset (which are the terms finalModel knew about). The consequence is invisible, unless the difference grows large enough that it leads to different terms being retained (like, which PC components fall above/below the retained variance threshold). I've looked back at prior code and found differences in the returned prepped recipe and what finalModel would have seen, and only when an indexFinal was specified, but it was only last night that I saw the difference grow large enough to cause a mismatch in terms.

As far as I can tell, everything is correct in the recipe prep and training for finalModel. The reprex below demonstrates both of these cases.

library(dplyr)
library(caret)
library(ranger)
library(recipes)
data(iris)

devtools::session_info()

# Setup: we're going to use a ranger model through caret::train, where we
# specify an indexFinal in the trainControl that is a much smaller number of rows than
# the full dataset we passed in. (This example is contrived for illustration.)
# After model train, we'll show that (a) the recipe returned in the train object
# has a different set of predictors from a recipe manually prepped on the data
# that our indexFinal setting would have yielded; and (b) we see an error thrown
# when we try to predict from the train object normally, but not when we predict
# using ranger's own generic and the manually-prepped data.

# Make many rows of randomly-sampled-with-replacement/randomly-ordered iris rows.
# Then add random noise to everything. The goal is to increase the number of 
# PCs we can work with in a step_pca, because it's easiest to see this issue
# when the recipe returned by train and finalModel have different predictors.

sample_nrow <- 1000 # how many extra rows we're creating from iris for training
indexFinal_nrow <- 50 # how many rows of the full set go to indexFinal

set.seed(1234)
iris_noisy <- iris %>%
  dplyr::sample_n(size = sample_nrow, replace = TRUE) %>%
  mutate_at(vars(starts_with("Sepal"), starts_with("Petal")),
            funs(plusrand_1 = . + rnorm(n = sample_nrow, mean = 0, sd = 10),
                 plusrand_2 = . + rnorm(n = sample_nrow, mean = 0, sd = 100),
                 plusrand_3 = . + rnorm(n = sample_nrow, mean = 0, sd = 1000)))

iris_recipe <- recipe(Species ~ .,
                      data = iris_noisy) %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes()) %>%
  step_pca(all_predictors(), threshold = 0.7)

iris_rf <- train(iris_recipe, 
                 data = iris_noisy,
                 method = "ranger", 
                 trControl = trainControl(method = "boot",
                                          number = 10,
                                          indexFinal = c(1:indexFinal_nrow)))

# > # This fails with error 'undefined columns selected'
# > predict(iris_rf, newdata = iris_noisy)
#  Show Traceback
#  
#  Rerun with Debug
#  Error in `[.data.frame`(data.selected, , (forest$dependent.varID + 1):ncol(data.selected)) : 
#   undefined columns selected 
# 
# > caret:::predict.train.recipe(iris_rf, newdata = iris_noisy)
#  Show Traceback
#  
#  Rerun with Debug
#  Error in `[.data.frame`(data.selected, , (forest$dependent.varID + 1):ncol(data.selected)) : 
#   undefined columns selected 

# NEXT: 
# Manually prep a recipe on the data specified in indexFinal, then
# predict through ranger on the baked full dataset
manually_prepped_recipe <- prep(iris_recipe, 
                                training = iris_noisy[1:indexFinal_nrow, ])

ranger:::predict.ranger(iris_rf$finalModel, 
                        data = bake(manually_prepped_recipe, 
                                    newdata = iris_noisy))
# ranger:::predict.ranger works just fine

# > manually_prepped_recipe <- prep(iris_recipe, 
# +                                 training = iris_noisy[1:indexFinal_nrow, ])
# > ranger:::predict.ranger(iris_rf$finalModel, 
# +                         data = bake(manually_prepped_recipe, 
# +                                     newdata = iris_noisy))
# Ranger prediction
# 
# Type:                             Classification 
# Sample size:                      1000 
# Number of independent variables:  7 
# > 

# NEXT
# Here, let's look at the variables that are in the prepped recipe from iris_rf
# and the recipe we prepped manually.
iris_rf$recipe$term_info$variable
manually_prepped_recipe$term_info$variable
# > iris_rf$recipe$term_info$variable
#  [1] "Species" "PC1"     "PC2"     "PC3"     "PC4"     "PC5"     "PC6"     "PC7"     "PC8"     "PC9"    
# > manually_prepped_recipe$term_info$variable
# [1] "Species" "PC1"     "PC2"     "PC3"     "PC4"     "PC5"     "PC6"     "PC7"    

# Pretty clear what happened - iris_rf$recipe has 9 PCs under 70% cumvar threshold
# while the manually prepped recipe (fewer rows, less variation) has 7 PCs

all.equal(iris_rf$recipe,
          manually_prepped_recipe)

# Check for equality of recipe prepped on full data and in iris_rf$recipe
manually_prepped_recipe_fulldata <- prep(iris_recipe, 
                                         training = iris_noisy, 
                                         retain = TRUE)

all.equal(iris_rf$recipe, manually_prepped_recipe_fulldata)

# > manually_prepped_recipe_fulldata <- prep(iris_recipe, 
# +                                          training = iris_noisy, 
# +                                          retain = TRUE)
# > all.equal(iris_rf$recipe,
# +           manually_prepped_recipe_fulldata)
# [1] TRUE
> devtools::session_info()
Session info -----------------------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.4 (2018-03-15)
 system   x86_64, darwin15.6.0        
 ui       RStudio (1.1.456)           
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/New_York            
 date     2018-08-17                  

Packages ---------------------------------------------------------------------------------------------------------------------------------
 package      * version    date       source         
 abind          1.4-5      2016-07-21 CRAN (R 3.4.0) 
 assertthat     0.2.0      2017-04-11 CRAN (R 3.4.0) 
 base         * 3.4.4      2018-03-15 local          
 bindr          0.1.1      2018-03-13 CRAN (R 3.4.4) 
 bindrcpp       0.2.2      2018-03-29 CRAN (R 3.4.4) 
 broom        * 0.4.4      2018-03-29 CRAN (R 3.4.4) 
 caret        * 6.0-80     2018-05-26 CRAN (R 3.4.4) 
 class          7.3-14     2015-08-30 CRAN (R 3.4.4) 
 codetools      0.2-15     2016-10-05 CRAN (R 3.4.4) 
 colorspace     1.3-2      2016-12-14 CRAN (R 3.4.0) 
 compiler       3.4.4      2018-03-15 local          
 CVST           0.2-2      2018-05-26 CRAN (R 3.4.4) 
 datasets     * 3.4.4      2018-03-15 local          
 ddalpha        1.3.4      2018-06-23 CRAN (R 3.4.4) 
 DEoptimR       1.0-8      2016-11-19 CRAN (R 3.4.0) 
 devtools       1.13.6     2018-06-27 CRAN (R 3.4.4) 
 digest         0.6.15     2018-01-28 CRAN (R 3.4.3) 
 dimRed         0.1.0      2017-05-04 CRAN (R 3.4.0) 
 dplyr        * 0.7.6      2018-06-29 CRAN (R 3.4.4) 
 DRR            0.0.3      2018-01-06 CRAN (R 3.4.3) 
 foreach        1.4.4      2017-12-12 CRAN (R 3.4.3) 
 foreign        0.8-70     2018-04-23 CRAN (R 3.4.4) 
 geometry       0.3-6      2015-09-09 CRAN (R 3.4.0) 
 ggplot2      * 3.0.0      2018-07-03 CRAN (R 3.4.4) 
 glue           1.3.0      2018-07-17 cran (@1.3.0)  
 gower          0.1.2      2017-02-23 CRAN (R 3.4.0) 
 graphics     * 3.4.4      2018-03-15 local          
 grDevices    * 3.4.4      2018-03-15 local          
 grid           3.4.4      2018-03-15 local          
 gtable         0.2.0      2016-02-26 CRAN (R 3.4.0) 
 ipred          0.9-6      2017-03-01 CRAN (R 3.4.0) 
 iterators      1.0.9      2017-12-12 CRAN (R 3.4.3) 
 kernlab        0.9-26     2018-04-30 CRAN (R 3.4.4) 
 lattice      * 0.20-35    2017-03-25 CRAN (R 3.4.4) 
 lava           1.6.1      2018-03-28 CRAN (R 3.4.4) 
 lazyeval       0.2.1      2017-10-29 CRAN (R 3.4.2) 
 lubridate      1.7.4      2018-04-11 CRAN (R 3.4.4) 
 magic          1.5-8      2018-01-26 CRAN (R 3.4.3) 
 magrittr       1.5        2014-11-22 CRAN (R 3.4.0) 
 MASS           7.3-50     2018-04-30 CRAN (R 3.4.4) 
 Matrix         1.2-14     2018-04-09 CRAN (R 3.4.4) 
 memoise        1.1.0      2017-04-21 CRAN (R 3.4.0) 
 methods      * 3.4.4      2018-03-15 local          
 mnormt         1.5-5      2016-10-15 CRAN (R 3.4.0) 
 ModelMetrics   1.1.0      2016-08-26 CRAN (R 3.4.0) 
 munsell        0.5.0      2018-06-12 CRAN (R 3.4.4) 
 nlme           3.1-137    2018-04-07 CRAN (R 3.4.4) 
 nnet           7.3-12     2016-02-02 CRAN (R 3.4.4) 
 parallel       3.4.4      2018-03-15 local          
 pillar         1.2.3      2018-05-25 CRAN (R 3.4.4) 
 pkgconfig      2.0.1      2017-03-21 CRAN (R 3.4.0) 
 pls            2.6-0      2016-12-18 CRAN (R 3.4.0) 
 plyr           1.8.4      2016-06-08 CRAN (R 3.4.0) 
 prodlim        2018.04.18 2018-04-18 CRAN (R 3.4.4) 
 psych          1.8.4      2018-05-06 CRAN (R 3.4.4) 
 purrr          0.2.5      2018-05-29 CRAN (R 3.4.4) 
 R6             2.2.2      2017-06-17 CRAN (R 3.4.0) 
 ranger       * 0.10.1     2018-06-04 CRAN (R 3.4.4) 
 Rcpp           0.12.18    2018-07-23 cran (@0.12.18)
 RcppRoll       0.3.0      2018-06-05 CRAN (R 3.4.4) 
 recipes      * 0.1.3      2018-06-16 CRAN (R 3.4.4) 
 reshape2       1.4.3      2017-12-11 CRAN (R 3.4.3) 
 rlang          0.2.1      2018-05-30 CRAN (R 3.4.4) 
 robustbase     0.93-1     2018-06-23 CRAN (R 3.4.4) 
 rpart          4.1-13     2018-02-23 CRAN (R 3.4.4) 
 rstudioapi     0.7        2017-09-07 CRAN (R 3.4.1) 
 scales         0.5.0      2017-08-24 CRAN (R 3.4.1) 
 sfsmisc        1.1-2      2018-03-05 CRAN (R 3.4.4) 
 splines        3.4.4      2018-03-15 local          
 stats        * 3.4.4      2018-03-15 local          
 stats4         3.4.4      2018-03-15 local          
 stringi        1.2.4      2018-07-20 cran (@1.2.4)  
 stringr        1.3.1      2018-05-10 CRAN (R 3.4.4) 
 survival       2.42-4     2018-06-30 CRAN (R 3.4.4) 
 tibble         1.4.2      2018-01-22 CRAN (R 3.4.3) 
 tidyr          0.8.1      2018-05-18 CRAN (R 3.4.4) 
 tidyselect     0.2.4      2018-02-26 CRAN (R 3.4.3) 
 timeDate       3043.102   2018-02-21 CRAN (R 3.4.3) 
 tools          3.4.4      2018-03-15 local          
 utils        * 3.4.4      2018-03-15 local          
 withr          2.1.2      2018-03-15 CRAN (R 3.4.4) 
 yaml           2.2.0      2018-07-25 cran (@2.2.0)  
@topepo
Copy link
Owner

topepo commented Aug 17, 2018

Definitely a bug (and thanks for the detailed diagnosis).

I'm looping back around to caret next week so I'll fix that soon.

@sheffe
Copy link
Author

sheffe commented Aug 17, 2018

@topepo thanks, as always! Appreciate the confirmation.

@sheffe
Copy link
Author

sheffe commented Aug 29, 2018

I just had a related issue come up -- I'm betting you're on top of this but posting just in case. The original bug I reported is easy enough to sidestep by prepping (outside of caret) a second recipe on the correct data rows that finalModel would have seen, but there are some related cases where this issue extends beyond specifying an indexOut instead of the full dataset.

Everything works as expected if a recipe specifies (a) only steps that have deterministic outputs, or (b) the user sets a seed inside any steps that do involve some random number generation. I was experimenting with step_upsample that came before a step_pca, and, without an obvious way to set the seed inside that step, realized there's no way to exactly reproduce what finalModel saw when I prep the recipe outside caret.

However, this also means that the bug is larger than I previously realized/reported. The root issue is that the final recipe prep, where trained_rec is generated, happens outside the call that produces finalModel. So -- even if the input data is exactly the same for finalModel and trained_rec because there's no indexOut specified -- users will be exposed to this. Imbalanced class sampling won't be the same, in particular, but that also goes for some new things I'd wanted to try (like a step_umap using tkonopka's package). I think the class sampling steps are the most likely user-facing issue.

@topepo
Copy link
Owner

topepo commented Aug 30, 2018

Can you put in an issue to the recipes repo for this? We can add a seed option that can can isolate those random steps with a specific seed to ensure (or at least encourage) reproducibility.

<edit> and if you want to make step_umap that would be 👍

@sheffe
Copy link
Author

sheffe commented Aug 30, 2018

@topepo on it. (And I'm pretty excited to try step_umap asap! Tomasz built a completely-in-R parallel for transforming new values to an existing UMAP embedding, so there's not even a reticulate/scikit-learn dependency now.)

@topepo
Copy link
Owner

topepo commented Nov 15, 2018

It turns out that the final model was built on a recipe with the correct data but that recipe was not saved in the final train object.

Fixed:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
library(ranger)
library(recipes)
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
data(iris)

sample_nrow <- 1000 # how many extra rows we're creating from iris for training
indexFinal_nrow <- 50 # how many rows of the full set go to indexFinal

set.seed(1234)
iris_noisy <- iris %>%
  dplyr::sample_n(size = sample_nrow, replace = TRUE) %>%
  mutate_at(vars(starts_with("Sepal"), starts_with("Petal")),
            funs(plusrand_1 = . + rnorm(n = sample_nrow, mean = 0, sd = 10),
                 plusrand_2 = . + rnorm(n = sample_nrow, mean = 0, sd = 100),
                 plusrand_3 = . + rnorm(n = sample_nrow, mean = 0, sd = 1000)))

iris_recipe <- recipe(Species ~ .,
                      data = iris_noisy) %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes()) %>%
  step_pca(all_predictors(), threshold = 0.7)

iris_rf <- 
  train(iris_recipe, 
        data = iris_noisy,
        method = "ranger", 
        # I had to add this so ranger wouldn't fail
        tuneGrid = data.frame(mtry = 1:3, min.node.size = 5, splitrule = "gini"),
        trControl = trainControl(method = "boot",
                                 number = 10,
                                 indexFinal = c(1:indexFinal_nrow)))
#> Loading required namespace: e1071
iris_rf$recipe
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor         16
#> 
#> Training data contained 50 data points and no missing data.
#> 
#> Operations:
#> 
#> Centering for Sepal.Length, Sepal.Width, ... [trained]
#> Scaling for Sepal.Length, Sepal.Width, ... [trained]
#> PCA extraction with Sepal.Length, Sepal.Width, ... [trained]

Created on 2018-11-14 by the reprex package (v0.2.1)

I'm merging in a ton of PRs so you might need to wait a few days before reinstalling to test this.

@topepo topepo closed this as completed Nov 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants