-
Notifications
You must be signed in to change notification settings - Fork 635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
recipe returned by train != recipe trained for finalModel when indexFinal is specified #928
Comments
Definitely a bug (and thanks for the detailed diagnosis). I'm looping back around to |
@topepo thanks, as always! Appreciate the confirmation. |
I just had a related issue come up -- I'm betting you're on top of this but posting just in case. The original bug I reported is easy enough to sidestep by prepping (outside of caret) a second recipe on the correct data rows that Everything works as expected if a recipe specifies (a) only steps that have deterministic outputs, or (b) the user sets a seed inside any steps that do involve some random number generation. I was experimenting with However, this also means that the bug is larger than I previously realized/reported. The root issue is that the final recipe prep, where trained_rec is generated, happens outside the call that produces |
Can you put in an issue to the recipes repo for this? We can add a
|
@topepo on it. (And I'm pretty excited to try |
It turns out that the final model was built on a recipe with the correct data but that recipe was not saved in the final Fixed: library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
library(ranger)
library(recipes)
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
data(iris)
sample_nrow <- 1000 # how many extra rows we're creating from iris for training
indexFinal_nrow <- 50 # how many rows of the full set go to indexFinal
set.seed(1234)
iris_noisy <- iris %>%
dplyr::sample_n(size = sample_nrow, replace = TRUE) %>%
mutate_at(vars(starts_with("Sepal"), starts_with("Petal")),
funs(plusrand_1 = . + rnorm(n = sample_nrow, mean = 0, sd = 10),
plusrand_2 = . + rnorm(n = sample_nrow, mean = 0, sd = 100),
plusrand_3 = . + rnorm(n = sample_nrow, mean = 0, sd = 1000)))
iris_recipe <- recipe(Species ~ .,
data = iris_noisy) %>%
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes()) %>%
step_pca(all_predictors(), threshold = 0.7)
iris_rf <-
train(iris_recipe,
data = iris_noisy,
method = "ranger",
# I had to add this so ranger wouldn't fail
tuneGrid = data.frame(mtry = 1:3, min.node.size = 5, splitrule = "gini"),
trControl = trainControl(method = "boot",
number = 10,
indexFinal = c(1:indexFinal_nrow)))
#> Loading required namespace: e1071
iris_rf$recipe
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 16
#>
#> Training data contained 50 data points and no missing data.
#>
#> Operations:
#>
#> Centering for Sepal.Length, Sepal.Width, ... [trained]
#> Scaling for Sepal.Length, Sepal.Width, ... [trained]
#> PCA extraction with Sepal.Length, Sepal.Width, ... [trained] Created on 2018-11-14 by the reprex package (v0.2.1) I'm merging in a ton of PRs so you might need to wait a few days before reinstalling to test this. |
(First - thanks for caret, recipes, and the whole list of fun toys coming down the pipe. I use your tools every day!)
I'm a heavy user of custom resampling setups, specified in
trainControl
withindex
,indexOut
, andindexFinal
. I've switched most of my feature engineering logic torecipes
run through train in (what I think is) the recommended way. Here's the issue I stumbled on:train
returns arecipe
object. I have always expected that object to be a prepped recipe, with prep done using the same training dataset asfinalModel
, so that you can use this for baking new data. This is true generally, but I think there's an exception when anindexFinal
is specified intrainControl
. It looks like (line 1003 in caret/train.default.R) the recipe train is done directly on thedata
input, so it bypasses theindexFinal
subsetting.The consequence is that
predict.train.recipe
fails whenever (by coincidence) the recipe prepped on full data has a different set of terms than the recipe prepped on theindexFinal
subset (which are the termsfinalModel
knew about). The consequence is invisible, unless the difference grows large enough that it leads to different terms being retained (like, which PC components fall above/below the retained variance threshold). I've looked back at prior code and found differences in the returned prepped recipe and what finalModel would have seen, and only when an indexFinal was specified, but it was only last night that I saw the difference grow large enough to cause a mismatch in terms.As far as I can tell, everything is correct in the recipe prep and training for
finalModel
. The reprex below demonstrates both of these cases.The text was updated successfully, but these errors were encountered: