-
Notifications
You must be signed in to change notification settings - Fork 633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First call to createFolds() not reproducible... despite set.seed() call #452
Comments
I can reproduce this but initially had no idea what the issue could be. That function uses This code doesn't have the same issue: lapply(1:5, function(x) {
set.seed(437234)
mean(randomForest::randomForest(Ozone ~ ., data=airquality, mtry=3,
importance=TRUE, na.action=na.omit)$mse)
}) However, just referencing a function via namespace will load it and the related dependencies. For example, with a clean R session, I have: > sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base If I just print the function by typing > sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] Rcpp_0.12.5 magrittr_1.5 splines_3.3.1 MASS_7.3-45
[5] munsell_0.4.3 colorspace_1.2-6 lattice_0.20-33 foreach_1.4.3
[9] minqa_1.2.4 stringr_1.0.0 car_2.1-2 plyr_1.8.4
[13] tools_3.3.1 nnet_7.3-12 pbkrtest_0.4-6 parallel_3.3.1
[17] caret_6.0-70 grid_3.3.1 gtable_0.2.0 nlme_3.1-128
[21] mgcv_1.8-12 quantreg_5.26 MatrixModels_0.4-1 iterators_1.0.8
[25] lme4_1.1-12 Matrix_1.2-6 nloptr_1.0.4 reshape2_1.4.1
[29] ggplot2_2.1.0 codetools_0.2-14 stringi_1.1.1 scales_0.4.0
[33] stats4_3.3.1 SparseM_1.7 34 more packages! Surprise! My theory is that loading these packages via namespace is, at some point, using the random number stream in-between To test this, let's do a clean boot and load them one at a time and see what happens: pkgs <- c('Rcpp', 'magrittr', 'splines', 'MASS',
'munsell', 'colorspace', 'lattice',
'foreach', 'minqa', 'stringr', 'car',
'plyr', 'tools', 'nnet', 'pbkrtest',
'parallel', 'caret', 'grid', 'gtable',
'nlme', 'mgcv', 'quantreg', 'MatrixModels',
'iterators', 'lme4', 'Matrix', 'nloptr',
'reshape2', 'ggplot2', 'codetools', 'stringi',
'scales', 'stats4', 'SparseM')
expected <- c(31, 63, 1)
for(i in pkgs) {
lapply(1:5, function(x) {
set.seed(437234)
loadNamespace(i)
if(!all(sample(1:100, 3) == expected))
stop(paste("failed after loading", i))
#print(sessionInfo())
})
} It fails with So I learned something today and.... |
Thanks, @topepo. Great explanation. I learned something as well (that I probably should have known). I'll add a comment to our local team coding standards about referencing functions in this way and the potential implications. |
update.packages(oldPkgs="caret", ask=FALSE)
sessionInfo()
First, my many, many thanks for your wonderful contributions to the R community.
caret
has saved me many hours over the years.The issue I've found occurs only on the first call to
createFolds()
after a fresh R session (or a restart). The second and all subsequent iterations of the call will behave consistently. This error has been reproduced on versions of MRO from 3.2.3 - 3.2.5 and GNU R from 3.2.3 - 3.3.1 using caret 6.0-52, 6.0-64, and 6.0-70 (current CRAN version). The error does not occur when the package is loaded vialibrary()
, but does occur after a fresh restart if the createFolds() function is called viacaret::createFolds()
. Unfortunately in this case, the latter convention has become a standard of mine. Again, the inconsistent result only happens on the very first call after a restart and only when using the::
calling convention.I only found this error after banging my head against my desk reviewing some model results in a rather nasty nested CV that I was trying to step back through without a full restart. Of course, since I had not restarted, I got the normal/expected result on run 2, 3, ... , etc., so I was totally baffled why I could not reproduce my results on the batch run I'd just completed. With another restart, I saw the difference in my outer fold assignments on the first run only. Some colleagues that were still in the office have reproduced the error using the junk example below, but we each played with several variations (e.g., straight calls one after the other,
apply()
,for{}
, etc.) and found the issue present in every setup. It's late, so perhaps I'm missing something...Minimal, runnable code:
Example run:
Session Info:
The text was updated successfully, but these errors were encountered: