First call to createFolds() not reproducible... despite set.seed() call #452

Closed
gangstR opened this Issue Jul 7, 2016 · 2 comments

Comments

Projects
None yet
2 participants
@gangstR

gangstR commented Jul 7, 2016

First, my many, many thanks for your wonderful contributions to the R community. caret has saved me many hours over the years.

The issue I've found occurs only on the first call to createFolds() after a fresh R session (or a restart). The second and all subsequent iterations of the call will behave consistently. This error has been reproduced on versions of MRO from 3.2.3 - 3.2.5 and GNU R from 3.2.3 - 3.3.1 using caret 6.0-52, 6.0-64, and 6.0-70 (current CRAN version). The error does not occur when the package is loaded via library(), but does occur after a fresh restart if the createFolds() function is called via caret::createFolds(). Unfortunately in this case, the latter convention has become a standard of mine. Again, the inconsistent result only happens on the very first call after a restart and only when using the :: calling convention.

I only found this error after banging my head against my desk reviewing some model results in a rather nasty nested CV that I was trying to step back through without a full restart. Of course, since I had not restarted, I got the normal/expected result on run 2, 3, ... , etc., so I was totally baffled why I could not reproduce my results on the batch run I'd just completed. With another restart, I saw the difference in my outer fold assignments on the first run only. Some colleagues that were still in the office have reproduced the error using the junk example below, but we each played with several variations (e.g., straight calls one after the other, apply(), for{}, etc.) and found the issue present in every setup. It's late, so perhaps I'm missing something...

Minimal, runnable code:

# BAD: Restart and run this...
lapply(1:5, function(x) {
  set.seed(1234)
  head(caret::createFolds(mtcars$cyl, k=3, list=FALSE))
})
# Returns: 2 1 2 3 3 3 then 3 2 3 1 3 2, 3 2 3 1 3 2, . . . on FIRST run.
# But all subsequent resubmissions of the code will return expected results.

# GOOD: Now restart and run this
library(caret)
lapply(1:5, function(x) {
  set.seed(1234)
  head(createFolds(mtcars$cyl, k=3, list=FALSE))
})
# Returns: 3 2 3 1 3 2 for all iterations and all subsequent resubmissions of the code

Example run:

Restarting R session...

> # BAD: Restart and run this...
> lapply(1:5, function(x) {
+   set.seed(1234)
+   head(caret::createFolds(mtcars$cyl, k=3, list=FALSE))
+ })
[[1]]
[1] 2 1 2 3 3 3

[[2]]
[1] 3 2 3 1 3 2

[[3]]
[1] 3 2 3 1 3 2

[[4]]
[1] 3 2 3 1 3 2

[[5]]
[1] 3 2 3 1 3 2

Restarting R session...

> # GOOD: Now restart and run this
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
> lapply(1:5, function(x) {
+   set.seed(1234)
+   head(createFolds(mtcars$cyl, k=3, list=FALSE))
+ })
[[1]]
[1] 3 2 3 1 3 2

[[2]]
[1] 3 2 3 1 3 2

[[3]]
[1] 3 2 3 1 3 2

[[4]]
[1] 3 2 3 1 3 2

[[5]]
[1] 3 2 3 1 3 2

Session Info:

> sessionInfo()
R version 3.2.5 (2016-04-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] RevoUtilsMath_3.2.5 Rfiglet_1.0         fortunes_1.5-2     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.5        magrittr_1.5       splines_3.2.5      MASS_7.3-45        munsell_0.4.3     
 [6] colorspace_1.2-6   lattice_0.20-33    foreach_1.4.3      minqa_1.2.4        stringr_1.0.0     
[11] car_2.1-1          plyr_1.8.4         tools_3.2.5        parallel_3.2.5     nnet_7.3-12       
[16] pbkrtest_0.4-6     caret_6.0-70       grid_3.2.5         gtable_0.2.0       nlme_3.1-125      
[21] mgcv_1.8-12        quantreg_5.21      MatrixModels_0.4-1 iterators_1.0.8    lme4_1.1-11       
[26] Matrix_1.2-4       nloptr_1.0.4       reshape2_1.4.1     ggplot2_2.1.0      codetools_0.2-14  
[31] stringi_1.1.1      scales_0.4.0       stats4_3.2.5       SparseM_1.7 
@topepo

This comment has been minimized.

Show comment
Hide comment
@topepo

topepo Jul 14, 2016

Owner

I can reproduce this but initially had no idea what the issue could be. That function uses sample without any trickery.

This code doesn't have the same issue:

lapply(1:5, function(x) {
  set.seed(437234)
  mean(randomForest::randomForest(Ozone ~ ., data=airquality, mtry=3,
                                  importance=TRUE, na.action=na.omit)$mse)
})

However, just referencing a function via namespace will load it and the related dependencies. For example, with a clean R session, I have:

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base 

If I just print the function by typing caret::createFolds then look at the session, I get

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.5        magrittr_1.5       splines_3.3.1      MASS_7.3-45       
 [5] munsell_0.4.3      colorspace_1.2-6   lattice_0.20-33    foreach_1.4.3     
 [9] minqa_1.2.4        stringr_1.0.0      car_2.1-2          plyr_1.8.4        
[13] tools_3.3.1        nnet_7.3-12        pbkrtest_0.4-6     parallel_3.3.1    
[17] caret_6.0-70       grid_3.3.1         gtable_0.2.0       nlme_3.1-128      
[21] mgcv_1.8-12        quantreg_5.26      MatrixModels_0.4-1 iterators_1.0.8   
[25] lme4_1.1-12        Matrix_1.2-6       nloptr_1.0.4       reshape2_1.4.1    
[29] ggplot2_2.1.0      codetools_0.2-14   stringi_1.1.1      scales_0.4.0      
[33] stats4_3.3.1       SparseM_1.7   

34 more packages! Surprise!

My theory is that loading these packages via namespace is, at some point, using the random number stream in-between set.seed(1234) and head(createFolds(mtcars$cyl, k=3, list=FALSE)).

To test this, let's do a clean boot and load them one at a time and see what happens:

pkgs <- c('Rcpp', 'magrittr', 'splines', 'MASS', 
          'munsell', 'colorspace', 'lattice', 
          'foreach', 'minqa', 'stringr', 'car', 
          'plyr', 'tools', 'nnet', 'pbkrtest', 
          'parallel', 'caret', 'grid', 'gtable', 
          'nlme', 'mgcv', 'quantreg', 'MatrixModels', 
          'iterators', 'lme4', 'Matrix', 'nloptr', 
          'reshape2', 'ggplot2', 'codetools', 'stringi', 
          'scales', 'stats4', 'SparseM')

expected <- c(31, 63, 1)

for(i in pkgs) {
  lapply(1:5, function(x) {
    set.seed(437234)
    loadNamespace(i)
    if(!all(sample(1:100, 3) == expected))
      stop(paste("failed after loading", i)) 
    #print(sessionInfo())
  })
}

It fails with Error in FUN(X[[i]], ...) : failed after loading car

So I learned something today and....

Owner

topepo commented Jul 14, 2016

I can reproduce this but initially had no idea what the issue could be. That function uses sample without any trickery.

This code doesn't have the same issue:

lapply(1:5, function(x) {
  set.seed(437234)
  mean(randomForest::randomForest(Ozone ~ ., data=airquality, mtry=3,
                                  importance=TRUE, na.action=na.omit)$mse)
})

However, just referencing a function via namespace will load it and the related dependencies. For example, with a clean R session, I have:

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base 

If I just print the function by typing caret::createFolds then look at the session, I get

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.5        magrittr_1.5       splines_3.3.1      MASS_7.3-45       
 [5] munsell_0.4.3      colorspace_1.2-6   lattice_0.20-33    foreach_1.4.3     
 [9] minqa_1.2.4        stringr_1.0.0      car_2.1-2          plyr_1.8.4        
[13] tools_3.3.1        nnet_7.3-12        pbkrtest_0.4-6     parallel_3.3.1    
[17] caret_6.0-70       grid_3.3.1         gtable_0.2.0       nlme_3.1-128      
[21] mgcv_1.8-12        quantreg_5.26      MatrixModels_0.4-1 iterators_1.0.8   
[25] lme4_1.1-12        Matrix_1.2-6       nloptr_1.0.4       reshape2_1.4.1    
[29] ggplot2_2.1.0      codetools_0.2-14   stringi_1.1.1      scales_0.4.0      
[33] stats4_3.3.1       SparseM_1.7   

34 more packages! Surprise!

My theory is that loading these packages via namespace is, at some point, using the random number stream in-between set.seed(1234) and head(createFolds(mtcars$cyl, k=3, list=FALSE)).

To test this, let's do a clean boot and load them one at a time and see what happens:

pkgs <- c('Rcpp', 'magrittr', 'splines', 'MASS', 
          'munsell', 'colorspace', 'lattice', 
          'foreach', 'minqa', 'stringr', 'car', 
          'plyr', 'tools', 'nnet', 'pbkrtest', 
          'parallel', 'caret', 'grid', 'gtable', 
          'nlme', 'mgcv', 'quantreg', 'MatrixModels', 
          'iterators', 'lme4', 'Matrix', 'nloptr', 
          'reshape2', 'ggplot2', 'codetools', 'stringi', 
          'scales', 'stats4', 'SparseM')

expected <- c(31, 63, 1)

for(i in pkgs) {
  lapply(1:5, function(x) {
    set.seed(437234)
    loadNamespace(i)
    if(!all(sample(1:100, 3) == expected))
      stop(paste("failed after loading", i)) 
    #print(sessionInfo())
  })
}

It fails with Error in FUN(X[[i]], ...) : failed after loading car

So I learned something today and....

@gangstR

This comment has been minimized.

Show comment
Hide comment
@gangstR

gangstR Jul 15, 2016

Thanks, @topepo. Great explanation. I learned something as well (that I probably should have known). I'll add a comment to our local team coding standards about referencing functions in this way and the potential implications.

gangstR commented Jul 15, 2016

Thanks, @topepo. Great explanation. I learned something as well (that I probably should have known). I'll add a comment to our local team coding standards about referencing functions in this way and the potential implications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment