Unreliable random numbers produced when using doFuture backend #377

mbac opened this issue May 2, 2021 · 9 comments · Fixed by #383

mbac commented May 2, 2021


I’m using doFuture as a foreach backend. This is an approximation of the code I’m trying to run:

Edit: forgot the doFuture code:


cores <- parallelly::availableCores()
# Only option available on Macs, as I understand it:

tr_te_split <- initial_split(cells %>% select(-case), prop = 3/4)
cell_train <- training(tr_te_split)
cell_test  <- testing(tr_te_split)

folds <- vfold_cv(cell_train, v = 10)

cell_rec <- recipe(
    class ~ .,
    data = cell_train

boost_forest_mod <- boost_tree(
    mtry = tune(),
    trees = tune(),
    min_n = tune(),
    learn_rate = tune(),
    tree_depth = tune(),
    loss_reduction = tune(),
    sample_size = tune(),
    stop_iter = tune()
) %>%
    set_engine("xgboost") %>%

workflow_cells <- workflow() %>%
    add_recipe(cell_rec) %>%

workflow_cells_tuned <- workflow_cells %>%
        grid = 20,
        metrics = metric_set(roc_auc, precision, recall)

The tuning procedure seems to work, but I’m getting warnings for each iteration of the doFuture backend (I guess):

UNRELIABLE VALUE: One of the foreach() iterations (‘doFuture-7’) unexpectedly generated random numbers without
declaring so. There is a risk that those random numbers are not statistically sound and the overall results might be
invalid. To fix this, use ‘%dorng%’ from the ‘doRNG’ package instead of ‘%dopar%’. This ensures that proper,
parallel-safe random numbers are produced via the L’Ecuyer-CMRG method. To disable this check, set option
‘future.rng.onMisuse’ to “ignore”.

My session info:

> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] xgboost_1.4.1.1          rcompanion_2.4.0         doFuture_0.12.0-9000    
 [4] future_1.21.0            foreach_1.5.1            multilevelmod_0.0.0.9000
 [7] REDCapR_0.11.1.9004      dtplyr_1.1.0             readxl_1.3.1            
[10] yardstick_0.0.8          workflowsets_0.0.2       workflows_0.2.2         
[13] tune_0.1.5               rsample_0.0.9            recipes_0.1.16          
[16] parsnip_0.1.5.9002       modeldata_0.1.0          infer_0.5.4             
[19] dials_0.0.9.9000         scales_1.1.1             broom_0.7.6             
[22] tidymodels_0.1.3.9000    forcats_0.5.1            stringr_1.4.0           
[25] dplyr_1.0.5              purrr_0.3.4              readr_1.4.0             
[28] tidyr_1.1.3              tibble_3.1.1             ggplot2_3.3.3           
[31] tidyverse_1.3.1          pacman_0.5.1             devtools_2.4.0          
[34] usethis_2.0.1           

loaded via a namespace (and not attached):
  [1] backports_1.2.1    plyr_1.8.6         splines_4.0.4      listenv_0.8.0     
  [5] TH.data_1.0-10     digest_0.6.27      fansi_0.4.2        checkmate_2.0.0   
  [9] magrittr_2.0.1     memoise_2.0.0      remotes_2.3.0      globals_0.14.0    
 [13] modelr_0.1.8       gower_0.2.2        matrixStats_0.58.0 sandwich_3.0-0    
 [17] hardhat_0.1.5      prettyunits_1.1.1  colorspace_2.0-0   rvest_1.0.0       
 [21] haven_2.4.0        xfun_0.22          callr_3.7.0        crayon_1.4.1      
 [25] jsonlite_1.7.2     libcoin_1.0-8      Exact_2.1          zoo_1.8-9         
 [29] survival_3.2-10    iterators_1.0.13   glue_1.4.2         gtable_0.3.0      
 [33] ipred_0.9-11       pkgbuild_1.2.0     mvtnorm_1.1-1      DBI_1.1.1         
 [37] Rcpp_1.0.6         GPfit_1.0-8        proxy_0.4-25       stats4_4.0.4      
 [41] lava_1.6.9         prodlim_2019.11.13 httr_1.4.2         modeltools_0.2-23 
 [45] ellipsis_0.3.2     farver_2.1.0       pkgconfig_2.0.3    multcompView_0.1-8
 [49] nnet_7.3-15        dbplyr_2.1.1       utf8_1.2.1         labeling_0.4.2    
 [53] tidyselect_1.1.1   rlang_0.4.11       DiceDesign_1.9     munsell_0.5.0     
 [57] cellranger_1.1.0   tools_4.0.4        cachem_1.0.4       cli_2.5.0         
 [61] generics_0.1.0     EMT_1.1            fastmap_1.1.0      processx_3.5.2    
 [65] knitr_1.33         fs_1.5.0           coin_1.4-1         rootSolve_1.8.2.1 
 [69] tictoc_1.0         xml2_1.3.2         compiler_4.0.4     rstudioapi_0.13   
 [73] curl_4.3.1         e1071_1.7-6        testthat_3.0.2     reprex_2.0.0      
 [77] lhs_1.1.1          DescTools_0.99.41  stringi_1.5.3      ps_1.6.0          
 [81] desc_1.3.0         lattice_0.20-41    Matrix_1.3-2       conflicted_1.0.4  
 [85] vctrs_0.3.8        pillar_1.6.0       lifecycle_1.0.0    furrr_0.2.2       
 [89] lmtest_0.9-38      data.table_1.14.0  lmom_2.8           R6_2.5.0          
 [93] parallelly_1.25.0  gld_2.6.2          sessioninfo_1.1.1  codetools_0.2-18  
 [97] boot_1.3-27        MASS_7.3-53.1      assertthat_0.2.1   pkgload_1.2.1     
[101] rprojroot_2.0.2    nortest_1.0-4      withr_2.4.2        multcomp_1.4-17   
[105] expm_0.999-6       parallel_4.0.4     hms_1.0.0          grid_4.0.4        
[109] rpart_4.1-15       timeDate_3043.102  class_7.3-18       pROC_1.17.0.1     
[113] lubridate_1.7.10
In #349 we switched to generating seeds with L'Ecuyer-CMRG, which is parallel safe. I am pretty sure this is a false positive warning here, but is there a way for us to not trigger this warning?

It looks like doFuture hard codes the seed argument of the future() call it makes to FALSE, which triggers the warning if any RNG manipulating code is run in the expression it runs in parallel.

If the user is using %dorng% through doRNG, it looks like an exception is made for that - and for BiocParallel?

Unfortunately it doesn't look like there is another way to say "hey we are already using parallel safe rng here", even though we are.

We generate L'Ecuyer-CMRG seeds in the main process here:

seeds <- generate_seeds(rng, n_resamples)

And assign them here:

assign(".Random.seed", seed, envir = globalenv())

@HenrikBengtsson do you have any thoughts on how we can avoid the warning? We can't use doRNG directly, because we use nested foreach loops and those aren't supported by doRNG. Instead, we do what is suggested by doRNG's vignette in section 5.1 here

topepo commented May 4, 2021

Could we run the foreach blocks within a with_options(list(future.rng.onMisuse = "ignore"))?

Could we run the foreach blocks within a with_options(list(future.rng.onMisuse = "ignore"))?

Yes, that was going to be my suggestion. Or, the slightly better one doFuture.rng.onMisuse = "ignore" - that's a bit more specific on what it targets.

The long-term real solution for foreach and pRNG? It's in RevolutionAnalytics/foreach#6

JalalAl-Tamimi commented May 21, 2021

Hi all, I am following up on this to potentially report a bug? I ran the code below, and when using "registerDoRNG(123456)", and specify "parallel_over" either NULL, "resamples" or "everything", I get an error in the final step:

> wkfl_tidym_final <- last_fit(wkfl_tidym_best, split = tr_te_split)
Error in (function (obj, ex)  : 
  nested/conditional foreach loops are not supported yet.
See the package's vignette for a work around.

If instead, I use 
I do not get the error. However, this does not allow to replicate the results, when running separate R sessions.. 

I though one needs to use:

with the doRNG package, no?


Here is a working example:

doFuture.rng.onMisuse = "ignore"
ncores <- availableCores()
cat(paste0("Number of cores available for model calculations set to ", ncores, "."))
cl <- parallelly::makeClusterPSOCK(ncores)
plan(cluster, workers = cl)


tr_te_split <- initial_split(iris, strata = "Species", prop = 3/4)
iris_train <- training(tr_te_split)
iris_test  <- testing(tr_te_split)

folds <- vfold_cv(iris_train, v = 10)

iris_rec <- iris_train %>% 
  Species ~ .) %>% 

engine_tidym <- rand_forest(
  mode = "classification",
  mtry = 2,
  trees = 500,
  min_n = 1
) %>% 
  set_engine("ranger", importance = "permutation", sample.fraction = 0.632,
             replace = FALSE, write.forest = T, splitrule = "extratrees",
             scale.permutation.importance = FALSE) # we add engine specific settings

workflow_iris <- workflow() %>%
  add_recipe(iris_rec) %>%

workflow_iris_tuned <- 
    resamples = folds,
    grid = 2,
    metrics = metric_set(roc_auc, precision, recall),
    control = control_grid(save_pred = TRUE, parallel_over = "everything")

grid_tidym_best <- select_best(workflow_iris_tuned, metric = "roc_auc")
wkfl_tidym_best <- finalize_workflow(workflow_iris, grid_tidym_best)
wkfl_tidym_final <- last_fit(wkfl_tidym_best, split = tr_te_split)

topepo commented May 28, 2021

You do not need to use doRNG; if you comment out the line calling registerDoRNG(), you will be fine.

The warning about unreliable values is a false positive; we manually set the seeds inside the workers so that we can get reproducible results.

Copy link

And silence the warning with:

options(doFuture.rng.onMisuse = "ignore")

JalalAl-Tamimi commented May 28, 2021 via email

