Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unreliable random numbers produced when using doFuture backend #377

Closed
mbac opened this issue May 2, 2021 · 9 comments · Fixed by #383
Closed

Unreliable random numbers produced when using doFuture backend #377

mbac opened this issue May 2, 2021 · 9 comments · Fixed by #383
Labels
upkeep maintenance, infrastructure, and similar

Comments

@mbac
Copy link

mbac commented May 2, 2021

Hi,

I’m using doFuture as a foreach backend. This is an approximation of the code I’m trying to run:

Edit: forgot the doFuture code:

library(tidymodels)
library(doFuture)

cores <- parallelly::availableCores()
registerDoFuture()
# Only option available on Macs, as I understand it:
plan("multisession")

data(cells)
set.seed(2369)
tr_te_split <- initial_split(cells %>% select(-case), prop = 3/4)
cell_train <- training(tr_te_split)
cell_test  <- testing(tr_te_split)

set.seed(1697)
folds <- vfold_cv(cell_train, v = 10)

cell_rec <- recipe(
    class ~ .,
    data = cell_train
)

boost_forest_mod <- boost_tree(
    mtry = tune(),
    trees = tune(),
    min_n = tune(),
    learn_rate = tune(),
    tree_depth = tune(),
    loss_reduction = tune(),
    sample_size = tune(),
    stop_iter = tune()
) %>%
    set_engine("xgboost") %>%
    set_mode("classification")

workflow_cells <- workflow() %>%
    add_recipe(cell_rec) %>%
    add_model(boost_forest_mod)

workflow_cells_tuned <- workflow_cells %>%
    tune_grid(
        folds,
        grid = 20,
        metrics = metric_set(roc_auc, precision, recall)
    )

The tuning procedure seems to work, but I’m getting warnings for each iteration of the doFuture backend (I guess):

UNRELIABLE VALUE: One of the foreach() iterations (‘doFuture-7’) unexpectedly generated random numbers without
declaring so. There is a risk that those random numbers are not statistically sound and the overall results might be
invalid. To fix this, use ‘%dorng%’ from the ‘doRNG’ package instead of ‘%dopar%’. This ensures that proper,
parallel-safe random numbers are produced via the L’Ecuyer-CMRG method. To disable this check, set option
‘future.rng.onMisuse’ to “ignore”.

My session info:

> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] xgboost_1.4.1.1          rcompanion_2.4.0         doFuture_0.12.0-9000    
 [4] future_1.21.0            foreach_1.5.1            multilevelmod_0.0.0.9000
 [7] REDCapR_0.11.1.9004      dtplyr_1.1.0             readxl_1.3.1            
[10] yardstick_0.0.8          workflowsets_0.0.2       workflows_0.2.2         
[13] tune_0.1.5               rsample_0.0.9            recipes_0.1.16          
[16] parsnip_0.1.5.9002       modeldata_0.1.0          infer_0.5.4             
[19] dials_0.0.9.9000         scales_1.1.1             broom_0.7.6             
[22] tidymodels_0.1.3.9000    forcats_0.5.1            stringr_1.4.0           
[25] dplyr_1.0.5              purrr_0.3.4              readr_1.4.0             
[28] tidyr_1.1.3              tibble_3.1.1             ggplot2_3.3.3           
[31] tidyverse_1.3.1          pacman_0.5.1             devtools_2.4.0          
[34] usethis_2.0.1           

loaded via a namespace (and not attached):
  [1] backports_1.2.1    plyr_1.8.6         splines_4.0.4      listenv_0.8.0     
  [5] TH.data_1.0-10     digest_0.6.27      fansi_0.4.2        checkmate_2.0.0   
  [9] magrittr_2.0.1     memoise_2.0.0      remotes_2.3.0      globals_0.14.0    
 [13] modelr_0.1.8       gower_0.2.2        matrixStats_0.58.0 sandwich_3.0-0    
 [17] hardhat_0.1.5      prettyunits_1.1.1  colorspace_2.0-0   rvest_1.0.0       
 [21] haven_2.4.0        xfun_0.22          callr_3.7.0        crayon_1.4.1      
 [25] jsonlite_1.7.2     libcoin_1.0-8      Exact_2.1          zoo_1.8-9         
 [29] survival_3.2-10    iterators_1.0.13   glue_1.4.2         gtable_0.3.0      
 [33] ipred_0.9-11       pkgbuild_1.2.0     mvtnorm_1.1-1      DBI_1.1.1         
 [37] Rcpp_1.0.6         GPfit_1.0-8        proxy_0.4-25       stats4_4.0.4      
 [41] lava_1.6.9         prodlim_2019.11.13 httr_1.4.2         modeltools_0.2-23 
 [45] ellipsis_0.3.2     farver_2.1.0       pkgconfig_2.0.3    multcompView_0.1-8
 [49] nnet_7.3-15        dbplyr_2.1.1       utf8_1.2.1         labeling_0.4.2    
 [53] tidyselect_1.1.1   rlang_0.4.11       DiceDesign_1.9     munsell_0.5.0     
 [57] cellranger_1.1.0   tools_4.0.4        cachem_1.0.4       cli_2.5.0         
 [61] generics_0.1.0     EMT_1.1            fastmap_1.1.0      processx_3.5.2    
 [65] knitr_1.33         fs_1.5.0           coin_1.4-1         rootSolve_1.8.2.1 
 [69] tictoc_1.0         xml2_1.3.2         compiler_4.0.4     rstudioapi_0.13   
 [73] curl_4.3.1         e1071_1.7-6        testthat_3.0.2     reprex_2.0.0      
 [77] lhs_1.1.1          DescTools_0.99.41  stringi_1.5.3      ps_1.6.0          
 [81] desc_1.3.0         lattice_0.20-41    Matrix_1.3-2       conflicted_1.0.4  
 [85] vctrs_0.3.8        pillar_1.6.0       lifecycle_1.0.0    furrr_0.2.2       
 [89] lmtest_0.9-38      data.table_1.14.0  lmom_2.8           R6_2.5.0          
 [93] parallelly_1.25.0  gld_2.6.2          sessioninfo_1.1.1  codetools_0.2-18  
 [97] boot_1.3-27        MASS_7.3-53.1      assertthat_0.2.1   pkgload_1.2.1     
[101] rprojroot_2.0.2    nortest_1.0-4      withr_2.4.2        multcomp_1.4-17   
[105] expm_0.999-6       parallel_4.0.4     hms_1.0.0          grid_4.0.4        
[109] rpart_4.1-15       timeDate_3043.102  class_7.3-18       pROC_1.17.0.1     
[113] lubridate_1.7.10
@juliasilge
Copy link
Member

In #349 we switched to generating seeds with L'Ecuyer-CMRG, which is parallel safe. I am pretty sure this is a false positive warning here, but is there a way for us to not trigger this warning?

@DavisVaughan
Copy link
Member

It looks like doFuture hard codes the seed argument of the future() call it makes to FALSE, which triggers the warning if any RNG manipulating code is run in the expression it runs in parallel.

If the user is using %dorng% through doRNG, it looks like an exception is made for that - and for BiocParallel?

https://github.com/HenrikBengtsson/doFuture/blob/e701db73afae56496b02c63b1a0cf3bd222bde46/R/doFuture.R#L214-L231

Unfortunately it doesn't look like there is another way to say "hey we are already using parallel safe rng here", even though we are.

We generate L'Ecuyer-CMRG seeds in the main process here:

seeds <- generate_seeds(rng, n_resamples)

And assign them here:

assign(".Random.seed", seed, envir = globalenv())

@HenrikBengtsson do you have any thoughts on how we can avoid the warning? We can't use doRNG directly, because we use nested foreach loops and those aren't supported by doRNG. Instead, we do what is suggested by doRNG's vignette in section 5.1 here https://cran.r-project.org/web/packages/doRNG/vignettes/doRNG.pdf

@topepo
Copy link
Member

topepo commented May 4, 2021

Could we run the foreach blocks within a with_options(list(future.rng.onMisuse = "ignore"))?

@HenrikBengtsson
Copy link

Could we run the foreach blocks within a with_options(list(future.rng.onMisuse = "ignore"))?

Yes, that was going to be my suggestion. Or, the slightly better one doFuture.rng.onMisuse = "ignore" - that's a bit more specific on what it targets.

The long-term real solution for foreach and pRNG? It's in RevolutionAnalytics/foreach#6

@juliasilge juliasilge added the upkeep maintenance, infrastructure, and similar label May 4, 2021
@JalalAl-Tamimi
Copy link

JalalAl-Tamimi commented May 21, 2021

Hi all, I am following up on this to potentially report a bug? I ran the code below, and when using "registerDoRNG(123456)", and specify "parallel_over" either NULL, "resamples" or "everything", I get an error in the final step:

> wkfl_tidym_final <- last_fit(wkfl_tidym_best, split = tr_te_split)
Error in (function (obj, ex)  : 
  nested/conditional foreach loops are not supported yet.
See the package's vignette for a work around.

If instead, I use 
set.seed(123456)
I do not get the error. However, this does not allow to replicate the results, when running separate R sessions.. 

I though one needs to use:
registerDoRNG(123456)

with the doRNG package, no?

thanks

Here is a working example:

library(tidymodels)
library(doFuture)
library(doRNG)
doFuture.rng.onMisuse = "ignore"
ncores <- availableCores()
cat(paste0("Number of cores available for model calculations set to ", ncores, "."))
registerDoFuture()
cl <- parallelly::makeClusterPSOCK(ncores)
plan(cluster, workers = cl)
ncores
cl

registerDoRNG(123456)

data(iris)
tr_te_split <- initial_split(iris, strata = "Species", prop = 3/4)
iris_train <- training(tr_te_split)
iris_test  <- testing(tr_te_split)

folds <- vfold_cv(iris_train, v = 10)

iris_rec <- iris_train %>% 
  recipe(
  Species ~ .) %>% 
  prep()

engine_tidym <- rand_forest(
  mode = "classification",
  mtry = 2,
  trees = 500,
  min_n = 1
) %>% 
  set_engine("ranger", importance = "permutation", sample.fraction = 0.632,
             replace = FALSE, write.forest = T, splitrule = "extratrees",
             scale.permutation.importance = FALSE) # we add engine specific settings

workflow_iris <- workflow() %>%
  add_recipe(iris_rec) %>%
  add_model(engine_tidym)

workflow_iris_tuned <- 
  tune_grid(
    workflow_iris,
    resamples = folds,
    grid = 2,
    metrics = metric_set(roc_auc, precision, recall),
    control = control_grid(save_pred = TRUE, parallel_over = "everything")
  )

collect_metrics(workflow_iris_tuned)
grid_tidym_best <- select_best(workflow_iris_tuned, metric = "roc_auc")
grid_tidym_best
wkfl_tidym_best <- finalize_workflow(workflow_iris, grid_tidym_best)
wkfl_tidym_final <- last_fit(wkfl_tidym_best, split = tr_te_split)

@topepo
Copy link
Member

topepo commented May 28, 2021

You do not need to use doRNG; if you comment out the line calling registerDoRNG(), you will be fine.

The warning about unreliable values is a false positive; we manually set the seeds inside the workers so that we can get reproducible results.

@HenrikBengtsson
Copy link

And silence the warning with:

options(doFuture.rng.onMisuse = "ignore")

@JalalAl-Tamimi
Copy link

JalalAl-Tamimi commented May 28, 2021 via email

topepo added a commit that referenced this issue Jun 1, 2021
topepo added a commit that referenced this issue Jun 5, 2021
* Increment version number

* use future option for #377

* dev version bump

* news update

* one-line solution from Davis
@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jun 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
upkeep maintenance, infrastructure, and similar
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants