Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit seed setting in worker processes #275

Merged
merged 7 commits into from
Sep 14, 2020
Merged

Conversation

topepo
Copy link
Member

@topepo topepo commented Sep 14, 2020

For models that have seed arguments, parallel processing will result in inconsistent results because the worker seeds are not controlled. This PR generates a set of seeds immediately when tune_grid() or fit_resamples() are called, then sets them in the iter_*() functions. This is the same approach as caret. Unit tests will be in the extratests package.

For example:

library(tidymodels)
data(ames)
ames <- mutate(ames, Sale_Price = log10(Sale_Price))

set.seed(123)
ames_split <- initial_split(ames, prob = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)

## -----------------------------------------------------------------------------

rf_model <-
  rand_forest(trees = 1000) %>%
  set_engine("ranger") %>%
  set_mode("regression")

rf_wflow <-
  workflow() %>%
  add_formula(
    Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
      Latitude + Longitude) %>%
  add_model(rf_model)

set.seed(55)
ames_folds <- vfold_cv(ames_train, v = 10)

keep_pred <- control_resamples(save_pred = TRUE)

## -----------------------------------------------------------------------------

set.seed(130)
rf_res_seq_1 <-
  rf_wflow %>%
  fit_resamples(resamples = ames_folds, control = keep_pred)

set.seed(130)
rf_res_seq_2 <-
  rf_wflow %>%
  fit_resamples(resamples = ames_folds, control = keep_pred)

## -----------------------------------------------------------------------------

library(doParallel)
cl <- makePSOCKcluster(10)
registerDoParallel(cl)

set.seed(130)
rf_res_par_1 <-
  rf_wflow %>%
  fit_resamples(resamples = ames_folds, control = keep_pred)

set.seed(130)
rf_res_par_2 <-
  rf_wflow %>%
  fit_resamples(resamples = ames_folds, control = keep_pred)

## -----------------------------------------------------------------------------

all.equal(bind_rows(rf_res_seq_1$.metrics),
          bind_rows(rf_res_seq_2$.metrics))
# CRAN tune results:
# [1] TRUE

# new results
# [1] TRUE

all.equal(bind_rows(rf_res_par_1$.metrics),
          bind_rows(rf_res_par_2$.metrics))
# CRAN tune results:
# [1] "Component “.estimate”: Mean relative difference: 0.001654902"

# new results
# [1] TRUE

all.equal(bind_rows(rf_res_seq_1$.metrics),
          bind_rows(rf_res_par_1$.metrics))

# CRAN tune results:
# [1] "Component “.estimate”: Mean relative difference: 0.001052047"

# new results
# [1] TRUE

topepo added a commit to tidymodels/extratests that referenced this pull request Sep 14, 2020
@topepo topepo merged commit beb7b14 into master Sep 14, 2020
@topepo topepo deleted the explicit-seed-setting branch September 14, 2020 23:26
@github-actions
Copy link

github-actions bot commented Mar 6, 2021

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants