Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problematic extra memory usage when tune_grid run in parallel compared to serial #384

Closed
dpanyard opened this issue Jun 6, 2021 · 3 comments · Fixed by #397
Closed

Problematic extra memory usage when tune_grid run in parallel compared to serial #384

dpanyard opened this issue Jun 6, 2021 · 3 comments · Fixed by #397

Comments

@dpanyard
Copy link

dpanyard commented Jun 6, 2021

The problem

I've been using tune_grid() for logistic regression with glmnet with a data set that has many predictors (~50,000). When running tune_grid() without parallelization, the memory usage was within my expectations (~1 GB). However, when I tried running the code with parallelization (doParallel with 20 cores), I found that the memory usage had spiked. Each core was using ~18 GB, and with 20 cores, the total memory usage on the server was near 200 GB. Even with using fewer cores, the memory usage was too much when run in parallel, so I'm now running my script without parallelization to ensure the memory usage isn't a problem.

I'm not much an expert on memory usage in R, and I'm new to tidymodels, so I may be misunderstanding something, but I figured since the non-parallel execution was much smaller on a per-core basis, something was probably going wrong with the parallelization.

I've been able to create a reproducible example with a less extreme data set. Here, with 10,000 predictors, I see the per-core memory usage at ~500 MB, but when run in parallel with 8 cores, the per-core usage increases to 731 MB. This example seems to be replicating what I was seeing with my real data set, but on a smaller scale. You can try increasing the number of predictor variables to see larger discrepancies between serial and parallelized versions.

Any idea what might be going on? Thanks for your help!

Reproducible example

##### Libraries #####
require( "tidymodels" )
require( "foreach" )
require( "doParallel" )
require( "tictoc" )
require( "glmnet" )
require( "tidyverse" )


set.seed( 4235 )

# Create data set
predictor_names <- str_c( "pred", 1:10000 )
id <- 1:150
data <- expand_grid( id, predictor_names ) %>%
  mutate( value = rnorm( n = n() ) ) %>%
  pivot_wider( names_from = predictor_names, values_from = value ) %>%
  mutate( outcome = as.factor( as.character( rbernoulli( n(), p = 0.3 ) ) ) ) %>%
  select( id, outcome, everything() )

# Set up model training parameters
n_iters <- 8
prop_training <- 0.66
n_fold_hyperparameter <- 3
lambdas_range <- 10^seq( -2, 0, by = 0.5 )


##### Set model type and engine #####
glm_model <- logistic_reg( penalty = tune(), mixture = 0.5 ) %>%
  set_engine( "glmnet" )


##### Set data processing recipe #####
cur_recipe <- recipe( data ) %>%
  # General processing
  update_role( id, new_role = "ID" ) %>%
  
  # Specific outcome/predictor processing
  update_role( outcome, new_role = "outcome" ) %>%
  update_role( starts_with( "pred" ), new_role = "predictor" ) %>%
  step_zv( all_predictors() )


##### Set the workflow #####
cur_workflow <- workflow() %>%
  add_model( glm_model ) %>%
  add_recipe( cur_recipe )


##### Select training and testing data #####
cur_testing_training_splits <- initial_split( data, prop = prop_training, strata = outcome )

# Get testing data
cur_testing_data <- testing( cur_testing_training_splits )

# Get training data
cur_training_data <- training( cur_testing_training_splits )


##### Train the model #####
# Set training grid
cur_training_grid <- tibble( penalty = lambdas_range )

# Set CV
cur_cv_folds <- vfold_cv( cur_training_data, v = n_fold_hyperparameter, repeats = n_iters, strata = outcome )

# Train model on a single core
tic( "Train model without parallelization" )
cur_training_results <- cur_workflow %>%
  tune_grid( resamples = cur_cv_folds,
             grid = cur_training_grid,
             metrics = metric_set( roc_auc, pr_auc, accuracy, npv, ppv, yardstick::sensitivity, yardstick::specificity, yardstick::precision, yardstick::recall ) )
toc()



##### Train the model in parallel #####
# Initialize cores for parallel processing
ncores <- 8
cl <- makeCluster( ncores )
registerDoParallel( cl )

# Train model with parallelization
tic( "Train model with parallelization" )
cur_training_results_parallel <- cur_workflow %>%
  tune_grid( resamples = cur_cv_folds,
             grid = cur_training_grid,
             metrics = metric_set( roc_auc, pr_auc, accuracy, npv, ppv, yardstick::sensitivity, yardstick::specificity, yardstick::precision, yardstick::recall ) )
toc()

# Clean up cores
stopCluster( cl )
@dpanyard
Copy link
Author

dpanyard commented Jun 7, 2021

Issue #376 may be related to what I'm encountering.

@juliasilge
Copy link
Member

Thanks for this report @dpanyard! 🙌

We do think that fixing #376 will solve problems like yours described here a lot, as only the necessary data will be passed to the workers. It's possible that it won't entirely solve what you are seeing as problematic, though, since functions like rsample::vfold_cv() make resampling really cheap in the same R session but not so much when you have multiple R sessions. Using parallel processing is generally going to be faster at a memory cost, which is extra-exaggerated given how memory-efficient resampling is in the same R session.

@github-actions
Copy link

github-actions bot commented Aug 7, 2021

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Aug 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants