You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been using tune_grid() for logistic regression with glmnet with a data set that has many predictors (~50,000). When running tune_grid() without parallelization, the memory usage was within my expectations (~1 GB). However, when I tried running the code with parallelization (doParallel with 20 cores), I found that the memory usage had spiked. Each core was using ~18 GB, and with 20 cores, the total memory usage on the server was near 200 GB. Even with using fewer cores, the memory usage was too much when run in parallel, so I'm now running my script without parallelization to ensure the memory usage isn't a problem.
I'm not much an expert on memory usage in R, and I'm new to tidymodels, so I may be misunderstanding something, but I figured since the non-parallel execution was much smaller on a per-core basis, something was probably going wrong with the parallelization.
I've been able to create a reproducible example with a less extreme data set. Here, with 10,000 predictors, I see the per-core memory usage at ~500 MB, but when run in parallel with 8 cores, the per-core usage increases to 731 MB. This example seems to be replicating what I was seeing with my real data set, but on a smaller scale. You can try increasing the number of predictor variables to see larger discrepancies between serial and parallelized versions.
Any idea what might be going on? Thanks for your help!
Reproducible example
##### Libraries #####
require( "tidymodels" )
require( "foreach" )
require( "doParallel" )
require( "tictoc" )
require( "glmnet" )
require( "tidyverse" )
set.seed( 4235 )
# Create data setpredictor_names<- str_c( "pred", 1:10000 )
id<-1:150data<- expand_grid( id, predictor_names ) %>%
mutate( value= rnorm( n= n() ) ) %>%
pivot_wider( names_from=predictor_names, values_from=value ) %>%
mutate( outcome= as.factor( as.character( rbernoulli( n(), p=0.3 ) ) ) ) %>%
select( id, outcome, everything() )
# Set up model training parametersn_iters<-8prop_training<-0.66n_fold_hyperparameter<-3lambdas_range<-10^seq( -2, 0, by=0.5 )
##### Set model type and engine #####glm_model<- logistic_reg( penalty= tune(), mixture=0.5 ) %>%
set_engine( "glmnet" )
##### Set data processing recipe #####cur_recipe<- recipe( data ) %>%
# General processing
update_role( id, new_role="ID" ) %>%
# Specific outcome/predictor processing
update_role( outcome, new_role="outcome" ) %>%
update_role( starts_with( "pred" ), new_role="predictor" ) %>%
step_zv( all_predictors() )
##### Set the workflow #####cur_workflow<- workflow() %>%
add_model( glm_model ) %>%
add_recipe( cur_recipe )
##### Select training and testing data #####cur_testing_training_splits<- initial_split( data, prop=prop_training, strata=outcome )
# Get testing datacur_testing_data<- testing( cur_testing_training_splits )
# Get training datacur_training_data<- training( cur_testing_training_splits )
##### Train the model ###### Set training gridcur_training_grid<- tibble( penalty=lambdas_range )
# Set CVcur_cv_folds<- vfold_cv( cur_training_data, v=n_fold_hyperparameter, repeats=n_iters, strata=outcome )
# Train model on a single core
tic( "Train model without parallelization" )
cur_training_results<-cur_workflow %>%
tune_grid( resamples=cur_cv_folds,
grid=cur_training_grid,
metrics= metric_set( roc_auc, pr_auc, accuracy, npv, ppv, yardstick::sensitivity, yardstick::specificity, yardstick::precision, yardstick::recall ) )
toc()
##### Train the model in parallel ###### Initialize cores for parallel processingncores<-8cl<- makeCluster( ncores )
registerDoParallel( cl )
# Train model with parallelization
tic( "Train model with parallelization" )
cur_training_results_parallel<-cur_workflow %>%
tune_grid( resamples=cur_cv_folds,
grid=cur_training_grid,
metrics= metric_set( roc_auc, pr_auc, accuracy, npv, ppv, yardstick::sensitivity, yardstick::specificity, yardstick::precision, yardstick::recall ) )
toc()
# Clean up cores
stopCluster( cl )
The text was updated successfully, but these errors were encountered:
We do think that fixing #376 will solve problems like yours described here a lot, as only the necessary data will be passed to the workers. It's possible that it won't entirely solve what you are seeing as problematic, though, since functions like rsample::vfold_cv() make resampling really cheap in the same R session but not so much when you have multiple R sessions. Using parallel processing is generally going to be faster at a memory cost, which is extra-exaggerated given how memory-efficient resampling is in the same R session.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
The problem
I've been using tune_grid() for logistic regression with glmnet with a data set that has many predictors (~50,000). When running tune_grid() without parallelization, the memory usage was within my expectations (~1 GB). However, when I tried running the code with parallelization (doParallel with 20 cores), I found that the memory usage had spiked. Each core was using ~18 GB, and with 20 cores, the total memory usage on the server was near 200 GB. Even with using fewer cores, the memory usage was too much when run in parallel, so I'm now running my script without parallelization to ensure the memory usage isn't a problem.
I'm not much an expert on memory usage in R, and I'm new to tidymodels, so I may be misunderstanding something, but I figured since the non-parallel execution was much smaller on a per-core basis, something was probably going wrong with the parallelization.
I've been able to create a reproducible example with a less extreme data set. Here, with 10,000 predictors, I see the per-core memory usage at ~500 MB, but when run in parallel with 8 cores, the per-core usage increases to 731 MB. This example seems to be replicating what I was seeing with my real data set, but on a smaller scale. You can try increasing the number of predictor variables to see larger discrepancies between serial and parallelized versions.
Any idea what might be going on? Thanks for your help!
Reproducible example
The text was updated successfully, but these errors were encountered: