Skip to content
Tobias Kind edited this page Mar 28, 2017 · 27 revisions

Calling caret methods in parallel mode is extremely easy. The training function in caret automatically provides the functionality. The only user task is to register a parallel backend and to make sure to un-register it after training. Visit my R-parallel WIKI to see the different parallel backends for R. Once the sequential code is validated and running all calculations should be carried out in parallel. To make sure its using all CPU power use Ctr-Alt-Del under Windows or "Activity Monitor" in Mac Spotlight or "System monitor" under LINUX.

Sequential code (one CPU core):

# train caret regression with one CPU
require(caret); data(BloodBrain); 
fit1 <- train(bbbDescr, logBBB, "knn"); fit1

Parallel code register 4 cores (no worries if you only have 4) and train in parallel. Observe the first line which installs the parallel client and last line below which removes it. The code inside the parallel bracket stays the same as in the above sequential code example. No fiddling around. Very easy.

# train caret regression with 4 CPUs
library(doParallel); cl <- makeCluster(4); registerDoParallel(cl) 
  require(caret); data(BloodBrain); 
  fit1 <- train(bbbDescr, logBBB, "knn"); fit1
stopCluster(cl); registerDoSEQ();

In both cases we will get the same result if a seed was set. To explain the above example: First register parallel backend, second run the caret train function and third un-register the parallel backend. There is of course a certain [overhead] (https://github.com/tobigithub/R-parallel/wiki/R-parallel-Snippets) for setting-up parallel functions its usually a few seconds.

>   fit1 <- train(bbbDescr, logBBB, "knn"); fit1
k-Nearest Neighbors 

208 samples
134 predictors

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 208, 208, 208, 208, 208, 208, ... 
Resampling results across tuning parameters:

  k  RMSE       Rsquared   RMSE SD     Rsquared SD
  5  0.6732761  0.2717121  0.05888422  0.07915765 
  7  0.6565293  0.2857143  0.05542943  0.08616568 
  9  0.6557080  0.2805902  0.04537600  0.07066764 

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was k = 9. 

Final words: Many of the older caret examples use doMC. That is quite frustrating for new beginners who have not worked with R under Windows, because doMC is not supported under WIN. Due to that it is better to use the library(doParallel) which is a new merger of doSNOW and doMC. Or one can also use the native parallel package which is a merger of snow and multicore.


The advantage of parallel code is extreme, in many cases the scaling is linear, so what takes one hour on a single CPU is finished in 2 minutes with a 32-CPU setup. Running parallel code requires lots of memory, some packages allocate up to 4 GByte per parallel client, hence 32 CPUs would require 4x32 = 128 GByte RAM. That is normal for a workstation, but maybe problematic for home computers.

caret-cubist-train-parallel

That means for parallel processes there will be

  1. parallel setup/merge time overhead
  2. parallel RAM or memory overhead

If we choose another method (cubist) we can see that we need an substantial amount of time to just setup the calculation, then distribute all the data to the different CPUs or nodes and then after the calculation is finished everything needs to be combined and merged again.

So below we can see that for the cubist method it took 91 seconds in sequential code (user time), but only 2 seconds (user time) in parallel mode. And when we observed the thread firing in real time we can see that the CPUs are indeed only active at 100% for 2 seconds. This would be a superficial 45-fold speedup if no data preparation would be needed.

It becomes clear that parallel use is not recommended for small 1-2 min examples, but once the code runs minutes, or hours or days (hopefully not) then parallelism is the only possible solution. Especially tuning methods which cover large parameter spaces using design of experiment require parallel CPU power. Each simple parameter change needs to perform a training and RMSE check hence 1000 parameters would extend the training time quite substantially. Below the cubist method with 16 CPUs in parallel.

library(doParallel); cl <- makeCluster(16); registerDoParallel(cl) 
  require(caret); data(BloodBrain); 
  fit1 <- train(bbbDescr, logBBB, "cubist"); 
  fit1; fit1$times$everything
stopCluster(cl); registerDoSEQ();

# Cubist with one CPU [s]
# user  system elapsed 
# 91.20    0.04   91.27 

# Cubist with 16 CPUs [s]
# user  system elapsed 
# 2.00    0.03   44.68 

We can also invoke functions that automatically detect CPU cores and threads of the system, that becomes handy when the code is shared between multiple systems. Here we can use the functionalities of the different parallel packages. It is important to remember to always clean-up after parallel processing and to make sure that no rscript.exe zombies are still lingering around. See the R-parallel WIKI how to remove those.

require(caret)
data(BloodBrain)
set.seed(123)

# Library parallel() is a native R library, no CRAN required
library(parallel)
nCores <- detectCores(logical = FALSE)
nThreads <- detectCores(logical = TRUE)
cat("CPU with",nCores,"cores and",nThreads,"threads detected.\n")

# load the doParallel/doSNOW library for caret cluster use
library(doParallel)
cl <- makeCluster(nThreads)
registerDoParallel(cl)

# random forest regression
fit1 <- train(bbbDescr, logBBB, "rf")
fit1; 

stopCluster(cl)
registerDoSEQ()
### END

The use of parallelism is especially important when using multiple caret models. Alltough we can execute those in classical loops we can also use lapply and sapply (vector and matrix operations) in R. The example code below applies the four models "qrf","xgbTree","knn","rf" and compares them in terms of regression goodness-of-fit R2 and RMSE errors. So we can quicly see the "rf" is not the best method to use, but xgbTree or eXtreme Gradient Boosting gives a better result for training. Of course we would need to validate the results with a split-set or holdout-set to come to a final conclusion. The disadvantage of using lapply/sapply is the somewhat complicated error handling in R. Using many different models will certainly bring up errors and warnings, package updates may break code so this is only given as a quick-and-dirty code snippet.

require(caret); data(BloodBrain); m <- c("qrf","xgbTree","knn","rf");
library(doParallel); cl <- makeCluster(8); registerDoParallel(cl)
  
  # seeds required for reproducible methods
  t2 <- lapply(m,function(x) {set.seed(123); 
  seeds <- vector(mode = "list", length = nrow(bbbDescr) + 1); 
  seeds <- lapply(seeds, function(x) 1:20); 
  
  # train the 5 methods from vector(m)
  t1 <- train(bbbDescr, logBBB, (x),
  	trControl = trainControl(method = "cv",seeds=seeds))})
  r2 <- lapply(1:length(t2), function(x) {
        cat(sprintf("%-10s",(m[x])));
        cat(t2[[x]]$results$Rsquared[which.min(t2[[x]]$results$RMSE)],"\t");
        cat(t2[[x]]$results$RMSE[which.min(t2[[x]]$results$RMSE)],"\n")})
        
stopCluster(cl); registerDoSEQ();


#model     R^2           RMSE
#qrf       0.5861108     0.5120318 
#xgbTree   0.6129255     0.4858211 
#knn       0.3736528     0.6185242 
#rf        0.6037442     0.493395 

Additional material:

  • R parallel - available R parallel packages
  • caret WIKI - original caret parallel page
  • H2O - H2O.ai - Fast Scalable Machine Learning also exploits parallelism
  • Formulize - nutonian Formulize uses parallel genetic algorithms and cloud computing
  • cubist examples - parallel ensemble
  • Workstations - Workstations for machine learning

Source code: