Optimising the train function in parallel from multiple splits generated by the createPartition function #192

acocac · 2015-07-25T09:42:39Z

Dears,

I tried to reproduce Max's response for the following issue:
http://stats.stackexchange.com/questions/99315/train-validate-test-sets-in-caret

Using the createPartition function and the times argument, I am creating multiple splits of train and test sets from my all train dataset. My aim is to assess the best model from these splits using the train function with 5-fold CV in parallel.

I implemented a foreach as suggested by Max's response. However, running these foreach my CPU utilisation is less than 10% (option 1). In contrast, if I use a for sentence, it has more than 10% CPU utilisation (option). The system.time from these two options as follows:

OPTION 1 (foreach and parallel)
user system elapsed
6.77 4.42 351.99

OPTION 2 (for and parallel)
user system elapsed
11.84 0.35 63.94

Is there any option or suggestion to optimise the following reproducible code using the iris dataset?

require(caret)
require(doParallel)

dataset

data(iris)

create multiple split train and test data (2 times in this example)

set.seed(40)
splits <- createDataPartition(iris$Species, p=0.7, list=T, times=2)
results <- lapply(splits,
function(x, dat) {
holdout <- (1:nrow(dat))[-unique(x)]
data.frame(index = holdout,
obs = dat$Species[holdout])
},
dat = iris)
mods <- vector(mode = "list", length = length(splits))

ANN parameters

decay.tune = c(0.01)
size = size = seq(2, 3,by=1)

tuning grid for train caret function

my.grid <- expand.grid(.decay = decay.tune, .size = size)

create a list of seed, here change the seed for each resampling

set.seed(123)
n.repeats = 100
n.resampling = 5
length.seeds = (n.repeats_n.resampling)+1
n.tune.parameters = length(decay.tune)_length(size)
seeds <- vector(mode = "list", length = length.seeds)#length is = (n_repeats*nresampling)+1
for(i in 1:length.seeds) seeds[[i]]<- sample.int(n=1000, n.tune.parameters) #(n.tune.parameters = number of tuning parameters)
seeds[[length.seeds]]<-sample.int(1000, 1)#for the last model

create a control object for the models, implementing 10-crossvalidation repeated 10 times

fitControl <- trainControl(
method = "repeatedcv",
number = n.resampling, ## 5-fold CV
repeats = 100, ## repeated ten times 100 iterations
classProbs=TRUE,
savePred = TRUE,
seeds = seeds
)

OPTION 1: FOREACH AND PARALLEL

cl <- makeCluster(detectCores()-2) #create a cluster
registerDoParallel(cl) #register the cluster

set.seed(40)
system.time(
foreach(i = seq(along = splits), .packages = c("caret")) %dopar% {
in_train <- unique(splits[[i]])
set.seed(2)
mod <- train(Species ~ ., data = iris[in_train, ],
preProcess=c("center","scale"),
tuneGrid = my.grid,
trControl = fitControl,
method = "nnet",
trace = F,
metric = "Kappa",
linout = F)
results[[i]]$pred <- predict(mod, iris[-in_train, ])
mods[[i]] <- mod
}
)

OPTION 2: FOR AND PARALLEL

cl <- makeCluster(detectCores()-2) #create a cluster
registerDoParallel(cl) #register the cluster

set.seed(40)
system.time(
for(i in seq(along = splits)) {
in_train <- unique(splits[[i]])
set.seed(2)
mod <- train(Species ~ ., data = iris[in_train, ],
preProcess=c("center","scale"),
tuneGrid = my.grid,
trControl = fitControl,
method = "nnet",
trace = F,
metric = "Kappa",
linout = F)
results[[i]]$pred <- predict(mod, iris[-in_train, ])
mods[[i]] <- mod
}
)

zachmayer · 2015-07-25T13:55:13Z

I think you need to register a parallel cluster in order to run in parallel.

—
Sent from Mailbox

On Sat, Jul 25, 2015 at 5:42 AM, acocac notifications@github.com wrote:

Dears,
I tried to reproduce Max's response for the following issue:
http://stats.stackexchange.com/questions/99315/train-validate-test-sets-in-caret
Using the createPartition function and the times argument, I am creating multiple splits of train and test sets from my all train dataset. My aim is to assess the best model from these splits using the train function with 5-fold CV in parallel.
I implemented a foreach as suggested by Max's response. However, running these foreach my CPU utilisation is less than 10% (option 1). In contrast, if I use a for sentence, it has more than 10% CPU utilisation (option). The system.time from these two options as follows:
OPTION 1 (foreach and parallel)
user system elapsed
6.77 4.42 351.99
OPTION 2 (for and parallel)
user system elapsed
11.84 0.35 63.94
Is there any option or suggestion to optimise the following reproducible code using the iris dataset?

libraries

require(caret)
require(doParallel)

end libraries

#dataset
data(iris)
#create multiple split train and test data (2 times in this example)
set.seed(40)
splits <- createDataPartition(iris$Species, p=0.7, list=T, times=2)
results <- lapply(splits,
function(x, dat) {
holdout <- (1:nrow(dat))[-unique(x)]
data.frame(index = holdout,
obs = dat$Species[holdout])
},
dat = iris)
mods <- vector(mode = "list", length = length(splits))
#ANN parameters
decay.tune = c(0.01)
size = size = seq(2, 3,by=1)
#tuning grid for train caret function
my.grid <- expand.grid(.decay = decay.tune, .size = size)
#create a list of seed, here change the seed for each resampling
set.seed(123)
n.repeats = 100
n.resampling = 5
length.seeds = (n.repeats_n.resampling)+1
n.tune.parameters = length(decay.tune)_length(size)
seeds <- vector(mode = "list", length = length.seeds)#length is = (n_repeats*nresampling)+1
for(i in 1:length.seeds) seeds[[i]]<- sample.int(n=1000, n.tune.parameters) #(n.tune.parameters = number of tuning parameters)
seeds[[length.seeds]]<-sample.int(1000, 1)#for the last model
#create a control object for the models, implementing 10-crossvalidation repeated 10 times
fitControl <- trainControl(
method = "repeatedcv",
number = n.resampling, ## 5-fold CV
repeats = 100, ## repeated ten times 100 iterations
classProbs=TRUE,
savePred = TRUE,
seeds = seeds
)
#OPTION 1: FOREACH AND PARALLEL
cl <- makeCluster(detectCores()-2) #create a cluster
registerDoParallel(cl) #register the cluster
set.seed(40)
system.time(
foreach(i = seq(along = splits), .packages = c("caret")) %dopar% {
in_train <- unique(splits[[i]])
set.seed(2)
mod <- train(Species ~ ., data = iris[in_train, ],
preProcess=c("center","scale"),
trControl = fitControl,
method = "nnet",
trace = F,
metric = "Kappa",
linout = F)
results[[i]]$pred <- predict(mod, iris[-in_train, ])
mods[[i]] <- mod
}
)
#OPTION 2: FOR AND PARALLEL
cl <- makeCluster(detectCores()-2) #create a cluster
registerDoParallel(cl) #register the cluster
set.seed(40)
system.time(
for(i in seq(along = splits)) {
in_train <- unique(splits[[i]])
set.seed(2)
mod <- train(Species ~ ., data = iris[in_train, ],
preProcess=c("center","scale"),
trControl = fitControl,
method = "nnet",
trace = F,
metric = "Kappa",
linout = F)
results[[i]]$pred <- predict(mod, iris[-in_train, ])
mods[[i]] <- mod
}

)

Reply to this email directly or view it on GitHub:
#192

acocac · 2015-07-25T14:31:19Z

Hi Zach,

I think I was registering a parallel cluster before foreach starts with the following lines, are they correct?
cl <- makeCluster(detectCores()-2) #create a cluster
registerDoParallel(cl) #register the cluster

Please let me know it.

zachmayer · 2015-07-25T14:52:08Z

Yes that looks correct. I'm on a phone without laptop so it's hard to edit code :-)

—
Sent from Mailbox

On Sat, Jul 25, 2015 at 10:31 AM, acocac notifications@github.com wrote:

Hi Zach,
I think I was registering before foreach starts with the following lines, is that correct?
cl <- makeCluster(detectCores()-2) #create a cluster
registerDoParallel(cl) #register the cluster

Please let me know it.

Reply to this email directly or view it on GitHub:
#192 (comment)

topepo · 2015-07-27T16:17:57Z

acocac,

Did this work?

acocac · 2015-07-27T16:51:33Z

topepo, it did not work. It is still slower using the foreach in comparison with the for sentence.

topepo · 2015-07-27T17:02:12Z

I'll try it on my machine. However, I should say that 100 repeats is like hitting a tack with a sledgehammer. Since we are just estimating means, 500 estimates are probably not needed, I've done 10 repeats at most.

Anyway, I'll run it in the next day.

acocac · 2015-07-27T17:09:01Z

Hi,

I am using 100 repeats due to in my real sample data is small (15 samples by class). These number of repeats are based on the following publication:
http://www.sciencedirect.com/science/article/pii/S0003267012016479

topepo · 2015-07-27T18:23:21Z

I'll take a look but I would file that under "bat shit crazy". I'll guarantee that there is very little reduction in variation at some point less than 500 resamples.

Also, when tuning the model, the problem is not so much about sensitivity and specificity but is mostly about correctly rank-ordering the tuning parameters. In that context, the bar is much lower.

topepo · 2015-07-27T19:50:11Z

On my machine, detectCores()-2 = 10. The execution time for the first was 15.742s . For the second took 7.503s.

A few things:

you should use allowParallel = FALSE when using foreach outside of train. Some parallel processing backends will spawn (detectCores()-2)^2 workers since you are using parallelism at two levels and that can end badly.
The first approach is probably inefficient since it is running larger blocks in parallel. Once foreach has less than 10 things to do (on my machine), those cores are inactive. In the second approach, train has hundreds of tasks for each model fit and the potential utilization of the works is much higher for a longer period.
Using top, I watched the works spawn and die. In each case, 10 workers were activated so I know that I was getting what I asked for.

So, use the second approach to parallelism.

acocac · 2015-07-28T23:26:11Z

Thanks for your response! It is great to have these sort of tips for future parallel processing. BTW, about the repeats how many of them do you suggest for training nnet models with a train set of 60 observations. This set that has 4 outcome classes (15 samples by class).

topepo · 2015-07-29T00:39:47Z

I use, at most, 10 repeats of 10-fold CV.

That paper uses 5-fold, which is strange because they talk a lot about the bias problem of the bootstrap (completely right too). However, 5-fold has higher bias than 10-fold so it seems like a contradiction.

acocac · 2015-07-29T11:12:42Z

Hi Max, thanks for your feedbacks, these are relevant to me!

topepo · 2015-07-29T13:36:46Z

Should we close this issue?

acocac closed this as completed Jul 29, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimising the train function in parallel from multiple splits generated by the createPartition function #192

Optimising the train function in parallel from multiple splits generated by the createPartition function #192

acocac commented Jul 25, 2015

zachmayer commented Jul 25, 2015

libraries

end libraries

)

acocac commented Jul 25, 2015

zachmayer commented Jul 25, 2015

Please let me know it.

topepo commented Jul 27, 2015

acocac commented Jul 27, 2015

topepo commented Jul 27, 2015

acocac commented Jul 27, 2015

topepo commented Jul 27, 2015

topepo commented Jul 27, 2015

acocac commented Jul 28, 2015

topepo commented Jul 29, 2015

acocac commented Jul 29, 2015

topepo commented Jul 29, 2015

Optimising the train function in parallel from multiple splits generated by the createPartition function #192

Optimising the train function in parallel from multiple splits generated by the createPartition function #192

Comments

acocac commented Jul 25, 2015

dataset

create multiple split train and test data (2 times in this example)

ANN parameters

tuning grid for train caret function

create a list of seed, here change the seed for each resampling

create a control object for the models, implementing 10-crossvalidation repeated 10 times

OPTION 1: FOREACH AND PARALLEL

OPTION 2: FOR AND PARALLEL

zachmayer commented Jul 25, 2015

libraries

end libraries

)

acocac commented Jul 25, 2015

zachmayer commented Jul 25, 2015

Please let me know it.

topepo commented Jul 27, 2015

acocac commented Jul 27, 2015

topepo commented Jul 27, 2015

acocac commented Jul 27, 2015

topepo commented Jul 27, 2015

topepo commented Jul 27, 2015

acocac commented Jul 28, 2015

topepo commented Jul 29, 2015

acocac commented Jul 29, 2015

topepo commented Jul 29, 2015