Restrict generation of confusion matrix for resamples when number of classes is large #356

bwilbertz · 2016-01-17T08:16:32Z

Hi Max.

When running caret on classification problems with huge number of classes (1000), we had terrible performance in the final aggregation of the resamples results.

It turned out, that in nominalTrainWorkflow there is a complete confusion matrix added to thisResample, which results in a quadratic number of cells.

The final problem then occurs some lines later (after the end of the foreach scope) : the rbind.fill has catastrophic runtime on these huge tables. For a problem with 1000 classes (and therefore a confusion matrix with 1mio elements), we stopped the rbind.fill after 12hours....

I have attached a small commit, which adds the confusion matrix only if length(lev) <= 50. Otherwise one could also add a parameter to train control() (with such a default).

Benedikt

… classes is large

topepo · 2016-01-19T19:58:20Z

Before going there, it would be good to pursue optimization. Here is the code:

if(length(lev) > 1) {
  cells <- lapply(predicted,
                  function(x) flatTable(x$pred, x$obs))
  for(ind in seq(along = cells)) 
    thisResample[[ind]] <- c(thisResample[[ind]], cells[[ind]])
}
thisResample <- do.call("rbind", thisResample)          
thisResample <- cbind(allParam, thisResample)

where

flatTable <- function(pred, obs) {
    cells <- as.vector(table(pred, obs))
    if(length(cells) == 0) cells <- rep(NA, length(levels(obs))^2)
    names(cells) <- paste(".cell", seq(along= cells), sep = "")
    cells
  }

I don't know there there is a lot to do for flatTable but there is a lot of room for improvement in the first block. Hadley's rbind.fillmight be a simple solution for the rbind, the for loop is ridiculous (mea culpa), etc.

Can you give us an example of your predicted object to test with (or code to simulate one)?

bwilbertz · 2016-01-19T22:04:07Z

ok, there are actually two issues with those large-classes problems:

the memory consumption of the a confusion matrix (quadratic in the number of classes)
the runtime performance of the rbind.fill in line 326:

   list(resamples = thisResample, pred = tmpPred)
}

  resamples <- rbind.fill(result[names(result) == "resamples"])
  pred <- if(keep_pred)  rbind.fill(result[names(result) == "pred"]) else NULL

Maybe there is some way to improve on the 'rbind.fill(result[names(result) == "resamples"])' call; from what I saw in the thread dump in gdb, it was doing a lot of string copies...

I will try to compile a small example tomorrow.

bwilbertz · 2016-01-20T12:23:01Z

Here is a small script to reproduce the problem:

library(caret)

nClasses <- 1000

y <- factor(rep(seq(1:nClasses), 2))

df <- as.data.frame(diag(nClasses))
df <- rbind(df,df)

nBootStrapIters <- 2
trCtrl <-  trainControl(method = "boot",
                number = nBootStrapIters,
                verboseIter = T)

Rprof('/tmp/classes_profiling.out', interval = 0.5)
fit <- train(df, y, method="rpart", trControl = trCtrl, tuneLength = 1)
fit
Rprof(NULL)

The output of the profiling file looks after a while like this:

"allocate_column" "output_template" "rbind.fill" "nominalTrainWorkflow" "train.default" "train" 
"output_template" "rbind.fill" "nominalTrainWorkflow" "train.default" "train" 
"output_template" "rbind.fill" "nominalTrainWorkflow" "train.default" "train" 
"output_template" "rbind.fill" "nominalTrainWorkflow" "train.default" "train" 
".subset2" "<Anonymous>" "[[.data.frame" "[[" "allocate_column" "output_template" "rbind.fill" "nominalTrainWorkflow" "train.default" "train" 
".subset2" "<Anonymous>" "[[.data.frame" "[[" "allocate_column" "output_template" "rbind.fill" "nominalTrainWorkflow" "train.default" "train"

topepo · 2016-03-17T00:15:52Z

I've looked at this a while and came to the conclusion that this is the best way to proceed. I'll also check in some code where you can generate the resampled confusion matrices from the holdout predictions in train_object$pred (if it is generated) irrespective of the number of classes.

Restrict generation of confusion matrix for resamples when number of classes is large

topepo · 2016-03-17T00:16:04Z

Thanks!

bwilbertz · 2016-03-17T15:35:13Z

Cool!.

I had also tried to find a way optimizing the slow "rbind.fill" call. Surprisingly a do.call("rbind") was faster, but still having $n^2$ runtime...

With this PR we are now running caret for a problem with 30 000 classes - and this is pretty awesome :-)

Restrict generation of. confusion matrix for resamples when number of…

c877cd3

… classes is large

topepo added a commit that referenced this pull request Mar 17, 2016

Merge pull request #356 from bwilbertz/resample_confusion

e651dc3

Restrict generation of confusion matrix for resamples when number of classes is large

topepo merged commit e651dc3 into topepo:master Mar 17, 2016

topepo added a commit that referenced this pull request Mar 17, 2016

made changes related to issue #356 for rfe and sbf objects

9db9cb1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restrict generation of confusion matrix for resamples when number of classes is large #356

Restrict generation of confusion matrix for resamples when number of classes is large #356

bwilbertz commented Jan 17, 2016

topepo commented Jan 19, 2016

bwilbertz commented Jan 19, 2016

bwilbertz commented Jan 20, 2016

topepo commented Mar 17, 2016

topepo commented Mar 17, 2016

bwilbertz commented Mar 17, 2016

Restrict generation of confusion matrix for resamples when number of classes is large #356

Restrict generation of confusion matrix for resamples when number of classes is large #356

Conversation

bwilbertz commented Jan 17, 2016

topepo commented Jan 19, 2016

bwilbertz commented Jan 19, 2016

bwilbertz commented Jan 20, 2016

topepo commented Mar 17, 2016

topepo commented Mar 17, 2016

bwilbertz commented Mar 17, 2016