Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict generation of confusion matrix for resamples when number of classes is large #356

Merged
merged 1 commit into from Mar 17, 2016

Conversation

bwilbertz
Copy link
Contributor

Hi Max.

When running caret on classification problems with huge number of classes (1000), we had terrible performance in the final aggregation of the resamples results.

It turned out, that in nominalTrainWorkflow there is a complete confusion matrix added to thisResample, which results in a quadratic number of cells.

The final problem then occurs some lines later (after the end of the foreach scope) : the rbind.fill has catastrophic runtime on these huge tables. For a problem with 1000 classes (and therefore a confusion matrix with 1mio elements), we stopped the rbind.fill after 12hours....

I have attached a small commit, which adds the confusion matrix only if length(lev) <= 50. Otherwise one could also add a parameter to train control() (with such a default).

Benedikt

@topepo
Copy link
Owner

topepo commented Jan 19, 2016

Before going there, it would be good to pursue optimization. Here is the code:

if(length(lev) > 1) {
  cells <- lapply(predicted,
                  function(x) flatTable(x$pred, x$obs))
  for(ind in seq(along = cells)) 
    thisResample[[ind]] <- c(thisResample[[ind]], cells[[ind]])
}
thisResample <- do.call("rbind", thisResample)          
thisResample <- cbind(allParam, thisResample)

where

flatTable <- function(pred, obs) {
    cells <- as.vector(table(pred, obs))
    if(length(cells) == 0) cells <- rep(NA, length(levels(obs))^2)
    names(cells) <- paste(".cell", seq(along= cells), sep = "")
    cells
  }

I don't know there there is a lot to do for flatTable but there is a lot of room for improvement in the first block. Hadley's rbind.fillmight be a simple solution for the rbind, the for loop is ridiculous (mea culpa), etc.

Can you give us an example of your predicted object to test with (or code to simulate one)?

@bwilbertz
Copy link
Contributor Author

ok, there are actually two issues with those large-classes problems:

  • the memory consumption of the a confusion matrix (quadratic in the number of classes)
  • the runtime performance of the rbind.fill in line 326:
   list(resamples = thisResample, pred = tmpPred)
}

  resamples <- rbind.fill(result[names(result) == "resamples"])
  pred <- if(keep_pred)  rbind.fill(result[names(result) == "pred"]) else NULL

Maybe there is some way to improve on the 'rbind.fill(result[names(result) == "resamples"])' call; from what I saw in the thread dump in gdb, it was doing a lot of string copies...

I will try to compile a small example tomorrow.

@bwilbertz
Copy link
Contributor Author

Here is a small script to reproduce the problem:

library(caret)

nClasses <- 1000

y <- factor(rep(seq(1:nClasses), 2))

df <- as.data.frame(diag(nClasses))
df <- rbind(df,df)

nBootStrapIters <- 2
trCtrl <-  trainControl(method = "boot",
                number = nBootStrapIters,
                verboseIter = T)

Rprof('/tmp/classes_profiling.out', interval = 0.5)
fit <- train(df, y, method="rpart", trControl = trCtrl, tuneLength = 1)
fit
Rprof(NULL)

The output of the profiling file looks after a while like this:

"allocate_column" "output_template" "rbind.fill" "nominalTrainWorkflow" "train.default" "train" 
"output_template" "rbind.fill" "nominalTrainWorkflow" "train.default" "train" 
"output_template" "rbind.fill" "nominalTrainWorkflow" "train.default" "train" 
"output_template" "rbind.fill" "nominalTrainWorkflow" "train.default" "train" 
".subset2" "<Anonymous>" "[[.data.frame" "[[" "allocate_column" "output_template" "rbind.fill" "nominalTrainWorkflow" "train.default" "train" 
".subset2" "<Anonymous>" "[[.data.frame" "[[" "allocate_column" "output_template" "rbind.fill" "nominalTrainWorkflow" "train.default" "train" 

@topepo
Copy link
Owner

topepo commented Mar 17, 2016

I've looked at this a while and came to the conclusion that this is the best way to proceed. I'll also check in some code where you can generate the resampled confusion matrices from the holdout predictions in train_object$pred (if it is generated) irrespective of the number of classes.

topepo added a commit that referenced this pull request Mar 17, 2016
Restrict generation of confusion matrix for resamples when number of classes is large
@topepo topepo merged commit e651dc3 into topepo:master Mar 17, 2016
@topepo
Copy link
Owner

topepo commented Mar 17, 2016

Thanks!

@bwilbertz
Copy link
Contributor Author

Cool!.

I had also tried to find a way optimizing the slow "rbind.fill" call. Surprisingly a do.call("rbind") was faster, but still having $n^2$ runtime...

With this PR we are now running caret for a problem with 30 000 classes - and this is pretty awesome :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants