Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict generation of confusion matrix for resamples when number of classes is large #356

Merged
merged 1 commit into from Mar 17, 2016

Conversation

@bwilbertz
Copy link
Contributor

@bwilbertz bwilbertz commented Jan 17, 2016

Hi Max.

When running caret on classification problems with huge number of classes (1000), we had terrible performance in the final aggregation of the resamples results.

It turned out, that in nominalTrainWorkflow there is a complete confusion matrix added to thisResample, which results in a quadratic number of cells.

The final problem then occurs some lines later (after the end of the foreach scope) : the rbind.fill has catastrophic runtime on these huge tables. For a problem with 1000 classes (and therefore a confusion matrix with 1mio elements), we stopped the rbind.fill after 12hours....

I have attached a small commit, which adds the confusion matrix only if length(lev) <= 50. Otherwise one could also add a parameter to train control() (with such a default).

Benedikt

@topepo
Copy link
Owner

@topepo topepo commented Jan 19, 2016

Before going there, it would be good to pursue optimization. Here is the code:

if(length(lev) > 1) {
  cells <- lapply(predicted,
                  function(x) flatTable(x$pred, x$obs))
  for(ind in seq(along = cells)) 
    thisResample[[ind]] <- c(thisResample[[ind]], cells[[ind]])
}
thisResample <- do.call("rbind", thisResample)          
thisResample <- cbind(allParam, thisResample)

where

flatTable <- function(pred, obs) {
    cells <- as.vector(table(pred, obs))
    if(length(cells) == 0) cells <- rep(NA, length(levels(obs))^2)
    names(cells) <- paste(".cell", seq(along= cells), sep = "")
    cells
  }

I don't know there there is a lot to do for flatTable but there is a lot of room for improvement in the first block. Hadley's rbind.fillmight be a simple solution for the rbind, the for loop is ridiculous (mea culpa), etc.

Can you give us an example of your predicted object to test with (or code to simulate one)?

@bwilbertz
Copy link
Contributor Author

@bwilbertz bwilbertz commented Jan 19, 2016

ok, there are actually two issues with those large-classes problems:

  • the memory consumption of the a confusion matrix (quadratic in the number of classes)
  • the runtime performance of the rbind.fill in line 326:
   list(resamples = thisResample, pred = tmpPred)
}

  resamples <- rbind.fill(result[names(result) == "resamples"])
  pred <- if(keep_pred)  rbind.fill(result[names(result) == "pred"]) else NULL

Maybe there is some way to improve on the 'rbind.fill(result[names(result) == "resamples"])' call; from what I saw in the thread dump in gdb, it was doing a lot of string copies...

I will try to compile a small example tomorrow.

@bwilbertz
Copy link
Contributor Author

@bwilbertz bwilbertz commented Jan 20, 2016

Here is a small script to reproduce the problem:

library(caret)

nClasses <- 1000

y <- factor(rep(seq(1:nClasses), 2))

df <- as.data.frame(diag(nClasses))
df <- rbind(df,df)

nBootStrapIters <- 2
trCtrl <-  trainControl(method = "boot",
                number = nBootStrapIters,
                verboseIter = T)

Rprof('/tmp/classes_profiling.out', interval = 0.5)
fit <- train(df, y, method="rpart", trControl = trCtrl, tuneLength = 1)
fit
Rprof(NULL)

The output of the profiling file looks after a while like this:

"allocate_column" "output_template" "rbind.fill" "nominalTrainWorkflow" "train.default" "train" 
"output_template" "rbind.fill" "nominalTrainWorkflow" "train.default" "train" 
"output_template" "rbind.fill" "nominalTrainWorkflow" "train.default" "train" 
"output_template" "rbind.fill" "nominalTrainWorkflow" "train.default" "train" 
".subset2" "<Anonymous>" "[[.data.frame" "[[" "allocate_column" "output_template" "rbind.fill" "nominalTrainWorkflow" "train.default" "train" 
".subset2" "<Anonymous>" "[[.data.frame" "[[" "allocate_column" "output_template" "rbind.fill" "nominalTrainWorkflow" "train.default" "train" 
@topepo
Copy link
Owner

@topepo topepo commented Mar 17, 2016

I've looked at this a while and came to the conclusion that this is the best way to proceed. I'll also check in some code where you can generate the resampled confusion matrices from the holdout predictions in train_object$pred (if it is generated) irrespective of the number of classes.

topepo added a commit that referenced this pull request Mar 17, 2016
Restrict generation of confusion matrix for resamples when number of classes is large
@topepo topepo merged commit e651dc3 into topepo:master Mar 17, 2016
2 checks passed
2 checks passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage remained the same at 13.814%
Details
@topepo
Copy link
Owner

@topepo topepo commented Mar 17, 2016

Thanks!

@bwilbertz
Copy link
Contributor Author

@bwilbertz bwilbertz commented Mar 17, 2016

Cool!.

I had also tried to find a way optimizing the slow "rbind.fill" call. Surprisingly a do.call("rbind") was faster, but still having $n^2$ runtime...

With this PR we are now running caret for a problem with 30 000 classes - and this is pretty awesome :-)

topepo added a commit that referenced this pull request Mar 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can’t perform that action at this time.