New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restrict generation of confusion matrix for resamples when number of classes is large #356
Conversation
… classes is large
Before going there, it would be good to pursue optimization. Here is the code: if(length(lev) > 1) {
cells <- lapply(predicted,
function(x) flatTable(x$pred, x$obs))
for(ind in seq(along = cells))
thisResample[[ind]] <- c(thisResample[[ind]], cells[[ind]])
}
thisResample <- do.call("rbind", thisResample)
thisResample <- cbind(allParam, thisResample) where flatTable <- function(pred, obs) {
cells <- as.vector(table(pred, obs))
if(length(cells) == 0) cells <- rep(NA, length(levels(obs))^2)
names(cells) <- paste(".cell", seq(along= cells), sep = "")
cells
} I don't know there there is a lot to do for Can you give us an example of your |
ok, there are actually two issues with those large-classes problems:
Maybe there is some way to improve on the 'rbind.fill(result[names(result) == "resamples"])' call; from what I saw in the thread dump in gdb, it was doing a lot of string copies... I will try to compile a small example tomorrow. |
Here is a small script to reproduce the problem: library(caret)
nClasses <- 1000
y <- factor(rep(seq(1:nClasses), 2))
df <- as.data.frame(diag(nClasses))
df <- rbind(df,df)
nBootStrapIters <- 2
trCtrl <- trainControl(method = "boot",
number = nBootStrapIters,
verboseIter = T)
Rprof('/tmp/classes_profiling.out', interval = 0.5)
fit <- train(df, y, method="rpart", trControl = trCtrl, tuneLength = 1)
fit
Rprof(NULL) The output of the profiling file looks after a while like this:
|
I've looked at this a while and came to the conclusion that this is the best way to proceed. I'll also check in some code where you can generate the resampled confusion matrices from the holdout predictions in |
Restrict generation of confusion matrix for resamples when number of classes is large
Thanks! |
Cool!. I had also tried to find a way optimizing the slow "rbind.fill" call. Surprisingly a do.call("rbind") was faster, but still having With this PR we are now running caret for a problem with 30 000 classes - and this is pretty awesome :-) |
Hi Max.
When running caret on classification problems with huge number of classes (1000), we had terrible performance in the final aggregation of the resamples results.
It turned out, that in nominalTrainWorkflow there is a complete confusion matrix added to thisResample, which results in a quadratic number of cells.
The final problem then occurs some lines later (after the end of the foreach scope) : the rbind.fill has catastrophic runtime on these huge tables. For a problem with 1000 classes (and therefore a confusion matrix with 1mio elements), we stopped the rbind.fill after 12hours....
I have attached a small commit, which adds the confusion matrix only if length(lev) <= 50. Otherwise one could also add a parameter to train control() (with such a default).
Benedikt