Make mnLogLoss() more efficient #189

brentonk · 2015-07-17T19:14:39Z

I rewrote mnLogLoss() to use match() to retrieve the column index corresponding to the observed class of each sample, rather than building a matrix of dummies. I ran a basic test to confirm that (1) this doesn't change the results and (2) it yields a speedup — see below.

library("caret")
library("microbenchmark")

mnLogLossOld <- function(data, lev = NULL, model = NULL){
  if(is.null(lev)) stop("'lev' cannot be NULL")
  if(!all(lev %in% colnames(data)))
    stop("'data' should have columns consistent with 'lev'")
  if(!all(sort(lev) %in% sort(levels(data$obs))))
    stop("'data$obs' should have levels consistent with 'lev'")
  eps <- 1e-15
  probs <- as.matrix(data[, lev, drop = FALSE])
  probs[probs > 1 - eps] <- 1 - eps
  probs[probs < eps] <- eps
  inds <- getFromNamespace("class2ind", "caret")(data$obs)[, lev, drop = FALSE]
  c(logLoss = -mean(apply(inds*log(probs), 1, sum), na.rm = TRUE))
}

mnLogLossNew <- function(data, lev = NULL, model = NULL){
  if(is.null(lev)) stop("'lev' cannot be NULL")
  if(!all(lev %in% colnames(data)))
    stop("'data' should have columns consistent with 'lev'")
  if(!all(sort(lev) %in% sort(levels(data$obs))))
    stop("'data$obs' should have levels consistent with 'lev'")
  eps <- 1e-15
  probs <- as.matrix(data[, lev, drop = FALSE])
  probs[probs > 1 - eps] <- 1 - eps
  probs[probs < eps] <- eps
  inds <- match(data$obs, colnames(probs))
  probs <- probs[cbind(seq_len(nrow(probs)), inds)]  
  c(logLoss = -mean(log(probs), na.rm = TRUE))
}

## Compare performance using simulated ten-class data
n <- 1e5
lev <- LETTERS[1:10]
y <- factor(sample(lev, n, replace = TRUE))
probs <- matrix(rexp(n * 10), n)
probs <- sweep(probs, 1, rowSums(probs), "/")
probs <- data.frame(probs)
names(probs) <- levels(y)
probs$obs <- y

ll_old <- mnLogLossOld(data = probs, lev = lev)
ll_new <- mnLogLossNew(data = probs, lev = lev)
all.equal(ll_old, ll_new)
#> [1] TRUE

microbenchmark(old = mnLogLossOld(data = probs, lev = lev),
               new = mnLogLossNew(data = probs, lev = lev))
#> Unit: milliseconds
#>  expr     min      lq   mean  median     uq     max neval
#>   old 319.222 358.413 387.35 375.382 397.15 659.757   100
#>   new  28.055  30.193  36.24  34.034  37.89  71.906   100

Previously, the function constructed the predicted probability of the observed outcome by building up a matrix of dummy variables and multiplying with the matrix of predicted probabilities. Now it uses match() to identify the column index corresponding to the observed class for each observation, then uses the matrix form of `[` to extract the relevant probabilities. Results in a speedup of roughly 10x, per back-of-the-envelope tests.

topepo · 2015-07-29T17:24:58Z

Thanks a lot!

brentonk · 2015-08-04T16:55:34Z

Any reason why this wasn't merged?

zachmayer · 2015-08-04T17:00:36Z

@topepo This is good to merge by me. The only test failures are due to errors on the master branch.

Make mnLogLoss() more efficient

topepo closed this Jul 29, 2015

topepo reopened this Aug 5, 2015

topepo added a commit that referenced this pull request Aug 5, 2015

Merge pull request #189 from brentonk/feature-mnll-speedup

f4a269b

Make mnLogLoss() more efficient

topepo merged commit f4a269b into topepo:master Aug 5, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make mnLogLoss() more efficient #189

Make mnLogLoss() more efficient #189

brentonk commented Jul 17, 2015

topepo commented Jul 29, 2015

brentonk commented Aug 4, 2015

zachmayer commented Aug 4, 2015

Make mnLogLoss() more efficient #189

Make mnLogLoss() more efficient #189

Conversation

brentonk commented Jul 17, 2015

topepo commented Jul 29, 2015

brentonk commented Aug 4, 2015

zachmayer commented Aug 4, 2015