Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caret pls training hanging #610

Closed
filippis opened this issue Mar 6, 2017 · 5 comments
Closed

Caret pls training hanging #610

filippis opened this issue Mar 6, 2017 · 5 comments

Comments

@filippis
Copy link

@filippis filippis commented Mar 6, 2017

Hi,

I have a case that is not possible to reproduce. For specific seeds and dataset caret training using pls is just hanging. All the following will make the training run:

  • Changing slightly one value in the training set
    mytest[1,2]<-mytest[1,2]+(1e-17)
  • Dputting dataset and loading it back (as this also changes slightly the double columns)
  • Increasing the number of seeds (larger than the number of models tried per subsample) num_models <- 10
  • Changing the order of levels
    mytest$Class <- factor(mytest$Class, levels=c("NEG","POS"), ordered=F)

I am adding below the code and the top 4 lines from the dataset but as I said it cannot be reproduced.

Thank you

Minimal dataset:

dput(head(mytest,4))
structure(list(X.1 = c(0.735823513122674, -0.887010102457548, 
-0.0443431087349642, -0.0443826702257031), X.2 = c(0.023701002734732, 
-0.755939222654391, -0.25548273826641, 0.00869794821614311), 
    X.3 = c(0.268299436940403, -0.682887364036797, 0.135717501768831, 
    -0.044580530524179), X.4 = c(-0.436729803742583, -0.208053691275168, 
    -0.308582743126951, 0.0441622530607122), X.5 = c(-0.578571428571428, 
    0.530434782608695, 0.233333333333334, 0.520000000000001), 
    X.6 = c(-0.292857142857144, 0.255072463768114, -0.350000000000001, 
    -0.146666666666666), Class = structure(c(2L, 2L, 1L, 1L), .Label = c("POS", 
    "NEG"), class = "factor")), .Names = c("X.1", "X.2", "X.3", 
"X.4", "X.5", "X.6", "Class"), row.names = c(NA, 4L), class = "data.frame")

Minimal, runnable code:

library(caret)
seed<-788
set.seed(seed)
num_models <- 5
num_rs <- 50
seeds <- sample.int(n = 1000000L, size = num_rs*num_models + 1L)
seeds <- lapply(seq(from = 1L, to = length(seeds), by = num_models), function(x) { seeds[x:(x+num_models-1L)] })
seeds[[num_rs + 1L]] <- seeds[[num_rs + 1L]][1L]
set.seed(seed)
ctrl <- trainControl(method = "repeatedcv",
                     repeats = 5,
                     number = 10,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary,
                     savePredictions = "final",
                     seeds = seeds)
ctrl$sampling <- "down"
set.seed(seed)
plsFit <- train(Class~., 
                data = mytest,
                method = "pls",
                tuneLength = 5,
                trControl = ctrl,
                metric = "ROC")

@topepo
Copy link
Owner

@topepo topepo commented Mar 15, 2017

Can you run str on the entire data set prior to the modifications and send the results?

@filippis
Copy link
Author

@filippis filippis commented Mar 17, 2017

Here it is

str(mytest)
'data.frame': 1000 obs. of 7 variables:
$ X.1 : num 0.7358 -0.887 -0.0443 -0.0444 0.0446 ...
$ X.2 : num 0.0237 -0.7559 -0.2555 0.0087 0.4803 ...
$ X.3 : num 0.2683 -0.6829 0.1357 -0.0446 0.0896 ...
$ X.4 : num -0.4367 -0.2081 -0.3086 0.0442 0.1039 ...
$ X.5 : num -0.579 0.53 0.233 0.52 0.299 ...
$ X.6 : num -0.293 0.255 -0.35 -0.147 -0.701 ...
$ Class: Factor w/ 2 levels "POS","NEG": 2 2 1 1 1 2 1 1 1 2 ...

@topepo
Copy link
Owner

@topepo topepo commented Mar 23, 2017

I can reproduce this in a simulated data set. If seems to be related to the type of PLS method being used. The caret code focuses on "oscorespls" as the method when the outcome is categorical. I have a vauge recollection that this is deliberate but I'm going back through my notes to check.

In the meantime, you can use a custom model and change the PLS method to something else. For example, here is code to use SIMPLS:

modelInfo <- list(label = "Partial Least Squares",
                  library = "pls",
                  type = c("Regression", "Classification"),
                  parameters = data.frame(parameter = 'ncomp',
                                          class = "numeric",
                                          label = '#Components'),
                  grid = function(x, y, len = NULL, search = "grid") {
                    if(search == "grid") {
                      out <- data.frame(ncomp = seq(1, min(ncol(x) - 1, len), by = 1))
                    } else {
                      out <- data.frame(ncomp = unique(sample(1:(ncol(x)-1), size = len, replace = TRUE)))
                    }
                    out
                  },
                  loop = function(grid) {     
                    grid <- grid[order(grid$ncomp, decreasing = TRUE),, drop = FALSE]
                    loop <- grid[1,,drop = FALSE]
                    submodels <- list(grid[-1,,drop = FALSE])  
                    list(loop = loop, submodels = submodels)
                  },
                  fit = function(x, y, wts, param, lev, last, classProbs, ...) {   
                    out <- if(is.factor(y))
                    {      
                      plsda(x, y, method = "simpls", ncomp = param$ncomp, ...)
                    } else {
                      dat <- if(is.data.frame(x)) x else as.data.frame(x)
                      dat$.outcome <- y
                      plsr(.outcome ~ ., data = dat, method = "simpls", ncomp = param$ncomp, ...)
                    }
                    out
                  },
                  predict = function(modelFit, newdata, submodels = NULL) {                    
                    out <- if(modelFit$problemType == "Classification")
                    {
                      if(!is.matrix(newdata)) newdata <- as.matrix(newdata)
                      out <- predict(modelFit, newdata, type="class")
                      
                    } else as.vector(pls:::predict.mvr(modelFit, newdata, ncomp = max(modelFit$ncomp)))
                    
                    if(!is.null(submodels))
                    {
                      ## We'll merge the first set of predictions below
                      tmp <- vector(mode = "list", length = nrow(submodels))
                      
                      if(modelFit$problemType == "Classification")
                      {
                        if(length(submodels$ncomp) > 1)
                        {
                          tmp <- as.list(predict(modelFit, newdata, ncomp = submodels$ncomp))
                        } else tmp <- list(predict(modelFit, newdata, ncomp = submodels$ncomp))
                        
                      } else {
                        tmp <- as.list(
                          as.data.frame(
                            apply(predict(modelFit, newdata, ncomp = submodels$ncomp), 3, function(x) list(x))))
                      }
                      
                      out <- c(list(out), tmp)   
                    }
                    out
                  },
                  prob = function(modelFit, newdata, submodels = NULL) {
                    if(!is.matrix(newdata)) newdata <- as.matrix(newdata)
                    out <- predict(modelFit, newdata, type = "prob",  ncomp = modelFit$tuneValue$ncomp)
                    if(length(dim(out)) == 3){
                      if(dim(out)[1] > 1) {
                        out <- out[,,1]
                      } else {
                        out <- as.data.frame(t(out[,,1]))
                      }
                    }
                    if(!is.null(submodels))
                    {
                      tmp <- vector(mode = "list", length = nrow(submodels) + 1)
                      tmp[[1]] <- out
                      
                      for(j in seq(along = submodels$ncomp))
                      {
                        tmpProb <- predict(modelFit, newdata, type = "prob",  ncomp = submodels$ncomp[j])
                        if(length(dim(tmpProb)) == 3){
                          if(dim(tmpProb)[1] > 1) {
                            tmpProb <- tmpProb[,,1]
                          } else {
                            tmpProb <- as.data.frame(t(tmpProb[,,1]))
                          }
                        } 
                        tmp[[j+1]] <- as.data.frame(tmpProb[, modelFit$obsLevels,drop = FALSE])
                      }
                      out <- tmp
                    }                        
                    out
                  },
                  varImp = function(object, estimate = NULL, ...) {
                    modelCoef <- coef(object, intercept = FALSE, comps = 1:object$ncomp)
                    perf <- MSEP(object)$val
                    
                    nms <- dimnames(perf)
                    if(length(nms$estimate) > 1) {
                      pIndex <- if(is.null(estimate)) 1 else which(nms$estimate == estimate)
                      perf <- perf[pIndex,,,drop = FALSE]
                    }  
                    numResp <- dim(modelCoef)[2]
                    
                    if(numResp <= 2) {
                      modelCoef <- modelCoef[,1,,drop = FALSE]     
                      perf <- perf[,1,]
                      delta <- -diff(perf)
                      delta <- delta/sum(delta)
                      out <- data.frame(Overall = apply(abs(modelCoef), 1, 
                                                        weighted.mean, w = delta))
                    } else {       
                      perf <- -t(apply(perf[1,,], 1, diff))
                      perf <- t(apply(perf, 1, function(u) u/sum(u)))
                      out <- matrix(NA, ncol = numResp, nrow = dim(modelCoef)[1])
                      
                      for(i in 1:numResp) {
                        tmp <- abs(modelCoef[,i,, drop = FALSE])
                        out[,i] <- apply(tmp, 1,  weighted.mean, w = perf[i,])
                      }
                      colnames(out) <- dimnames(modelCoef)[[2]]
                      rownames(out) <- dimnames(modelCoef)[[1]]
                    }
                    as.data.frame(out)
                  },
                  levels = function(x) x$obsLevels,
                  predictors = function(x, ...) rownames(x$projection),
                  tags = c("Partial Least Squares", "Feature Extraction", "Linear Classifier", "Linear Regression"),
                  sort = function(x) x[order(x[,1]),])

then you can use method = modelInfo when you call train.

@filippis
Copy link
Author

@filippis filippis commented Mar 23, 2017

Many thanks for the code. I will try this.

If I understand this correctly, no matter which method I use in train ("pls", "simppls", "kernelpls"), existing code will always use internally "oscorespls" if the outcome is categorical. Is this correct?

@topepo
Copy link
Owner

@topepo topepo commented Apr 6, 2017

That is correct but I'm changing it to be consistent. I went back to my notes to see if this was intentional and I think it was a copy/paste error.

topepo added a commit that referenced this issue Apr 6, 2017
@topepo topepo closed this Apr 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.