Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default arguments of ROC computation (pROC) in the RFE process (caret) #431

Closed
RafaOR opened this issue May 25, 2016 · 14 comments
Closed

Comments

@RafaOR
Copy link

@RafaOR RafaOR commented May 25, 2016

I am computing a SVM-RFE model using "ROC" as the metric with the rfe function of the caret package and I have noticed that there is no way to change the default arguments of the roc function (pROC package). In my case, for example, I would like to set the direction argument to "<" instead of "auto" because in some cases the resulting AUC is computed in reverse. Would it be possible to consider this enhancement?

@xrobin
Copy link
Contributor

@xrobin xrobin commented May 25, 2016

This is actually an issue with worse classifiers with an AUC is close to 0.5. The AUC is systematically over-estimated with the default direction (auto) that will attempt to keep the AUC above 0.5 at all times. In general one should never use the default direction in a resampling operation.

The way I handle this in pROC with the bootstrapping operations is to compute the direction on the whole ROC curve, and keep that direction fixed during the resampling.
An alternative would be to simply fix the direction to "<". It is reasonable most of the time actually, as positive classifications are typically assigned to higher values. I am not very familiar with caret, so I am somewhat unsure if we can assume it will always be the case, but even if it is not it is probably better than the current state.

The ROC function from pROC seem to be called in 3 different places, in aaa.R and filterVarImp.R. I will try to submit a pull request with the latter approach.

@topepo
Copy link
Owner

@topepo topepo commented May 25, 2016

We need to be careful here. Right now, we assume that the first factor is the event of interest. That has been consistent in the documentation and teaching materials for some time. The changes should ensure that this design choice is not negated.

@xrobin
Copy link
Contributor

@xrobin xrobin commented May 25, 2016

And pROC assumes that the level of interest is the second level of the factor, so we already have an inconsistency indeed, and relied on the direction="auto" argument. This is probably the biggest design mistake I did with pROC.

Ideally it would be best to have calls like this:

levelOfInterest <- lev[1]
levelOfNegatives <- lev[2]
pROC::roc(obs, data[,class], direction = "<", levels = c(levelOfNegatives, levelOfInterest))

So two things are left:

  • The levels are not available in rocPerCol. Can I assume that I will get them correctly if I do

    lev <- levels(cls) or could they end up being reversed?

  • Do you always assume that the observations of interest will get higher predicted values, or could this be false in some cases?

@topepo
Copy link
Owner

@topepo topepo commented May 25, 2016

Here is how rocPerCol is used in filterVarImp (where the original outcome factor may have many levels):

for (i in 1:k) {
  for (j in i:k) {
    if (i != j) {
      classIndex[[i]] <- c(classIndex[[i]], counter)
      classIndex[[j]] <- c(classIndex[[j]], counter)
      index <- which(y %in% c(classLevels[i], classLevels[j]))
      tmpX <- x[index, , drop = FALSE]
      tmpY <- factor(as.character(y[index]), levels = c(classLevels[i], classLevels[j]))
      tmpStat[, counter] <- apply(tmpX, 2, rocPerCol, cls = tmpY)
      counter <- counter + 1
    }
  }
}

In this case, we might want to reorder the classes based on their average value since there is no guarantee on the direction (and it is just being used to screen variables)

@xrobin
Copy link
Contributor

@xrobin xrobin commented May 27, 2016

If it's used only for screening variables of unknown direction, then I guess it is safe to keep it on "auto" in rocPerCol.

@RafaOR
Copy link
Author

@RafaOR RafaOR commented May 27, 2016

I think that I am not understanding everything you say (I have only been using R for a few months...)

What I am doing now is to recalculate the average AUC values with the $pred data of the RFE model for all the possible number of predictors, setting the direction argument to "<", and then update the RFE model (with the function update) to the set of predictors with the best average AUC value.

Is this a good simple solution to the problem?

@topepo
Copy link
Owner

@topepo topepo commented May 27, 2016

The simplest short-term solution is to modify the summary components of the functions (as described here) and change how roc is called.

@RafaOR
Copy link
Author

@RafaOR RafaOR commented May 30, 2016

To this end, I have followed the instructions in your book Applied Predictive Modeling (very helpful by the way!). The code I use is:

fiveStats <- function(...) c(twoClassSummary(...), defaultSummary(...))
svmFuncs <- caretFuncs
svmFuncs$summary <- fiveStats

but I do not know how to change how roc is called. What should I do to change it?

@topepo
Copy link
Owner

@topepo topepo commented May 31, 2016

You are using rfe? If that's the case then pass it to the control function. This page has details under "helper functions". If you are passing the method to train, then use something like

ctrl$functions <- caretFuncs
ctrl$functions$summary <- fiveStats
topepo added a commit that referenced this issue May 31, 2016
Fix direction argument for pROC (#431)
@topepo
Copy link
Owner

@topepo topepo commented Jun 11, 2016

Now that I've gotten around to running the regressions tests, I think that twoClassSummary should use ">" as the direction.

It chooses the class probability associated with the first level of the factor to use to compute the ROC curve and that should have the higher value (unless the model is degenerate somehow).

rocObject <- try(pROC::roc(data$obs, data[, lev[1]]), silent = TRUE)

For example:

> library(caret)
> library(pROC)
> 
> set.seed(346)
> dat <- twoClassSim(200)
> 
> set.seed(35)
> mod <- train(Class ~ ., data = dat,
+              method = "lda",
+              metric = "ROC",
+              trControl = trainControl(method = "cv",
+                                       savePredictions = TRUE,
+                                       summaryFunction = twoClassSummary, 
+                                       classProbs = TRUE))
> 
> 
> mod$finalModel$obsLevels[1]
[1] "Class1"
> 
> ggplot(mod$pred, aes_string(x = "obs", y = mod$finalModel$obsLevels[1])) +
+   geom_boxplot()
> 
> pROC::roc(mod$pred$obs, mod$pred$Class1, direction = "<")

Call:
roc.default(response = mod$pred$obs, predictor = mod$pred$Class1,     direction = "<")

Data: mod$pred$Class1 in 110 controls (mod$pred$obs Class1) < 90 cases (mod$pred$obs Class2).
Area under the curve: 0.09222
> pROC::roc(mod$pred$obs, mod$pred$Class1, direction = ">")

Call:
roc.default(response = mod$pred$obs, predictor = mod$pred$Class1,     direction = ">")

Data: mod$pred$Class1 in 110 controls (mod$pred$obs Class1) > 90 cases (mod$pred$obs Class2).
Area under the curve: 0.9078
@zijianding
Copy link

@zijianding zijianding commented Jun 16, 2016

I've come into this problem when using random forest as well. Here are the concerns:

  1. Optimizing AUC values through actually the negative class (un-target class)
    When training the best random forest model, I use the AUC of ROC curve to find the best mtry parameter of rf. This will request the trainControl function to call roc function in pROC, which we can not modify the parameter of roc, including levels and direction
  2. pROC package gives opposite AUC values when switching the positive and negative class
    I am not quite sure why this happens, but we can prove that if the predictive scores are set, the AUC value keeps the same whether we switch the positive and negative class.

I can not fully understand the solutions mentioned above, so

  1. If I want to change the direction parameter of roc when calling trainControl, how can we achieve this?
  2. Or is it possible to change to other package which takes the first level as target class, such as ROCR?
@topepo
Copy link
Owner

@topepo topepo commented Jun 16, 2016

@zijianding :

This will request the trainControl function to call roc function in pROC, which we can not modify the parameter of roc, including levels and direction

You can change those parameters. You can modify the twoClalssSummary function and pass it to summaryFunction.

Probably the better bet would be to change the ordering of your factor outcome. train assumes that the first level is the event of interest.

For your second comment, we would need a reproducible example to understand and comment.

@zijianding
Copy link

@zijianding zijianding commented Jun 17, 2016

@topepo :
Thank you!
I've followed your advice and here is the new code:
rocObject <- try(pROC::roc(response=data$obs,predictor=data[, lev[1]], levels=c(lev[2],lev[1]), direct="<"), silent=T ),
which substitutes the original code
rocObject <- try(pROC::roc(data$obs, data[, lev[1]]), silent = TRUE)

Here are the concerns:

  1. The major modification is that I pass the the target class lev[1] to the roc function as the second level, which is favored by the pROC package.
  2. Also during the modification, a weird error happened that R cannot find the function requireNamespaceQuiteStop,even it fails by requireNamespace, so I directly call the pROC package.

Anyhow I guess the codes run properly, thank you so much for your help!

@xrobin
Copy link
Contributor

@xrobin xrobin commented Jul 3, 2016

Sorry for the late answer.

I agree with this fix. Setting direction = ">" is exactly equivalent to switching positive and negative observations and will make no difference to the AUC. Unless the ROC curve is exposed to the user at some point, it is perfectly fine.

One additional note, it looks like my previous pull request was merged so rocPerCol now uses the (wrong) direction. I can make a new pull request to revert this.

@topepo topepo closed this Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.