Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upChange default arguments of ROC computation (pROC) in the RFE process (caret) #431
Comments
|
This is actually an issue with worse classifiers with an AUC is close to 0.5. The AUC is systematically over-estimated with the default direction (auto) that will attempt to keep the AUC above 0.5 at all times. In general one should never use the default direction in a resampling operation. The way I handle this in pROC with the bootstrapping operations is to compute the direction on the whole ROC curve, and keep that direction fixed during the resampling. The ROC function from pROC seem to be called in 3 different places, in aaa.R and filterVarImp.R. I will try to submit a pull request with the latter approach. |
|
We need to be careful here. Right now, we assume that the first factor is the event of interest. That has been consistent in the documentation and teaching materials for some time. The changes should ensure that this design choice is not negated. |
|
And pROC assumes that the level of interest is the second level of the factor, so we already have an inconsistency indeed, and relied on the direction="auto" argument. This is probably the biggest design mistake I did with pROC. Ideally it would be best to have calls like this:
So two things are left:
|
|
Here is how for (i in 1:k) {
for (j in i:k) {
if (i != j) {
classIndex[[i]] <- c(classIndex[[i]], counter)
classIndex[[j]] <- c(classIndex[[j]], counter)
index <- which(y %in% c(classLevels[i], classLevels[j]))
tmpX <- x[index, , drop = FALSE]
tmpY <- factor(as.character(y[index]), levels = c(classLevels[i], classLevels[j]))
tmpStat[, counter] <- apply(tmpX, 2, rocPerCol, cls = tmpY)
counter <- counter + 1
}
}
}In this case, we might want to reorder the classes based on their average value since there is no guarantee on the direction (and it is just being used to screen variables) |
|
If it's used only for screening variables of unknown direction, then I guess it is safe to keep it on "auto" in |
|
I think that I am not understanding everything you say (I have only been using R for a few months...) What I am doing now is to recalculate the average AUC values with the Is this a good simple solution to the problem? |
|
The simplest short-term solution is to modify the |
|
To this end, I have followed the instructions in your book Applied Predictive Modeling (very helpful by the way!). The code I use is:
but I do not know how to change how |
|
You are using ctrl$functions <- caretFuncs
ctrl$functions$summary <- fiveStats |
Fix direction argument for pROC (#431)
|
Now that I've gotten around to running the regressions tests, I think that It chooses the class probability associated with the first level of the factor to use to compute the ROC curve and that should have the higher value (unless the model is degenerate somehow).
For example: > library(caret)
> library(pROC)
>
> set.seed(346)
> dat <- twoClassSim(200)
>
> set.seed(35)
> mod <- train(Class ~ ., data = dat,
+ method = "lda",
+ metric = "ROC",
+ trControl = trainControl(method = "cv",
+ savePredictions = TRUE,
+ summaryFunction = twoClassSummary,
+ classProbs = TRUE))
>
>
> mod$finalModel$obsLevels[1]
[1] "Class1"
>
> ggplot(mod$pred, aes_string(x = "obs", y = mod$finalModel$obsLevels[1])) +
+ geom_boxplot()
>
> pROC::roc(mod$pred$obs, mod$pred$Class1, direction = "<")
Call:
roc.default(response = mod$pred$obs, predictor = mod$pred$Class1, direction = "<")
Data: mod$pred$Class1 in 110 controls (mod$pred$obs Class1) < 90 cases (mod$pred$obs Class2).
Area under the curve: 0.09222
> pROC::roc(mod$pred$obs, mod$pred$Class1, direction = ">")
Call:
roc.default(response = mod$pred$obs, predictor = mod$pred$Class1, direction = ">")
Data: mod$pred$Class1 in 110 controls (mod$pred$obs Class1) > 90 cases (mod$pred$obs Class2).
Area under the curve: 0.9078 |
|
I've come into this problem when using random forest as well. Here are the concerns:
I can not fully understand the solutions mentioned above, so
|
You can change those parameters. You can modify the Probably the better bet would be to change the ordering of your factor outcome. For your second comment, we would need a reproducible example to understand and comment. |
|
@topepo : Here are the concerns:
Anyhow I guess the codes run properly, thank you so much for your help! |
|
Sorry for the late answer. I agree with this fix. Setting direction = ">" is exactly equivalent to switching positive and negative observations and will make no difference to the AUC. Unless the ROC curve is exposed to the user at some point, it is perfectly fine. One additional note, it looks like my previous pull request was merged so |
I am computing a SVM-RFE model using "ROC" as the metric with the
rfefunction of thecaretpackage and I have noticed that there is no way to change the default arguments of therocfunction (pROCpackage). In my case, for example, I would like to set the direction argument to"<"instead of"auto"because in some cases the resulting AUC is computed in reverse. Would it be possible to consider this enhancement?