Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upAdd ability to optimize probability thresholds for class imbalances without having to create a custom model #224
Comments
|
It is not an easy thing to automate and would require significant changes to almost every model's modules. I'll work on doing ti for |
|
@topepo Here's a thought: we could do this as a post-processing step, and write an "optimize threshold" function that chooses a cutoff based on the ROC curve (maybe choose the top-left most point on the curve). Then we could add a This wouldn't let you cross-validate the threshold, but a simple heuristic based on the roc curve (or maybe just the median predicted probability) would probably do a lot better in unbalanced cases. |
|
I've thought about doing that but have resister for a few reasons:
So it is not a bad idea but let's scope of how this would work and what modules would be affected. |
|
Sounds good! |
|
Thank you both for your consideration of this issue. |
|
Hi @topepo , I was wondering if you had made any progress on implementing this functionality into caret? |
|
No, it's fairly low on the list given the complexity it would generate to work across all of the models. I wont say never but not in the short term right now. |
|
Actually, now that I think about it, it might be easier to created a separate function that can compute the required statistics after the model fit. I'll see what I can do in the short term in that way. |
|
So this is basically what @zachmayer was advocating but a little more manual. Please take a look and let me know if you find any issues. The next step is to parameterize a I still have my reservations but give it a whirl and see how it works for you. |
|
We won't be implementing anything beyond |
I am interested in binary classification on an extremely unbalanced dataset. As is often the case with such data, the default probability threshold of 0.5 is inappropriate, and results in a lack of sensitivity to predict the rarer class.
I was pleased to see that this issue has been addressed through the use of custom models as described here: http://topepo.github.io/caret/custom_models.html#Illustration5
However, I have not been able to get this method working with the method I'd like to use (gbm), no doubt due to my lack of programming ability. I have seen questions on StackOverflow suggesting that other caret users have been having similar difficulties.
Would it then be possible to include functionality in caret making it easier for the user to optimize the probability threshold, without having to create a custom model?
Thanks.