Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to optimize probability thresholds for class imbalances without having to create a custom model #224

Closed
iron0012 opened this issue Aug 17, 2015 · 10 comments

Comments

@iron0012
Copy link

@iron0012 iron0012 commented Aug 17, 2015

I am interested in binary classification on an extremely unbalanced dataset. As is often the case with such data, the default probability threshold of 0.5 is inappropriate, and results in a lack of sensitivity to predict the rarer class.
I was pleased to see that this issue has been addressed through the use of custom models as described here: http://topepo.github.io/caret/custom_models.html#Illustration5
However, I have not been able to get this method working with the method I'd like to use (gbm), no doubt due to my lack of programming ability. I have seen questions on StackOverflow suggesting that other caret users have been having similar difficulties.

Would it then be possible to include functionality in caret making it easier for the user to optimize the probability threshold, without having to create a custom model?

Thanks.

@topepo
Copy link
Owner

@topepo topepo commented Aug 20, 2015

It is not an easy thing to automate and would require significant changes to almost every model's modules.

I'll work on doing ti for gbm over the next few days.

@zachmayer
Copy link
Collaborator

@zachmayer zachmayer commented Aug 20, 2015

@topepo Here's a thought: we could do this as a post-processing step, and write an "optimize threshold" function that chooses a cutoff based on the ROC curve (maybe choose the top-left most point on the curve).

Then we could add a threshold argument to predict.train. If the threshold is set, AND the model support class probs, when predict.train is called we'd use the probs + threshold to predict the classes.

This wouldn't let you cross-validate the threshold, but a simple heuristic based on the roc curve (or maybe just the median predicted probability) would probably do a lot better in unbalanced cases.

@topepo
Copy link
Owner

@topepo topepo commented Aug 20, 2015

I've thought about doing that but have resister for a few reasons:

  • it is only for two-class problems and adding that functionality adds complexity for one particular (but very important) modeling case.
  • we would want to use the out-of-sample predictions to do this. I think doing it inside the resampling process (via the custom model) is probably best but it is worth doing some simulations to test this against a post hoc approach.
  • I would worry that we would overfit the threshold to the training data
  • If we do something like this, we should maybe bundle it with a probability calibration step. My boss asked this morning why we have an optimized threshold of 0.001 (on a data set with a 1% event rate). Adjusting the threshold is mostly about trading off errors and a lot of times this is due to class imbalance. The imbalances "whack out" the resulting probability distribution etc. Just a thought.

So it is not a bad idea but let's scope of how this would work and what modules would be affected.

@zachmayer
Copy link
Collaborator

@zachmayer zachmayer commented Aug 20, 2015

Sounds good!

@iron0012
Copy link
Author

@iron0012 iron0012 commented Aug 20, 2015

Thank you both for your consideration of this issue.

@scbrown86
Copy link

@scbrown86 scbrown86 commented Jul 28, 2017

Hi @topepo , I was wondering if you had made any progress on implementing this functionality into caret?

@topepo
Copy link
Owner

@topepo topepo commented Jul 28, 2017

No, it's fairly low on the list given the complexity it would generate to work across all of the models. I wont say never but not in the short term right now.

@topepo
Copy link
Owner

@topepo topepo commented Jul 28, 2017

Actually, now that I think about it, it might be easier to created a separate function that can compute the required statistics after the model fit. I'll see what I can do in the short term in that way.

topepo added a commit that referenced this issue Jul 28, 2017
@topepo
Copy link
Owner

@topepo topepo commented Jul 28, 2017

So this is basically what @zachmayer was advocating but a little more manual. Please take a look and let me know if you find any issues.

The next step is to parameterize a threshold argument for train's predict method.

I still have my reservations but give it a whirl and see how it works for you.

@topepo
Copy link
Owner

@topepo topepo commented Aug 19, 2020

We won't be implementing anything beyond thresholder.

@topepo topepo closed this Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.