Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add wrapper class that changes threshold value for predict #8614

Open
amueller opened this issue Mar 19, 2017 · 51 comments · May be fixed by #10117

Comments

@amueller
Copy link
Member

@amueller amueller commented Mar 19, 2017

This was discussed before, but not sure if there's an issue.
We should have a wrapper class that changes the decision threshold based on a cross-validation (or hold-out, pre-fit) estimate.
This is very very common so I think we should have a built-in solution.
Simple rules for selecting a new threshold are:

  • picking the point on the roc curve that's closest to the ideal corner
  • picking the point on the precision-recall curve that's closest to the ideal corner
  • optimize one metric while holding another one constant: find the best threshold to yield the best recall with a precision of at least 10% for example. We could also make this slightly less flexible and just say "the median of the largest threshold over cross-validation that yields at least X precision".
@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Mar 19, 2017

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Mar 19, 2017

I wouldn't object to that.

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Mar 19, 2017

There was also a recent paper by some colleagues from nyu how to properly control for different kinds of errors but I need to search for that...

@glouppe

This comment has been minimized.

Copy link
Member

@glouppe glouppe commented Mar 20, 2017

Related to #6663 from @betatim ?

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Mar 26, 2017

Yes, definitely related to #6663. Though I think I would implement it as a meta-estimator and not a transformer and I would add cross-validation to adjust the threshold using mentioned strategies.

@amueller amueller added this to PR phase in Andy's pets Jul 21, 2017
@amueller amueller moved this from PR phase to needs pr in Andy's pets Jul 21, 2017
@PGryllos

This comment has been minimized.

Copy link
Contributor

@PGryllos PGryllos commented Oct 2, 2017

Hey, it seems like no one is working on that; can I give it a try?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Oct 2, 2017

@PGryllos

This comment has been minimized.

Copy link
Contributor

@PGryllos PGryllos commented Oct 3, 2017

@jnothman that sounds good. Indeed I wouldn't try to commit the whole feature at once. I also see it as a way to become more familiar with the library. I first need to get a good grip of the requested feature though. @amueller did you maybe find the paper you mentioned?
thnx in advance

@PGryllos

This comment has been minimized.

Copy link
Contributor

@PGryllos PGryllos commented Oct 19, 2017

to give an update. I went through the mentioned issues and got a better understanding of what the task is. Now I think I can start implementing the api of the class. But I could still use some help on what the exact threshold decision should be based on. Maybe as @amueller said provide all of three options and let the user decide which one to use?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Oct 19, 2017

@PGryllos

This comment has been minimized.

Copy link
Contributor

@PGryllos PGryllos commented Oct 23, 2017

Ok I have to admit I am still pretty stuck :$ I looked at that the CalibrationClassifier in the calibration module which seems like it's tacking a very similar problem. #6663 and #4822 also work on calibrating probability thresholds. I am not sure how exactly the proposal fits in and I find it difficult to find my way around the problem. I would like to have some guidance or
some paper suggestions to help me understand what is being asked :P but if that is too much let me know if someone else should take over

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Oct 24, 2017

Maybe start with implementing a wrapper that picks the point on the P/R curve that's closest to the top right. Try to understand what that would mean doing it on the training set (or using a "prefit" model).
Then, do the same, but inside a cross-validation loop, and take the average (?) of the thresholds.

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Oct 24, 2017

This seems relevant: https://www.ncbi.nlm.nih.gov/pubmed/16828672 (via Max Kuhn's book) but I haven't really found more?

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Oct 24, 2017

There's some info here: http://ncss.wpengine.netdna-cdn.com/wp-content/themes/ncss/pdf/Procedures/NCSS/One_ROC_Curve_and_Cutoff_Analysis.pdf

and maybe looking at rocr helps: http://rocr.bioinf.mpi-sb.mpg.de/

but I haven't actually found anything explicitly describing a methodology.

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Oct 24, 2017

@PGryllos

This comment has been minimized.

Copy link
Contributor

@PGryllos PGryllos commented Oct 25, 2017

@amueller thanks for taking the time.

Try to understand what that would mean doing it on the training set (or using a "prefit" model)

If I have understood correctly if the base_estimator is not prefit we should hold out part of the training set to calibrate the threshold and the rest to fit the classifier otherwise we can use the whole training set for calibration of the decision threshold. The same as in CalibratedClassifierCV class.

What I am actually still not 100% sure is how is this wrapper going to be different than CalibratedClassifierCV? This class already implements the cv folds logic for calibrating. Should this new wrapper be a similar class that just adds the option to calibrate using the ROC curve?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Oct 25, 2017

@PGryllos

This comment has been minimized.

Copy link
Contributor

@PGryllos PGryllos commented Oct 29, 2017

@jnothman @amueller I am thinking about how should the calibrated predict work in the case of multilabel classification.

i.e. we have 3 labels [1, 2, 3] and the thresholds we found after calibration are [.6, .3, .7]

  • for sample 1 predict_proba predicts condifence values [.5, .4, .6]. In this case only the confidence for label 2 is above the threshold so calibrated predict should return 2
  • for sample 2 predict_proba predicts condifence values [.7, .5, .6]. In this case we have two confidence values above the corresponding label's threshold. The confidence / threshold ratio for label 2 is higher than label 1. Do you think it would make sense to base the prediction on this ratio for this case?
@PGryllos

This comment has been minimized.

Copy link
Contributor

@PGryllos PGryllos commented Oct 29, 2017

also the way I see it the threshold calibration functionality should be offered through the CalibratedClassifierCV class (mainly because the name is so broad that covers all calibration practises), which currently offers probability estimation calibration, and it should be an option to choose whether to use the threshold calibration or the probability calibration or both. What do you think?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Oct 29, 2017

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Oct 29, 2017

@PGryllos

This comment has been minimized.

Copy link
Contributor

@PGryllos PGryllos commented Oct 30, 2017

in multilabel you should calibrate each output column independently

do you mean that the threshold calibration should be used only for binary classification?

@PGryllos PGryllos referenced a pull request that will close this issue Nov 12, 2017
@PGryllos

This comment has been minimized.

Copy link
Contributor

@PGryllos PGryllos commented Nov 20, 2017

@amueller @jnothman I have come to understand that a calibrated decision threshold / cut-off using the methods mentioned above (having tried the first two) is not guaranteed to yield optimal classification accuracy. http://optimalprediction.com/files/pdf/V3A29.pdf

I am fairly new to medical literature but I understand that our point here is not to improve the classification accuracy but to provide a cut-off strategy relative to improving the specificity and sensitive; From a look I had there seem to be suggested ways to target optimal accuracy cut-off points but it's not clear to me yet exactly what that means (also maybe interesting: https://www.researchgate.net/publication/51496344_Performance_Measures_for_Prediction_Models_and_Markers_Evaluation_of_Predictions_and_Classifications)

So I believe our evaluation on the cut-off points calculation should focus on showing that the method provides an "optimal" in some sense combination on specificity and sensitivity. But I am still in the process of understanding what that implies.

let me know if you have any pointers.

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Nov 21, 2017

The paper you're citing seems a marketing paper. I've never heard of ESS or UniODA. There are certainly several different sensible criteria. Why you'd exactly want the smallest distance to the top right in the roc curve is certainly debatable, but also a somewhat reasonable choice imho. And yes, we probably don't want accuracy, which is likely what the estimator is maximizing anyway.

@twolodzko

This comment has been minimized.

Copy link

@twolodzko twolodzko commented Nov 22, 2017

Guys, if I may, first of all it would be great if there were a non-fixed cutoff value for predict methods, since 0.5 is arbitrary and in many cases simply wrong (you can actually see people saying "don't use it, use predict_proba instead").

So when classifier has predict_proba function, then the predict function should have also the cutoff argument (by default equal to 0.5 for backward compatibility). I would imagine this to be something like predict(X, cutoff = None) to generalize to multiclass and algorithms that do not have predict_proba methods.

As about CalibrationClassifier it is a great idea, but using only ROC, or precision and recall, is not the way to go, it should be much more flexible. As argued, for example, by Frank Harrell in many places like on CrossValidated Q&A site (feel free for a number of other his answers on this on the site, or others in different places), or in several places on his blog, the optimal cutoff has not much to do with precision or recall, but rather about making optimal decisions based on the estimated probabilities. So I would argue for a function that also enables user to optimize some prespecified cost function, e.g. something like

from scipy.optimize import minimize_scalar, minimize

def OptimizeCutoff(target, pred_prob, cost_fun):
    def f(theta):
        pred = [1 if p > theta else 0 for p in pred_prob]
        return cost_fun(target, pred)
    
    opt = minimize_scalar(f, bounds = (0, 1), method = 'bounded')
    # you can use grid search as well with grid [1/n, 2/n, ..., 1] since we are limited by the data
    
    if opt['success']:
        return {'cutoff' : opt['x'], 'minimum' : opt['fun']}
    else:
        warn(opt['message'])
        return {'cutoff' : 0.5, 'minimum' : None}

(Please notice that this is a quick and dirty, naive implementation.)

The ROC-like possible default cost function could be -1 * (sensitivity + specificity), the thing that many people seem to use (e.g. in the quoted Ewald, 2006 paper).

For some other ideas about seeking cutoff, check also this thread on CV and other similar threads.

@PGryllos

This comment has been minimized.

Copy link
Contributor

@PGryllos PGryllos commented Nov 22, 2017

@twolodzko thnx a lot for the input.

then the predict function should have also the cutoff argument (by default equal to 0.5 for backward compatibility

I believe (at least by looking at linear classifiers in scikit) predict does not use a cut-off. It's picking the class with the higher probability. Of course the class with the higher probability will have a probability higher than 0.5 in the case of binary classification.

the optimal cutoff has not much to do with precision or recall ...

I cannot look at the links now (though I will later) but from the literature it seems that the ROC curve is a fairly decent starting point for choosing a cut-off; More specifically most of the cut-off metrics I've found utilise the sensitive and / or specificity (which relate to precision and recall) one way or another.

... but rather about making optimal decisions based on the estimated probabilities

What is hard to define here, in my understanding at least, is what actually constitutes an optimal decision; I don't think there is a simple answer to that because it really depends on what you are willing to sacrifice to get what amount of what? I believe the notion of optimising one metric (e.g. specificity) while preserving a minimum amount of the other (sensitive) could be a reasonable strategy.

@twolodzko

This comment has been minimized.

Copy link

@twolodzko twolodzko commented Nov 22, 2017

@PGryllos

I believe (at least by looking at linear classifiers in scikit) predict does not use a cut-off. It's picking the class with the higher probability. Of course the class with the higher probability will have a probability higher than 0.5 in the case of binary classification.

But in binary case this is exactly equivalent of using 0.5 cutoff...

What is hard to define here, in my understanding at least, is what actually constitutes an optimal decision; I don't think there is a simple answer to that because it really depends on what you are willing to sacrifice to get what amount of what? I believe the notion of optimising one metric (e.g. specificity) while preserving a minimum amount of the other (sensitive) could be a reasonable strategy.

Say that you have a daughter, who went to a doctor and got tested for some disease. The test says that there is 51% chance she is healthy and 49% chance she is sick. If she is sick and she undergoes the treatment that has no side effects, she will be well, otherwise she dies. Would you stick to the choice that greater probability wins? Same with ROC-curve case, if ROC analysis tells you that the optimal cutoff is X, then still you may be more prone to use different optimality criterion, if the cost of false negative diagnosis is large as compared to the cost of false positive one. If you're playing Russian roulette, then if you take only one round, the odds are in your favor, yet most people won't take the bet. Basically that is why the method for finding the optimal cutoff should enable the user to define his own cost function (e.g. one that favors sensitivity more then specificity, to give simple example).

@PGryllos

This comment has been minimized.

Copy link
Contributor

@PGryllos PGryllos commented Nov 22, 2017

But in binary case this is exactly equivalent of using 0.5 cutoff...

I am not sure if there are classifiers that produce un-thresholded scores instead of probabilities in which case I am not sure if you can say that the decision threshold is 0.5.

For the second part of the comment, it seems to me that the same can be achieved by allowing the user to decide which of the two metrics to optimise with what minimum value of the other (the third point described in the first comment). I am not confident enough to say whether allowing the user to specify custom cost functions would play nicely, I have to give it further thought, but I don't disagree with the idea per se. Maybe @amueller could give some feedback?

@PGryllos

This comment has been minimized.

Copy link
Contributor

@PGryllos PGryllos commented Dec 1, 2017

update I plan on making progress during the weekend; precisely I want to focus on the following

  1. add the other mentioned methods for cut-off estimation
  2. create examples to showcase the implementation
@amueller amueller referenced this issue Dec 12, 2017
4 of 4 tasks complete
@PGryllos

This comment has been minimized.

Copy link
Contributor

@PGryllos PGryllos commented Dec 18, 2017

@amueller @jnothman I extended the implementation with two more methods for picking optimal cutoffs; changed naming and updated docstrings. I plan in the following days to add examples and tests. Do you think that at that point it will qualify for mrg review ?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Dec 18, 2017

@PGryllos

This comment has been minimized.

Copy link
Contributor

@PGryllos PGryllos commented Dec 18, 2017

Okay, thnx for the prompt reply. I will also try to make it as clear and comprehensible as possible.

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Dec 18, 2017

@twolodzko @PGryllos
I agree that a cost function is one good way to deal with picking a threshold, though it's not necessarily natural in all cases. If you gamble (or do business, basically same thing) defining the costs is easy. In a medical example, defining the costs is non-obvious. How much more does it cost you if your child dies compared to a misdiagnosis? In these settings I think precision and recall (and associated measures) are more natural.

I don't think we'll introduce a cutoff parameter to predict. I'm not sure what the benefit of that would be - the semantics would be unclear (given that not all models are probabilistic) and it would just move the burden to find the cutoff to the user. And given that we don't have great mechanisms in place for this right now, people would probably tune it on the test set.

@twolodzko what I don't get from your implementation is what the interface of the cost function would be. That's the critical part, I think. You can just have a cost matrix, which is the simplest, and I think we should probably support that. Did you have a more general case in mind?

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Dec 18, 2017

@twolodzko btw do you have references for methods for tuning a cutoff using cross-validation? I didn't really find any.

@twolodzko

This comment has been minimized.

Copy link

@twolodzko twolodzko commented Dec 18, 2017

@amueller for example the paper you quote earlier on in the thread talks about finding optimal cutoff given the sensitivity + specificity criterion. I can't recall any specific references on that, in most cases they say that in general you should tune it based on the loss function that is specific for given problem, cross validation is rather not discussed but seems to be a natural choice.

My code basically takes as an argument a loss function (say metrics.mean_squared_error) and then uses optimizer with it. It is just an example.

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Dec 18, 2017

@twolodzko The article doesn't talk about cross-validation, and while it might be "an obvious choice" it's not really obvious to me what to do. You could either cross-validate the actual threshold value, or you could keep all the models and apply their thresholds and let them vote. I feel like averaging the threshold value over folds sounds a bit dangerous, but the other option doesn't really seem any better?

Why would you provide a callable for classification? If it's binary, we only need a 2x2 matrix, right? The point is that deciding how to create the interface is the hard part, optimizing it is the easy part ;)

@twolodzko

This comment has been minimized.

Copy link

@twolodzko twolodzko commented Dec 28, 2017

@PGryllos & @amueller FYI, see AUC: a misleading measure of the
performance of predictive distribution models
by Lobo et al:

It has been assumed that in ROC plots the optimal classifier point is the one that maximizes
the sum of sensitivity and specificity (Zweig & Campbell, 1993).
However, Jiménez-Valverde & Lobo (2007) have found that a
threshold that minimizes the difference between sensitivity and
specificity performs slightly better than one that maximizes the
sum if commission and omission errors are equally costly. When
the threshold changes from 0 to 1, the rate of well-predicted
presences diminishes while the rate of well-predicted absences
increases. The point where both curves cross can be considered
the appropriate threshold if both types of errors are equally
weighted (Fig. 1a). In a ROC plot, this point lies at the intersection
of the ROC curve and the line perpendicular to the diagonal of
no discrimination (Fig. 1b), i.e., the ‘northwesternmost’ point of
the ROC curve. The two thresholds can be easily computed
without using the ROC curve. Both thresholds are highly correlated
and, more importantly, they also correlate with prevalence (Liu
et al., 2005; Jiménez-Valverde & Lobo, 2007). As a general rule,
a good classifier needs to minimize the false positive and negative
rates or, similarly, to maximize the true negative and positive
rates. Thus, if we place equal weight on presences and absences
there is only one correct threshold. This optimal threshold, the
one that minimizes the difference between sensitivity and specificity,
achieves this objective and provides a balanced trade-off between
commission and omission errors. Nevertheless, as pointed out
before, if different costs are assigned to false negatives and false
positives, and the prevalence bias is always taken into account,
the threshold should be selected according to the required criteria.
It is also necessary to underline that the transformation of
continuous probabilities into binary maps is frequently necessary
for many practical applications that rely on making decisions
(e.g., reserve selection).

Check also the referred papers for more discussion.

@PGryllos

This comment has been minimized.

Copy link
Contributor

@PGryllos PGryllos commented Jan 5, 2018

@twolodzko thnx a lot for the input; I will take a look in the coming days

@amueller amueller moved this from needs pr to PR phase in Andy's pets Aug 21, 2018
@nizaevka

This comment has been minimized.

Copy link

@nizaevka nizaevka commented Nov 23, 2018

So what? is there any solution to use GridSearch to tune the best threshold?
In the current implementation, I use cv in score function to find best th_ for all folds, after that calc average score for folds with that th_. It`s extremely uncomfortable and breaks the structural logic of sklearn.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Nov 26, 2018

@nizaevka

This comment has been minimized.

Copy link

@nizaevka nizaevka commented Nov 26, 2018

No, I mean my temporary implementation) Actually, I don`t know how to do it right.
In cv I do :

  • bind predict_proba for all folds => y_prob vector
  • get the list of possible thresholds from sklearn.metrics.roc_curve on y_prob
  • brute force to get the one "th_" that maximize score of y_prob after threshold applied
  • then use that "th_" to calculate the score of every fold

The problem, that every fold should have the same th_, so before calc score for one fold, we need y_prob from others folds.
Maybe I am wrong, would be appreciated for any solution.

Another way is to use GridSearch(like in #6663), but it is hard to know (without roc_curve) the proper th_ range for search, I assume th_ depends on others hyperparameters of estimator.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Nov 26, 2018

@nizaevka

This comment has been minimized.

Copy link

@nizaevka nizaevka commented Nov 27, 2018

What do you mean? th_ range for #6663 solution or what? Could you write more concretely what steps do you propose?
GridSearch should tune threshold and others hyperparams(hp) simultaneously, in general roc_curve depends on estimator`s hp, so I need roc_curves for each fold for all combinations of hp.

@amueller

This comment has been minimized.

Copy link
Member Author

@amueller amueller commented Nov 27, 2018

@KNizaev this is not a forum for usage questions or how to implement something. Try stackoverflow.

@jmwoloso

This comment has been minimized.

Copy link
Contributor

@jmwoloso jmwoloso commented Mar 28, 2019

The statistic we're looking for here is Youden's J. The scenario arises quite often (all the time in my line of work it seems, lol?) when you have a highly imbalanced dataset.

Our team was looking into implementing something like this as well and optimization via CV (using something like GridSearchCV on a previously-unused portion of the dataset) seemed the natural way to proceed as tuning by hand without CV (a.k.a. guessing) would introduce leakage. We also looked into Matthews' Correlation Coefficient as the metric to use for threshold optimization.

We never implemented it ultimately as we needed to bin the probabilities and turn them into letter grades, so we opted for a quick and dirty method using Jenks' Natural Breaks.

Seems like you'd want to GridSearch the hyper-params while optimizing for Youden's J and allowing a prob_threshold param that will be searched over as well.

Is this on hold for the time being? Thoughts?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Mar 28, 2019

@jmwoloso

This comment has been minimized.

Copy link
Contributor

@jmwoloso jmwoloso commented Mar 29, 2019

@jnothman exactly, sorry, was just trying to put a "name with a face". Looking forward to this and happy to help move it along if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
8 participants
You can’t perform that action at this time.