Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a threshold for prediction for pairs #131

Closed
1 task
wdevazelhes opened this issue Nov 13, 2018 · 5 comments · Fixed by #168
Closed
1 task

Use a threshold for prediction for pairs #131

wdevazelhes opened this issue Nov 13, 2018 · 5 comments · Fixed by #168
Milestone

Comments

@wdevazelhes
Copy link
Member

wdevazelhes commented Nov 13, 2018

A pairs predictor should be able to predict a binary result when given a pair, like a classifier, when we would call the predict function. This could be done by using a threshold and comparing the score of similarity for the pair to this threshold.

TODO:
Also:

@bellet
Copy link
Member

bellet commented Nov 13, 2018

This threshold could be tuned (e.g. on the training set) so as to achieve a given level of precision

@bellet bellet added this to the v0.5.0 milestone Dec 20, 2018
@wdevazelhes
Copy link
Member Author

wdevazelhes commented Jan 9, 2019

1/ There's this recent PR in scikit-learn that is related: scikit-learn/scikit-learn#10117

They use a meta-estimator taking as an argument the estimator we want to threshold. It seems to be pretty close to what we want to do, allowing to specify a precision level etc, allowing not to refit the model if needed...

However it does not have an option for setting a threshold for just optimizing the accuracy. They indeed wouldn't need it if all binary classifiers in scikit-learn optimized for a cost function that gave the higher accuracy for a known threshold (e.g. 0.5) (which would mean that accuracy is the high-level metric optimized by default).

I'll try to investigate that but maybe you already know the answer ?

Because if we find a case in scikit-learn where we want to choose a threshold to optimize for accuracy then we could say it in the PR to add it there. But otherwise I think we would need to implement it in metric-learn.

What is more, maybe it would be good to have this (the accuracy maximizing thresholding) by default in pairs metric learners so that they directly have a predict function and we don't need a meta-estimator to have it ? But then for more sophisticated selections (like with precision, etc) they would use the scikit-learn metaestimator which should be compatible ?

2/ Also related, there's this class (from the module sklearn.calibration) that allows to calibrate the predictions. It provides a good predict_proba to estimators that don't have it or have a bad one (I don't think we need to put it in the code but maybe just say in the docs we can use this):

https://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html#sklearn.calibration.CalibratedClassifierCV

The only problem is I could only use it using a preprocessor. Indeed at fit time they have a check_array that doesn't accept 3D arrays.. I guess I could file an issue in scikit-learn about that, because it's a bit weird since GridSearchCV for instance doesn't do that and is still a metaestimator.

@PGryllos
Copy link

@wdevazelhes in the current implementation of scikit-learn/scikit-learn#10117 you can tune the threshold either for optimal roc auc curve point or for fbeta, where you get to choose the parameters. So with beta == 1 you are optimising for f1; So far I didn't see any reason to add accuracy as well as usually f1 is more useful. But can be easily added; I plan to make some progress next week

@wdevazelhes
Copy link
Member Author

@PGryllos thanks for your comment, and great PR in scikit-learn ! I would say that accuracy would be the simpler/more natural in our case, as a default function for having a predict function (because in our setting of metric learning on pairs, we don't have a natural threshold hence if we don't set it we cannot predict). But what do you think @bellet ? Maybe f1 score would be more natural as a default ?

@wdevazelhes
Copy link
Member Author

I just raised an issue in scikit-learn for using the CalibratedClassifierCV, see: scikit-learn/scikit-learn#13077

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants