Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document and adapt isclose() usage #4864

Closed
Gerenuk opened this issue Jun 15, 2015 · 8 comments
Closed

Document and adapt isclose() usage #4864

Gerenuk opened this issue Jun 15, 2015 · 8 comments

Comments

@Gerenuk
Copy link

Gerenuk commented Jun 15, 2015

sklearn.metrics introduces isclose() in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/ranking.py which can leave the unaware data practitioner with hours of debugging.

In very unbalanced classification, probabilities/scores can be very small and yet meaningful. This however will cause unexpected missing precision_recall points due to isclose treating values within 10e-6 as equal.

I'd suggest to place a warning about isclose in the documentation and also replace the absolute epsilon by a relative closeness comparison in order to avoid the problems with small probabilities in unbalanced classification.

@amueller
Copy link
Member

can you give an example where this is failing? Maybe we need to handle certain situations better?

@Gerenuk
Copy link
Author

Gerenuk commented Jun 15, 2015

Whenever you have very small probabilities. For example from a very unbalanced and slightly noisy classification. These small probabilities can still be predictive though (I had a real-life case). However in precision_recall_curve points are unrightfully aggregated due to isclose.

from sklearn import metrics
prec, recall, thresh = metrics.precision_recall_curve([0,1,0,1,0,1,0],[0.1,1.211111e-7,1.21111e-7,1.2111e-7,1.211e-7, 1.21e-7,1.2e-7])
print(prec)
-> array([ 0.42857143,  0.        ,  1.        ])

Now as you say it, I agree this should be handled differently. I'd suggest not to tamper the raw data and just sort the floats as they are.

On the other hand, I keep reimplementing aggregation procedures to compress millions of precision_recall_curve points into some number of points that is plottable. For that I usually look at min(precision), max(precision) for each small recall range.

I can imagine using something similar to this min/max approach for a given recall resolution (e.g. [r_i ... r_i+0.01] ranges) would be more robust if you really want to combine presumably equal points.

So this aggregation is needed anyway, but should be performed in the appropriate way.

@arjoly
Copy link
Member

arjoly commented Jun 15, 2015

One student ran into this issue recently when upgrading his scikit learn version.

@vassilisp
Copy link

Same here, using it for ROC curves, isclose groups all scores bellow a certain level creating plots with missing values below a certain fpr (below 1% for me)
This can be seen in the picture bellow, looking at the NB curve.

t3-svm-p true -pro288955_sliding50100250_roc_curvelog

A good way to overcome this for me was to use the predict_log_proba() instead of predict_proba()

@arjoly
Copy link
Member

arjoly commented Sep 14, 2015

I would be in favour of removing the isclose. Other opinion ? (@jnothman?)

@vassilisp
Copy link

Correct me if I am mistaken, but just dropping the isclose can create a mess when dealing with big data sets (fpr and tpr arrays could become extremely large making almost certainly impossible to plot without some processing).

If people need to deal with such low score values, I can see 3 viable solutions taking into consideration that only a small percentage of users would ever hit that limit. Good documentation is viable in any case.
1)Either allow a user defined tolerance for isclose (and have a default value same as the current one which is good for 99% of the cases)
2) Implement a decision function that would return predict_log_proba() for estimators without a decision function.
or

3)Myself, I would be much more inclined on an approach that would aggregate the end result (output) rather than the input data. That is, group the fpr and tpr values (using a similar or even zero tolerance).
I cant imagine a scenario where such a precision (below 0.00001) in FPR or TPR would be needed.

As an example, when I used the default metrics.roc_curve in cases with big data sets, it resulted in millions of points for FPR and TPR (most of them redundant -at least for a plotting purpose), which is similar to not using isclose and get an even bigger number of different thresholds and consequently fpr and tpr values.

To solve this issue, given the nature of ROC curve plots (mostly vertical and horizontal lines that need only two sets of coordinates), using a very simple method I kept only the values at the end of those linear parts. Depending on the tolerance and the ROC plot itself, this can decrease the number of values usually close to 1% which makes plotting way easier and faster and out-weights any computation costs introduced by this method.

Actually I would suggest to anyone to use this method before plotting his fpr and tpr values. Having a zero tolerance (no loss of info) decreases the number of values close to the afforementioned 1%.

code snippet here

Cheers

@Gerenuk
Copy link
Author

Gerenuk commented Sep 14, 2015

Certainly 3) seems to be the cleanest solution. If you want to solve the plotting issue, then solve the plotting issue. Do not mess with the values beforehand. The plotting issue can be solved with different independent ways. I believe Bokeh can even handle it without any user adjustment.

@jnothman
Copy link
Member

Appears to be a duplicate of #3864. Would someone like to summarise, there, the discussion here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants