Document and adapt isclose() usage #4864

Gerenuk · 2015-06-15T17:43:29Z

sklearn.metrics introduces isclose() in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/ranking.py which can leave the unaware data practitioner with hours of debugging.

In very unbalanced classification, probabilities/scores can be very small and yet meaningful. This however will cause unexpected missing precision_recall points due to isclose treating values within 10e-6 as equal.

I'd suggest to place a warning about isclose in the documentation and also replace the absolute epsilon by a relative closeness comparison in order to avoid the problems with small probabilities in unbalanced classification.

The text was updated successfully, but these errors were encountered:

amueller · 2015-06-15T18:02:11Z

can you give an example where this is failing? Maybe we need to handle certain situations better?

Gerenuk · 2015-06-15T18:20:14Z

Whenever you have very small probabilities. For example from a very unbalanced and slightly noisy classification. These small probabilities can still be predictive though (I had a real-life case). However in precision_recall_curve points are unrightfully aggregated due to isclose.

from sklearn import metrics
prec, recall, thresh = metrics.precision_recall_curve([0,1,0,1,0,1,0],[0.1,1.211111e-7,1.21111e-7,1.2111e-7,1.211e-7, 1.21e-7,1.2e-7])
print(prec)
-> array([ 0.42857143,  0.        ,  1.        ])

Now as you say it, I agree this should be handled differently. I'd suggest not to tamper the raw data and just sort the floats as they are.

On the other hand, I keep reimplementing aggregation procedures to compress millions of precision_recall_curve points into some number of points that is plottable. For that I usually look at min(precision), max(precision) for each small recall range.

I can imagine using something similar to this min/max approach for a given recall resolution (e.g. [r_i ... r_i+0.01] ranges) would be more robust if you really want to combine presumably equal points.

So this aggregation is needed anyway, but should be performed in the appropriate way.

arjoly · 2015-06-15T18:50:54Z

One student ran into this issue recently when upgrading his scikit learn version.

vassilisp · 2015-09-14T12:05:30Z

Same here, using it for ROC curves, isclose groups all scores bellow a certain level creating plots with missing values below a certain fpr (below 1% for me)
This can be seen in the picture bellow, looking at the NB curve.

A good way to overcome this for me was to use the predict_log_proba() instead of predict_proba()

arjoly · 2015-09-14T13:47:09Z

I would be in favour of removing the isclose. Other opinion ? (@jnothman?)

vassilisp · 2015-09-14T14:52:21Z

Correct me if I am mistaken, but just dropping the isclose can create a mess when dealing with big data sets (fpr and tpr arrays could become extremely large making almost certainly impossible to plot without some processing).

If people need to deal with such low score values, I can see 3 viable solutions taking into consideration that only a small percentage of users would ever hit that limit. Good documentation is viable in any case.
1)Either allow a user defined tolerance for isclose (and have a default value same as the current one which is good for 99% of the cases)
2) Implement a decision function that would return predict_log_proba() for estimators without a decision function.
or

3)Myself, I would be much more inclined on an approach that would aggregate the end result (output) rather than the input data. That is, group the fpr and tpr values (using a similar or even zero tolerance).
I cant imagine a scenario where such a precision (below 0.00001) in FPR or TPR would be needed.

As an example, when I used the default metrics.roc_curve in cases with big data sets, it resulted in millions of points for FPR and TPR (most of them redundant -at least for a plotting purpose), which is similar to not using isclose and get an even bigger number of different thresholds and consequently fpr and tpr values.

To solve this issue, given the nature of ROC curve plots (mostly vertical and horizontal lines that need only two sets of coordinates), using a very simple method I kept only the values at the end of those linear parts. Depending on the tolerance and the ROC plot itself, this can decrease the number of values usually close to 1% which makes plotting way easier and faster and out-weights any computation costs introduced by this method.

Actually I would suggest to anyone to use this method before plotting his fpr and tpr values. Having a zero tolerance (no loss of info) decreases the number of values close to the afforementioned 1%.

code snippet here

Cheers

Gerenuk · 2015-09-14T15:17:37Z

Certainly 3) seems to be the cleanest solution. If you want to solve the plotting issue, then solve the plotting issue. Do not mess with the values beforehand. The plotting issue can be solved with different independent ways. I believe Bokeh can even handle it without any user adjustment.

jnothman · 2016-04-25T14:14:29Z

Appears to be a duplicate of #3864. Would someone like to summarise, there, the discussion here?

amueller added the Documentation label Jun 15, 2015

jnothman closed this as completed Apr 25, 2016

jnothman mentioned this issue Apr 25, 2016

Bug in metrics.roc_auc_score #3864

Closed

jblackburne mentioned this issue Sep 7, 2016

[MRG + 1] Remove np.isclose() from ROC curve calculation #7353

Merged

chenhe95 mentioned this issue Dec 2, 2016

[WIP] Fix gradient boosting overflow #7959

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document and adapt isclose() usage #4864

Document and adapt isclose() usage #4864

Gerenuk commented Jun 15, 2015

amueller commented Jun 15, 2015

Gerenuk commented Jun 15, 2015

arjoly commented Jun 15, 2015

vassilisp commented Sep 14, 2015

arjoly commented Sep 14, 2015

vassilisp commented Sep 14, 2015

Gerenuk commented Sep 14, 2015

jnothman commented Apr 25, 2016

Document and adapt isclose() usage #4864

Document and adapt isclose() usage #4864

Comments

Gerenuk commented Jun 15, 2015

amueller commented Jun 15, 2015

Gerenuk commented Jun 15, 2015

arjoly commented Jun 15, 2015

vassilisp commented Sep 14, 2015

arjoly commented Sep 14, 2015

vassilisp commented Sep 14, 2015

Gerenuk commented Sep 14, 2015

jnothman commented Apr 25, 2016