[MRG+1] Bugfix for precision_recall_curve when all labels are negative #8280

varunagrawal · 2017-02-03T02:30:55Z

Reference Issue

What does this implement/fix? Explain your changes.

When all the y_true labels are negative, precision_recall_curve returns nan because of recall being set to nan instead of 1. This is because of the direction division of the tps vector by tps[-1] which happens to be 0.

This fix checks if tps[-1] is 0 and if yes, sets the recall to 1 directly since there are no True Positives or False Negatives, else we calculate recall as normal.

Any other comments?

I had to update test_precision_recall_curve_toydata since this test was expecting the TrueDivide exception to be raised which is no longer the case as a result of this fix. I added 2 test cases, one to check when all truth labels are negative and the other to check when all truth labels are positive to ensure precision calculation is accurate.

lesteve · 2017-02-03T06:38:01Z

sklearn/metrics/tests/test_ranking.py

-        assert_raises(Exception, average_precision_score, y_true, y_score)
+        # y_true = [0, 0]
+        # y_score = [0.25, 0.75]
+        # assert_raises(Exception, precision_recall_curve, y_true, y_score)


This is the behaviour change, it uses to warn because of a division by zero which was turned into an error by the np.seterr('raise') context manager. Remove the np.seterr('raise') and make sure that you test the result of precision_recall_curve.

You don't need to add more tests for the edge cases since they were already tested, you just need to change what we expect from the results.

lesteve · 2017-02-03T06:39:15Z

sklearn/metrics/tests/test_ranking.py

+        # y_score = np.array([[0, 1], [0, 1]])
+        # assert_raises(Exception, average_precision_score, y_true, y_score,
+        #               average="macro")
+        # assert_raises(Exception, average_precision_score, y_true, y_score,


Hmm, I have to say I am not sure what should happen in the multi-label case.

lesteve · 2017-02-03T06:39:16Z

sklearn/metrics/tests/test_ranking.py

@@ -978,3 +979,27 @@ def test_ranking_loss_ties_handling():
    assert_almost_equal(label_ranking_loss([[1, 0, 0]], [[0.25, 0.5, 0.5]]), 1)
    assert_almost_equal(label_ranking_loss([[1, 0, 1]], [[0.25, 0.5, 0.5]]), 1)
    assert_almost_equal(label_ranking_loss([[1, 1, 0]], [[0.25, 0.5, 0.5]]), 1)
+
+
+def test_precision_recall_curve_all_negatives():


You can remove this test

lesteve · 2017-02-03T06:39:23Z

sklearn/metrics/tests/test_ranking.py

+    assert_not_equal(recall[0], np.nan)
+
+
+def test_precision_recall_curve_all_positives():


You can remove this test

varunagrawal · 2017-02-03T19:23:22Z

@lesteve I've updated the tests. Will squash this PR once it gets the green light.

varunagrawal · 2017-02-04T22:50:16Z

From the travis logs, it seems the tests all pass but something else is failing, which I can't seem to understand. Can someone help with this?

jnothman · 2017-02-05T10:23:07Z

You'll get help, but we're moving through the issues quite slowly. Thanks for your patience.

…

On 5 February 2017 at 09:50, Varun Agrawal ***@***.***> wrote: From the travis logs, it seems the tests all pass but something else is failing, which I can't seem to understand. Can someone help with this? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8280 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6wtV-GhfRkl2h1S1uZPgsAqU6JhGks5rZQCpgaJpZM4L15LU> .

jnothman · 2017-02-06T10:17:16Z

The travis failure is a current bug in Travis, which we are waiting on.

lesteve · 2017-02-08T06:04:22Z

I restarted the Travis build. @varunagrawal note you can force a rebuild of all the CIs by pushing a new commit in this branch. My favourite way to do that is git commit --amend + git push -f.

lesteve · 2017-02-08T06:17:46Z

The Travis flake8 error is genuine. Setting up on-the-fly flake8 checks inside your favourite editor is a great investment to avoid this kind of things.

codecov · 2017-02-08T19:12:22Z

Codecov Report

Merging #8280 into master will increase coverage by <.01%.

@@            Coverage Diff             @@
##           master    #8280      +/-   ##
==========================================
+ Coverage   94.73%   94.73%   +<.01%     
==========================================
  Files         342      342              
  Lines       60674    60675       +1     
==========================================
+ Hits        57482    57483       +1     
  Misses       3192     3192

Impacted Files	Coverage Δ
sklearn/metrics/ranking.py	`98.75% <100%> (ø)`	✅
sklearn/metrics/tests/test_ranking.py	`100% <100%> (ø)`	✅

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 542c02b...24e9a11. Read the comment docs.

varunagrawal · 2017-02-08T19:13:01Z

@lesteve gotcha! Fixed them and submitted the changes. Thank you for the review. 😃

jnothman · 2017-02-23T11:22:38Z

I might note that in precision_recall_fscore_support we take a different convention of making R = 0 when there are no true instances. I appreciate the need here to do otherwise for plotting. Are you certain it's the correct approach to take in terms of average precision?

Otherwise LGTM

varunagrawal · 2017-02-23T18:40:29Z

@jnothman so as per this answer on StackOverflow, this does seem to be the correct approach to take. I had the same doubts as you so I went and looked up half a dozen resources and worked out the arguments in my head before posting an issue about this.
The issue is #8245 for your reference.

Thanks for your review.

jnothman · 2017-02-23T21:00:09Z

And thanks for the research.

…

On 24 February 2017 at 05:40, Varun Agrawal ***@***.***> wrote: @jnothman <https://github.com/jnothman> so as per this answer <http://stats.stackexchange.com/questions/1773/what-are-correct-values-for-precision-and-recall-in-edge-cases> on StackOverflow, this does seem to be the correct approach to take. I had the same doubts as you so I went and looked up half a dozen resources and worked out the arguments in my head before posting an issue about this. The issue is #8245 <#8245> for your reference. Thanks for your review. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8280 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6wckUWZJUywdsu6_Cudljqov_SUPks5rfdKfgaJpZM4L15LU> .

lesteve · 2017-02-24T11:39:36Z

@jnothman so as per this answer on StackOverflow, this does seem to be the correct approach to take

The more I think about this, the less I am sure about it, it would be great to find a more authoritative source clearly specifying what the recall value should be when TP + FN == 0, maybe NaN is actually the best we can do ...

jnothman · 2017-02-25T21:49:03Z

@lesteve we've had this discussion with respect to just measuring P and R years ago. This is different because we want a limit. We try to avoid nan because we average these things.

…

On 24 Feb 2017 10:39 pm, "Loïc Estève" ***@***.***> wrote: @jnothman <https://github.com/jnothman> so as per this answer on StackOverflow, this does seem to be the correct approach to take The more I think about this, the less I am sure about it, it would be great to find a more authoritative source clearly specifying what the recall value should be when TP + FN == 0, maybe NaN is actually the best we can do ... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8280 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz65KGIIUO3IgbaEoRRK7o6ok3Ffemks5rfsF5gaJpZM4L15LU> .

varunagrawal · 2017-03-16T04:46:51Z

@jnothman @lesteve sorry for being AWOL. Paper deadlines have not been kind.
Is there still any type of confusion regarding this PR? I tried looking through the code for calculating the P-R curve but it seems to be doing some fancy operations which aren't quite clear to me.

From an authoritative standpoint, I believe the original PASCAL paper is our best bet.

varunagrawal · 2017-03-29T20:39:29Z

@jnothman @lesteve any updates on this?

jnothman · 2017-03-30T04:19:05Z

Haven't had much time for reviewing, and I expect little availability for the next couple of weeks. Please be patient, and ping again if necessary.

jnothman · 2017-05-28T10:54:53Z

@lesteve, as opposed to in precision_recall_fscore_support, for a PR curve, we need to consider the limits, rather than just a sensible overall evaluation metric. At the limit, it should always be possible for a system to get perfect recall of the positive class, even if there are no instances of it in the sample.

varunagrawal · 2017-07-12T21:42:02Z

@jnothman @lesteve a gentle reminder. :)

amueller · 2017-07-21T18:27:33Z

Potentially yes, though hopefully for real-world data the changes are less significant. I don't think a bug in software that you used is an ethics violation. Relative statements are likely to also hold with the fix, absolute statements depend a lot on what you compare against. Some people use "interpolated mAP" which is even more optimistic than our previous implementation.

varunagrawal · 2017-07-21T18:31:15Z

The changes are severely significant in my case. My model went from being state-of-the-art to worse than current by a significant margin. If I hope to publish these results, I'll have to refute the claims made in my citations and that's going to be another mess altogether.

varunagrawal · 2017-07-21T18:32:43Z

Does the current implementation follow the Pascal VOC formulation like the previous one or is this based more on the information retrieval aspect of things?
I saw the Stanford Information Retrieval reference and wanted to clarify on that.

amueller · 2017-07-21T18:33:28Z

@varunagrawal I'm sorry the bug is creating so much trouble for you. Again, are you comparing against a number in the literature? Neither the previous one nor the current one is the PascalVOC, which implements something that is not standard in the literature. The PascalVOC is much more optimistic than any other measure.

amueller · 2017-07-21T18:35:49Z

We discussed implementing the PascalVOC one, but the PascalVOC paper disagrees with the code in the submission tools (the paper quotes an 11-point interpolation, the matlab code does general interpolation). Neither version is found anywhere else in the literature as far as I can tell.

amueller · 2017-07-21T18:38:34Z

check out #4577
The computer vision community seems to think the interpolated variant is standard, though it's proven to be theoretically wrong, see http://mark.goadrich.com/articles/davisgoadrichcamera2.pdf

amueller · 2017-07-21T18:39:23Z

I think we might be open to adding the VOC variant so that people can more easily compare to the computer vision literature.

varunagrawal · 2017-07-21T19:02:08Z

@amueller thank you for your help on this. I was just about to mention adding a separate PascalVOC version since I myself work in computer vision and all the papers I have referenced use that form of mAP calculation, thus making it standard from a peer review perspective.

Thank you for the link to the argument showing PascalVOC got it wrong. I'll be sure to inform my advisor and see what we can do. However at this time, I'll be glad to help contribute an implementation of the PascalVOC version, thus allowing a nice separation between the information retrieval community and the computer vision community. :)

amueller · 2017-07-21T19:51:57Z

@varunagrawal feel free to open a PR to add an argument to the average_precision function. I'd also welcome some additions to the user guide explaining the differences. And for comparing against other computer vision papers, you should not use the version currently implemented in sklearn, but the PascalVOC one. So I don't think you're in trouble.

varunagrawal · 2017-07-21T21:26:18Z

Yup, the moment you mentioned PascalVOC and the IR definitions being different, I let out a sigh of relief.

varunagrawal · 2018-02-19T02:16:16Z

@amueller @GaelVaroquaux can you please provide more information on what is going in this commit?
It seems that this is the change that was supposed to fix the bug so as to use the step function instead of interpolation. I am guessing this line is the one that performs the step interpolation?

What I don't understand is that there seems to have been some effort in parameterizing the interpolation type being used, but there are addition commits on top of those which remove those changes.

jnothman · 2018-02-19T02:35:52Z

That commit is the (poorly-titled) result of #9017. @vene and @GaelVaroquaux seem to have concluded in person that there were problems in providing averaged eleven-point precision.

varunagrawal · 2018-02-19T02:43:32Z

Alright, so it seems that I can update the average_precision_score function to take a parameter interpolation and if False, just use the step function and if True, use @ndingwall's implementation of the 11 point interpolated version.
Does that seem like a good roadmap?

varunagrawal · 2018-12-08T01:44:25Z

@amueller @qinhanmin2014 I know it's been a while since we touched upon this PR. I am more than happy to finish updating this sometime next week if we have a consensus on what is the right thing to do here.

jnothman · 2018-12-17T09:53:16Z

I know it's been a while since we touched upon this PR. I am more than happy to finish updating this sometime next week if we have a consensus on what is the right thing to do here.

Do you recall the discussion well enough to summarise it?

varunagrawal · 2019-08-06T16:37:40Z

@jnothman this PR is a fix for #8245.

Basically, in the multilabel case, if there are no positive binary labels, then we end up getting NaNs. I have described a fix for it in the issue itself.

varunagrawal · 2019-08-09T23:31:27Z

@amueller @jnothman I am closing this PR and opening a new one.

varunagrawal · 2020-06-09T14:35:20Z

The whole PascalVOC debate should ideally be settled in a different PR. I can open an issue later if that works.

varunagrawal force-pushed the pr-curve-bugfix branch 2 times, most recently from b39523f to 92f129d Compare February 3, 2017 05:04

lesteve reviewed Feb 3, 2017

View reviewed changes

varunagrawal force-pushed the pr-curve-bugfix branch from f06b98c to 50fbd04 Compare February 4, 2017 21:14

varunagrawal force-pushed the pr-curve-bugfix branch from e9b21db to 24e9a11 Compare February 8, 2017 19:11

jnothman changed the title ~~[MRG] Bugfix for precision_recall_curve when all labels are negative~~ [MRG+1] Bugfix for precision_recall_curve when all labels are negative Feb 23, 2017

jnothman added this to the 0.19 milestone Jun 18, 2017

fixed bug for precision recall curve when all labels are negative

b0b9bca

varunagrawal force-pushed the pr-curve-bugfix branch from 24e9a11 to b0b9bca Compare July 12, 2017 21:41

varunagrawal mentioned this pull request Feb 25, 2018

[MRG] Bug fix and new feature: fix implementation of average precision score and add eleven-point interpolated option #7356

Closed

qinhanmin2014 mentioned this pull request Oct 6, 2018

average_precision_score breaks on string labels #12312

Closed

jnothman mentioned this pull request Jan 25, 2019

precision_recall_curve should raise UndefinedMetricWarning when there are no true samples #13043

Open

amueller removed this from the 0.19 milestone Aug 6, 2019

amueller added Bug Stalled labels Aug 6, 2019

varunagrawal closed this Aug 9, 2019

varunagrawal deleted the pr-curve-bugfix branch August 9, 2019 23:28

varunagrawal restored the pr-curve-bugfix branch August 9, 2019 23:29

varunagrawal deleted the pr-curve-bugfix branch August 9, 2019 23:29

qinhanmin2014 mentioned this pull request Sep 4, 2019

Precision Recall and F-score: behavior when all negative #14876

Closed

varunagrawal mentioned this pull request Jun 9, 2020

[MRG] Bugfix for precision_recall_curve when all labels are negative #14621

Closed

		assert_not_equal(recall[0], np.nan)


		def test_precision_recall_curve_all_positives():

[MRG+1] Bugfix for precision_recall_curve when all labels are negative #8280

[MRG+1] Bugfix for precision_recall_curve when all labels are negative #8280

Conversation

varunagrawal commented Feb 3, 2017

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

lesteve Feb 3, 2017

Choose a reason for hiding this comment

lesteve Feb 3, 2017

Choose a reason for hiding this comment

lesteve Feb 3, 2017

Choose a reason for hiding this comment

lesteve Feb 3, 2017

Choose a reason for hiding this comment

varunagrawal commented Feb 3, 2017

varunagrawal commented Feb 4, 2017

jnothman commented Feb 5, 2017 via email

jnothman commented Feb 6, 2017

lesteve commented Feb 8, 2017

lesteve commented Feb 8, 2017

codecov bot commented Feb 8, 2017 • edited

Codecov Report

varunagrawal commented Feb 8, 2017

jnothman commented Feb 23, 2017

varunagrawal commented Feb 23, 2017

jnothman commented Feb 23, 2017 via email

lesteve commented Feb 24, 2017

jnothman commented Feb 25, 2017 via email

varunagrawal commented Mar 16, 2017

varunagrawal commented Mar 29, 2017

jnothman commented Mar 30, 2017

jnothman commented May 28, 2017

varunagrawal commented Jul 12, 2017

amueller commented Jul 21, 2017

varunagrawal commented Jul 21, 2017

varunagrawal commented Jul 21, 2017

amueller commented Jul 21, 2017

amueller commented Jul 21, 2017

amueller commented Jul 21, 2017

amueller commented Jul 21, 2017

varunagrawal commented Jul 21, 2017

amueller commented Jul 21, 2017

varunagrawal commented Jul 21, 2017

varunagrawal commented Feb 19, 2018 • edited

jnothman commented Feb 19, 2018

varunagrawal commented Feb 19, 2018

varunagrawal commented Dec 8, 2018

jnothman commented Dec 17, 2018

varunagrawal commented Aug 6, 2019

varunagrawal commented Aug 9, 2019

varunagrawal commented Jun 9, 2020 • edited

codecov bot commented Feb 8, 2017 •

edited

varunagrawal commented Feb 19, 2018 •

edited

varunagrawal commented Jun 9, 2020 •

edited