Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Bugfix for precision_recall_curve when all labels are negative #8280

Closed

Conversation

varunagrawal
Copy link
Contributor

Reference Issue

Fixes #8245

What does this implement/fix? Explain your changes.

When all the y_true labels are negative, precision_recall_curve returns nan because of recall being set to nan instead of 1. This is because of the direction division of the tps vector by tps[-1] which happens to be 0.

This fix checks if tps[-1] is 0 and if yes, sets the recall to 1 directly since there are no True Positives or False Negatives, else we calculate recall as normal.

Any other comments?

I had to update test_precision_recall_curve_toydata since this test was expecting the TrueDivide exception to be raised which is no longer the case as a result of this fix. I added 2 test cases, one to check when all truth labels are negative and the other to check when all truth labels are positive to ensure precision calculation is accurate.

@varunagrawal varunagrawal force-pushed the pr-curve-bugfix branch 2 times, most recently from b39523f to 92f129d Compare February 3, 2017 05:04
assert_raises(Exception, average_precision_score, y_true, y_score)
# y_true = [0, 0]
# y_score = [0.25, 0.75]
# assert_raises(Exception, precision_recall_curve, y_true, y_score)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the behaviour change, it uses to warn because of a division by zero which was turned into an error by the np.seterr('raise') context manager. Remove the np.seterr('raise') and make sure that you test the result of precision_recall_curve.

You don't need to add more tests for the edge cases since they were already tested, you just need to change what we expect from the results.

# y_score = np.array([[0, 1], [0, 1]])
# assert_raises(Exception, average_precision_score, y_true, y_score,
# average="macro")
# assert_raises(Exception, average_precision_score, y_true, y_score,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I have to say I am not sure what should happen in the multi-label case.

@@ -978,3 +979,27 @@ def test_ranking_loss_ties_handling():
assert_almost_equal(label_ranking_loss([[1, 0, 0]], [[0.25, 0.5, 0.5]]), 1)
assert_almost_equal(label_ranking_loss([[1, 0, 1]], [[0.25, 0.5, 0.5]]), 1)
assert_almost_equal(label_ranking_loss([[1, 1, 0]], [[0.25, 0.5, 0.5]]), 1)


def test_precision_recall_curve_all_negatives():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this test

assert_not_equal(recall[0], np.nan)


def test_precision_recall_curve_all_positives():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this test

@varunagrawal
Copy link
Contributor Author

@lesteve I've updated the tests. Will squash this PR once it gets the green light.

@varunagrawal
Copy link
Contributor Author

From the travis logs, it seems the tests all pass but something else is failing, which I can't seem to understand. Can someone help with this?

@jnothman
Copy link
Member

jnothman commented Feb 5, 2017 via email

@jnothman
Copy link
Member

jnothman commented Feb 6, 2017

The travis failure is a current bug in Travis, which we are waiting on.

@lesteve
Copy link
Member

lesteve commented Feb 8, 2017

I restarted the Travis build. @varunagrawal note you can force a rebuild of all the CIs by pushing a new commit in this branch. My favourite way to do that is git commit --amend + git push -f.

@lesteve
Copy link
Member

lesteve commented Feb 8, 2017

The Travis flake8 error is genuine. Setting up on-the-fly flake8 checks inside your favourite editor is a great investment to avoid this kind of things.

@codecov
Copy link

codecov bot commented Feb 8, 2017

Codecov Report

Merging #8280 into master will increase coverage by <.01%.

@@            Coverage Diff             @@
##           master    #8280      +/-   ##
==========================================
+ Coverage   94.73%   94.73%   +<.01%     
==========================================
  Files         342      342              
  Lines       60674    60675       +1     
==========================================
+ Hits        57482    57483       +1     
  Misses       3192     3192
Impacted Files Coverage Δ
sklearn/metrics/ranking.py 98.75% <100%> (ø)
sklearn/metrics/tests/test_ranking.py 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 542c02b...24e9a11. Read the comment docs.

@varunagrawal
Copy link
Contributor Author

@lesteve gotcha! Fixed them and submitted the changes. Thank you for the review. 😃

@jnothman
Copy link
Member

I might note that in precision_recall_fscore_support we take a different convention of making R = 0 when there are no true instances. I appreciate the need here to do otherwise for plotting. Are you certain it's the correct approach to take in terms of average precision?

Otherwise LGTM

@jnothman jnothman changed the title [MRG] Bugfix for precision_recall_curve when all labels are negative [MRG+1] Bugfix for precision_recall_curve when all labels are negative Feb 23, 2017
@varunagrawal
Copy link
Contributor Author

@jnothman so as per this answer on StackOverflow, this does seem to be the correct approach to take. I had the same doubts as you so I went and looked up half a dozen resources and worked out the arguments in my head before posting an issue about this.
The issue is #8245 for your reference.

Thanks for your review.

@jnothman
Copy link
Member

jnothman commented Feb 23, 2017 via email

@lesteve
Copy link
Member

lesteve commented Feb 24, 2017

@jnothman so as per this answer on StackOverflow, this does seem to be the correct approach to take

The more I think about this, the less I am sure about it, it would be great to find a more authoritative source clearly specifying what the recall value should be when TP + FN == 0, maybe NaN is actually the best we can do ...

@jnothman
Copy link
Member

jnothman commented Feb 25, 2017 via email

@varunagrawal
Copy link
Contributor Author

@jnothman @lesteve sorry for being AWOL. Paper deadlines have not been kind.
Is there still any type of confusion regarding this PR? I tried looking through the code for calculating the P-R curve but it seems to be doing some fancy operations which aren't quite clear to me.

From an authoritative standpoint, I believe the original PASCAL paper is our best bet.

@varunagrawal
Copy link
Contributor Author

@jnothman @lesteve any updates on this?

@jnothman
Copy link
Member

Haven't had much time for reviewing, and I expect little availability for the next couple of weeks. Please be patient, and ping again if necessary.

@jnothman
Copy link
Member

@lesteve, as opposed to in precision_recall_fscore_support, for a PR curve, we need to consider the limits, rather than just a sensible overall evaluation metric. At the limit, it should always be possible for a system to get perfect recall of the positive class, even if there are no instances of it in the sample.

@jnothman jnothman added this to the 0.19 milestone Jun 18, 2017
@varunagrawal
Copy link
Contributor Author

@jnothman @lesteve a gentle reminder. :)

@amueller
Copy link
Member

Potentially yes, though hopefully for real-world data the changes are less significant. I don't think a bug in software that you used is an ethics violation. Relative statements are likely to also hold with the fix, absolute statements depend a lot on what you compare against. Some people use "interpolated mAP" which is even more optimistic than our previous implementation.

@varunagrawal
Copy link
Contributor Author

The changes are severely significant in my case. My model went from being state-of-the-art to worse than current by a significant margin. If I hope to publish these results, I'll have to refute the claims made in my citations and that's going to be another mess altogether.

@varunagrawal
Copy link
Contributor Author

Does the current implementation follow the Pascal VOC formulation like the previous one or is this based more on the information retrieval aspect of things?
I saw the Stanford Information Retrieval reference and wanted to clarify on that.

@amueller
Copy link
Member

@varunagrawal I'm sorry the bug is creating so much trouble for you. Again, are you comparing against a number in the literature? Neither the previous one nor the current one is the PascalVOC, which implements something that is not standard in the literature. The PascalVOC is much more optimistic than any other measure.

@amueller
Copy link
Member

We discussed implementing the PascalVOC one, but the PascalVOC paper disagrees with the code in the submission tools (the paper quotes an 11-point interpolation, the matlab code does general interpolation). Neither version is found anywhere else in the literature as far as I can tell.

@amueller
Copy link
Member

check out #4577
The computer vision community seems to think the interpolated variant is standard, though it's proven to be theoretically wrong, see http://mark.goadrich.com/articles/davisgoadrichcamera2.pdf

@amueller
Copy link
Member

I think we might be open to adding the VOC variant so that people can more easily compare to the computer vision literature.

@varunagrawal
Copy link
Contributor Author

@amueller thank you for your help on this. I was just about to mention adding a separate PascalVOC version since I myself work in computer vision and all the papers I have referenced use that form of mAP calculation, thus making it standard from a peer review perspective.

Thank you for the link to the argument showing PascalVOC got it wrong. I'll be sure to inform my advisor and see what we can do. However at this time, I'll be glad to help contribute an implementation of the PascalVOC version, thus allowing a nice separation between the information retrieval community and the computer vision community. :)

@amueller
Copy link
Member

@varunagrawal feel free to open a PR to add an argument to the average_precision function. I'd also welcome some additions to the user guide explaining the differences. And for comparing against other computer vision papers, you should not use the version currently implemented in sklearn, but the PascalVOC one. So I don't think you're in trouble.

@varunagrawal
Copy link
Contributor Author

Yup, the moment you mentioned PascalVOC and the IR definitions being different, I let out a sigh of relief.

@varunagrawal
Copy link
Contributor Author

varunagrawal commented Feb 19, 2018

@amueller @GaelVaroquaux can you please provide more information on what is going in this commit?
It seems that this is the change that was supposed to fix the bug so as to use the step function instead of interpolation. I am guessing this line is the one that performs the step interpolation?

What I don't understand is that there seems to have been some effort in parameterizing the interpolation type being used, but there are addition commits on top of those which remove those changes.

@jnothman
Copy link
Member

That commit is the (poorly-titled) result of #9017. @vene and @GaelVaroquaux seem to have concluded in person that there were problems in providing averaged eleven-point precision.

@varunagrawal
Copy link
Contributor Author

Alright, so it seems that I can update the average_precision_score function to take a parameter interpolation and if False, just use the step function and if True, use @ndingwall's implementation of the 11 point interpolated version.
Does that seem like a good roadmap?

@varunagrawal
Copy link
Contributor Author

@amueller @qinhanmin2014 I know it's been a while since we touched upon this PR. I am more than happy to finish updating this sometime next week if we have a consensus on what is the right thing to do here.

@jnothman
Copy link
Member

I know it's been a while since we touched upon this PR. I am more than happy to finish updating this sometime next week if we have a consensus on what is the right thing to do here.

Do you recall the discussion well enough to summarise it?

@varunagrawal
Copy link
Contributor Author

@jnothman this PR is a fix for #8245.

Basically, in the multilabel case, if there are no positive binary labels, then we end up getting NaNs. I have described a fix for it in the issue itself.

@varunagrawal varunagrawal deleted the pr-curve-bugfix branch August 9, 2019 23:28
@varunagrawal varunagrawal restored the pr-curve-bugfix branch August 9, 2019 23:29
@varunagrawal varunagrawal deleted the pr-curve-bugfix branch August 9, 2019 23:29
@varunagrawal
Copy link
Contributor Author

@amueller @jnothman I am closing this PR and opening a new one.

@varunagrawal
Copy link
Contributor Author

varunagrawal commented Jun 9, 2020

The whole PascalVOC debate should ideally be settled in a different PR. I can open an issue later if that works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

average_precision_score does not return correct AP when all negative ground truth labels
4 participants