Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hammering home the point #1

Open
amueller opened this issue Sep 8, 2016 · 6 comments
Open

hammering home the point #1

amueller opened this issue Sep 8, 2016 · 6 comments

Comments

@amueller
Copy link

amueller commented Sep 8, 2016

Just some comment on this here:
https://github.com/roaminsight/roamresearch/blob/master/BlogPosts/Average_precision/Average_precision_post.ipynb

It's true that linear interpolation for PR curves is overly optimistic and we should fix that. See this paper:
http://machinelearning.wustl.edu/mlpapers/paper_files/icml2006_DavisG06.pdf

However, linear interpolation is totally fine for ROC curves and your argument in "Hammering home the point" is actually wrong.
The way you picked the rounding point, you actually ended up with a better classifier. You can achieve any classifier that is linearly interpolated between points by flipping a weighted coin on which of the two end-points to use (for ROC, not PR).
There is a more subtle issue with how you do the interpolation. Choosing any point for interpolation based on their P/R values means that you already observed these points. You can no longer use them as you test set. So interpolation that skips "bad" points is only allowed when using a validation set.

@ndingwall
Copy link
Contributor

Oh, the "Hammering home the point" section is still about P-R curves. I now realize that it's confusing because we talk about ROC AUC and then go back to P-R without making the switch clear. I'll update the text to clarify this. Anyway, I think we agree: linear interpolation is fine for ROC AUC, but not P-R.

I'm not sure I follow your second point. Is there any reason to think that a better-than-chance classifier could be improved by randomly increasing or decreasing all of the scores? (Okay, not quite randomly, but the classifier doesn't know how rounding works, so from its point of view these changes are essentially random!)

Finally, I would assume that anyone using this function is computing the operating points on a test set, so now it's just a matter of how we convert the list of operating points into a single number. It makes sense for that number to represent the area under a curve defined by those operating points, and so we just need to choose how to interpolate. We agree that we shouldn't interpolate linearly. Step interpolation arises naturally in the same way that coin-flipping leads to linear interpolation for ROC, so that seems like the right choice. Another option is to return to the ROC space, compute the curve and transform it into the P-R space. But that would require a pretty big change to the API: lists containing precision and recall numbers aren't enough to compute ROC.

I might have missed your point entirely though, in which case please let me know!

Anyway, thanks for your feedback on this!

@amueller
Copy link
Author

amueller commented Sep 8, 2016

I think we agree on most things. I really recommend reading the paper that I cited, though ;) I commented on your PR on what I think would be the right thing to do.

@amueller
Copy link
Author

amueller commented Sep 8, 2016

Hm but maybe the simple point that I'm not sure got across is: in the IR book, they remove the dips by computing a maximum (on the test set!!). You don't do that in your code at all.

@ndingwall
Copy link
Contributor

Yes, I was confused by that as well:

The justification is that almost anyone would be prepared to look at a few more documents if it would increase the percentage of the viewed set that were relevant (that is, if the precision of the larger set is higher). Source for anyone else following this conversation

This seems wrong to me, because you wouldn't know if the precision of the larger set is higher without peeking at the gold labels. My reference to the paper was just in their use of horizontal segments to interpolate between points, but I'm not convinced by their overall strategy.

I'm about to get on a flight so I'll review the rest of your comments and read that paper. Thanks for all your feedback - super helpful!

@amueller
Copy link
Author

amueller commented Sep 8, 2016

Yeah I just discussed exactly the same point with a colleague and we both
have the same view. Also thanks for you input, totally helped me wrap my
head around this. I think I'm good now ;)

Sent from phone. Please excuse spelling and brevity.

On Sep 8, 2016 18:27, "ndingwall" notifications@github.com wrote:

Yes, I was confused by that as well:

The justification is that almost anyone would be prepared to look at a few
more documents if it would increase the percentage of the viewed set that
were relevant (that is, if the precision of the larger set is higher). Source
for anyone else following this conversation
http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html

This seems wrong to me, because you wouldn't know if the precision of the
larger set is higher without peeking at the gold labels. My reference to
the paper was just in their use of horizontal segments to interpolate
between points, but I'm not convinced by their overall strategy.

I'm about to get on a flight so I'll review the rest of your comments and
read that paper. Thanks for all your feedback - super helpful!


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAbcFgDT4JgYSOxeBRm34B062fNHfhDoks5qoIu9gaJpZM4J4ZD-
.

@ndingwall
Copy link
Contributor

Great - happy to help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants