Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Bug fix and new feature: fix implementation of average precision score and add eleven-point interpolated option #7356

Closed
wants to merge 44 commits into from

Conversation

@ndingwall
Copy link
Contributor

@ndingwall ndingwall commented Sep 7, 2016

Reference Issue

Fixes #4577 and #6377

What does this implement/fix? Explain your changes.

This adds an optional interpolation parameter to both auc and average_precision_score. By default, the value is set to 'linear' which replicates the existing behavior, but there is also a 'step' option that implements the strategy described in Stanford's Introduction to Information Retrieval, and as discussed in this blog post. To generalize this, auc also offers a parameter to control the direction of the step-wise interpolation; I'm sure someone will find this useful at some point!

Any other comments?

At some point, I'd support switching the standard behavior to step-wise interpolation, but that will make the existing docstring tests fail. I don't want to change them without the support of the community!


This change is Reviewable

interpolation : string ['linear' (default), 'step']
Determines the kind of interpolation used when computed AUC. If there
are many repeated scores, 'step' is recommended to avoid under- or
over-estimating the AUC. See `Roam Analytics blog post

This comment has been minimized.

@amueller

amueller Sep 8, 2016
Member

This doesn't seem a good source to me. I'm happy with citing the IR book, though.

This comment has been minimized.

@ndingwall

ndingwall Sep 8, 2016
Author Contributor

No problem - I'll remove it and add the IR book to the references.

This comment has been minimized.

@ndingwall

ndingwall Sep 8, 2016
Author Contributor

@amueller Related question: in the references section, is it okay to include the link to my blog post? I'm never sure whether the references are to justify the implementation, or to provide useful guidance for users. I feel like my post might be helpful for the latter but not for the former!

@amueller
Copy link
Member

@amueller amueller commented Sep 8, 2016

You can also cite the Pascal VOC paper. Can you please also add a reference to the paper I cited earlier about the relation of ROC curve and PR curve?

@amueller
Copy link
Member

@amueller amueller commented Sep 8, 2016

Wait, so currently we are using linear interpolation? That is actually... wrong... see the paper I cited.
However, this looks like it's not the same as what the Pascal VOC computes, which cheats even more than we do.... hum...

@amueller
Copy link
Member

@amueller amueller commented Sep 8, 2016

Could you please do a comparison of the VOC measure, the measure you implemented, the measure in the IR book and our current measure?
This is actually a different issue for PR and ROC curve, as the paper that I mentioned explains. For ROC curve, linear interpolation is fine, for PR curve, it is not - which is what the blog post complains about.
You could do stepwise interpolation on the PR curve, which is a pessimistic boundary, or you could do the real interpolation, as explained in the paper, but that requires a validation set.

Also, can you maybe add an example that compares the different methods. This is pretty subtle.
So for AUC, there is "linear interpolation", "step-wise interpolation" and "VOC smoothing" (aka cheating). There could also be the correct smoothing using a validation set, which would be good.

For PR, there is "linear interpolation" (which we currently do and is wrong), "VOC smoothing" which is even worse, "step-wise interpolation" which is a lower bound and the correct interpolation as explained in the paper, which requires a validation set.

@amueller
Copy link
Member

@amueller amueller commented Sep 8, 2016

The IR book is also cheating (using the VOC method), the blog post is not cheating as far as I can see.

@amueller
Copy link
Member

@amueller amueller commented Sep 8, 2016

Oh the blog post is by you? Then never mind my comment there ;)

So interestingly AP as defined on wikipedia is again different from this. mean AP is the mean over recall@k for all k. That means some of the sawtooth are skipped over, but not all.

@@ -55,6 +57,25 @@ def auc(x, y, reorder=False):
If True, assume that the curve is ascending in the case of ties, as for
an ROC curve. If the curve is non-ascending, the result will be wrong.
interpolation : string ['trapezoid' (default), 'step']

This comment has been minimized.

@amueller

amueller Sep 8, 2016
Member

linear, not trapezoid ;)

This comment has been minimized.

@ndingwall

ndingwall Sep 8, 2016
Author Contributor

Good catch!

@amueller
Copy link
Member

@amueller amueller commented Sep 8, 2016

Ok so after thinking about it for another hour: you are implementing the wikipedia definition. We should not allow "linear" interpolation for average_precision as it is wrong. We should also allow "interpolated" average precision as many people use it.

So I'd add a parameter to average_precision that is ``interpolation="none"" and add the option "pascal_voc", and maybe also the "ir_book" only we should give them better names.

I wouldn't worry about AUC for now.

@ndingwall
Copy link
Contributor Author

@ndingwall ndingwall commented Sep 8, 2016

Before I get on this flight, I want to clarify your last comment so I can spend some flight time working on it! I haven't read the paper yet, so apologies if some of this would be pre-empted by that.

Interpolation options
  1. None (default) Just return the mean of the precision values (and ignore recall).
  2. pascal_voc (Pg 11) This appears to be the same as the 11-point interpolated average precision (11PIAP from here on) mentioned in the IR book. For each of 11 recall values r, we choose the precision that corresponds to the smallest recall greater than or equal to r. Then we return the mean of these precisions.
  3. ir_book I don't think either of us like the approach that excludes operating points based on comparing their precision to the precision of the subsequent operating point. I'm therefore not keen to implement something that I wouldn't encourage people to use, especially because it's going to be fiddly!
Comments:

'None' falls into the same pitfalls as linear interpolation. If we only have two operating points, the arithmetic mean of the precision values is the midpoint of the straight line joining them. That's what I was trying to escape!

This problem doesn't apply to the 11PIAP because the recall values are specified, so although we're taking an arithmetic mean, we're at least comparing oranges and oranges.

But 11PIAP does introduce a different problem. Consider a model that assigns scores which permit recall values of 0.09, 0.19, 0.29, etc. This model is likely to appear to be much worse than one whose recall values are 0.11, 0.21, 0.31, etc. even if the first one has much better precision at those operating points. We could imagine a situation where including examples at random would actually improve the 11PIAP by bringing the recall above the arbitrary thresholds.

The step interpolation I proposed is really just an extension of the 11PIAP to arbitrarily many points. It doesn't introduce the bias that linear interpolation does, and it allows each operating point to fully contribute whatever its recall value is. So I'd still advocate for including that as an option!

@amueller
Copy link
Member

@amueller amueller commented Sep 9, 2016

By "none" I meant what you did, i.e. step-wise interpolation to the left. it's not really interpolation, though.
I'm not sure what you mean by the "ignoring recall". I meant the wikipedia definition of average precision. That's what you implemented, and that's what I think the default should be.

I hope I'll have some plots later today to illustrate.

@ndingwall
Copy link
Contributor Author

@ndingwall ndingwall commented Sep 9, 2016

Oh, that makes sense. I call it 'interpolated' because I'm thinking of it from a graphical perspective, but I see that the wiki article comes to the same calculation without explicitly considering the curve.

I'll recode it so that the wikipedia definition ($\operatorname{AveP} = \sum_{k=1}^n P(k) \Delta r(k)$) is the default and I'll add an option to use interpolation='eleven_point' interpolation instead (and point to the IR book which has a good section on that).

Presumably I'll need to change some tests since we'll be changing the default behavior. Is there a best-practice way to do that?

@amueller
Copy link
Member

@amueller amueller commented Sep 9, 2016

check which tests fail, fix them. add a test that checks that the current behavior is correct, not like the previous one, possibly using one of the reported bugs.
then add a section to bugfixes in whatsnew that points out the change in the behavior.
Usually we don't make backward incompatible changes, but I would consider this a bugfix.
I would really love to also see a visual comparison of the different methods as an example (with how to do it and how not to do it). Not sure I find the time to do that.
Also see #7372.
Currently the plots of precision_recall_curve use plt.plot which basically plots linear interpolation, which is a bit odd.

@amueller
Copy link
Member

@amueller amueller commented Sep 14, 2016

please also check #4936 to see if you can reuse any of that.

@amueller amueller closed this Sep 14, 2016
@amueller amueller reopened this Sep 14, 2016
@amueller
Copy link
Member

@amueller amueller commented Sep 14, 2016

sorry accidental close.

Copy link
Member

@amueller amueller left a comment

Please add a non-regression test. Ideally it would be great if you could have a test against the output of the Pascal VOC code.

Uses a step function to interpolate between operating points.
interpolation : None (default), or string ['eleven_point']
``None``:

This comment has been minimized.

@amueller

amueller Sep 16, 2016
Member

Can you check how this renders / whether sphinx spits out warnings?

This comment has been minimized.

@ndingwall

ndingwall Sep 20, 2016
Author Contributor

Seems to be okay.
screen shot 2016-09-20 at 3 51 43 pm

Linearly interpolates between operating points.
``'step'``:
Uses a step function to interpolate between operating points.
interpolation : None (default), or string ['eleven_point']

This comment has been minimized.

@amueller

amueller Sep 16, 2016
Member

why a list?

This comment has been minimized.

@ndingwall

ndingwall Sep 16, 2016
Author Contributor

I was just copying the style from other places that includes acceptable strings in a list. But they had multiple options so I guess that made more sense. Should it read interpolation : None (default), or string 'eleven_point' or interpolation : None (default), or string ['eleven_point', ]?

This comment has been minimized.

@amueller

amueller Sep 16, 2016
Member

the first, I think.

This comment has been minimized.

@ndingwall

ndingwall Sep 20, 2016
Author Contributor

Done. See the screenshot below.

Returns
-------
average_precision : float
References
----------
.. [1] `Wikipedia entry for the Average precision
<https://en.wikipedia.org/wiki/Average_precision>`_
<http://en.wikipedia.org/wiki/Average_precision>`_
.. [2] `Stanford Information Retrieval book

This comment has been minimized.

@amueller

amueller Sep 16, 2016
Member

Maybe add the Pascal VOC paper?

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@ndingwall

ndingwall Sep 20, 2016
Author Contributor

Done.

``'eleven_point'``:
For each of the recall values, r in {0, 0.1, 0.2, ..., 1.0},
compute the arithmetic mean of the first precision value with a
corresponding recall ≥ r. See the `Stanford Information Retrieval

This comment has been minimized.

@amueller

amueller Sep 16, 2016
Member

the >= symbol is probably not a good idea here (you could declare an encoding for the file but I think just writing >= is better).

This comment has been minimized.

@ndingwall

ndingwall Sep 16, 2016
Author Contributor

Done.

@ndingwall
Copy link
Contributor Author

@ndingwall ndingwall commented Sep 16, 2016

@amueller is there a python implementation of Pascal VOC? The Matlab implementation is in this comment, but I can't find a Python version. I'd write one, but it's likely to be identical to what the implementation here so that's probably not helpful for testing purposes!

@amueller
Copy link
Member

@amueller amueller commented Sep 16, 2016

There's no python version. Would you mind running the matlab version? There could be subtle differences.
Running it with octave should be fine. If that doesn't work / is too much work, we might find someone who has matlab, or not bother.

@ndingwall
Copy link
Contributor Author

@ndingwall ndingwall commented Sep 16, 2016

I don't have either matlab or octave. If you/someone else can run it and paste the y_true and y_score they used into this thread, I'd be happy to add a test.

@@ -6,6 +6,34 @@
Release history
===============

Version 0.18.2

This comment has been minimized.

@amueller

amueller May 18, 2017
Member

This version will not happen, we'll do 0.19 soonish?

@ndingwall
Copy link
Contributor Author

@ndingwall ndingwall commented May 18, 2017

@amueller The CircleCI output seems to stop with this error so I'm not seeing outputs of tests, but I had a problem before that my local results don't match what CircleCI is reporting.

Exception occurred:
  File "/home/ubuntu/scikit-learn/doc/sphinxext/sphinx_gallery/docs_resolv.py", line 48, in _get_data
    with open(url, 'r') as fid:
IOError: [Errno 2] No such file or directory: '/home/ubuntu/scikit-learn/doc/_build/html/stable/modules/generated/sklearn.feature_selection.SelectKBest.rst.html'

For testing the 11-point average, I would probably like to test against actual outcomes of the Pascal VOC code, since that is de-facto the definition of this measure

Do you have an example? I don't have Matlab so can't generate any.

@jnothman
Copy link
Member

@jnothman jnothman commented May 24, 2017

I wasn't sure where to add the changes in whats_new.rst so I created a 0.18.2 section. I can move them if this will be merged into a different version.

If you merge with current master, you'll not only have an opportunity to fix current conflicts, but find that there is a section for 0.19

Copy link
Member

@jnothman jnothman left a comment

I've not reviewed it all, but I don't see how we can make a fundamental change to basically every evaluation of a metric without having a deprecation cycle or something even more drastic like giving the new implementations new function names. People have, no doubt, reported experimental results with the incumbent implementation and will find their measures suddenly and inexplicably changed on update. I don't think that's okay.

I'm pretty sure I recall an earlier version of this maintaining current behaviour, and I think we need to keep that, with a warning, for some time; or else warn loudly that it has changed but instruct the user how to silence the warning.

/viewdoc/download?doi=10.1.1.157.5766&rep=rep1&type=pdf>`_. In the example
below, the eleven precision values are indicated with an arrow to pointing to
the best precision possible while meeting or exceeding the desired recall.
Note that it's possible that the same operating point might correspond to

This comment has been minimized.

@jnothman

jnothman May 24, 2017
Member

This is in contradiction to the definition of operating point above. Do you simply mean that precision may be identical for multiple recall values?

This comment has been minimized.

@ndingwall

ndingwall May 25, 2017
Author Contributor

No, it's that there's no requirement that the eleven operating points chosen must be distinct: e.g. the points corresponding to recalls of 0.2, 0.3 and 0.4 might all be the same operating point at e.g. (0.45, 0.6) if 0.6 is the best precision you can get with a recall of at least 0.2.

- :func:`metrics.ranking.average_precision_score` no longer linearly
interpolates between operating points, and instead weights precisions
by the change in recall since the last operating point, as per the
`Wikipedia entry <http://en.wikipedia.org/wiki/Average_precision>`_.

This comment has been minimized.

@jnothman

jnothman May 24, 2017
Member

I think you need to cite a more stable reference; at a minimum, a versioned Wikipedia page.

This comment has been minimized.

@ndingwall

ndingwall May 25, 2017
Author Contributor

👍🏻

@amueller
Copy link
Member

@amueller amueller commented Jun 6, 2017

Can you maybe fix the conflicts?

Copy link
Member

@amueller amueller left a comment

Looks good apart from the minor nitpick, I think. Trying to get @agramfort to dig up his matlab.

0.79...
0.83...
>>> yt = np.array([0, 0, 1, 1])

This comment has been minimized.

@amueller

amueller Jun 6, 2017
Member

you could also just use the y_true and y_score variables defined above, instead of redefining them here?

@amueller
Copy link
Member

@amueller amueller commented Jun 6, 2017

I didn't realize Pascal VOC uses the standard AP, not 11 point. Is there a reference implementation for the 11 point somewhere?

@agramfort
Copy link
Member

@agramfort agramfort commented Jun 6, 2017

I did:

import numpy as np
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target
np.savetxt('X.txt', X[:, :1])
np.savetxt('y.txt', y)

then in matlab using VOC code:

load X.txt
load y.txt

mask = y < 2;
X = X(mask, :);
y = y(mask, :);
y(y == 0) = -1;
out = X;
gt = y;

[so,si]=sort(-out);
tp=gt(si)>0;
fp=gt(si)<0;

fp=cumsum(fp);
tp=cumsum(tp);
rec=tp/sum(gt>0);
prec=tp./(fp+tp);

ap=VOCap(rec,prec)

returns

ap =

    0.9355
@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jun 6, 2017

In the interest of getting this merged while the heat of the sprint lasts, I am taking over this PR and finishing up the last details.

@ndingwall
Copy link
Contributor Author

@ndingwall ndingwall commented Jun 6, 2017

@amueller The Pascal VOC documentation is very clear that they're using 11-point AP in their evaluation (top right of page 11). It's not dated so this might have changed - do you have the actual Matlab code used in their evaluation?

Will you or @GaelVaroquaux respond to @jnothman's comment above?

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jun 6, 2017

I want to heavily rewrite the example. It's gotten way too complicated for demoing the average precision. I want to try to break it down more (and maybe do slightly less pretty plots).

@ndingwall
Copy link
Contributor Author

@ndingwall ndingwall commented Jun 6, 2017

@amueller, it looks like they're using interpolated average precision, just not 11-point interpolated. This loop propagates the best precision backwards iteratively from the highest recall value.

for i=numel(mpre)-1:-1:1
    mpre(i)=max(mpre(i),mpre(i+1));
end

But either way, this contradicts the Pascal VOC documentation I've linked to in the docstrings! Do you know if this was updated somewhere?

@ndingwall
Copy link
Contributor Author

@ndingwall ndingwall commented Jun 6, 2017

@GaelVaroquaux: agreed. It's hard to find the balance between providing examples of all functionality, and avoiding unnecessary complication. You're better placed than me to strike that balance! In any case, it looks like we might be able to abandon 11-point interpolated AP, although we might have to replace it with interpolated AP which suffers the same weird peeking-at-the-test-set-performance problem that 11-point IAP does.

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jun 7, 2017

So, it seems that on different datasets, we do not reproduce well the pascal VOC measure. I am not sure why

@ndingwall
Copy link
Contributor Author

@ndingwall ndingwall commented Jun 7, 2017

@GaelVaroquaux, that's probably because I based the code on this documentation, which doesn't correspond to their evaluation code. This calls for 11-point interpolated AP, but the code does straight interpolated AP (no fixing of recall values). I could probably amend the code to reflect their code, but it'd be good to also update the link to the documentation. I just tried searching for an updated version of the documentation but the host.robots.ox.ac.uk website seems to be down at the moment...

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jun 7, 2017

@ndingwall : do you mean that it should match our "interpolation=None"? That's not what I am observing :$

@amueller
Copy link
Member

@amueller amueller commented Jun 7, 2017

no, interpolation=None is pessimistic, while I think theirs is optimistic (aka cheating)

@ndingwall
Copy link
Contributor Author

@ndingwall ndingwall commented Jun 7, 2017

@GaelVaroquaux - as @amueller said, there's does use interpolation, just not the 11-point variety. We don't have an implementation that matches their evaluation code but it should be easy to write.

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jun 7, 2017

@ndingwall : understood. I confirm looking at their code.

It doesn't seem to be useful to try to replicate it.

@ndingwall
Copy link
Contributor Author

@ndingwall ndingwall commented Jun 7, 2017

@GaelVaroquaux, the interpolated version is requested in #4577. I don't mind removing it (and the 11-point version) because, like @amueller, I consider them both to be cheating! But we could argue that the code should reflect what the community uses rather than what we think is best...

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jun 9, 2017

Closing in favor of #9091

@amueller amueller removed this from PR phase in Andy's pets Jul 21, 2017
@varunagrawal
Copy link
Contributor

@varunagrawal varunagrawal commented Feb 25, 2018

@GaelVaroquaux you have closed this issue and stated in #9091 that you won't be finishing that PR. Can we please re-open and resolve this issue?

@amueller

@jnothman
Copy link
Member

@jnothman jnothman commented Feb 25, 2018

Copy link
Contributor

@varunagrawal varunagrawal left a comment

Collation of related issues in order to better improve the PR.


else:
raise ValueError("interpolation has to be one of "
"(None, 'eleven_point').")

This comment has been minimized.

@varunagrawal

varunagrawal Feb 25, 2018
Contributor

  1. We also need an option for the VOC variant. @ndingwall you mentioned that you could write it easily. Can you please add it in?
  2. It would be nice to abstract the _binary_*_average_precision functions into a variable and have a single call to _average_binary_score. It would help make the code easier to read.
# We need the recall values in ascending order, and to ignore the first
# (precision, recall) pair with precision = 1.
precision = precision[-2::-1]
recall = recall[-2::-1]

This comment has been minimized.

@varunagrawal

varunagrawal Feb 25, 2018
Contributor

I think it would make sense to fix line 470 in precision_recall_curve instead of doing this. Please refer to #8280

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
scikit-learn 0.19
PRs that need reviews
Linked issues

Successfully merging this pull request may close these issues.

6 participants
You can’t perform that action at this time.