{{ message }}

# [MRG] Bug fix and new feature: fix implementation of average precision score and add eleven-point interpolated option #7356

Closed
wants to merge 44 commits into from
Closed

# [MRG] Bug fix and new feature: fix implementation of average precision score and add eleven-point interpolated option#7356

wants to merge 44 commits into from

## Conversation

### ndingwall commented Sep 7, 2016 • edited by lesteve

#### Reference Issue

Fixes #4577 and #6377

#### What does this implement/fix? Explain your changes.

This adds an optional interpolation parameter to both auc and average_precision_score. By default, the value is set to 'linear' which replicates the existing behavior, but there is also a 'step' option that implements the strategy described in Stanford's Introduction to Information Retrieval, and as discussed in this blog post. To generalize this, auc also offers a parameter to control the direction of the step-wise interpolation; I'm sure someone will find this useful at some point!

At some point, I'd support switching the standard behavior to step-wise interpolation, but that will make the existing docstring tests fail. I don't want to change them without the support of the community!

This change is

added 3 commits Sep 7, 2016
 Adds support for step-wise interpolation to auc and average_precision… 
 3673ee1 
…_score
 Adds support for step-wise interpolation to auc and average_precision… 
 62d7e18 
…_score
 Fixed line length 
 ee8bb5c 
mentioned this pull request Sep 7, 2016
 interpolation : string ['linear' (default), 'step'] Determines the kind of interpolation used when computed AUC. If there are many repeated scores, 'step' is recommended to avoid under- or over-estimating the AUC. See Roam Analytics blog post

#### amueller Sep 8, 2016 Member

This doesn't seem a good source to me. I'm happy with citing the IR book, though.

#### ndingwall Sep 8, 2016 Author Contributor

No problem - I'll remove it and add the IR book to the references.

#### ndingwall Sep 8, 2016 Author Contributor

@amueller Related question: in the references section, is it okay to include the link to my blog post? I'm never sure whether the references are to justify the implementation, or to provide useful guidance for users. I feel like my post might be helpful for the latter but not for the former!

### amueller commented Sep 8, 2016

 You can also cite the Pascal VOC paper. Can you please also add a reference to the paper I cited earlier about the relation of ROC curve and PR curve?

### amueller commented Sep 8, 2016

 Wait, so currently we are using linear interpolation? That is actually... wrong... see the paper I cited. However, this looks like it's not the same as what the Pascal VOC computes, which cheats even more than we do.... hum...

### amueller commented Sep 8, 2016

 Could you please do a comparison of the VOC measure, the measure you implemented, the measure in the IR book and our current measure? This is actually a different issue for PR and ROC curve, as the paper that I mentioned explains. For ROC curve, linear interpolation is fine, for PR curve, it is not - which is what the blog post complains about. You could do stepwise interpolation on the PR curve, which is a pessimistic boundary, or you could do the real interpolation, as explained in the paper, but that requires a validation set. Also, can you maybe add an example that compares the different methods. This is pretty subtle. So for AUC, there is "linear interpolation", "step-wise interpolation" and "VOC smoothing" (aka cheating). There could also be the correct smoothing using a validation set, which would be good. For PR, there is "linear interpolation" (which we currently do and is wrong), "VOC smoothing" which is even worse, "step-wise interpolation" which is a lower bound and the correct interpolation as explained in the paper, which requires a validation set.

### amueller commented Sep 8, 2016

 The IR book is also cheating (using the VOC method), the blog post is not cheating as far as I can see.

### amueller commented Sep 8, 2016

 Oh the blog post is by you? Then never mind my comment there ;) So interestingly AP as defined on wikipedia is again different from this. mean AP is the mean over recall@k for all k. That means some of the sawtooth are skipped over, but not all.
 @@ -55,6 +57,25 @@ def auc(x, y, reorder=False): If True, assume that the curve is ascending in the case of ties, as for an ROC curve. If the curve is non-ascending, the result will be wrong. interpolation : string ['trapezoid' (default), 'step']

#### amueller Sep 8, 2016 Member

linear, not trapezoid ;)

#### ndingwall Sep 8, 2016 Author Contributor

Good catch!

 Fix docstring from 'trapezoid' to 'linear' 
 4c46f53 

### amueller commented Sep 8, 2016

 Ok so after thinking about it for another hour: you are implementing the wikipedia definition. We should not allow "linear" interpolation for average_precision as it is wrong. We should also allow "interpolated" average precision as many people use it. So I'd add a parameter to average_precision that is interpolation="none"" and add the option "pascal_voc", and maybe also the "ir_book" only we should give them better names. I wouldn't worry about AUC for now.

### ndingwall commented Sep 8, 2016

Before I get on this flight, I want to clarify your last comment so I can spend some flight time working on it! I haven't read the paper yet, so apologies if some of this would be pre-empted by that.

##### Interpolation options
1. None (default) Just return the mean of the precision values (and ignore recall).
2. pascal_voc (Pg 11) This appears to be the same as the 11-point interpolated average precision (11PIAP from here on) mentioned in the IR book. For each of 11 recall values r, we choose the precision that corresponds to the smallest recall greater than or equal to r. Then we return the mean of these precisions.
3. ir_book I don't think either of us like the approach that excludes operating points based on comparing their precision to the precision of the subsequent operating point. I'm therefore not keen to implement something that I wouldn't encourage people to use, especially because it's going to be fiddly!

'None' falls into the same pitfalls as linear interpolation. If we only have two operating points, the arithmetic mean of the precision values is the midpoint of the straight line joining them. That's what I was trying to escape!

This problem doesn't apply to the 11PIAP because the recall values are specified, so although we're taking an arithmetic mean, we're at least comparing oranges and oranges.

But 11PIAP does introduce a different problem. Consider a model that assigns scores which permit recall values of 0.09, 0.19, 0.29, etc. This model is likely to appear to be much worse than one whose recall values are 0.11, 0.21, 0.31, etc. even if the first one has much better precision at those operating points. We could imagine a situation where including examples at random would actually improve the 11PIAP by bringing the recall above the arbitrary thresholds.

The step interpolation I proposed is really just an extension of the 11PIAP to arbitrarily many points. It doesn't introduce the bias that linear interpolation does, and it allows each operating point to fully contribute whatever its recall value is. So I'd still advocate for including that as an option!

 Remove reference to blog 
 10e80f3 

### amueller commented Sep 9, 2016

 By "none" I meant what you did, i.e. step-wise interpolation to the left. it's not really interpolation, though. I'm not sure what you mean by the "ignoring recall". I meant the wikipedia definition of average precision. That's what you implemented, and that's what I think the default should be. I hope I'll have some plots later today to illustrate.

### ndingwall commented Sep 9, 2016

 Oh, that makes sense. I call it 'interpolated' because I'm thinking of it from a graphical perspective, but I see that the wiki article comes to the same calculation without explicitly considering the curve. I'll recode it so that the wikipedia definition ($\operatorname{AveP} = \sum_{k=1}^n P(k) \Delta r(k)$) is the default and I'll add an option to use interpolation='eleven_point' interpolation instead (and point to the IR book which has a good section on that). Presumably I'll need to change some tests since we'll be changing the default behavior. Is there a best-practice way to do that?

### amueller commented Sep 9, 2016

 check which tests fail, fix them. add a test that checks that the current behavior is correct, not like the previous one, possibly using one of the reported bugs. then add a section to bugfixes in whatsnew that points out the change in the behavior. Usually we don't make backward incompatible changes, but I would consider this a bugfix. I would really love to also see a visual comparison of the different methods as an example (with how to do it and how not to do it). Not sure I find the time to do that. Also see #7372. Currently the plots of precision_recall_curve use plt.plot which basically plots linear interpolation, which is a bit odd.

### amueller commented Sep 14, 2016

 please also check #4936 to see if you can reuse any of that.
closed this Sep 14, 2016
reopened this Sep 14, 2016

### amueller commented Sep 14, 2016

 sorry accidental close.
mentioned this pull request Sep 14, 2016
added 2 commits Sep 16, 2016
 Change interpolation options to None and 'eleven_point' 
 c2371dd 
 Revert to original auc() function 
 da07e85 
requested changes

### amueller left a comment

 Please add a non-regression test. Ideally it would be great if you could have a test against the output of the Pascal VOC code.
 Uses a step function to interpolate between operating points. interpolation : None (default), or string ['eleven_point'] None:

#### amueller Sep 16, 2016 Member

Can you check how this renders / whether sphinx spits out warnings?

#### ndingwall Sep 20, 2016 Author Contributor

Seems to be okay.

 Linearly interpolates between operating points. 'step': Uses a step function to interpolate between operating points. interpolation : None (default), or string ['eleven_point']

why a list?

#### ndingwall Sep 16, 2016 Author Contributor

I was just copying the style from other places that includes acceptable strings in a list. But they had multiple options so I guess that made more sense. Should it read interpolation : None (default), or string 'eleven_point' or interpolation : None (default), or string ['eleven_point', ]?

#### amueller Sep 16, 2016 Member

the first, I think.

#### ndingwall Sep 20, 2016 Author Contributor

Done. See the screenshot below.

 Returns ------- average_precision : float References ---------- .. [1] Wikipedia entry for the Average precision _ _ .. [2] Stanford Information Retrieval book

#### amueller Sep 16, 2016 Member

Maybe add the Pascal VOC paper?

yes

#### ndingwall Sep 20, 2016 Author Contributor

Done.

 'eleven_point': For each of the recall values, r in {0, 0.1, 0.2, ..., 1.0}, compute the arithmetic mean of the first precision value with a corresponding recall ≥ r. See the Stanford Information Retrieval

#### amueller Sep 16, 2016 Member

the >= symbol is probably not a good idea here (you could declare an encoding for the file but I think just writing >= is better).

#### ndingwall Sep 16, 2016 Author Contributor

Done.

 Change >= symbol to '>=' string 
 59e805d 

### ndingwall commented Sep 16, 2016

 @amueller is there a python implementation of Pascal VOC? The Matlab implementation is in this comment, but I can't find a Python version. I'd write one, but it's likely to be identical to what the implementation here so that's probably not helpful for testing purposes!

### amueller commented Sep 16, 2016

 There's no python version. Would you mind running the matlab version? There could be subtle differences. Running it with octave should be fine. If that doesn't work / is too much work, we might find someone who has matlab, or not bother.

### ndingwall commented Sep 16, 2016

 I don't have either matlab or octave. If you/someone else can run it and paste the y_true and y_score they used into this thread, I'd be happy to add a test.
reviewed
 @@ -6,6 +6,34 @@ Release history =============== Version 0.18.2

#### amueller May 18, 2017 Member

This version will not happen, we'll do 0.19 soonish?

### ndingwall commented May 18, 2017

 @amueller The CircleCI output seems to stop with this error so I'm not seeing outputs of tests, but I had a problem before that my local results don't match what CircleCI is reporting. Exception occurred: File "/home/ubuntu/scikit-learn/doc/sphinxext/sphinx_gallery/docs_resolv.py", line 48, in _get_data with open(url, 'r') as fid: IOError: [Errno 2] No such file or directory: '/home/ubuntu/scikit-learn/doc/_build/html/stable/modules/generated/sklearn.feature_selection.SelectKBest.rst.html'  For testing the 11-point average, I would probably like to test against actual outcomes of the Pascal VOC code, since that is de-facto the definition of this measure Do you have an example? I don't have Matlab so can't generate any.

### jnothman commented May 24, 2017

 I wasn't sure where to add the changes in whats_new.rst so I created a 0.18.2 section. I can move them if this will be merged into a different version. If you merge with current master, you'll not only have an opportunity to fix current conflicts, but find that there is a section for 0.19
reviewed

### jnothman left a comment

 I've not reviewed it all, but I don't see how we can make a fundamental change to basically every evaluation of a metric without having a deprecation cycle or something even more drastic like giving the new implementations new function names. People have, no doubt, reported experimental results with the incumbent implementation and will find their measures suddenly and inexplicably changed on update. I don't think that's okay. I'm pretty sure I recall an earlier version of this maintaining current behaviour, and I think we need to keep that, with a warning, for some time; or else warn loudly that it has changed but instruct the user how to silence the warning.
 /viewdoc/download?doi=10.1.1.157.5766&rep=rep1&type=pdf>_. In the example below, the eleven precision values are indicated with an arrow to pointing to the best precision possible while meeting or exceeding the desired recall. Note that it's possible that the same operating point might correspond to

#### jnothman May 24, 2017 Member

This is in contradiction to the definition of operating point above. Do you simply mean that precision may be identical for multiple recall values?

#### ndingwall May 25, 2017 Author Contributor

No, it's that there's no requirement that the eleven operating points chosen must be distinct: e.g. the points corresponding to recalls of 0.2, 0.3 and 0.4 might all be the same operating point at e.g. (0.45, 0.6) if 0.6 is the best precision you can get with a recall of at least 0.2.

 - :func:metrics.ranking.average_precision_score no longer linearly interpolates between operating points, and instead weights precisions by the change in recall since the last operating point, as per the Wikipedia entry _.

#### jnothman May 24, 2017 Member

I think you need to cite a more stable reference; at a minimum, a versioned Wikipedia page.

👍🏻

### amueller commented Jun 6, 2017

 Can you maybe fix the conflicts?
reviewed

### amueller left a comment

 Looks good apart from the minor nitpick, I think. Trying to get @agramfort to dig up his matlab.
 0.79... 0.83... >>> yt = np.array([0, 0, 1, 1])

#### amueller Jun 6, 2017 Member

you could also just use the y_true and y_score variables defined above, instead of redefining them here?

### amueller commented Jun 6, 2017

 I didn't realize Pascal VOC uses the standard AP, not 11 point. Is there a reference implementation for the 11 point somewhere?

### agramfort commented Jun 6, 2017 • edited

 I did: import numpy as np from sklearn.datasets import load_iris iris = load_iris() X, y = iris.data, iris.target np.savetxt('X.txt', X[:, :1]) np.savetxt('y.txt', y)  then in matlab using VOC code: load X.txt load y.txt mask = y < 2; X = X(mask, :); y = y(mask, :); y(y == 0) = -1; out = X; gt = y; [so,si]=sort(-out); tp=gt(si)>0; fp=gt(si)<0; fp=cumsum(fp); tp=cumsum(tp); rec=tp/sum(gt>0); prec=tp./(fp+tp); ap=VOCap(rec,prec)  returns ap = 0.9355 

### GaelVaroquaux commented Jun 6, 2017

 In the interest of getting this merged while the heat of the sprint lasts, I am taking over this PR and finishing up the last details.

### ndingwall commented Jun 6, 2017

 @amueller The Pascal VOC documentation is very clear that they're using 11-point AP in their evaluation (top right of page 11). It's not dated so this might have changed - do you have the actual Matlab code used in their evaluation? Will you or @GaelVaroquaux respond to @jnothman's comment above?
mentioned this pull request Jun 6, 2017

### GaelVaroquaux commented Jun 6, 2017

 I want to heavily rewrite the example. It's gotten way too complicated for demoing the average precision. I want to try to break it down more (and maybe do slightly less pretty plots).

### amueller commented Jun 6, 2017

 @ndingwall http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCdevkit_18-May-2011.tar is the matlab code... hm

### ndingwall commented Jun 6, 2017

 @amueller, it looks like they're using interpolated average precision, just not 11-point interpolated. This loop propagates the best precision backwards iteratively from the highest recall value. for i=numel(mpre)-1:-1:1 mpre(i)=max(mpre(i),mpre(i+1)); end  But either way, this contradicts the Pascal VOC documentation I've linked to in the docstrings! Do you know if this was updated somewhere?

### ndingwall commented Jun 6, 2017

 @GaelVaroquaux: agreed. It's hard to find the balance between providing examples of all functionality, and avoiding unnecessary complication. You're better placed than me to strike that balance! In any case, it looks like we might be able to abandon 11-point interpolated AP, although we might have to replace it with interpolated AP which suffers the same weird peeking-at-the-test-set-performance problem that 11-point IAP does.

### GaelVaroquaux commented Jun 7, 2017

 So, it seems that on different datasets, we do not reproduce well the pascal VOC measure. I am not sure why

### ndingwall commented Jun 7, 2017

 @GaelVaroquaux, that's probably because I based the code on this documentation, which doesn't correspond to their evaluation code. This calls for 11-point interpolated AP, but the code does straight interpolated AP (no fixing of recall values). I could probably amend the code to reflect their code, but it'd be good to also update the link to the documentation. I just tried searching for an updated version of the documentation but the host.robots.ox.ac.uk website seems to be down at the moment...

### GaelVaroquaux commented Jun 7, 2017

 @ndingwall : do you mean that it should match our "interpolation=None"? That's not what I am observing :\$

### amueller commented Jun 7, 2017

 no, interpolation=None is pessimistic, while I think theirs is optimistic (aka cheating)

### ndingwall commented Jun 7, 2017

 @GaelVaroquaux - as @amueller said, there's does use interpolation, just not the 11-point variety. We don't have an implementation that matches their evaluation code but it should be easy to write.

### GaelVaroquaux commented Jun 7, 2017

 @ndingwall : understood. I confirm looking at their code. It doesn't seem to be useful to try to replicate it.

### ndingwall commented Jun 7, 2017

 @GaelVaroquaux, the interpolated version is requested in #4577. I don't mind removing it (and the 11-point version) because, like @amueller, I consider them both to be cheating! But we could argue that the code should reflect what the community uses rather than what we think is best...
mentioned this pull request Jun 9, 2017

### GaelVaroquaux commented Jun 9, 2017

 Closing in favor of #9091
removed this from PR phase in Andy's pets Jul 21, 2017

### varunagrawal commented Feb 25, 2018 • edited

 @GaelVaroquaux you have closed this issue and stated in #9091 that you won't be finishing that PR. Can we please re-open and resolve this issue? @amueller

### jnothman commented Feb 25, 2018

 @vene, it seems you and @GaelVaroquaux discussed 11-point average precision at a sprint last year and thought it was problematic. Could you please enlighten us?
reviewed

### varunagrawal left a comment

 Collation of related issues in order to better improve the PR.
 else: raise ValueError("interpolation has to be one of " "(None, 'eleven_point').")

#### varunagrawal Feb 25, 2018 Contributor

1. We also need an option for the VOC variant. @ndingwall you mentioned that you could write it easily. Can you please add it in?
2. It would be nice to abstract the _binary_*_average_precision functions into a variable and have a single call to _average_binary_score. It would help make the code easier to read.
 # We need the recall values in ascending order, and to ignore the first # (precision, recall) pair with precision = 1. precision = precision[-2::-1] recall = recall[-2::-1]

#### varunagrawal Feb 25, 2018 Contributor

I think it would make sense to fix line 470 in precision_recall_curve` instead of doing this. Please refer to #8280

mentioned this pull request Jan 31, 2019