# [MRG] Bug fix and new feature: fix implementation of average precision score and add eleven-point interpolated option #7356

Closed
wants to merge 44 commits into
from
+256 −52

## Conversation

Projects
6 participants
Contributor

### ndingwall commented Sep 7, 2016 • edited by lesteve Edited 1 time lesteve edited Sep 28, 2016 (most recent)

#### Reference Issue

Fixes #4577 and #6377

#### What does this implement/fix? Explain your changes.

This adds an optional interpolation parameter to both auc and average_precision_score. By default, the value is set to 'linear' which replicates the existing behavior, but there is also a 'step' option that implements the strategy described in Stanford's Introduction to Information Retrieval, and as discussed in this blog post. To generalize this, auc also offers a parameter to control the direction of the step-wise interpolation; I'm sure someone will find this useful at some point!

At some point, I'd support switching the standard behavior to step-wise interpolation, but that will make the existing docstring tests fail. I don't want to change them without the support of the community!

This change is

### ndingwall added some commits Sep 7, 2016

 Adds support for step-wise interpolation to auc and average_precision… 
…_score
 3673ee1 
 Adds support for step-wise interpolation to auc and average_precision… 
…_score
 62d7e18 
 Fixed line length 
 ee8bb5c 

Closed

### amueller reviewed Sep 8, 2016

 + interpolation : string ['linear' (default), 'step'] + Determines the kind of interpolation used when computed AUC. If there + are many repeated scores, 'step' is recommended to avoid under- or + over-estimating the AUC. See Roam Analytics blog post

#### amueller Sep 8, 2016

Member

This doesn't seem a good source to me. I'm happy with citing the IR book, though.

#### ndingwall Sep 8, 2016

Contributor

No problem - I'll remove it and add the IR book to the references.

#### ndingwall Sep 8, 2016

Contributor

@amueller Related question: in the references section, is it okay to include the link to my blog post? I'm never sure whether the references are to justify the implementation, or to provide useful guidance for users. I feel like my post might be helpful for the latter but not for the former!

Member

### amueller commented Sep 8, 2016

 You can also cite the Pascal VOC paper. Can you please also add a reference to the paper I cited earlier about the relation of ROC curve and PR curve?
Member

### amueller commented Sep 8, 2016

 Wait, so currently we are using linear interpolation? That is actually... wrong... see the paper I cited. However, this looks like it's not the same as what the Pascal VOC computes, which cheats even more than we do.... hum...
Member

### amueller commented Sep 8, 2016

 Could you please do a comparison of the VOC measure, the measure you implemented, the measure in the IR book and our current measure? This is actually a different issue for PR and ROC curve, as the paper that I mentioned explains. For ROC curve, linear interpolation is fine, for PR curve, it is not - which is what the blog post complains about. You could do stepwise interpolation on the PR curve, which is a pessimistic boundary, or you could do the real interpolation, as explained in the paper, but that requires a validation set. Also, can you maybe add an example that compares the different methods. This is pretty subtle. So for AUC, there is "linear interpolation", "step-wise interpolation" and "VOC smoothing" (aka cheating). There could also be the correct smoothing using a validation set, which would be good. For PR, there is "linear interpolation" (which we currently do and is wrong), "VOC smoothing" which is even worse, "step-wise interpolation" which is a lower bound and the correct interpolation as explained in the paper, which requires a validation set.
Member

### amueller commented Sep 8, 2016

 The IR book is also cheating (using the VOC method), the blog post is not cheating as far as I can see.
Member

### amueller commented Sep 8, 2016

 Oh the blog post is by you? Then never mind my comment there ;) So interestingly AP as defined on wikipedia is again different from this. mean AP is the mean over recall@k for all k. That means some of the sawtooth are skipped over, but not all.

### amueller reviewed Sep 8, 2016

 @@ -55,6 +57,25 @@ def auc(x, y, reorder=False): If True, assume that the curve is ascending in the case of ties, as for an ROC curve. If the curve is non-ascending, the result will be wrong. + interpolation : string ['trapezoid' (default), 'step']

#### amueller Sep 8, 2016

Member

linear, not trapezoid ;)

#### ndingwall Sep 8, 2016

Contributor

Good catch!

 Fix docstring from 'trapezoid' to 'linear' 
 4c46f53 
Member

### amueller commented Sep 8, 2016

 Ok so after thinking about it for another hour: you are implementing the wikipedia definition. We should not allow "linear" interpolation for average_precision as it is wrong. We should also allow "interpolated" average precision as many people use it. So I'd add a parameter to average_precision that is interpolation="none"" and add the option "pascal_voc", and maybe also the "ir_book" only we should give them better names. I wouldn't worry about AUC for now.
Contributor

### ndingwall commented Sep 8, 2016

Before I get on this flight, I want to clarify your last comment so I can spend some flight time working on it! I haven't read the paper yet, so apologies if some of this would be pre-empted by that.

##### Interpolation options
1. None (default) Just return the mean of the precision values (and ignore recall).
2. pascal_voc (Pg 11) This appears to be the same as the 11-point interpolated average precision (11PIAP from here on) mentioned in the IR book. For each of 11 recall values r, we choose the precision that corresponds to the smallest recall greater than or equal to r. Then we return the mean of these precisions.
3. ir_book I don't think either of us like the approach that excludes operating points based on comparing their precision to the precision of the subsequent operating point. I'm therefore not keen to implement something that I wouldn't encourage people to use, especially because it's going to be fiddly!

'None' falls into the same pitfalls as linear interpolation. If we only have two operating points, the arithmetic mean of the precision values is the midpoint of the straight line joining them. That's what I was trying to escape!

This problem doesn't apply to the 11PIAP because the recall values are specified, so although we're taking an arithmetic mean, we're at least comparing oranges and oranges.

But 11PIAP does introduce a different problem. Consider a model that assigns scores which permit recall values of 0.09, 0.19, 0.29, etc. This model is likely to appear to be much worse than one whose recall values are 0.11, 0.21, 0.31, etc. even if the first one has much better precision at those operating points. We could imagine a situation where including examples at random would actually improve the 11PIAP by bringing the recall above the arbitrary thresholds.

The step interpolation I proposed is really just an extension of the 11PIAP to arbitrarily many points. It doesn't introduce the bias that linear interpolation does, and it allows each operating point to fully contribute whatever its recall value is. So I'd still advocate for including that as an option!

 Remove reference to blog 
 10e80f3 
Member

### amueller commented Sep 9, 2016

 By "none" I meant what you did, i.e. step-wise interpolation to the left. it's not really interpolation, though. I'm not sure what you mean by the "ignoring recall". I meant the wikipedia definition of average precision. That's what you implemented, and that's what I think the default should be. I hope I'll have some plots later today to illustrate.
Contributor

### ndingwall commented Sep 9, 2016

 Oh, that makes sense. I call it 'interpolated' because I'm thinking of it from a graphical perspective, but I see that the wiki article comes to the same calculation without explicitly considering the curve. I'll recode it so that the wikipedia definition ($\operatorname{AveP} = \sum_{k=1}^n P(k) \Delta r(k)$) is the default and I'll add an option to use interpolation='eleven_point' interpolation instead (and point to the IR book which has a good section on that). Presumably I'll need to change some tests since we'll be changing the default behavior. Is there a best-practice way to do that?
Member

### amueller commented Sep 9, 2016

 check which tests fail, fix them. add a test that checks that the current behavior is correct, not like the previous one, possibly using one of the reported bugs. then add a section to bugfixes in whatsnew that points out the change in the behavior. Usually we don't make backward incompatible changes, but I would consider this a bugfix. I would really love to also see a visual comparison of the different methods as an example (with how to do it and how not to do it). Not sure I find the time to do that. Also see #7372. Currently the plots of precision_recall_curve use plt.plot which basically plots linear interpolation, which is a bit odd.
Member

### amueller commented Sep 14, 2016

 please also check #4936 to see if you can reuse any of that.

Member

### amueller commented Sep 14, 2016

 sorry accidental close.

Closed

### ndingwall added some commits Sep 16, 2016

 Change interpolation options to None and 'eleven_point' 
 c2371dd 
 Revert to original auc() function 
 da07e85 

### amueller requested changes Sep 16, 2016

Please add a non-regression test. Ideally it would be great if you could have a test against the output of the Pascal VOC code.

 - Uses a step function to interpolate between operating points. + interpolation : None (default), or string ['eleven_point'] + + None:

#### amueller Sep 16, 2016

Member

Can you check how this renders / whether sphinx spits out warnings?

#### ndingwall Sep 20, 2016

Contributor

Seems to be okay.

 - Linearly interpolates between operating points. - 'step': - Uses a step function to interpolate between operating points. + interpolation : None (default), or string ['eleven_point']

Member

why a list?

#### ndingwall Sep 16, 2016

Contributor

I was just copying the style from other places that includes acceptable strings in a list. But they had multiple options so I guess that made more sense. Should it read interpolation : None (default), or string 'eleven_point' or interpolation : None (default), or string ['eleven_point', ]?

#### amueller Sep 16, 2016

Member

the first, I think.

#### ndingwall Sep 20, 2016

Contributor

Done. See the screenshot below.

 Returns ------- average_precision : float References ---------- .. [1] Wikipedia entry for the Average precision - _ + _ + .. [2] Stanford Information Retrieval book

#### amueller Sep 16, 2016

Member

Maybe add the Pascal VOC paper?

Contributor

Member

yes

#### ndingwall Sep 20, 2016

Contributor

Done.

 + 'eleven_point': + For each of the recall values, r in {0, 0.1, 0.2, ..., 1.0}, + compute the arithmetic mean of the first precision value with a + corresponding recall ≥ r. See the Stanford Information Retrieval

#### amueller Sep 16, 2016

Member

the >= symbol is probably not a good idea here (you could declare an encoding for the file but I think just writing >= is better).

#### ndingwall Sep 16, 2016

Contributor

Done.

 Change >= symbol to '>=' string 
 59e805d 
Contributor

### ndingwall commented Sep 16, 2016

 @amueller is there a python implementation of Pascal VOC? The Matlab implementation is in this comment, but I can't find a Python version. I'd write one, but it's likely to be identical to what the implementation here so that's probably not helpful for testing purposes!
Member

### amueller commented Sep 16, 2016

 There's no python version. Would you mind running the matlab version? There could be subtle differences. Running it with octave should be fine. If that doesn't work / is too much work, we might find someone who has matlab, or not bother.
Contributor

### ndingwall commented Sep 16, 2016

 I don't have either matlab or octave. If you/someone else can run it and paste the y_true and y_score they used into this thread, I'd be happy to add a test.
Contributor

### ndingwall commented Sep 16, 2016

 There are a couple of problems in the test_ranking.py file. _test_precision_recall_curve uses auc on the precision & recall values as the baseline for average_precision_score. That means my implementation fails the test. Is there a sensible alternative to auc that I should compare against? Perhaps a slow implementation that more obviously follows the wiki article: precision, recall, threshold = precision_recall_curve(y_true, y_score) precision = list(reversed(precision)) recall = list(reversed(recall)) average_precision = 0 for i in range(1, len(precision)): average_precision += precision[i] * (recall[i] - recall[i - 1])  The alternative implementation of average precision suffers from a similar problem to the original average_precision_score implementation. In the edge case of identical predictions, the ordering of the documents matters. For example: _average_precision(np.array([1, 0, 0, 0]), np.ones(4))  results in an average_precision of 0.25, while _average_precision(np.array([0,0,0,1]), np.ones(4))  results in 1.0. This doesn't stop it passing the test, since it works whenever the elements y_score are mostly unique, and that implementation is only used to test against auc, and not average_precision_score. Which is kind of weird in itself! But this raises a question: should the test implementation approximate the actual implementation for all inputs? If that's the case, then we should replace it with the (inefficient) implementation above. Or should it just work for a specified test case? In which case I'll leave it as it is.
Member

### amueller commented Sep 16, 2016

 slow and obvious implementation sounds good. If the test currently works, I'd say leave it and comment that the alternative implementation only works in this edge-case.
Member

### amueller commented Sep 16, 2016

 @ogrisel @jnothman @GaelVaroquaux I would kinda like this in the final version as our current average precision is not great :-/ @ndingwall would you be interested in adding a plot of a PR curve comparing the 11-point interpolation, the IR "skip dips", the "linear interpolation" we had before and the "left step interpolation" you're implementing here? I'd rather get this merged sooner and maybe backport to 0.18 than waiting for an example, though if you have extra time on hand, I'd love to see the plot in the documentation.
Contributor

### ndingwall commented Sep 16, 2016

 @amueller I might be able to find some time over the weekend to amend the example. I started it (see #7372) so perhaps we should discuss it there? I'm reluctant to include the "skip dips" part because that's going to require a load of new code and I'm yet to see a justification for what seems to me (and to you as well I think) as cheating.

### ndingwall added some commits Sep 16, 2016

 Small change to docstring 
 3f64c6d 
 Add reference to Pascal VOC book 
 852e043 
 Replace auc with slow implementation of average precision for testing 
 a92d774 
Member

### amueller commented Sep 16, 2016

 cool let's see what the CIs say. I should do something else now, though and might not get back to you on the matlab thing soon (the final final version of my book is due on monday and I have a keynote next week)

### ndingwall added some commits Sep 16, 2016

 Minor change to docstring 
 5cf87b0 
 Update example to include 'eleven_point' interpolation 
 bca5cb8 

### GaelVaroquaux reviewed Apr 24, 2017

 + recall = recall[-2::-1] + + return np.mean([precision[i:].max() for i in + np.searchsorted(recall, np.arange(0, 1.1, 0.1))])

#### GaelVaroquaux Apr 24, 2017

Member

It would be good to move those two functions outside the enclosing function. In general, functions defined inside functions is not great style, as it hinders reuse and testing.

#### ndingwall May 16, 2017 • edited Edited 1 time ndingwall edited May 16, 2017 (most recent)

Contributor

I put the second function there to reflect the existing code which has _binary_average_precision inside the average_precision_score function. I'm happy to move them out, but they're both very specific to average precision so I doubt they'd ever be reused! Would you prefer them at the module level (prefixed with _) anyway?

#### varunagrawal Feb 25, 2018

Contributor

@ndingwall I believe having them at the module level or at metrics.base would be preferred.

Member

### GaelVaroquaux commented Apr 24, 2017

 I've made a bunch of minor comments. Overall, I think that this PR is good and that it is very important to merge it (one of my students just hit the corresponding bug :D ). The only aspect that I feel uneasy about is the example: it was just made much more complicated. There is also one failing test, due to numerical instablity. It should be corrected. Finally (and that's quite important), a clear note should be made in the docs/whats_new.rst file to stress the change and explain it. Some users will be very surprised when they update scikit-learn and find that they get different scores. @ndingwall: do you think that you can address those comments? And I'll merge your PR. Thanks a lot!
Contributor

### ndingwall commented May 3, 2017

 @GaelVaroquaux thanks for the comments! I should be able to get to them this week. I agree about the example - perhaps we could split it into two: one for the binary case and one for multi-class? I feel like the current one is complicated partly because it's trying to cram so much in! On the other hand, having multiple examples for a single function is probably confusing in itself. I can't remember the justification for reversing of precision and recall lists, but I'll try and rework it without that either way.
Contributor

### ndingwall commented May 3, 2017

 @amueller I realize I didn't respond to your question before: So are they actually cheating by picking the best points? That sums up how I feel about, but there may be some subtlety to the justification that I haven't grasped. In the IR book they say The justification is that almost anyone would be prepared to look at a few more documents if it would increase the percentage of the viewed set that were relevant (that is, if the precision of the larger set is higher). but (as I think we discussed before), you'd only be prepared to look at them if you already knew they were relevant. If your classifier isn't giving you that information (and it wouldn't, since it must have assigned them lower relevance than the things it's already shown you), it doesn't seem reasonable to use that information in evaluating the classifier.
Member

### GaelVaroquaux commented May 3, 2017

 I agree about the example - perhaps we could split it into two: one for the binary case and one for multi-class? I feel like the current one is complicated partly because it's trying to cram so much in! On the other hand, having multiple examples for a single function is probably confusing in itself. I think that one good way to go about this is to split the example in two, but in the same file, using the restructure text formating, as in http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html

### ndingwall added some commits May 16, 2017

 Better description of 11-point average precision 
 4fae063 
 Separate eleven-point interpolated average precision in example 
 3cc067a 
 Fix typo 
 b769998 
 Add changes to average_precision_score 
 323a59c 
Contributor

### ndingwall commented May 16, 2017

 @GaelVaroquaux, I finally got to addressing your comments! I ran nosetests and everything seems to be passing. I get a load of warnings about metrics/classification.py and a couple of DeprecationWarnings but the average precision tests are passing for me. Is the failing test the one I refer to in this comment? I wasn't sure where to add the changes in whats_new.rst so I created a 0.18.2 section. I can move them if this will be merged into a different version. I followed your advice to split the example into sections, but I decided to move eleven point average precision entirely into its own section. This both avoids confusing the original average precision computation, and provides a much clearer explanation of eleven point AP. I printed out a string table which seems a bit ugly but I think helps to explain the algorithm. Let me know if I should remove it or change it to something else.
Member

### amueller commented May 18, 2017

 @ndingwall I'm back! test are failing. For testing the 11-point average, I would probably like to test against actual outcomes of the Pascal VOC code, since that is de-facto the definition of this measure (because defining scientific evaluation measures with matlab code is such a good idea!).

### amueller reviewed May 18, 2017

 @@ -6,6 +6,34 @@ Release history =============== +Version 0.18.2

#### amueller May 18, 2017

Member

This version will not happen, we'll do 0.19 soonish?

Contributor

### ndingwall commented May 18, 2017

 @amueller The CircleCI output seems to stop with this error so I'm not seeing outputs of tests, but I had a problem before that my local results don't match what CircleCI is reporting. Exception occurred: File "/home/ubuntu/scikit-learn/doc/sphinxext/sphinx_gallery/docs_resolv.py", line 48, in _get_data with open(url, 'r') as fid: IOError: [Errno 2] No such file or directory: '/home/ubuntu/scikit-learn/doc/_build/html/stable/modules/generated/sklearn.feature_selection.SelectKBest.rst.html'  For testing the 11-point average, I would probably like to test against actual outcomes of the Pascal VOC code, since that is de-facto the definition of this measure Do you have an example? I don't have Matlab so can't generate any.
Member

### jnothman commented May 24, 2017

 I wasn't sure where to add the changes in whats_new.rst so I created a 0.18.2 section. I can move them if this will be merged into a different version. If you merge with current master, you'll not only have an opportunity to fix current conflicts, but find that there is a section for 0.19

### jnothman reviewed May 24, 2017

I've not reviewed it all, but I don't see how we can make a fundamental change to basically every evaluation of a metric without having a deprecation cycle or something even more drastic like giving the new implementations new function names. People have, no doubt, reported experimental results with the incumbent implementation and will find their measures suddenly and inexplicably changed on update. I don't think that's okay.

I'm pretty sure I recall an earlier version of this maintaining current behaviour, and I think we need to keep that, with a warning, for some time; or else warn loudly that it has changed but instruct the user how to silence the warning.

 +/viewdoc/download?doi=10.1.1.157.5766&rep=rep1&type=pdf>_. In the example +below, the eleven precision values are indicated with an arrow to pointing to +the best precision possible while meeting or exceeding the desired recall. +Note that it's possible that the same operating point might correspond to

#### jnothman May 24, 2017

Member

This is in contradiction to the definition of operating point above. Do you simply mean that precision may be identical for multiple recall values?

#### ndingwall May 25, 2017

Contributor

No, it's that there's no requirement that the eleven operating points chosen must be distinct: e.g. the points corresponding to recalls of 0.2, 0.3 and 0.4 might all be the same operating point at e.g. (0.45, 0.6) if 0.6 is the best precision you can get with a recall of at least 0.2.

 + - :func:metrics.ranking.average_precision_score no longer linearly + interpolates between operating points, and instead weights precisions + by the change in recall since the last operating point, as per the + Wikipedia entry _.

#### jnothman May 24, 2017

Member

I think you need to cite a more stable reference; at a minimum, a versioned Wikipedia page.

Contributor

👍🏻

Member

### amueller commented Jun 6, 2017

 Can you maybe fix the conflicts?

### amueller reviewed Jun 6, 2017

Looks good apart from the minor nitpick, I think. Trying to get @agramfort to dig up his matlab.

 - 0.79... + 0.83... + + >>> yt = np.array([0, 0, 1, 1])

#### amueller Jun 6, 2017

Member

you could also just use the y_true and y_score variables defined above, instead of redefining them here?

Member

### amueller commented Jun 6, 2017

 I didn't realize Pascal VOC uses the standard AP, not 11 point. Is there a reference implementation for the 11 point somewhere?
Member

### agramfort commented Jun 6, 2017 • edited Edited 1 time agramfort edited Jun 7, 2017 (most recent)

 I did: import numpy as np from sklearn.datasets import load_iris iris = load_iris() X, y = iris.data, iris.target np.savetxt('X.txt', X[:, :1]) np.savetxt('y.txt', y)  then in matlab using VOC code: load X.txt load y.txt mask = y < 2; X = X(mask, :); y = y(mask, :); y(y == 0) = -1; out = X; gt = y; [so,si]=sort(-out); tp=gt(si)>0; fp=gt(si)<0; fp=cumsum(fp); tp=cumsum(tp); rec=tp/sum(gt>0); prec=tp./(fp+tp); ap=VOCap(rec,prec)  returns ap = 0.9355 
Member

### GaelVaroquaux commented Jun 6, 2017

 In the interest of getting this merged while the heat of the sprint lasts, I am taking over this PR and finishing up the last details.
Contributor

### ndingwall commented Jun 6, 2017

 @amueller The Pascal VOC documentation is very clear that they're using 11-point AP in their evaluation (top right of page 11). It's not dated so this might have changed - do you have the actual Matlab code used in their evaluation? Will you or @GaelVaroquaux respond to @jnothman's comment above?

Closed

Member

### GaelVaroquaux commented Jun 6, 2017

 I want to heavily rewrite the example. It's gotten way too complicated for demoing the average precision. I want to try to break it down more (and maybe do slightly less pretty plots).
Member

### amueller commented Jun 6, 2017

 @ndingwall http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCdevkit_18-May-2011.tar is the matlab code... hm
Contributor

### ndingwall commented Jun 6, 2017

 @amueller, it looks like they're using interpolated average precision, just not 11-point interpolated. This loop propagates the best precision backwards iteratively from the highest recall value. for i=numel(mpre)-1:-1:1 mpre(i)=max(mpre(i),mpre(i+1)); end  But either way, this contradicts the Pascal VOC documentation I've linked to in the docstrings! Do you know if this was updated somewhere?
Contributor

### ndingwall commented Jun 6, 2017

 @GaelVaroquaux: agreed. It's hard to find the balance between providing examples of all functionality, and avoiding unnecessary complication. You're better placed than me to strike that balance! In any case, it looks like we might be able to abandon 11-point interpolated AP, although we might have to replace it with interpolated AP which suffers the same weird peeking-at-the-test-set-performance problem that 11-point IAP does.
Member

### GaelVaroquaux commented Jun 7, 2017

 So, it seems that on different datasets, we do not reproduce well the pascal VOC measure. I am not sure why
Contributor

### ndingwall commented Jun 7, 2017

 @GaelVaroquaux, that's probably because I based the code on this documentation, which doesn't correspond to their evaluation code. This calls for 11-point interpolated AP, but the code does straight interpolated AP (no fixing of recall values). I could probably amend the code to reflect their code, but it'd be good to also update the link to the documentation. I just tried searching for an updated version of the documentation but the host.robots.ox.ac.uk website seems to be down at the moment...
Member

### GaelVaroquaux commented Jun 7, 2017

 @ndingwall : do you mean that it should match our "interpolation=None"? That's not what I am observing :\$
Member

### amueller commented Jun 7, 2017

 no, interpolation=None is pessimistic, while I think theirs is optimistic (aka cheating)
Contributor

### ndingwall commented Jun 7, 2017

 @GaelVaroquaux - as @amueller said, there's does use interpolation, just not the 11-point variety. We don't have an implementation that matches their evaluation code but it should be easy to write.
Member

### GaelVaroquaux commented Jun 7, 2017

 @ndingwall : understood. I confirm looking at their code. It doesn't seem to be useful to try to replicate it.
Contributor

### ndingwall commented Jun 7, 2017

 @GaelVaroquaux, the interpolated version is requested in #4577. I don't mind removing it (and the 11-point version) because, like @amueller, I consider them both to be cheating! But we could argue that the code should reflect what the community uses rather than what we think is best...

Open

Member

### GaelVaroquaux commented Jun 9, 2017

 Closing in favor of #9091

Contributor

### varunagrawal commented Feb 25, 2018 • edited Edited 1 time varunagrawal edited Feb 25, 2018 (most recent)

 @GaelVaroquaux you have closed this issue and stated in #9091 that you won't be finishing that PR. Can we please re-open and resolve this issue? @amueller
Member

### jnothman commented Feb 25, 2018

 @vene, it seems you and @GaelVaroquaux discussed 11-point average precision at a sprint last year and thought it was problematic. Could you please enlighten us?

### varunagrawal reviewed Feb 25, 2018

Collation of related issues in order to better improve the PR.

 + + else: + raise ValueError("interpolation has to be one of " + "(None, 'eleven_point').")

#### varunagrawal Feb 25, 2018

Contributor
1. We also need an option for the VOC variant. @ndingwall you mentioned that you could write it easily. Can you please add it in?
2. It would be nice to abstract the _binary_*_average_precision functions into a variable and have a single call to _average_binary_score. It would help make the code easier to read.
 + # We need the recall values in ascending order, and to ignore the first + # (precision, recall) pair with precision = 1. + precision = precision[-2::-1] + recall = recall[-2::-1]

#### varunagrawal Feb 25, 2018

Contributor

I think it would make sense to fix line 470 in precision_recall_curve` instead of doing this. Please refer to #8280