error in average_precision_score #5379

tomsvds · 2015-10-09T17:46:37Z

I believe there is an error in sklearn.metrics.average_precision_score. Here is a script to show the result:

from sklearn.metrics import average_precision_score

y_true = [
       1, 1,
       1, 1,
       1, 1,
       1, -1,      # Single negative here, pos 8
       1, 1
       ]
y_score = list(range(len(y_true)))

for average in ('micro', 'macro', 'weighted', 'samples'):
    print( "Average:", average)
    print(average_precision_score(y_true, y_score, average=average))

What we have are essentially ten documents, nine of which are Positive. The
example at position eight is negative. According to my understanding (and
Wikipedia, see https://en.wikipedia.org/wiki/Information_retrieval#Average_precision
third equation for AveP down the screen), the average precision score should be:

| Pos | Value | Precision |                                 
| === |   === | ========  |                                 
|   1 |     1 | 1/1       |                                 
|   2 |     1 | 2/2       |                                 
|   3 |     1 | 3/3       |                                 
|   4 |     1 | 4/4       |                                 
|   5 |     1 | 5/5       |                                 
|   6 |     1 | 6/6       |                                 
|   7 |     1 | 7/7       |                                 
|   8 |     0 | 7/8       | Contributes zero since negative example
|   9 |     1 | 8/9       |                                 
|  10 |     1 | 9/10      |

The True Positive (TP) value here is 9.
The average precision = (1*7 + 8/9 + 9/10) / TP = 0.976
The value produced by the script (for all averaging schemes) = 0.865
I believe a_p_s is producing this value because it is dividing by N instead of TP.

Regards,
-Tom

The text was updated successfully, but these errors were encountered:

amueller · 2015-10-09T17:48:01Z

ping @arjoly @jnothman

absolutelyNoWarranty · 2015-10-09T19:38:21Z

Your script works fine if you reverse 'y_score'.
'y_score', the greater the score the higher the position. (Like probabilities.)

tomsvds · 2015-10-09T20:22:33Z

Ah, OK, that's most of the problem. Thanks! My bad.

But it exposes another thing that still looks wrong. Try this even simpler example:

y_true = [1, -1, 1]
y_score = [3, 2, 1]

Average precision should be (1/1 + 2/3) / 2 = 5/6 = 0.83
a_p_s produces 0.79

Regards,
-Tom

absolutelyNoWarranty · 2015-10-10T09:49:17Z

The wikipedia page says:

That is the area under the precision-recall curve. This integral is in practice replaced with a finite sum over every position in the ranked sequence of documents:

That formula is more of an approximation than what sklearn is doing. In your example, the precision-recall graph has points (1,0), (1, 0.5), (0.5, 0.5) and (2/3, 1) and sklearn calculates the area under the curve formed by connected those points with straight lines. Notice the upward slope in going from (0.5, 0.5) to (2/3, 1), similar to what we see here: http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html

The finite sum formula is equivalent to calculate the area under a curve which has no upward slopes only stairs going downwards. (In your example, it'd be as if the point (0.5, 0.5) were magically moved up to (2/3, 0.5) to match the height of the next point.)

amueller · 2015-10-12T21:52:41Z

@tomsvds more questions? Or can we close?

tomsvds · 2015-10-13T20:14:15Z

On Oct 12, 2015, at 2:53 PM, Andreas Mueller notifications@github.com wrote:
@tomsvds https://github.com/tomsvds more questions? Or can we close?

Well, I can’t use the function, but I appreciate the answer and I have no more questions.

Regards,
-Tom

amueller · 2015-10-13T20:26:21Z

why can't you use the function? Because you don't want the standard definition of average precision?

amueller · 2015-10-13T20:33:56Z

We could have options "interpolated", "triangle" and "quadrature"?

ndingwall · 2015-12-18T21:04:43Z

I just submitted a pull request to change the interpolation from linear to a sort of step function (for reasons I explain in some detail the comments of the pull request!) I'd support making linear interpolation an option, but using it as a default is (IMHO!) fatally flawed.

roadjiang · 2016-02-17T05:41:27Z

@tomsvds @absolutelyNoWarranty

I think it is an issue to fix for the current average precision algorithm.

If we follow the definition of NIST (National Institute of Standards and Technology),
http://trec.nist.gov/pubs/trec15/appendices/CE.MEASURES06.pdf

They define two versions of average precision (interpolated and non-interpolated).

According to them, for the non-interpolated average precision:
"As an example, consider a query that has four relevant documents which are retrieved at ranks 1, 2, 4, and 7. The actual precision obtained when each relevant document is retrieved is 1, 1, 0.75, and 0.57, respectively, the mean of which is 0.83. Thus, the average precision over all relevant documents for this query is 0.83"

However, the current code yields 0.81101190476190477 rather than 0.83
y_true = np.array([1, 1, 0, 1, 0, 0, 1])
y_scores = np.array([7, 6, 5, 4, 3, 2, 1])
average_precision_score(y_true, y_scores)

Out[520]: 0.81101190476190477

Besides, the current algorithm is different from the one on Wikipedia. Which calculates

\operatorname{AveP} = \frac{\sum_{k=1}^n (P(k) \times \operatorname{rel}(k))}{\mbox{number of relevant documents}} !

achalddave · 2016-03-12T04:54:25Z

Tentatively, it would be good to at least document this confusion about average precision. The overestimation of the true mAP as shown in #6377 is a rather critical bug.

amueller · 2016-09-06T17:33:10Z

@achalddave which confusion exactly? Whether it's interpolated? Also see #4577 and #6711 and check out the paper at http://pages.cs.wisc.edu/~jdavis/davisgoadrichcamera2.pdf If you want real interpolation, you need a validation set (or do it on the training set, which is possibly bad). Doing interpolation on the test set is methodically flawed.

amueller · 2016-09-06T17:38:26Z

@roadjiang I'm actually not quite sure why our definition disagrees with the wikipedia one. It shouldn't.

ndingwall · 2016-09-07T21:54:11Z

@amueller @roadjiang Your comments here yesterday inspired me to finish a blog post and an update to average_precision_score that offers step-wise interpolation as an option. Step-wise interpolation is equivalent to randomly breaking ties, which removes a bias towards models that assign few discrete scores. This blog post discusses all this in more detail.

bkj · 2017-01-24T22:34:20Z

I agree it'd be a good idea to put a disclaimer on the docs page that this doesn't agree with the most straightforward definition of AP. Took a little bit of time to figure out what was going on.

amueller · 2017-08-21T15:07:13Z

@bkj I'm for more documentation, but I disagree with "this doesn't agree with the most straightforward definition of AP". If you look at the definition of AP on wikipedia, which is pretty straight-forward, that's what we do. Also the same in the IR book. If you say "area under the PR curve" that's way less straight-forward, very ambiguous, and also mostly wrong. The PR curve is a series of points, and linear interpolation between these points is meaningless. So it's very unclear what "area under the curve" means.

amueller · 2017-08-21T15:08:19Z

I think we should close this. Adding interpolation as an option is still possible, but we should be very clear with the documentation.

jnothman · 2017-08-21T22:33:34Z

well the implementation has changed since this issue was raised anyway. closing

…

On 22 Aug 2017 1:08 am, "Andreas Mueller" ***@***.***> wrote: I think we should close this. Adding interpolation as an option is still possible, but we should be very clear with the documentation. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5379 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz69sCJte6G4O71-gZaXSrDrATW3YDks5saZ1lgaJpZM4GMRD2> .

glouppe changed the title ~~error in~~ error in average_precision_score Oct 19, 2015

agitter mentioned this issue Aug 18, 2017

[MRG+1] Add average precision definitions and cross references #9583

Merged

jnothman closed this as completed Aug 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error in average_precision_score #5379

error in average_precision_score #5379

tomsvds commented Oct 9, 2015

amueller commented Oct 9, 2015

absolutelyNoWarranty commented Oct 9, 2015

tomsvds commented Oct 9, 2015

absolutelyNoWarranty commented Oct 10, 2015

amueller commented Oct 12, 2015

tomsvds commented Oct 13, 2015

amueller commented Oct 13, 2015

amueller commented Oct 13, 2015

ndingwall commented Dec 18, 2015

roadjiang commented Feb 17, 2016

achalddave commented Mar 12, 2016

amueller commented Sep 6, 2016 •

edited

amueller commented Sep 6, 2016

ndingwall commented Sep 7, 2016

bkj commented Jan 24, 2017

amueller commented Aug 21, 2017

amueller commented Aug 21, 2017

jnothman commented Aug 21, 2017 via email

error in average_precision_score #5379

error in average_precision_score #5379

Comments

tomsvds commented Oct 9, 2015

amueller commented Oct 9, 2015

absolutelyNoWarranty commented Oct 9, 2015

tomsvds commented Oct 9, 2015

absolutelyNoWarranty commented Oct 10, 2015

amueller commented Oct 12, 2015

tomsvds commented Oct 13, 2015

amueller commented Oct 13, 2015

amueller commented Oct 13, 2015

ndingwall commented Dec 18, 2015

roadjiang commented Feb 17, 2016

achalddave commented Mar 12, 2016

amueller commented Sep 6, 2016 • edited

amueller commented Sep 6, 2016

ndingwall commented Sep 7, 2016

bkj commented Jan 24, 2017

amueller commented Aug 21, 2017

amueller commented Aug 21, 2017

jnothman commented Aug 21, 2017 via email

amueller commented Sep 6, 2016 •

edited