[MRG] Bug fix and new feature: fix implementation of average precision score and add eleven-point interpolated option #7356

Closed
wants to merge 44 commits into
from

Conversation

@ndingwall
Contributor

ndingwall commented Sep 7, 2016

Reference Issue

Fixes #4577 and #6377

What does this implement/fix? Explain your changes.

This adds an optional interpolation parameter to both auc and average_precision_score. By default, the value is set to 'linear' which replicates the existing behavior, but there is also a 'step' option that implements the strategy described in Stanford's Introduction to Information Retrieval, and as discussed in this blog post. To generalize this, auc also offers a parameter to control the direction of the step-wise interpolation; I'm sure someone will find this useful at some point!

Any other comments?

At some point, I'd support switching the standard behavior to step-wise interpolation, but that will make the existing docstring tests fail. I don't want to change them without the support of the community!


This change is Reviewable

sklearn/metrics/ranking.py
+ interpolation : string ['linear' (default), 'step']
+ Determines the kind of interpolation used when computed AUC. If there
+ are many repeated scores, 'step' is recommended to avoid under- or
+ over-estimating the AUC. See `Roam Analytics blog post

This comment has been minimized.

@amueller

amueller Sep 8, 2016

Member

This doesn't seem a good source to me. I'm happy with citing the IR book, though.

@amueller

amueller Sep 8, 2016

Member

This doesn't seem a good source to me. I'm happy with citing the IR book, though.

This comment has been minimized.

@ndingwall

ndingwall Sep 8, 2016

Contributor

No problem - I'll remove it and add the IR book to the references.

@ndingwall

ndingwall Sep 8, 2016

Contributor

No problem - I'll remove it and add the IR book to the references.

This comment has been minimized.

@ndingwall

ndingwall Sep 8, 2016

Contributor

@amueller Related question: in the references section, is it okay to include the link to my blog post? I'm never sure whether the references are to justify the implementation, or to provide useful guidance for users. I feel like my post might be helpful for the latter but not for the former!

@ndingwall

ndingwall Sep 8, 2016

Contributor

@amueller Related question: in the references section, is it okay to include the link to my blog post? I'm never sure whether the references are to justify the implementation, or to provide useful guidance for users. I feel like my post might be helpful for the latter but not for the former!

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 8, 2016

Member

You can also cite the Pascal VOC paper. Can you please also add a reference to the paper I cited earlier about the relation of ROC curve and PR curve?

Member

amueller commented Sep 8, 2016

You can also cite the Pascal VOC paper. Can you please also add a reference to the paper I cited earlier about the relation of ROC curve and PR curve?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 8, 2016

Member

Wait, so currently we are using linear interpolation? That is actually... wrong... see the paper I cited.
However, this looks like it's not the same as what the Pascal VOC computes, which cheats even more than we do.... hum...

Member

amueller commented Sep 8, 2016

Wait, so currently we are using linear interpolation? That is actually... wrong... see the paper I cited.
However, this looks like it's not the same as what the Pascal VOC computes, which cheats even more than we do.... hum...

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 8, 2016

Member

Could you please do a comparison of the VOC measure, the measure you implemented, the measure in the IR book and our current measure?
This is actually a different issue for PR and ROC curve, as the paper that I mentioned explains. For ROC curve, linear interpolation is fine, for PR curve, it is not - which is what the blog post complains about.
You could do stepwise interpolation on the PR curve, which is a pessimistic boundary, or you could do the real interpolation, as explained in the paper, but that requires a validation set.

Also, can you maybe add an example that compares the different methods. This is pretty subtle.
So for AUC, there is "linear interpolation", "step-wise interpolation" and "VOC smoothing" (aka cheating). There could also be the correct smoothing using a validation set, which would be good.

For PR, there is "linear interpolation" (which we currently do and is wrong), "VOC smoothing" which is even worse, "step-wise interpolation" which is a lower bound and the correct interpolation as explained in the paper, which requires a validation set.

Member

amueller commented Sep 8, 2016

Could you please do a comparison of the VOC measure, the measure you implemented, the measure in the IR book and our current measure?
This is actually a different issue for PR and ROC curve, as the paper that I mentioned explains. For ROC curve, linear interpolation is fine, for PR curve, it is not - which is what the blog post complains about.
You could do stepwise interpolation on the PR curve, which is a pessimistic boundary, or you could do the real interpolation, as explained in the paper, but that requires a validation set.

Also, can you maybe add an example that compares the different methods. This is pretty subtle.
So for AUC, there is "linear interpolation", "step-wise interpolation" and "VOC smoothing" (aka cheating). There could also be the correct smoothing using a validation set, which would be good.

For PR, there is "linear interpolation" (which we currently do and is wrong), "VOC smoothing" which is even worse, "step-wise interpolation" which is a lower bound and the correct interpolation as explained in the paper, which requires a validation set.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 8, 2016

Member

The IR book is also cheating (using the VOC method), the blog post is not cheating as far as I can see.

Member

amueller commented Sep 8, 2016

The IR book is also cheating (using the VOC method), the blog post is not cheating as far as I can see.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 8, 2016

Member

Oh the blog post is by you? Then never mind my comment there ;)

So interestingly AP as defined on wikipedia is again different from this. mean AP is the mean over recall@k for all k. That means some of the sawtooth are skipped over, but not all.

Member

amueller commented Sep 8, 2016

Oh the blog post is by you? Then never mind my comment there ;)

So interestingly AP as defined on wikipedia is again different from this. mean AP is the mean over recall@k for all k. That means some of the sawtooth are skipped over, but not all.

sklearn/metrics/ranking.py
@@ -55,6 +57,25 @@ def auc(x, y, reorder=False):
If True, assume that the curve is ascending in the case of ties, as for
an ROC curve. If the curve is non-ascending, the result will be wrong.
+ interpolation : string ['trapezoid' (default), 'step']

This comment has been minimized.

@amueller

amueller Sep 8, 2016

Member

linear, not trapezoid ;)

@amueller

amueller Sep 8, 2016

Member

linear, not trapezoid ;)

This comment has been minimized.

@ndingwall

ndingwall Sep 8, 2016

Contributor

Good catch!

@ndingwall

ndingwall Sep 8, 2016

Contributor

Good catch!

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 8, 2016

Member

Ok so after thinking about it for another hour: you are implementing the wikipedia definition. We should not allow "linear" interpolation for average_precision as it is wrong. We should also allow "interpolated" average precision as many people use it.

So I'd add a parameter to average_precision that is ``interpolation="none"" and add the option "pascal_voc", and maybe also the "ir_book" only we should give them better names.

I wouldn't worry about AUC for now.

Member

amueller commented Sep 8, 2016

Ok so after thinking about it for another hour: you are implementing the wikipedia definition. We should not allow "linear" interpolation for average_precision as it is wrong. We should also allow "interpolated" average precision as many people use it.

So I'd add a parameter to average_precision that is ``interpolation="none"" and add the option "pascal_voc", and maybe also the "ir_book" only we should give them better names.

I wouldn't worry about AUC for now.

@ndingwall

This comment has been minimized.

Show comment
Hide comment
@ndingwall

ndingwall Sep 8, 2016

Contributor

Before I get on this flight, I want to clarify your last comment so I can spend some flight time working on it! I haven't read the paper yet, so apologies if some of this would be pre-empted by that.

Interpolation options
  1. None (default) Just return the mean of the precision values (and ignore recall).
  2. pascal_voc (Pg 11) This appears to be the same as the 11-point interpolated average precision (11PIAP from here on) mentioned in the IR book. For each of 11 recall values r, we choose the precision that corresponds to the smallest recall greater than or equal to r. Then we return the mean of these precisions.
  3. ir_book I don't think either of us like the approach that excludes operating points based on comparing their precision to the precision of the subsequent operating point. I'm therefore not keen to implement something that I wouldn't encourage people to use, especially because it's going to be fiddly!
Comments:

'None' falls into the same pitfalls as linear interpolation. If we only have two operating points, the arithmetic mean of the precision values is the midpoint of the straight line joining them. That's what I was trying to escape!

This problem doesn't apply to the 11PIAP because the recall values are specified, so although we're taking an arithmetic mean, we're at least comparing oranges and oranges.

But 11PIAP does introduce a different problem. Consider a model that assigns scores which permit recall values of 0.09, 0.19, 0.29, etc. This model is likely to appear to be much worse than one whose recall values are 0.11, 0.21, 0.31, etc. even if the first one has much better precision at those operating points. We could imagine a situation where including examples at random would actually improve the 11PIAP by bringing the recall above the arbitrary thresholds.

The step interpolation I proposed is really just an extension of the 11PIAP to arbitrarily many points. It doesn't introduce the bias that linear interpolation does, and it allows each operating point to fully contribute whatever its recall value is. So I'd still advocate for including that as an option!

Contributor

ndingwall commented Sep 8, 2016

Before I get on this flight, I want to clarify your last comment so I can spend some flight time working on it! I haven't read the paper yet, so apologies if some of this would be pre-empted by that.

Interpolation options
  1. None (default) Just return the mean of the precision values (and ignore recall).
  2. pascal_voc (Pg 11) This appears to be the same as the 11-point interpolated average precision (11PIAP from here on) mentioned in the IR book. For each of 11 recall values r, we choose the precision that corresponds to the smallest recall greater than or equal to r. Then we return the mean of these precisions.
  3. ir_book I don't think either of us like the approach that excludes operating points based on comparing their precision to the precision of the subsequent operating point. I'm therefore not keen to implement something that I wouldn't encourage people to use, especially because it's going to be fiddly!
Comments:

'None' falls into the same pitfalls as linear interpolation. If we only have two operating points, the arithmetic mean of the precision values is the midpoint of the straight line joining them. That's what I was trying to escape!

This problem doesn't apply to the 11PIAP because the recall values are specified, so although we're taking an arithmetic mean, we're at least comparing oranges and oranges.

But 11PIAP does introduce a different problem. Consider a model that assigns scores which permit recall values of 0.09, 0.19, 0.29, etc. This model is likely to appear to be much worse than one whose recall values are 0.11, 0.21, 0.31, etc. even if the first one has much better precision at those operating points. We could imagine a situation where including examples at random would actually improve the 11PIAP by bringing the recall above the arbitrary thresholds.

The step interpolation I proposed is really just an extension of the 11PIAP to arbitrarily many points. It doesn't introduce the bias that linear interpolation does, and it allows each operating point to fully contribute whatever its recall value is. So I'd still advocate for including that as an option!

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 9, 2016

Member

By "none" I meant what you did, i.e. step-wise interpolation to the left. it's not really interpolation, though.
I'm not sure what you mean by the "ignoring recall". I meant the wikipedia definition of average precision. That's what you implemented, and that's what I think the default should be.

I hope I'll have some plots later today to illustrate.

Member

amueller commented Sep 9, 2016

By "none" I meant what you did, i.e. step-wise interpolation to the left. it's not really interpolation, though.
I'm not sure what you mean by the "ignoring recall". I meant the wikipedia definition of average precision. That's what you implemented, and that's what I think the default should be.

I hope I'll have some plots later today to illustrate.

@ndingwall

This comment has been minimized.

Show comment
Hide comment
@ndingwall

ndingwall Sep 9, 2016

Contributor

Oh, that makes sense. I call it 'interpolated' because I'm thinking of it from a graphical perspective, but I see that the wiki article comes to the same calculation without explicitly considering the curve.

I'll recode it so that the wikipedia definition ($\operatorname{AveP} = \sum_{k=1}^n P(k) \Delta r(k)$) is the default and I'll add an option to use interpolation='eleven_point' interpolation instead (and point to the IR book which has a good section on that).

Presumably I'll need to change some tests since we'll be changing the default behavior. Is there a best-practice way to do that?

Contributor

ndingwall commented Sep 9, 2016

Oh, that makes sense. I call it 'interpolated' because I'm thinking of it from a graphical perspective, but I see that the wiki article comes to the same calculation without explicitly considering the curve.

I'll recode it so that the wikipedia definition ($\operatorname{AveP} = \sum_{k=1}^n P(k) \Delta r(k)$) is the default and I'll add an option to use interpolation='eleven_point' interpolation instead (and point to the IR book which has a good section on that).

Presumably I'll need to change some tests since we'll be changing the default behavior. Is there a best-practice way to do that?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 9, 2016

Member

check which tests fail, fix them. add a test that checks that the current behavior is correct, not like the previous one, possibly using one of the reported bugs.
then add a section to bugfixes in whatsnew that points out the change in the behavior.
Usually we don't make backward incompatible changes, but I would consider this a bugfix.
I would really love to also see a visual comparison of the different methods as an example (with how to do it and how not to do it). Not sure I find the time to do that.
Also see #7372.
Currently the plots of precision_recall_curve use plt.plot which basically plots linear interpolation, which is a bit odd.

Member

amueller commented Sep 9, 2016

check which tests fail, fix them. add a test that checks that the current behavior is correct, not like the previous one, possibly using one of the reported bugs.
then add a section to bugfixes in whatsnew that points out the change in the behavior.
Usually we don't make backward incompatible changes, but I would consider this a bugfix.
I would really love to also see a visual comparison of the different methods as an example (with how to do it and how not to do it). Not sure I find the time to do that.
Also see #7372.
Currently the plots of precision_recall_curve use plt.plot which basically plots linear interpolation, which is a bit odd.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 14, 2016

Member

please also check #4936 to see if you can reuse any of that.

Member

amueller commented Sep 14, 2016

please also check #4936 to see if you can reuse any of that.

@amueller amueller closed this Sep 14, 2016

@amueller amueller reopened this Sep 14, 2016

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 14, 2016

Member

sorry accidental close.

Member

amueller commented Sep 14, 2016

sorry accidental close.

@amueller

Please add a non-regression test. Ideally it would be great if you could have a test against the output of the Pascal VOC code.

sklearn/metrics/ranking.py
- Uses a step function to interpolate between operating points.
+ interpolation : None (default), or string ['eleven_point']
+
+ ``None``:

This comment has been minimized.

@amueller

amueller Sep 16, 2016

Member

Can you check how this renders / whether sphinx spits out warnings?

@amueller

amueller Sep 16, 2016

Member

Can you check how this renders / whether sphinx spits out warnings?

This comment has been minimized.

@ndingwall

ndingwall Sep 20, 2016

Contributor

Seems to be okay.
screen shot 2016-09-20 at 3 51 43 pm

@ndingwall

ndingwall Sep 20, 2016

Contributor

Seems to be okay.
screen shot 2016-09-20 at 3 51 43 pm

sklearn/metrics/ranking.py
- Linearly interpolates between operating points.
- ``'step'``:
- Uses a step function to interpolate between operating points.
+ interpolation : None (default), or string ['eleven_point']

This comment has been minimized.

@amueller

amueller Sep 16, 2016

Member

why a list?

@amueller

amueller Sep 16, 2016

Member

why a list?

This comment has been minimized.

@ndingwall

ndingwall Sep 16, 2016

Contributor

I was just copying the style from other places that includes acceptable strings in a list. But they had multiple options so I guess that made more sense. Should it read interpolation : None (default), or string 'eleven_point' or interpolation : None (default), or string ['eleven_point', ]?

@ndingwall

ndingwall Sep 16, 2016

Contributor

I was just copying the style from other places that includes acceptable strings in a list. But they had multiple options so I guess that made more sense. Should it read interpolation : None (default), or string 'eleven_point' or interpolation : None (default), or string ['eleven_point', ]?

This comment has been minimized.

@amueller

amueller Sep 16, 2016

Member

the first, I think.

@amueller

amueller Sep 16, 2016

Member

the first, I think.

This comment has been minimized.

@ndingwall

ndingwall Sep 20, 2016

Contributor

Done. See the screenshot below.

@ndingwall

ndingwall Sep 20, 2016

Contributor

Done. See the screenshot below.

Returns
-------
average_precision : float
References
----------
.. [1] `Wikipedia entry for the Average precision
- <https://en.wikipedia.org/wiki/Average_precision>`_
+ <http://en.wikipedia.org/wiki/Average_precision>`_
+ .. [2] `Stanford Information Retrieval book

This comment has been minimized.

@amueller

amueller Sep 16, 2016

Member

Maybe add the Pascal VOC paper?

@amueller

amueller Sep 16, 2016

Member

Maybe add the Pascal VOC paper?

This comment has been minimized.

This comment has been minimized.

@amueller

amueller Sep 16, 2016

Member

yes

This comment has been minimized.

@ndingwall

ndingwall Sep 20, 2016

Contributor

Done.

@ndingwall

ndingwall Sep 20, 2016

Contributor

Done.

sklearn/metrics/ranking.py
+ ``'eleven_point'``:
+ For each of the recall values, r in {0, 0.1, 0.2, ..., 1.0},
+ compute the arithmetic mean of the first precision value with a
+ corresponding recall ≥ r. See the `Stanford Information Retrieval

This comment has been minimized.

@amueller

amueller Sep 16, 2016

Member

the >= symbol is probably not a good idea here (you could declare an encoding for the file but I think just writing >= is better).

@amueller

amueller Sep 16, 2016

Member

the >= symbol is probably not a good idea here (you could declare an encoding for the file but I think just writing >= is better).

This comment has been minimized.

@ndingwall

ndingwall Sep 16, 2016

Contributor

Done.

@ndingwall

ndingwall Sep 16, 2016

Contributor

Done.

@ndingwall

This comment has been minimized.

Show comment
Hide comment
@ndingwall

ndingwall Sep 16, 2016

Contributor

@amueller is there a python implementation of Pascal VOC? The Matlab implementation is in this comment, but I can't find a Python version. I'd write one, but it's likely to be identical to what the implementation here so that's probably not helpful for testing purposes!

Contributor

ndingwall commented Sep 16, 2016

@amueller is there a python implementation of Pascal VOC? The Matlab implementation is in this comment, but I can't find a Python version. I'd write one, but it's likely to be identical to what the implementation here so that's probably not helpful for testing purposes!

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 16, 2016

Member

There's no python version. Would you mind running the matlab version? There could be subtle differences.
Running it with octave should be fine. If that doesn't work / is too much work, we might find someone who has matlab, or not bother.

Member

amueller commented Sep 16, 2016

There's no python version. Would you mind running the matlab version? There could be subtle differences.
Running it with octave should be fine. If that doesn't work / is too much work, we might find someone who has matlab, or not bother.

@ndingwall

This comment has been minimized.

Show comment
Hide comment
@ndingwall

ndingwall Sep 16, 2016

Contributor

I don't have either matlab or octave. If you/someone else can run it and paste the y_true and y_score they used into this thread, I'd be happy to add a test.

Contributor

ndingwall commented Sep 16, 2016

I don't have either matlab or octave. If you/someone else can run it and paste the y_true and y_score they used into this thread, I'd be happy to add a test.

@ndingwall

This comment has been minimized.

Show comment
Hide comment
@ndingwall

ndingwall Sep 16, 2016

Contributor

There are a couple of problems in the test_ranking.py file.

  1. _test_precision_recall_curve uses auc on the precision & recall values as the baseline for average_precision_score. That means my implementation fails the test. Is there a sensible alternative to auc that I should compare against? Perhaps a slow implementation that more obviously follows the wiki article:

    precision, recall, threshold = precision_recall_curve(y_true, y_score)
    precision = list(reversed(precision))
    recall = list(reversed(recall))
    average_precision = 0
    for i in range(1, len(precision)):
        average_precision += precision[i] * (recall[i] - recall[i - 1])
    
  2. The alternative implementation of average precision suffers from a similar problem to the original average_precision_score implementation. In the edge case of identical predictions, the ordering of the documents matters.

    For example:

    _average_precision(np.array([1, 0, 0, 0]), 
                       np.ones(4))
    

    results in an average_precision of 0.25, while

    _average_precision(np.array([0,0,0,1]), 
                       np.ones(4))
    

    results in 1.0.

    This doesn't stop it passing the test, since it works whenever the elements y_score are mostly unique, and that implementation is only used to test against auc, and not average_precision_score. Which is kind of weird in itself! But this raises a question: should the test implementation approximate the actual implementation for all inputs? If that's the case, then we should replace it with the (inefficient) implementation above. Or should it just work for a specified test case? In which case I'll leave it as it is.

Contributor

ndingwall commented Sep 16, 2016

There are a couple of problems in the test_ranking.py file.

  1. _test_precision_recall_curve uses auc on the precision & recall values as the baseline for average_precision_score. That means my implementation fails the test. Is there a sensible alternative to auc that I should compare against? Perhaps a slow implementation that more obviously follows the wiki article:

    precision, recall, threshold = precision_recall_curve(y_true, y_score)
    precision = list(reversed(precision))
    recall = list(reversed(recall))
    average_precision = 0
    for i in range(1, len(precision)):
        average_precision += precision[i] * (recall[i] - recall[i - 1])
    
  2. The alternative implementation of average precision suffers from a similar problem to the original average_precision_score implementation. In the edge case of identical predictions, the ordering of the documents matters.

    For example:

    _average_precision(np.array([1, 0, 0, 0]), 
                       np.ones(4))
    

    results in an average_precision of 0.25, while

    _average_precision(np.array([0,0,0,1]), 
                       np.ones(4))
    

    results in 1.0.

    This doesn't stop it passing the test, since it works whenever the elements y_score are mostly unique, and that implementation is only used to test against auc, and not average_precision_score. Which is kind of weird in itself! But this raises a question: should the test implementation approximate the actual implementation for all inputs? If that's the case, then we should replace it with the (inefficient) implementation above. Or should it just work for a specified test case? In which case I'll leave it as it is.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 16, 2016

Member
  1. slow and obvious implementation sounds good.
  2. If the test currently works, I'd say leave it and comment that the alternative implementation only works in this edge-case.
Member

amueller commented Sep 16, 2016

  1. slow and obvious implementation sounds good.
  2. If the test currently works, I'd say leave it and comment that the alternative implementation only works in this edge-case.
@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 16, 2016

Member

@ogrisel @jnothman @GaelVaroquaux I would kinda like this in the final version as our current average precision is not great :-/

@ndingwall would you be interested in adding a plot of a PR curve comparing the 11-point interpolation, the IR "skip dips", the "linear interpolation" we had before and the "left step interpolation" you're implementing here?
I'd rather get this merged sooner and maybe backport to 0.18 than waiting for an example, though if you have extra time on hand, I'd love to see the plot in the documentation.

Member

amueller commented Sep 16, 2016

@ogrisel @jnothman @GaelVaroquaux I would kinda like this in the final version as our current average precision is not great :-/

@ndingwall would you be interested in adding a plot of a PR curve comparing the 11-point interpolation, the IR "skip dips", the "linear interpolation" we had before and the "left step interpolation" you're implementing here?
I'd rather get this merged sooner and maybe backport to 0.18 than waiting for an example, though if you have extra time on hand, I'd love to see the plot in the documentation.

@ndingwall

This comment has been minimized.

Show comment
Hide comment
@ndingwall

ndingwall Sep 16, 2016

Contributor

@amueller I might be able to find some time over the weekend to amend the example. I started it (see #7372) so perhaps we should discuss it there? I'm reluctant to include the "skip dips" part because that's going to require a load of new code and I'm yet to see a justification for what seems to me (and to you as well I think) as cheating.

Contributor

ndingwall commented Sep 16, 2016

@amueller I might be able to find some time over the weekend to amend the example. I started it (see #7372) so perhaps we should discuss it there? I'm reluctant to include the "skip dips" part because that's going to require a load of new code and I'm yet to see a justification for what seems to me (and to you as well I think) as cheating.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 16, 2016

Member

cool let's see what the CIs say. I should do something else now, though and might not get back to you on the matlab thing soon (the final final version of my book is due on monday and I have a keynote next week)

Member

amueller commented Sep 16, 2016

cool let's see what the CIs say. I should do something else now, though and might not get back to you on the matlab thing soon (the final final version of my book is due on monday and I have a keynote next week)

+ recall = recall[-2::-1]
+
+ return np.mean([precision[i:].max() for i in
+ np.searchsorted(recall, np.arange(0, 1.1, 0.1))])

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Apr 24, 2017

Member

It would be good to move those two functions outside the enclosing function. In general, functions defined inside functions is not great style, as it hinders reuse and testing.

@GaelVaroquaux

GaelVaroquaux Apr 24, 2017

Member

It would be good to move those two functions outside the enclosing function. In general, functions defined inside functions is not great style, as it hinders reuse and testing.

This comment has been minimized.

@ndingwall

ndingwall May 16, 2017

Contributor

I put the second function there to reflect the existing code which has _binary_average_precision inside the average_precision_score function. I'm happy to move them out, but they're both very specific to average precision so I doubt they'd ever be reused! Would you prefer them at the module level (prefixed with _) anyway?

@ndingwall

ndingwall May 16, 2017

Contributor

I put the second function there to reflect the existing code which has _binary_average_precision inside the average_precision_score function. I'm happy to move them out, but they're both very specific to average precision so I doubt they'd ever be reused! Would you prefer them at the module level (prefixed with _) anyway?

This comment has been minimized.

@varunagrawal

varunagrawal Feb 25, 2018

Contributor

@ndingwall I believe having them at the module level or at metrics.base would be preferred.

@varunagrawal

varunagrawal Feb 25, 2018

Contributor

@ndingwall I believe having them at the module level or at metrics.base would be preferred.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Apr 24, 2017

Member

I've made a bunch of minor comments. Overall, I think that this PR is good and that it is very important to merge it (one of my students just hit the corresponding bug :D ).

The only aspect that I feel uneasy about is the example: it was just made much more complicated.

There is also one failing test, due to numerical instablity. It should be corrected.

Finally (and that's quite important), a clear note should be made in the docs/whats_new.rst file to stress the change and explain it. Some users will be very surprised when they update scikit-learn and find that they get different scores.

@ndingwall: do you think that you can address those comments? And I'll merge your PR. Thanks a lot!

Member

GaelVaroquaux commented Apr 24, 2017

I've made a bunch of minor comments. Overall, I think that this PR is good and that it is very important to merge it (one of my students just hit the corresponding bug :D ).

The only aspect that I feel uneasy about is the example: it was just made much more complicated.

There is also one failing test, due to numerical instablity. It should be corrected.

Finally (and that's quite important), a clear note should be made in the docs/whats_new.rst file to stress the change and explain it. Some users will be very surprised when they update scikit-learn and find that they get different scores.

@ndingwall: do you think that you can address those comments? And I'll merge your PR. Thanks a lot!

@ndingwall

This comment has been minimized.

Show comment
Hide comment
@ndingwall

ndingwall May 3, 2017

Contributor

@GaelVaroquaux thanks for the comments! I should be able to get to them this week.

I agree about the example - perhaps we could split it into two: one for the binary case and one for multi-class? I feel like the current one is complicated partly because it's trying to cram so much in! On the other hand, having multiple examples for a single function is probably confusing in itself. I can't remember the justification for reversing of precision and recall lists, but I'll try and rework it without that either way.

Contributor

ndingwall commented May 3, 2017

@GaelVaroquaux thanks for the comments! I should be able to get to them this week.

I agree about the example - perhaps we could split it into two: one for the binary case and one for multi-class? I feel like the current one is complicated partly because it's trying to cram so much in! On the other hand, having multiple examples for a single function is probably confusing in itself. I can't remember the justification for reversing of precision and recall lists, but I'll try and rework it without that either way.

@ndingwall

This comment has been minimized.

Show comment
Hide comment
@ndingwall

ndingwall May 3, 2017

Contributor

@amueller I realize I didn't respond to your question before:

So are they actually cheating by picking the best points?

That sums up how I feel about, but there may be some subtlety to the justification that I haven't grasped. In the IR book they say

The justification is that almost anyone would be prepared to look at a few more documents if it would increase the percentage of the viewed set that were relevant (that is, if the precision of the larger set is higher).

but (as I think we discussed before), you'd only be prepared to look at them if you already knew they were relevant. If your classifier isn't giving you that information (and it wouldn't, since it must have assigned them lower relevance than the things it's already shown you), it doesn't seem reasonable to use that information in evaluating the classifier.

Contributor

ndingwall commented May 3, 2017

@amueller I realize I didn't respond to your question before:

So are they actually cheating by picking the best points?

That sums up how I feel about, but there may be some subtlety to the justification that I haven't grasped. In the IR book they say

The justification is that almost anyone would be prepared to look at a few more documents if it would increase the percentage of the viewed set that were relevant (that is, if the precision of the larger set is higher).

but (as I think we discussed before), you'd only be prepared to look at them if you already knew they were relevant. If your classifier isn't giving you that information (and it wouldn't, since it must have assigned them lower relevance than the things it's already shown you), it doesn't seem reasonable to use that information in evaluating the classifier.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux May 3, 2017

Member
Member

GaelVaroquaux commented May 3, 2017

@ndingwall

This comment has been minimized.

Show comment
Hide comment
@ndingwall

ndingwall May 16, 2017

Contributor

@GaelVaroquaux, I finally got to addressing your comments!

I ran nosetests and everything seems to be passing. I get a load of warnings about metrics/classification.py and a couple of DeprecationWarnings but the average precision tests are passing for me. Is the failing test the one I refer to in this comment?

I wasn't sure where to add the changes in whats_new.rst so I created a 0.18.2 section. I can move them if this will be merged into a different version.

I followed your advice to split the example into sections, but I decided to move eleven point average precision entirely into its own section. This both avoids confusing the original average precision computation, and provides a much clearer explanation of eleven point AP. I printed out a string table which seems a bit ugly but I think helps to explain the algorithm. Let me know if I should remove it or change it to something else.

Contributor

ndingwall commented May 16, 2017

@GaelVaroquaux, I finally got to addressing your comments!

I ran nosetests and everything seems to be passing. I get a load of warnings about metrics/classification.py and a couple of DeprecationWarnings but the average precision tests are passing for me. Is the failing test the one I refer to in this comment?

I wasn't sure where to add the changes in whats_new.rst so I created a 0.18.2 section. I can move them if this will be merged into a different version.

I followed your advice to split the example into sections, but I decided to move eleven point average precision entirely into its own section. This both avoids confusing the original average precision computation, and provides a much clearer explanation of eleven point AP. I printed out a string table which seems a bit ugly but I think helps to explain the algorithm. Let me know if I should remove it or change it to something else.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 18, 2017

Member

@ndingwall I'm back! test are failing.

For testing the 11-point average, I would probably like to test against actual outcomes of the Pascal VOC code, since that is de-facto the definition of this measure (because defining scientific evaluation measures with matlab code is such a good idea!).

Member

amueller commented May 18, 2017

@ndingwall I'm back! test are failing.

For testing the 11-point average, I would probably like to test against actual outcomes of the Pascal VOC code, since that is de-facto the definition of this measure (because defining scientific evaluation measures with matlab code is such a good idea!).

@@ -6,6 +6,34 @@
Release history
===============
+Version 0.18.2

This comment has been minimized.

@amueller

amueller May 18, 2017

Member

This version will not happen, we'll do 0.19 soonish?

@amueller

amueller May 18, 2017

Member

This version will not happen, we'll do 0.19 soonish?

@ndingwall

This comment has been minimized.

Show comment
Hide comment
@ndingwall

ndingwall May 18, 2017

Contributor

@amueller The CircleCI output seems to stop with this error so I'm not seeing outputs of tests, but I had a problem before that my local results don't match what CircleCI is reporting.

Exception occurred:
  File "/home/ubuntu/scikit-learn/doc/sphinxext/sphinx_gallery/docs_resolv.py", line 48, in _get_data
    with open(url, 'r') as fid:
IOError: [Errno 2] No such file or directory: '/home/ubuntu/scikit-learn/doc/_build/html/stable/modules/generated/sklearn.feature_selection.SelectKBest.rst.html'

For testing the 11-point average, I would probably like to test against actual outcomes of the Pascal VOC code, since that is de-facto the definition of this measure

Do you have an example? I don't have Matlab so can't generate any.

Contributor

ndingwall commented May 18, 2017

@amueller The CircleCI output seems to stop with this error so I'm not seeing outputs of tests, but I had a problem before that my local results don't match what CircleCI is reporting.

Exception occurred:
  File "/home/ubuntu/scikit-learn/doc/sphinxext/sphinx_gallery/docs_resolv.py", line 48, in _get_data
    with open(url, 'r') as fid:
IOError: [Errno 2] No such file or directory: '/home/ubuntu/scikit-learn/doc/_build/html/stable/modules/generated/sklearn.feature_selection.SelectKBest.rst.html'

For testing the 11-point average, I would probably like to test against actual outcomes of the Pascal VOC code, since that is de-facto the definition of this measure

Do you have an example? I don't have Matlab so can't generate any.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 24, 2017

Member

I wasn't sure where to add the changes in whats_new.rst so I created a 0.18.2 section. I can move them if this will be merged into a different version.

If you merge with current master, you'll not only have an opportunity to fix current conflicts, but find that there is a section for 0.19

Member

jnothman commented May 24, 2017

I wasn't sure where to add the changes in whats_new.rst so I created a 0.18.2 section. I can move them if this will be merged into a different version.

If you merge with current master, you'll not only have an opportunity to fix current conflicts, but find that there is a section for 0.19

@jnothman

I've not reviewed it all, but I don't see how we can make a fundamental change to basically every evaluation of a metric without having a deprecation cycle or something even more drastic like giving the new implementations new function names. People have, no doubt, reported experimental results with the incumbent implementation and will find their measures suddenly and inexplicably changed on update. I don't think that's okay.

I'm pretty sure I recall an earlier version of this maintaining current behaviour, and I think we need to keep that, with a warning, for some time; or else warn loudly that it has changed but instruct the user how to silence the warning.

+/viewdoc/download?doi=10.1.1.157.5766&rep=rep1&type=pdf>`_. In the example
+below, the eleven precision values are indicated with an arrow to pointing to
+the best precision possible while meeting or exceeding the desired recall.
+Note that it's possible that the same operating point might correspond to

This comment has been minimized.

@jnothman

jnothman May 24, 2017

Member

This is in contradiction to the definition of operating point above. Do you simply mean that precision may be identical for multiple recall values?

@jnothman

jnothman May 24, 2017

Member

This is in contradiction to the definition of operating point above. Do you simply mean that precision may be identical for multiple recall values?

This comment has been minimized.

@ndingwall

ndingwall May 25, 2017

Contributor

No, it's that there's no requirement that the eleven operating points chosen must be distinct: e.g. the points corresponding to recalls of 0.2, 0.3 and 0.4 might all be the same operating point at e.g. (0.45, 0.6) if 0.6 is the best precision you can get with a recall of at least 0.2.

@ndingwall

ndingwall May 25, 2017

Contributor

No, it's that there's no requirement that the eleven operating points chosen must be distinct: e.g. the points corresponding to recalls of 0.2, 0.3 and 0.4 might all be the same operating point at e.g. (0.45, 0.6) if 0.6 is the best precision you can get with a recall of at least 0.2.

+ - :func:`metrics.ranking.average_precision_score` no longer linearly
+ interpolates between operating points, and instead weights precisions
+ by the change in recall since the last operating point, as per the
+ `Wikipedia entry <http://en.wikipedia.org/wiki/Average_precision>`_.

This comment has been minimized.

@jnothman

jnothman May 24, 2017

Member

I think you need to cite a more stable reference; at a minimum, a versioned Wikipedia page.

@jnothman

jnothman May 24, 2017

Member

I think you need to cite a more stable reference; at a minimum, a versioned Wikipedia page.

This comment has been minimized.

@ndingwall

ndingwall May 25, 2017

Contributor

👍🏻

@ndingwall

ndingwall May 25, 2017

Contributor

👍🏻

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 6, 2017

Member

Can you maybe fix the conflicts?

Member

amueller commented Jun 6, 2017

Can you maybe fix the conflicts?

@amueller

Looks good apart from the minor nitpick, I think. Trying to get @agramfort to dig up his matlab.

- 0.79...
+ 0.83...
+
+ >>> yt = np.array([0, 0, 1, 1])

This comment has been minimized.

@amueller

amueller Jun 6, 2017

Member

you could also just use the y_true and y_score variables defined above, instead of redefining them here?

@amueller

amueller Jun 6, 2017

Member

you could also just use the y_true and y_score variables defined above, instead of redefining them here?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 6, 2017

Member

I didn't realize Pascal VOC uses the standard AP, not 11 point. Is there a reference implementation for the 11 point somewhere?

Member

amueller commented Jun 6, 2017

I didn't realize Pascal VOC uses the standard AP, not 11 point. Is there a reference implementation for the 11 point somewhere?

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Jun 6, 2017

Member

I did:

import numpy as np
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target
np.savetxt('X.txt', X[:, :1])
np.savetxt('y.txt', y)

then in matlab using VOC code:

load X.txt
load y.txt

mask = y < 2;
X = X(mask, :);
y = y(mask, :);
y(y == 0) = -1;
out = X;
gt = y;

[so,si]=sort(-out);
tp=gt(si)>0;
fp=gt(si)<0;

fp=cumsum(fp);
tp=cumsum(tp);
rec=tp/sum(gt>0);
prec=tp./(fp+tp);

ap=VOCap(rec,prec)

returns

ap =

    0.9355
Member

agramfort commented Jun 6, 2017

I did:

import numpy as np
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target
np.savetxt('X.txt', X[:, :1])
np.savetxt('y.txt', y)

then in matlab using VOC code:

load X.txt
load y.txt

mask = y < 2;
X = X(mask, :);
y = y(mask, :);
y(y == 0) = -1;
out = X;
gt = y;

[so,si]=sort(-out);
tp=gt(si)>0;
fp=gt(si)<0;

fp=cumsum(fp);
tp=cumsum(tp);
rec=tp/sum(gt>0);
prec=tp./(fp+tp);

ap=VOCap(rec,prec)

returns

ap =

    0.9355
@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jun 6, 2017

Member

In the interest of getting this merged while the heat of the sprint lasts, I am taking over this PR and finishing up the last details.

Member

GaelVaroquaux commented Jun 6, 2017

In the interest of getting this merged while the heat of the sprint lasts, I am taking over this PR and finishing up the last details.

@ndingwall

This comment has been minimized.

Show comment
Hide comment
@ndingwall

ndingwall Jun 6, 2017

Contributor

@amueller The Pascal VOC documentation is very clear that they're using 11-point AP in their evaluation (top right of page 11). It's not dated so this might have changed - do you have the actual Matlab code used in their evaluation?

Will you or @GaelVaroquaux respond to @jnothman's comment above?

Contributor

ndingwall commented Jun 6, 2017

@amueller The Pascal VOC documentation is very clear that they're using 11-point AP in their evaluation (top right of page 11). It's not dated so this might have changed - do you have the actual Matlab code used in their evaluation?

Will you or @GaelVaroquaux respond to @jnothman's comment above?

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jun 6, 2017

Member

I want to heavily rewrite the example. It's gotten way too complicated for demoing the average precision. I want to try to break it down more (and maybe do slightly less pretty plots).

Member

GaelVaroquaux commented Jun 6, 2017

I want to heavily rewrite the example. It's gotten way too complicated for demoing the average precision. I want to try to break it down more (and maybe do slightly less pretty plots).

@amueller

This comment has been minimized.

Show comment
Hide comment
@ndingwall

This comment has been minimized.

Show comment
Hide comment
@ndingwall

ndingwall Jun 6, 2017

Contributor

@amueller, it looks like they're using interpolated average precision, just not 11-point interpolated. This loop propagates the best precision backwards iteratively from the highest recall value.

for i=numel(mpre)-1:-1:1
    mpre(i)=max(mpre(i),mpre(i+1));
end

But either way, this contradicts the Pascal VOC documentation I've linked to in the docstrings! Do you know if this was updated somewhere?

Contributor

ndingwall commented Jun 6, 2017

@amueller, it looks like they're using interpolated average precision, just not 11-point interpolated. This loop propagates the best precision backwards iteratively from the highest recall value.

for i=numel(mpre)-1:-1:1
    mpre(i)=max(mpre(i),mpre(i+1));
end

But either way, this contradicts the Pascal VOC documentation I've linked to in the docstrings! Do you know if this was updated somewhere?

@ndingwall

This comment has been minimized.

Show comment
Hide comment
@ndingwall

ndingwall Jun 6, 2017

Contributor

@GaelVaroquaux: agreed. It's hard to find the balance between providing examples of all functionality, and avoiding unnecessary complication. You're better placed than me to strike that balance! In any case, it looks like we might be able to abandon 11-point interpolated AP, although we might have to replace it with interpolated AP which suffers the same weird peeking-at-the-test-set-performance problem that 11-point IAP does.

Contributor

ndingwall commented Jun 6, 2017

@GaelVaroquaux: agreed. It's hard to find the balance between providing examples of all functionality, and avoiding unnecessary complication. You're better placed than me to strike that balance! In any case, it looks like we might be able to abandon 11-point interpolated AP, although we might have to replace it with interpolated AP which suffers the same weird peeking-at-the-test-set-performance problem that 11-point IAP does.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jun 7, 2017

Member

So, it seems that on different datasets, we do not reproduce well the pascal VOC measure. I am not sure why

Member

GaelVaroquaux commented Jun 7, 2017

So, it seems that on different datasets, we do not reproduce well the pascal VOC measure. I am not sure why

@ndingwall

This comment has been minimized.

Show comment
Hide comment
@ndingwall

ndingwall Jun 7, 2017

Contributor

@GaelVaroquaux, that's probably because I based the code on this documentation, which doesn't correspond to their evaluation code. This calls for 11-point interpolated AP, but the code does straight interpolated AP (no fixing of recall values). I could probably amend the code to reflect their code, but it'd be good to also update the link to the documentation. I just tried searching for an updated version of the documentation but the host.robots.ox.ac.uk website seems to be down at the moment...

Contributor

ndingwall commented Jun 7, 2017

@GaelVaroquaux, that's probably because I based the code on this documentation, which doesn't correspond to their evaluation code. This calls for 11-point interpolated AP, but the code does straight interpolated AP (no fixing of recall values). I could probably amend the code to reflect their code, but it'd be good to also update the link to the documentation. I just tried searching for an updated version of the documentation but the host.robots.ox.ac.uk website seems to be down at the moment...

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jun 7, 2017

Member

@ndingwall : do you mean that it should match our "interpolation=None"? That's not what I am observing :$

Member

GaelVaroquaux commented Jun 7, 2017

@ndingwall : do you mean that it should match our "interpolation=None"? That's not what I am observing :$

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 7, 2017

Member

no, interpolation=None is pessimistic, while I think theirs is optimistic (aka cheating)

Member

amueller commented Jun 7, 2017

no, interpolation=None is pessimistic, while I think theirs is optimistic (aka cheating)

@ndingwall

This comment has been minimized.

Show comment
Hide comment
@ndingwall

ndingwall Jun 7, 2017

Contributor

@GaelVaroquaux - as @amueller said, there's does use interpolation, just not the 11-point variety. We don't have an implementation that matches their evaluation code but it should be easy to write.

Contributor

ndingwall commented Jun 7, 2017

@GaelVaroquaux - as @amueller said, there's does use interpolation, just not the 11-point variety. We don't have an implementation that matches their evaluation code but it should be easy to write.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jun 7, 2017

Member

@ndingwall : understood. I confirm looking at their code.

It doesn't seem to be useful to try to replicate it.

Member

GaelVaroquaux commented Jun 7, 2017

@ndingwall : understood. I confirm looking at their code.

It doesn't seem to be useful to try to replicate it.

@ndingwall

This comment has been minimized.

Show comment
Hide comment
@ndingwall

ndingwall Jun 7, 2017

Contributor

@GaelVaroquaux, the interpolated version is requested in #4577. I don't mind removing it (and the 11-point version) because, like @amueller, I consider them both to be cheating! But we could argue that the code should reflect what the community uses rather than what we think is best...

Contributor

ndingwall commented Jun 7, 2017

@GaelVaroquaux, the interpolated version is requested in #4577. I don't mind removing it (and the 11-point version) because, like @amueller, I consider them both to be cheating! But we could argue that the code should reflect what the community uses rather than what we think is best...

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jun 9, 2017

Member

Closing in favor of #9091

Member

GaelVaroquaux commented Jun 9, 2017

Closing in favor of #9091

@amueller amueller removed this from PR phase in Andy's pets Jul 21, 2017

@varunagrawal

This comment has been minimized.

Show comment
Hide comment
@varunagrawal

varunagrawal Feb 25, 2018

Contributor

@GaelVaroquaux you have closed this issue and stated in #9091 that you won't be finishing that PR. Can we please re-open and resolve this issue?

@amueller

Contributor

varunagrawal commented Feb 25, 2018

@GaelVaroquaux you have closed this issue and stated in #9091 that you won't be finishing that PR. Can we please re-open and resolve this issue?

@amueller

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Feb 25, 2018

Member
Member

jnothman commented Feb 25, 2018

@varunagrawal

Collation of related issues in order to better improve the PR.

+
+ else:
+ raise ValueError("interpolation has to be one of "
+ "(None, 'eleven_point').")

This comment has been minimized.

@varunagrawal

varunagrawal Feb 25, 2018

Contributor
  1. We also need an option for the VOC variant. @ndingwall you mentioned that you could write it easily. Can you please add it in?
  2. It would be nice to abstract the _binary_*_average_precision functions into a variable and have a single call to _average_binary_score. It would help make the code easier to read.
@varunagrawal

varunagrawal Feb 25, 2018

Contributor
  1. We also need an option for the VOC variant. @ndingwall you mentioned that you could write it easily. Can you please add it in?
  2. It would be nice to abstract the _binary_*_average_precision functions into a variable and have a single call to _average_binary_score. It would help make the code easier to read.
+ # We need the recall values in ascending order, and to ignore the first
+ # (precision, recall) pair with precision = 1.
+ precision = precision[-2::-1]
+ recall = recall[-2::-1]

This comment has been minimized.

@varunagrawal

varunagrawal Feb 25, 2018

Contributor

I think it would make sense to fix line 470 in precision_recall_curve instead of doing this. Please refer to #8280

@varunagrawal

varunagrawal Feb 25, 2018

Contributor

I think it would make sense to fix line 470 in precision_recall_curve instead of doing this. Please refer to #8280

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment