[WIP] enhance `labels` and deprecate `pos_label` in PRF metrics #2610

Closed
wants to merge 5 commits into
from

Conversation

Projects
None yet
4 participants
Owner

jnothman commented Nov 24, 2013

This intends to make the parameter labels clearly defined for the precision_recall_fscore_support family of metrics. This amounts to a (partial) fix for #1983, #1989, #2029, #2094, #3122. As implied by the comment I have committed, labels will:

  • determine the sort order when returning a result for each class
  • determine which labels will be included in an average (allowing one or more negative/ignored classes in multiclass classification)
  • by default, all labels are used, with no special handling of binary -the current special handling of binary classification is preserved, but pos_label is determined implicitly.-

There are some potential issues in the deprecation process, and in making binary classifier metrics available as Scorers...

  • Define and document intended behaviour
  • Tests for new labels functionality
  • Implement new labels functionality
  • Ensure labels is set correctly for legacy functionality
  • Deprecation warnings for pos_label
  • Some solution for scorers, at least handling the binary case
  • Copy documentation to derived functions
  • Update narrative documentation
  • Update what's new
  • Open issue for similar behaviour in average_precision_score, roc_auc_score and any other binary-averaged metrics.

This should not be merged until after the 0.15 release to allow at least one release with the warning merged from #2952

Coverage Status

Coverage remained the same when pulling a852787 on jnothman:deprecate_pos_label into f63e5fd on scikit-learn:master.

Owner

arjoly commented Nov 25, 2013

Thanks for working on those issues !!!

Owner

jnothman commented Nov 25, 2013

Thanks for working on those issues !!!

np! The fiddliness of pos_label and average has annoyed me for a long time.

But there is at least one problem. Consider the following parameters in the current code:

  • a binary classification dataset (e.g. y_true=[-1, 1, 1], y_pred=[1, 1, -1])
  • pos_label=1 by default
  • average='macro' (or at least not None)
  • labels=[-1, 1] is set explicitly

This was treated as the special binary case, i.e. the metric was returned only for the positive label (P,R,F=1/3). Because labels is set explicitly, this PR would want to return the macro-average over all labels (P,R,F=1/6). However, it would be unusual for labels to have been set explicitly, because it does not affect the binary special case.

I can think of the following deprecation options for this weird case:

  1. delay this PR for a couple of versions, in the meantime warning that behaviour will change in the future
  2. provide only the new functionality, with a warning that the behaviour has changed
  3. provide the old functionality for a couple of versions, but it's not clear how to allow the user to override this behaviour
  4. Change the labels parameter name in the new version, and deprecate the old parameter.

Also, I'm considering changing it to, by default, do the automatic handling of binary like it currently does (but with an inferred pos_label). The main benefit would be that we don't need to change the current scorers, but still has some problems with implicit behaviour.

Coverage Status

Coverage remained the same when pulling 31eafdd on jnothman:deprecate_pos_label into f63e5fd on scikit-learn:master.

Owner

jnothman commented Nov 25, 2013

If someone has a moment to review this -- as an idea if not code -- I am very eager to hear comment on the deprecation edge-case I described in my previous comment.

I'm also interested in whether we should maintain special handling of binary targets as a default (not just during deprecation), even if it provides confusing implicit behaviour (e.g. #2094), for convenience.

Owner

arjoly commented Dec 17, 2013

I don't have any strong opinion. My only fear is how to keep the behaviour simple to be compatible with binary and multi-class scorers in mind.

  1. delay this PR for a couple of versions, in the meantime warning that behaviour will change in the future

I have seen numpy using this strategy.

Owner

jnothman commented Dec 17, 2013

I've decided that to make things more explicit, and to enable some features
of this PR, we should provide scorers named 'binary_f1', 'micro_f1',
'macro_f1', 'weighted_f1'. I might throw together a PR shortly.

On Tue, Dec 17, 2013 at 7:26 PM, Arnaud Joly notifications@github.comwrote:

I don't have any strong opinion. My only fear is how to keep the
behaviour simple to be compatible with binary and multi-class scorershttp://scikit-learn.org/dev/modules/model_evaluation.html#common-cases-predefined-valuesin mind.

  1. delay this PR for a couple of versions, in the meantime warning
    that behaviour will change in the future

    I have seen numpy using this strategy.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/2610#issuecomment-30733488
.

@jnothman jnothman added a commit to jnothman/scikit-learn that referenced this pull request Mar 9, 2014

@jnothman jnothman MAINT warn of future behaviour change proposed in #2610 9e6e892

@jnothman jnothman added a commit to jnothman/scikit-learn that referenced this pull request Mar 9, 2014

@jnothman jnothman MAINT warn of future behaviour change proposed in #2610 63595d5

@jnothman jnothman added a commit to jnothman/scikit-learn that referenced this pull request Mar 9, 2014

@jnothman jnothman MAINT warn of future behaviour change proposed in #2610 28c8d0b

jnothman added this to the 0.16 milestone Mar 11, 2014

@maheshakya maheshakya added a commit to maheshakya/scikit-learn that referenced this pull request Apr 15, 2014

@jnothman @maheshakya jnothman + maheshakya MAINT warn of future behaviour change proposed in #2610 c27e75f

@maheshakya maheshakya added a commit to maheshakya/scikit-learn that referenced this pull request Apr 15, 2014

@jnothman @maheshakya jnothman + maheshakya MAINT warn of future behaviour change proposed in #2610 d9d0e76
Owner

amueller commented Jan 27, 2015

Will this fix #3123? Also: can I help?

Owner

jnothman commented Jan 27, 2015

I'll try to get this to MRG soon. Should not be a big deal. I'm not sure it will intentionally fix #3123, but I can make a point of doing so so as not to make conflicts.

Owner

amueller commented Jan 27, 2015

Thanks, let me know if you feel ready for review.

Owner

jnothman commented Feb 1, 2015

@amueller, @arjoly and anyone else interested:

I've recalled that this was somewhat muddied by #2679 being altered to have an average='binary' mode. While that mode intends to raise an error with multiclass/label data in the future, the opposite is not true, and ultimately the number of distinct class labels in y_{true,pred} still switches the mode of operation (i.e. with pos_label is not None and average not None binary targets always get the measure of a single positive class), which I think is pretty bad. In the future, average=something-other-than-binary should do so strictly, without reference to the number of input classes.

I think that stopping special handling of binary data needs to happen before pos_label can be deprecated: I don't have a nice way of asking someone who otherwise provides binary targets, pos_label=None and average=macro (with or without a list of labels) to correct their code to make it future-compatible.

So I propose that, at least for now, we keep pos_label. This should disrupt fewer users, although leave the API still a bit messy in the near future.

At the same time, recall this is not just an API fix, but supports an additional (useful) case wherein labels can specify a strict subset of present labels, thus treating some as an ignored class.

Rather, this PR or its successor should:

  • deprecate towards banning pos_label=None when average='binary'
  • deprecate towards stopping special handling of binary data when average != 'binary', pos_label != None
  • extend the functionality of labels when average != 'binary'

Thus in the future, pos_label will be exclusively used when average='binary', and labels will be used otherwise.

(This has taken an alarming effort to reason through, so I hope I've come to correct conclusions! At some point soon I'll try to implement them, although I don't look forward to testing all these possibilities and their backwards-compatibilities.)

Owner

jnothman commented Feb 1, 2015

That leads me to split this into two PRs: one that basically finishes up the work of #2679 to handle the converse case; the other to extend and fix the handling of labels in the non-binary case.

Owner

jnothman commented Feb 1, 2015

(The other reason I abandoned this PR was because of the splitting of metrics.py which made the rebase a pain!)

Owner

jnothman commented Feb 1, 2015

Closing this incarnation.

jnothman closed this Feb 1, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment