Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Require explicit average arg for multiclass/label P/R/F metrics and scorers #2679

Merged
merged 1 commit into from
Dec 9, 2014

Conversation

jnothman
Copy link
Member

In order to avoid problems like #2094, and to avoid people unwittingly reporting weighted average, this goes towards making 'average' a required parameter for multiclass/multilabel precision, recall, f-score. Closely related to #2676.

After a deprecation cycle, we can turn the warning into an error, or make macro/micro default.

This PR also shards the builtin scorers to make the averaging explicit. This avoids users getting binary behaviour when they shouldn't (cf. #2094 where scoring isn't used). I think this is extra important because "weighted" F1 isn't especially common in the literature, and having people report it without realising that's what it is is unhelpful to the applied ML community. This helps, IMO, towards a more explicit and robust API for binary classification metrics (cf. #2610).

It also entails a deprecation procedure for scorers, and more API there: public get_scorer and list_scorers

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 26ac3cf on jnothman:prf_average_explicit into 6ec2c8b on scikit-learn:master.

@amueller
Copy link
Member

I think it is a bit weird that the 'compat' value is not documented and the current default behavior is not explained. I don't have a solution ready, though. Also, it looks like you added ignore_warnings because of the newly introduced behavior to some tests. Shouldn't the test rather be adjusted to give an explicit average method? Or did you want to test the backward compatibility? I think we should rather try to test the new behavior (or both).

@amueller
Copy link
Member

Can you briefly explain why this change is necessary after #2610 is merged?

@jnothman
Copy link
Member Author

Thanks for looking at this, Andy. Responses:

  • Scikit-learn promises sensible default parameters. average='weighted' is not a sensible default in terms of the literature, which is one reason this PR is needed apart from [WIP] enhance labels and deprecate pos_label in PRF metrics #2610. Indeed, given this PR, [WIP] enhance labels and deprecate pos_label in PRF metrics #2610 is less important as a solution for f1_score and precision_recall_fscore_support throw an error for some class labels #2094, but still has other benefits (clearer, enhanced functionality of labels and removing the confusing pos_label).
  • I'm not sure if there's any neater way to do deprecation where you want to check if someone's passed an explicit value, hence ''compat". But, sure, it can be documented.
  • The need for ignore_warnings comes in part because of the sophisticated invariance testing in metrics, such as METRICS_WITH_AVERAGING relying on the metrics with no average kwarg set existing in ALL_METRICS. There's possibly a nicer way around it; but ignore_warnings seems sensible for invariance tests as long as the warning functionality is tested elsewhere.

@jnothman
Copy link
Member Author

@arjoly, I'd like it if you could review or comment on this at some point.

@arjoly
Copy link
Member

arjoly commented Jan 2, 2014

Do you plan to move the averaging keyword to the third argument? Do you want to remove the default value or set the default value to None?

@jnothman
Copy link
Member Author

jnothman commented Jan 2, 2014

It is essential that we stop the default being 'weighted', and anything
that makes the special binary handling more explicit is also important
(i.e. if I change my 2-class problem into a 3-class problem, the warning
should give me some indication that I'm using a fundamentally different
metric).

I don't mind changing average to None by default... but I don't think that
needs to be decided now. One reason not to merely remove the default value
is that if labels indicates there's only one class that counts, average
is unnecessary. Given this, moving average up might be useful, but with
pos_label gone, there aren't that many optional parameters anymore.

On Fri, Jan 3, 2014 at 1:00 AM, Arnaud Joly notifications@github.comwrote:

Do you plan to move the averaging keyword to the third argument? Do you
want to remove the default value or set the default value to None?


Reply to this email directly or view it on GitHubhttps://github.com//pull/2679#issuecomment-31453888
.

@arjoly
Copy link
Member

arjoly commented Jan 3, 2014

looks good to merge !
Thanks for your hard works !

@jnothman
Copy link
Member Author

jnothman commented Jan 3, 2014

Thanks for the review, @arjoly

@jnothman jnothman mentioned this pull request Jan 6, 2014
@@ -20,7 +20,7 @@


METRICS = {
'f1': f1_score,
'f1': partial(f1_score, average='micro'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the docs (the part that describes the different scoring options http://scikit-learn.org/dev/modules/model_evaluation.html#common-cases-predefined-values ) should be updated to stress this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I should just merge this PR with #2676 which focuses on that
specifically. It doesn't entirely make sense without it.

On 19 January 2014 02:22, Gael Varoquaux notifications@github.com wrote:

In benchmarks/bench_multilabel_metrics.py:

@@ -20,7 +20,7 @@

METRICS = {

  • 'f1': f1_score,
  • 'f1': partial(f1_score, average='micro'),

I think that the docs (the part that describes the different scoring
options
http://scikit-learn.org/dev/modules/model_evaluation.html#common-cases-predefined-values) should be updated to stress this.


Reply to this email directly or view it on GitHubhttps://github.com//pull/2679/files#r8988047
.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self (and other reviewer) this merge has been done.

@jnothman
Copy link
Member Author

I've rebased this on #2676, so that both the metrics and scorers are explicit.

@jnothman
Copy link
Member Author

And that rebase means @arjoly's LGTM no longer applies. If you'd like to review the whole PR, Arnaud, that would be nice ;)

@arjoly
Copy link
Member

arjoly commented Jan 21, 2014

Is there a need to make get_scorer, list_scorers public functions? Can we prefix those by an _?

There are also several new constants such as SCORER_DEPRECATION, msg. By the way, I don't think we need to have all scorer class public such as r2_scorer.

It would be nice to add an __ALL__ to the file.

"on your data; '{0}_macro', '{0}_micro' and '{0}_samples' provide "
"alternative multiclass/multilabel averaging.")
for name, metric in [('precision', precision_score),
('recall', recall_score), ('f1', f1_score)]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep8?

@jnothman
Copy link
Member Author

Is there a need to make get_scorer, list_scorers public functions? Can we prefix those by an _?

Yes, IMO. If someone wrote their own CV utility, they should be using get_scorer. That's the point: it provides a formal abstraction over a dict lookup so that we can maintain it. list_scorers could be private, but that just means the only way to get the official list of scorers is to trigger an exception, whose message then needs parsing, etc.

I should note that get_scorer already exists, as of a876682 (this needs a rebase).

There are also several new constants such as SCORER_DEPRECATION, msg. By the way, I don't think we need to have all scorer class public such as r2_scorer.

I agree that there's a lot of unnecessary mess in the module namespace, but I think that's out of this PR's scope.

@jnothman
Copy link
Member Author

Rebased and addressed @arjoly's comments.

@@ -185,6 +185,17 @@ API changes summary
of length greater than one.
By `Manoj Kumar`_.

- `scoring` parameter for cross validatiokn now accepts `'f1_binary'`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validation

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling f809cd1 on jnothman:prf_average_explicit into fb43369 on scikit-learn:master.

@arjoly
Copy link
Member

arjoly commented Jan 21, 2014

LGTM

@jnothman
Copy link
Member Author

Thanks for the feedback! I hope overall that it's the right thing to be
doing. Do you think we need pre-packaged scorers for all forms of average,
for instance?

On 21 January 2014 21:05, Arnaud Joly notifications@github.com wrote:

LGTM


Reply to this email directly or view it on GitHubhttps://github.com//pull/2679#issuecomment-32835012
.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 3797cd9 on jnothman:prf_average_explicit into fb43369 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) when pulling 3797cd9 on jnothman:prf_average_explicit into fb43369 on scikit-learn:master.

@jnothman
Copy link
Member Author

jnothman commented Dec 7, 2014

Rebased again.

I'd really like to see this merged, finally. @mblondel, you expressed distaste in the default 'weighted' scheme for precision/recall/f1. Do you mind taking a look at this API change? Or @ogrisel or @MechCoder or @amueller? I think it would be good to have this merged into master for a while before the next release.

@GaelVaroquaux
Copy link
Member

updated it to conform to what I understood from Gaël. Most
particularly, binary classification will continue to work with the
un-suffixed f1, etc scorers.

Thanks you!

'well as `average` will average over those '
'labels. For now, please use `labels=None` with '
'`pos_label` to evaluate precision, recall and '
'well as `average!=\'binary\'` will average over '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that it is more elegant to simply use a double tick (") when the string is defined with simple ticks, and vice versa, rather than protecting the ticks.

Not that it really matters, so don't change anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even I think the same way ;)

@GaelVaroquaux
Copy link
Member

Looks good to me! 👍 for merge.

Thanks a lot.

@GaelVaroquaux GaelVaroquaux changed the title [MRG+1] Require explicit average arg for multiclass/label P/R/F metrics and scorers [MRG+2] Require explicit average arg for multiclass/label P/R/F metrics and scorers Dec 7, 2014
@MechCoder
Copy link
Member

I can have a look at this tomorrow., (if it hasn't been merged by then already)

@jnothman
Copy link
Member Author

jnothman commented Dec 8, 2014

Thanks Gaël! Now get back to your research proposals! ;)

On 8 December 2014 at 04:21, Manoj Kumar notifications@github.com wrote:

I can have a look at this tomorrow., (if it hasn't been merged by then
already)


Reply to this email directly or view it on GitHub
#2679 (comment)
.

the set of classes, each of which may be useful in some scenario.
Where available, you should select among these using the ``average`` parameter.

* ``"macro"`` simply calculates calculates the mean of the binary metrics,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it have to calculate twice? ;)

@MechCoder
Copy link
Member

@jnothman I'm still at a point where I learn more from Pull Requests than Pull Requests learn from me, so sorry if my comments caused more pain (in explaining). than pleasure. ;)

@jnothman
Copy link
Member Author

jnothman commented Dec 8, 2014

There was no problem with the comments. Thanks for catching some small
things that must have been hiding for a while. I've squashed those changes
in.

On 9 December 2014 at 01:36, Manoj Kumar notifications@github.com wrote:

@jnothman https://github.com/jnothman I'm still at a point where I
learn more from Pull Requests than Pull Requests learn from me, so sorry if
my comments caused more pain (in explaining). than pleasure. ;)


Reply to this email directly or view it on GitHub
#2679 (comment)
.

@jnothman
Copy link
Member Author

jnothman commented Dec 8, 2014

I'll merge after travis confirms that I haven't done anything silly.

…ault

Scorers for different average parameters have been added.
jnothman added a commit that referenced this pull request Dec 9, 2014
[MRG] Require explicit average arg for multiclass/label P/R/F metrics and scorers
@jnothman jnothman merged commit 56ee99c into scikit-learn:master Dec 9, 2014
@arjoly
Copy link
Member

arjoly commented Dec 9, 2014

Thanks @jnothman !

@mblondel
Copy link
Member

Sorry I am at a conference and didn't have time to review. Glad that this is finally in. Regarding 'weighted', my main concern was just that we should really try to avoid using non-standard stuff as default. On the other hand, there is the question of backward compatibility and it's hard to tell which would be a better default between micro and macro averaging. When the default is used (weighted), we could potentially raise a warning and tell the user to explicitly specify the averaging option. This way, users will be able to correctly report the results they got when writing a paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants