[MRG+2] Require explicit average arg for multiclass/label P/R/F metrics and scorers #2679

jnothman · 2013-12-19T21:57:28Z

In order to avoid problems like #2094, and to avoid people unwittingly reporting weighted average, this goes towards making 'average' a required parameter for multiclass/multilabel precision, recall, f-score. Closely related to #2676.

After a deprecation cycle, we can turn the warning into an error, or make macro/micro default.

This PR also shards the builtin scorers to make the averaging explicit. This avoids users getting binary behaviour when they shouldn't (cf. #2094 where scoring isn't used). I think this is extra important because "weighted" F1 isn't especially common in the literature, and having people report it without realising that's what it is is unhelpful to the applied ML community. This helps, IMO, towards a more explicit and robust API for binary classification metrics (cf. #2610).

It also entails a deprecation procedure for scorers, and more API there: public get_scorer and list_scorers

coveralls · 2013-12-19T22:13:45Z

Coverage remained the same when pulling 26ac3cf on jnothman:prf_average_explicit into 6ec2c8b on scikit-learn:master.

amueller · 2013-12-27T11:42:42Z

I think it is a bit weird that the 'compat' value is not documented and the current default behavior is not explained. I don't have a solution ready, though. Also, it looks like you added ignore_warnings because of the newly introduced behavior to some tests. Shouldn't the test rather be adjusted to give an explicit average method? Or did you want to test the backward compatibility? I think we should rather try to test the new behavior (or both).

amueller · 2013-12-27T11:44:03Z

Can you briefly explain why this change is necessary after #2610 is merged?

jnothman · 2013-12-28T10:41:39Z

Thanks for looking at this, Andy. Responses:

Scikit-learn promises sensible default parameters. average='weighted' is not a sensible default in terms of the literature, which is one reason this PR is needed apart from [WIP] enhance labels and deprecate pos_label in PRF metrics #2610. Indeed, given this PR, [WIP] enhance labels and deprecate pos_label in PRF metrics #2610 is less important as a solution for f1_score and precision_recall_fscore_support throw an error for some class labels #2094, but still has other benefits (clearer, enhanced functionality of labels and removing the confusing pos_label).
I'm not sure if there's any neater way to do deprecation where you want to check if someone's passed an explicit value, hence ''compat". But, sure, it can be documented.
The need for ignore_warnings comes in part because of the sophisticated invariance testing in metrics, such as METRICS_WITH_AVERAGING relying on the metrics with no average kwarg set existing in ALL_METRICS. There's possibly a nicer way around it; but ignore_warnings seems sensible for invariance tests as long as the warning functionality is tested elsewhere.

jnothman · 2013-12-31T03:59:43Z

@arjoly, I'd like it if you could review or comment on this at some point.

arjoly · 2014-01-02T14:00:31Z

Do you plan to move the averaging keyword to the third argument? Do you want to remove the default value or set the default value to None?

jnothman · 2014-01-02T21:02:55Z

It is essential that we stop the default being 'weighted', and anything
that makes the special binary handling more explicit is also important
(i.e. if I change my 2-class problem into a 3-class problem, the warning
should give me some indication that I'm using a fundamentally different
metric).

I don't mind changing average to None by default... but I don't think that
needs to be decided now. One reason not to merely remove the default value
is that if labels indicates there's only one class that counts, average
is unnecessary. Given this, moving average up might be useful, but with
pos_label gone, there aren't that many optional parameters anymore.

On Fri, Jan 3, 2014 at 1:00 AM, Arnaud Joly notifications@github.comwrote:

Do you plan to move the averaging keyword to the third argument? Do you
want to remove the default value or set the default value to None?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/2679#issuecomment-31453888
.

arjoly · 2014-01-03T08:06:44Z

looks good to merge !
Thanks for your hard works !

jnothman · 2014-01-03T08:24:08Z

Thanks for the review, @arjoly

GaelVaroquaux · 2014-01-18T15:21:59Z

benchmarks/bench_multilabel_metrics.py

@@ -20,7 +20,7 @@


 METRICS = {
-    'f1': f1_score,
+    'f1': partial(f1_score, average='micro'),


I think that the docs (the part that describes the different scoring options http://scikit-learn.org/dev/modules/model_evaluation.html#common-cases-predefined-values ) should be updated to stress this.

I think I should just merge this PR with #2676 which focuses on that
specifically. It doesn't entirely make sense without it.

On 19 January 2014 02:22, Gael Varoquaux notifications@github.com wrote:

In benchmarks/bench_multilabel_metrics.py:

@@ -20,7 +20,7 @@

METRICS = {

'f1': f1_score,

'f1': partial(f1_score, average='micro'),

I think that the docs (the part that describes the different scoring
options
http://scikit-learn.org/dev/modules/model_evaluation.html#common-cases-predefined-values) should be updated to stress this.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/2679/files#r8988047
.

Note to self (and other reviewer) this merge has been done.

jnothman · 2014-01-18T22:25:49Z

I've rebased this on #2676, so that both the metrics and scorers are explicit.

jnothman · 2014-01-18T22:26:38Z

And that rebase means @arjoly's LGTM no longer applies. If you'd like to review the whole PR, Arnaud, that would be nice ;)

arjoly · 2014-01-21T07:50:17Z

Is there a need to make get_scorer, list_scorers public functions? Can we prefix those by an _?

There are also several new constants such as SCORER_DEPRECATION, msg. By the way, I don't think we need to have all scorer class public such as r2_scorer.

It would be nice to add an __ALL__ to the file.

arjoly · 2014-01-21T07:50:32Z

sklearn/metrics/scorer.py

+       "on your data; '{0}_macro', '{0}_micro' and '{0}_samples' provide "
+       "alternative multiclass/multilabel averaging.")
+for name, metric in [('precision', precision_score),
+                       ('recall', recall_score), ('f1', f1_score)]:


jnothman · 2014-01-21T08:39:22Z

Is there a need to make get_scorer, list_scorers public functions? Can we prefix those by an _?

Yes, IMO. If someone wrote their own CV utility, they should be using get_scorer. That's the point: it provides a formal abstraction over a dict lookup so that we can maintain it. list_scorers could be private, but that just means the only way to get the official list of scorers is to trigger an exception, whose message then needs parsing, etc.

I should note that get_scorer already exists, as of a876682 (this needs a rebase).

There are also several new constants such as SCORER_DEPRECATION, msg. By the way, I don't think we need to have all scorer class public such as r2_scorer.

I agree that there's a lot of unnecessary mess in the module namespace, but I think that's out of this PR's scope.

jnothman · 2014-01-21T09:56:14Z

Rebased and addressed @arjoly's comments.

arjoly · 2014-01-21T09:58:49Z

doc/whats_new.rst

@@ -185,6 +185,17 @@ API changes summary
     of length greater than one.
     By `Manoj Kumar`_.

+   - `scoring` parameter for cross validatiokn now accepts `'f1_binary'`,


coveralls · 2014-01-21T10:04:22Z

Coverage remained the same when pulling f809cd1 on jnothman:prf_average_explicit into fb43369 on scikit-learn:master.

arjoly · 2014-01-21T10:05:27Z

LGTM

jnothman · 2014-01-21T10:09:11Z

Thanks for the feedback! I hope overall that it's the right thing to be
doing. Do you think we need pre-packaged scorers for all forms of average,
for instance?

On 21 January 2014 21:05, Arnaud Joly notifications@github.com wrote:

LGTM

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/2679#issuecomment-32835012
.

coveralls · 2014-01-21T10:15:49Z

Coverage remained the same when pulling 3797cd9 on jnothman:prf_average_explicit into fb43369 on scikit-learn:master.

coveralls · 2014-01-21T10:21:10Z

Coverage increased (+0.01%) when pulling 3797cd9 on jnothman:prf_average_explicit into fb43369 on scikit-learn:master.

jnothman · 2014-12-07T01:42:47Z

Rebased again.

I'd really like to see this merged, finally. @mblondel, you expressed distaste in the default 'weighted' scheme for precision/recall/f1. Do you mind taking a look at this API change? Or @ogrisel or @MechCoder or @amueller? I think it would be good to have this merged into master for a while before the next release.

GaelVaroquaux · 2014-12-07T07:50:43Z

updated it to conform to what I understood from Gaël. Most
particularly, binary classification will continue to work with the
un-suffixed f1, etc scorers.

Thanks you!

GaelVaroquaux · 2014-12-07T08:31:01Z

sklearn/metrics/classification.py

-                          'well as `average` will average over those '
-                          'labels. For now, please use `labels=None` with '
-                          '`pos_label` to evaluate precision, recall and '
+                          'well as `average!=\'binary\'` will average over '


I find that it is more elegant to simply use a double tick (") when the string is defined with simple ticks, and vice versa, rather than protecting the ticks.

Not that it really matters, so don't change anything.

Even I think the same way ;)

GaelVaroquaux · 2014-12-07T08:32:13Z

Looks good to me! 👍 for merge.

Thanks a lot.

MechCoder · 2014-12-07T17:21:16Z

I can have a look at this tomorrow., (if it hasn't been merged by then already)

jnothman · 2014-12-08T03:04:39Z

Thanks Gaël! Now get back to your research proposals! ;)

On 8 December 2014 at 04:21, Manoj Kumar notifications@github.com wrote:

I can have a look at this tomorrow., (if it hasn't been merged by then
already)

—
Reply to this email directly or view it on GitHub
#2679 (comment)
.

MechCoder · 2014-12-08T13:37:10Z

doc/modules/model_evaluation.rst

+the set of classes, each of which may be useful in some scenario.
+Where available, you should select among these using the ``average`` parameter.
+
+* ``"macro"`` simply calculates calculates the mean of the binary metrics,


Why does it have to calculate twice? ;)

MechCoder · 2014-12-08T14:36:41Z

@jnothman I'm still at a point where I learn more from Pull Requests than Pull Requests learn from me, so sorry if my comments caused more pain (in explaining). than pleasure. ;)

jnothman · 2014-12-08T22:01:21Z

There was no problem with the comments. Thanks for catching some small
things that must have been hiding for a while. I've squashed those changes
in.

On 9 December 2014 at 01:36, Manoj Kumar notifications@github.com wrote:

@jnothman https://github.com/jnothman I'm still at a point where I
learn more from Pull Requests than Pull Requests learn from me, so sorry if
my comments caused more pain (in explaining). than pleasure. ;)

—
Reply to this email directly or view it on GitHub
#2679 (comment)
.

jnothman · 2014-12-08T22:04:13Z

I'll merge after travis confirms that I haven't done anything silly.

…ault Scorers for different average parameters have been added.

[MRG] Require explicit average arg for multiclass/label P/R/F metrics and scorers

arjoly · 2014-12-09T08:32:12Z

Thanks @jnothman !

mblondel · 2014-12-11T12:09:21Z

Sorry I am at a conference and didn't have time to review. Glad that this is finally in. Regarding 'weighted', my main concern was just that we should really try to avoid using non-standard stuff as default. On the other hand, there is the question of backward compatibility and it's hard to tell which would be a better default between micro and macro averaging. When the default is used (weighted), we could potentially raise a warning and tell the user to explicitly specify the averaging option. This way, users will be able to correctly report the results they got when writing a paper.

jnothman mentioned this pull request Dec 30, 2013

[MRG] Reduce complexity of cluster sampling in make_classification #2650

Closed

jnothman mentioned this pull request Jan 6, 2014

[MRG] Learning curves #2701

Closed

GaelVaroquaux reviewed Jan 18, 2014
View reviewed changes

jnothman mentioned this pull request Jan 18, 2014

[MRG] replace 'f1' scorer by explicit variants #2676

Closed

arjoly reviewed Jan 21, 2014
View reviewed changes

jnothman force-pushed the prf_average_explicit branch from 2672e68 to c628ac8 Compare December 7, 2014 01:39

GaelVaroquaux reviewed Dec 7, 2014
View reviewed changes

GaelVaroquaux changed the title ~~[MRG+1] Require explicit average arg for multiclass/label P/R/F metrics and scorers~~ [MRG+2] Require explicit average arg for multiclass/label P/R/F metrics and scorers Dec 7, 2014

MechCoder reviewed Dec 8, 2014
View reviewed changes

jnothman force-pushed the prf_average_explicit branch from c628ac8 to 83d9e4f Compare December 8, 2014 22:01

jnothman force-pushed the prf_average_explicit branch from 83d9e4f to f0d2881 Compare December 8, 2014 22:02

FIX P/R/F metrics and scorers are now for binary problems only by def…

081a554

…ault Scorers for different average parameters have been added.

jnothman force-pushed the prf_average_explicit branch from f0d2881 to 081a554 Compare December 9, 2014 02:59

jnothman added a commit that referenced this pull request Dec 9, 2014

Merge pull request #2679 from jnothman/prf_average_explicit

56ee99c

[MRG] Require explicit average arg for multiclass/label P/R/F metrics and scorers

jnothman merged commit 56ee99c into scikit-learn:master Dec 9, 2014

This was referenced Feb 1, 2015

[WIP] enhance labels and deprecate pos_label in PRF metrics #2610

Closed

[MRG+2] P/R/F: in future, average='binary' iff 2 labels in y one of which is pos_label #4192

Merged

jnothman mentioned this pull request Mar 2, 2015

f1_score and precision_recall_fscore_support throw an error for some class labels #2094

Closed

[MRG+2] Require explicit average arg for multiclass/label P/R/F metrics and scorers #2679

[MRG+2] Require explicit average arg for multiclass/label P/R/F metrics and scorers #2679

Conversation

jnothman commented Dec 19, 2013

coveralls commented Dec 19, 2013

amueller commented Dec 27, 2013

amueller commented Dec 27, 2013

jnothman commented Dec 28, 2013

jnothman commented Dec 31, 2013

arjoly commented Jan 2, 2014

jnothman commented Jan 2, 2014

arjoly commented Jan 3, 2014

jnothman commented Jan 3, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Jan 18, 2014

jnothman commented Jan 18, 2014

arjoly commented Jan 21, 2014

Choose a reason for hiding this comment

jnothman commented Jan 21, 2014

jnothman commented Jan 21, 2014

Choose a reason for hiding this comment

coveralls commented Jan 21, 2014

arjoly commented Jan 21, 2014

jnothman commented Jan 21, 2014

coveralls commented Jan 21, 2014

coveralls commented Jan 21, 2014

jnothman commented Dec 7, 2014

GaelVaroquaux commented Dec 7, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GaelVaroquaux commented Dec 7, 2014

MechCoder commented Dec 7, 2014

jnothman commented Dec 8, 2014

Choose a reason for hiding this comment

MechCoder commented Dec 8, 2014

jnothman commented Dec 8, 2014

jnothman commented Dec 8, 2014

arjoly commented Dec 9, 2014

mblondel commented Dec 11, 2014