[MRG+2] Require explicit average arg for multiclass/label P/R/F metrics and scorers #2679

Merged
merged 1 commit into from Dec 9, 2014

Projects

None yet

8 participants

@jnothman
scikit-learn member

In order to avoid problems like #2094, and to avoid people unwittingly reporting weighted average, this goes towards making 'average' a required parameter for multiclass/multilabel precision, recall, f-score. Closely related to #2676.

After a deprecation cycle, we can turn the warning into an error, or make macro/micro default.

This PR also shards the builtin scorers to make the averaging explicit. This avoids users getting binary behaviour when they shouldn't (cf. #2094 where scoring isn't used). I think this is extra important because "weighted" F1 isn't especially common in the literature, and having people report it without realising that's what it is is unhelpful to the applied ML community. This helps, IMO, towards a more explicit and robust API for binary classification metrics (cf. #2610).

It also entails a deprecation procedure for scorers, and more API there: public get_scorer and list_scorers

@coveralls

Coverage Status

Coverage remained the same when pulling 26ac3cf on jnothman:prf_average_explicit into 6ec2c8b on scikit-learn:master.

@amueller
scikit-learn member

I think it is a bit weird that the 'compat' value is not documented and the current default behavior is not explained. I don't have a solution ready, though. Also, it looks like you added ignore_warnings because of the newly introduced behavior to some tests. Shouldn't the test rather be adjusted to give an explicit average method? Or did you want to test the backward compatibility? I think we should rather try to test the new behavior (or both).

@amueller
scikit-learn member

Can you briefly explain why this change is necessary after #2610 is merged?

@jnothman
scikit-learn member

Thanks for looking at this, Andy. Responses:

  • Scikit-learn promises sensible default parameters. average='weighted' is not a sensible default in terms of the literature, which is one reason this PR is needed apart from #2610. Indeed, given this PR, #2610 is less important as a solution for #2094, but still has other benefits (clearer, enhanced functionality of labels and removing the confusing pos_label).
  • I'm not sure if there's any neater way to do deprecation where you want to check if someone's passed an explicit value, hence ''compat". But, sure, it can be documented.
  • The need for ignore_warnings comes in part because of the sophisticated invariance testing in metrics, such as METRICS_WITH_AVERAGING relying on the metrics with no average kwarg set existing in ALL_METRICS. There's possibly a nicer way around it; but ignore_warnings seems sensible for invariance tests as long as the warning functionality is tested elsewhere.
@jnothman
scikit-learn member

@arjoly, I'd like it if you could review or comment on this at some point.

@arjoly
scikit-learn member

Do you plan to move the averaging keyword to the third argument? Do you want to remove the default value or set the default value to None?

@jnothman
scikit-learn member
@arjoly
scikit-learn member

looks good to merge !
Thanks for your hard works !

@jnothman
scikit-learn member

Thanks for the review, @arjoly

@jnothman jnothman referenced this pull request Jan 6, 2014
Closed

[MRG] Learning curves #2701

@GaelVaroquaux GaelVaroquaux commented on the diff Jan 18, 2014
benchmarks/bench_multilabel_metrics.py
@@ -20,7 +20,7 @@
METRICS = {
- 'f1': f1_score,
+ 'f1': partial(f1_score, average='micro'),
@GaelVaroquaux
GaelVaroquaux Jan 18, 2014

I think that the docs (the part that describes the different scoring options http://scikit-learn.org/dev/modules/model_evaluation.html#common-cases-predefined-values ) should be updated to stress this.

@jnothman
jnothman Jan 18, 2014
@GaelVaroquaux
GaelVaroquaux Jul 17, 2014

Note to self (and other reviewer) this merge has been done.

@jnothman
scikit-learn member

I've rebased this on #2676, so that both the metrics and scorers are explicit.

@jnothman
scikit-learn member

And that rebase means @arjoly's LGTM no longer applies. If you'd like to review the whole PR, Arnaud, that would be nice ;)

@arjoly
scikit-learn member

Is there a need to make get_scorer, list_scorers public functions? Can we prefix those by an _?

There are also several new constants such as SCORER_DEPRECATION, msg. By the way, I don't think we need to have all scorer class public such as r2_scorer.

It would be nice to add an __ALL__ to the file.

@arjoly arjoly commented on an outdated diff Jan 21, 2014
sklearn/metrics/scorer.py
@@ -287,3 +319,23 @@ def make_scorer(score_func, greater_is_better=True, needs_proba=False,
precision=precision_scorer, recall=recall_scorer,
log_loss=log_loss_scorer,
adjusted_rand_score=adjusted_rand_scorer)
+
+msg = ("The {0!r} scorer has been deprecated and will be removed in version "
+ "0.17. Please choose one of '{0}_binary' or '{0}_weighted' depending "
+ "on your data; '{0}_macro', '{0}_micro' and '{0}_samples' provide "
+ "alternative multiclass/multilabel averaging.")
+for name, metric in [('precision', precision_score),
+ ('recall', recall_score), ('f1', f1_score)]:
@arjoly arjoly and 2 others commented on an outdated diff Jan 21, 2014
sklearn/metrics/scorer.py
+ "0.17. Please choose one of '{0}_binary' or '{0}_weighted' depending "
+ "on your data; '{0}_macro', '{0}_micro' and '{0}_samples' provide "
+ "alternative multiclass/multilabel averaging.")
+for name, metric in [('precision', precision_score),
+ ('recall', recall_score), ('f1', f1_score)]:
+ SCORERS.update({
+ name: make_scorer(metric),
+ '{0}_binary'.format(name): make_scorer(partial(metric)),
+ '{0}_macro'.format(name): make_scorer(partial(metric, pos_label=None,
+ average='macro')),
+ '{0}_micro'.format(name): make_scorer(partial(metric, pos_label=None,
+ average='micro')),
+ '{0}_samples'.format(name): make_scorer(partial(metric, pos_label=None,
+ average='samples')),
+ '{0}_weighted'.format(name): make_scorer(partial(metric, pos_label=None,
+ average='weighted')),
@arjoly
arjoly Jan 21, 2014

I would prefer something like "macro-{0}", "binary-{0}", ...

@jnothman
jnothman Jan 21, 2014

I'd originally done this, but for usability I'd rather see them listed together. I considered - or even between words, but it would only create confusion given that scorers exist with underscored names.

@GaelVaroquaux
GaelVaroquaux Jul 17, 2014

I prefer underscored name because they are also valid Python identifiers, which can come in handy at some point.

@arjoly arjoly and 1 other commented on an outdated diff Jan 21, 2014
sklearn/metrics/scorer.py
+def get_scorer(scoring=None):
+ """Get a scorer by its name
+
+ Parameters
+ ----------
+ scoring : string or callable
+
+ Returns
+ -------
+ scorer : callable
+ Returns the scorer of the given name if scoring is a string, and
+ otherwise the object passed in.
+ """
+ if isinstance(scoring, six.string_types):
+ if scoring in SCORER_DEPRECATION:
+ warn(SCORER_DEPRECATION[scoring], DeprecationWarning)
@arjoly
arjoly Jan 21, 2014

Instead of having msg, would it be possible to handle all the deprecation stuff here?

SCORER_DEPRECATION could be simplified to a list of name.

@jnothman
jnothman Jan 21, 2014

As in a series of if statements? To what benefit? We still need a list of deprecated scorers to subtract when listing scorers, so this way we duplicate the information in different places.

But I think what you're asking me is to encapsulate the bits and pieces so that there aren't these global names floating around. I guess I can sort that out.

@arjoly
arjoly Jan 21, 2014

But I think what you're asking me is to encapsulate the bits and pieces so that there aren't these global names floating around. I guess I can sort that out.

+1

@jnothman
jnothman Jan 21, 2014

I've pushed one attempt at this. It still has global SCORERS and SCORER_DEPRECATION, but I don't think a more encapsulated approach (using closures or a singleton class to define get_scorer) is in keeping with scikit-learn style.

@arjoly arjoly and 3 others commented on an outdated diff Jan 21, 2014
sklearn/metrics/scorer.py
@@ -256,6 +276,17 @@ def make_scorer(score_func, greater_is_better=True, needs_proba=False,
return cls(score_func, sign, kwargs)
+def list_scorers():
+ """Lists the names of known scorers
+
+ Returns
+ -------
+ scorer_names : list of strings
+ """
+ return sorted(set(SCORERS) - set(SCORER_DEPRECATION))
@arjoly
arjoly Jan 21, 2014

Could you use set methods here instead of the operator? Others find, it is clearer to have explicit method call.

@jnothman
jnothman Jul 21, 2014

I just noticed this comment. I don't find difference a particularly good method name.

@vene
vene Jul 21, 2014

from set import difference; difference(...) is arguably not great.
import set; set.difference(...) is very readable however.

@jnothman
jnothman Jul 21, 2014

I mean that it's hard to appreciate that it's an asymmetric difference and that the argument is subtracted from the calling object.

@jnothman
scikit-learn member

Is there a need to make get_scorer, list_scorers public functions? Can we prefix those by an _?

Yes, IMO. If someone wrote their own CV utility, they should be using get_scorer. That's the point: it provides a formal abstraction over a dict lookup so that we can maintain it. list_scorers could be private, but that just means the only way to get the official list of scorers is to trigger an exception, whose message then needs parsing, etc.

I should note that get_scorer already exists, as of a876682 (this needs a rebase).

There are also several new constants such as SCORER_DEPRECATION, msg. By the way, I don't think we need to have all scorer class public such as r2_scorer.

I agree that there's a lot of unnecessary mess in the module namespace, but I think that's out of this PR's scope.

@jnothman
scikit-learn member

Rebased and addressed @arjoly's comments.

@arjoly arjoly commented on an outdated diff Jan 21, 2014
doc/whats_new.rst
@@ -185,6 +185,17 @@ API changes summary
of length greater than one.
By `Manoj Kumar`_.
+ - `scoring` parameter for cross validatiokn now accepts `'f1_binary'`,
@arjoly arjoly commented on an outdated diff Jan 21, 2014
doc/whats_new.rst
@@ -185,6 +185,17 @@ API changes summary
of length greater than one.
By `Manoj Kumar`_.
+ - `scoring` parameter for cross validatiokn now accepts `'f1_binary'`,
+ `'f1_micro'`, `'f1_macro'` or `'f1_weighted'`, deprecating the generic
+ `'f1'`. Similarly, `'precision'` and `'recall'` are deprecated.
@arjoly
arjoly Jan 21, 2014

deprecating ... are deprecated?

@arjoly arjoly commented on an outdated diff Jan 21, 2014
sklearn/metrics/scorer.py
+ scorers['{0}_binary'.format(name)] = make_scorer(partial(metric))
+ for average in ['macro', 'micro', 'samples', 'weighted']:
+ averaged= partial(metric, pos_label=None, average=average)
+ scorers['{0}_{1}'.format(name, average)] = make_scorer(averaged)
+
+ # deprecated but available until version 0.17:
+ scorers[name] = make_scorer(metric)
+ deprecation_messages[name] = (msg.format(name))
+
+ return scorers, deprecation_messages
+
+
+SCORERS, SCORER_DEPRECATION = _build_scorers()
+
+
+__all__ = ['make_scorer', 'get_scorer', 'list_scorers', 'check_scoring']
@arjoly
arjoly Jan 21, 2014

Can you put this at the top?

@coveralls

Coverage Status

Coverage remained the same when pulling f809cd1 on jnothman:prf_average_explicit into fb43369 on scikit-learn:master.

@arjoly
scikit-learn member

LGTM

@jnothman
scikit-learn member
@coveralls

Coverage Status

Coverage remained the same when pulling 3797cd9 on jnothman:prf_average_explicit into fb43369 on scikit-learn:master.

@coveralls

Coverage Status

Coverage increased (+0.01%) when pulling 3797cd9 on jnothman:prf_average_explicit into fb43369 on scikit-learn:master.

@arjoly
scikit-learn member

Do you think we need pre-packaged scorers for all forms of average,
for instance?

I think that having all metrics as a scorer is handy. However it could overwhelm the user when he has to choose which metrics to use. I'm fine with both options provide only a subset or all forms of averaging.

An opinion @mblondel ?

@arjoly
scikit-learn member

A rebase is apparently needed. Thanks @jnothman for your hardwork !

@jnothman
scikit-learn member

Rebased. Hope Travis is still appeased.

@coveralls

Coverage Status

Coverage remained the same when pulling 8cbb1ca on jnothman:prf_average_explicit into fec2867 on scikit-learn:master.

@arjoly
scikit-learn member

@GaelVaroquaux, it's what we were talking yesterday.

@GaelVaroquaux
scikit-learn member
@GaelVaroquaux GaelVaroquaux commented on an outdated diff Jul 17, 2014
doc/modules/model_evaluation.rst
**Classification**
'accuracy' :func:`sklearn.metrics.accuracy_score`
'average_precision' :func:`sklearn.metrics.average_precision_score`
-'f1' :func:`sklearn.metrics.f1_score`
-'precision' :func:`sklearn.metrics.precision_score`
-'recall' :func:`sklearn.metrics.recall_score`
+'f1_binary' :func:`sklearn.metrics.f1_score` with `pos_label=1`
+'f1_micro' :func:`sklearn.metrics.f1_score` micro-averaged
+'f1_macro' :func:`sklearn.metrics.f1_score` macro-averaged
+'f1_weighted' :func:`sklearn.metrics.f1_score` weighted average
+'f1_samples' :func:`sklearn.metrics.f1_score` by multilabel sample
+'precision_...' :func:`sklearn.metrics.precision_score` likewise
+'recall_...' :func:`sklearn.metrics.recall_score` likewise
@GaelVaroquaux
GaelVaroquaux Jul 17, 2014

I don't think that it is a good idea to just remove 'f1', 'precision' and 'recall'. What we might have, is to keep them (as equivalent to f1_binary), but to raise a useful error message if there are more than 2 classes.

@GaelVaroquaux GaelVaroquaux commented on an outdated diff Jul 17, 2014
sklearn/metrics/scorer.py
@@ -340,3 +366,23 @@ def make_scorer(score_func, greater_is_better=True, needs_proba=False,
precision=precision_scorer, recall=recall_scorer,
log_loss=log_loss_scorer,
adjusted_rand_score=adjusted_rand_scorer)
+
+msg = ("The {0!r} scorer has been deprecated and will be removed in version "
+ "0.17. Please choose one of '{0}_binary' or '{0}_weighted' depending "
@GaelVaroquaux
GaelVaroquaux Jul 17, 2014

Now that should be 0.18 :$

@GaelVaroquaux
scikit-learn member

So after discussing this a bit with @arjoly to have a big-picture view of the problems here is my take on the PR:

  • In general I am very positive about the new, more explicit scorer names

  • I would really like things to work by default. Many people don't understand the refinements of the differences between various metrics, and don't want to have to make a choice.

  • In terms of defaults, it seems that the somewhat consensus is that 'macro' is better than 'weighted' for multi-class. The right behavior thus seems to be that if there are only 2 classes, binary is used, and of there are more than 2 classes, macro is used.

  • There should be a paragraph in the docs that gives intuition with regards to the difference between the various multiclass approaches. @arjoly was able to give me a fantastic set of intuitions while sitting on a couch. If we can get this in the docs, it would be great.

@GaelVaroquaux
scikit-learn member

And thanks for working on this!

@jnothman
scikit-learn member

I had rebased and updated given the release overnight without internet access, and hence without seeing your comments... But it now appears to need another rebase and an addressing of your comments.

@jnothman
scikit-learn member

I would really like things to work by default. Many people don't understand the refinements of the differences between various metrics, and don't want to have to make a choice.

I'm okay with the idea that the default average is 'binary', which will throw an error if the user provides non-binary targets. I very strongly object to a default (like the incumbent) that means the special-casing of binary data is done implicitly, such that changing to a 3-class problem results in a completely different metric. This is exactly what happened in some test cases that scikit-learn used to have, and resulted in us testing nonsense. But we would need to keep the current behaviour with a deprecation warning for two releases before making that error anyway. (Making it binary default also affects some decisions in #2610. One benefit of it is that it is easily consistent with ROC AUC score and average precision.)

In terms of defaults, it seems that the somewhat consensus is that 'macro' is better than 'weighted' for multi-class. The right behavior thus seems to be that if there are only 2 classes, binary is used, and of there are more than 2 classes, macro is used.

To my knowledge, weighted is something no-one has ever heard of, except in scikit-learn. However, it's also arguable that micro is the best way to go in multilabel cases, or a multiclass setting where there is a majority class you want to ignore (or at least this is where it has been useful to me), which is not yet supported (except by construing it as a multilabel problem where each instance has 0 or 1 label), but will be if #2610 is merged.

One reason for requiring the function not to work just by default is that the user reporting this score needs to report it with the type of averaging or else the reader needs to guess. (You can often guess that someone has reported a macro average by their low score in an imbalanced-but-otherwise-easy problem.) And that's bad.

There should be a paragraph in the docs that gives intuition with regards to the difference between the various multiclass approaches.

http://scikit-learn.org/dev/modules/model_evaluation.html#multiclass-and-multilabel-classification attempts to give some of this intuition, but perhaps fall short of making it intuitive.

@jnothman
scikit-learn member

In short, I propose the following future (after deprecation) behaviour: without an explicit average argument, or an explicit suffixed scoring name, binary classification targets will be required for P/R/F.

WDYT?

@arjoly
scikit-learn member

One benefit of it is that it is easily consistent with ROC AUC score and average precision

In that case, we use macro-average by default. But the meaning is equivalent in the binary case since we don't support multi-class.

@jnothman
scikit-learn member
@arjoly
scikit-learn member

I am personally in favour of explicit behaviour. Any other opinion @ogrisel, @amueller, @vene, @mblondel ?

@vene vene and 1 other commented on an outdated diff Jul 20, 2014
sklearn/metrics/metrics.py
@@ -1342,7 +1342,8 @@ def f1_score(y_true, y_pred, labels=None, pos_label=1, average='weighted',
If ``average`` is not ``None`` and the classification target is binary,
only this class's scores will be returned.
- average : string, [None, 'micro', 'macro', 'samples', 'weighted' (default)]
+ average : string, [None, 'micro', 'macro', 'samples', 'weighted']
@vene
vene Jul 20, 2014

The default value is not documented for now. This is made particularly ugly by having a meaningful value for None here. Since we're changing the API here, how about deprecating average=None too and adding average='binary'?

@jnothman
jnothman Jul 21, 2014

If we're moving to the behaviour I have proposed, we will probably add average='binary'. In the current state of the PR, there is no default value: we want the user to provide one explicitly.

@vene vene commented on an outdated diff Jul 20, 2014
sklearn/metrics/tests/test_score_objects.py
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = LinearSVC(random_state=0)
clf.fit(X_train, y_train)
- score1 = SCORERS['f1'](clf, X_test, y_test)
- score2 = f1_score(y_test, clf.predict(X_test))
- assert_almost_equal(score1, score2)
+
+ for prefix, metric in [('f1', f1_score), ('precision', precision_score),
+ ('recall', recall_score)]:
+
+ score1 = get_scorer('%s_weighted' % prefix)(clf, X_test, y_test)
+ score2 = metric(y_test, clf.predict(X_test), pos_label=None,
+ average='weighted')
@vene
vene Jul 20, 2014

strange indent here and in the similar lines below

@jnothman
scikit-learn member

@GaelVaroquaux wrote:

There should be a paragraph in the docs that gives intuition with regards to the difference between the various multiclass approaches. @arjoly was able to give me a fantastic set of intuitions while sitting on a couch. If we can get this in the docs, it would be great.

Well scikit-learn.org isn't exactly a couch :) I've put together a gist which (perhaps with a figure) may help illustrate some of the differences between the averaging, by using a toy example. I calculate the metrics once with the majority class included, and again with it excluded by treating it as a 0-or-1 label classification evaluation (#2610 will make this possible without an explicit transformation of the data).

I haven't included average='samples' as it really requires at least one label per true and predicted sample; with the current example, samples-averaged evaluation with pred and true identical produces F1=0.5. And I am of a mind to remove 'weighted' unless someone can tell me what it's for and where it's used!

@arjoly
scikit-learn member

I haven't included average='samples' as it really requires at least one label per true and predicted sample; with the current example, samples-averaged evaluation with pred and true identical produces F1=0.5.

Should we set a default value? Maybe as an option?

@jnothman
scikit-learn member
@arjoly
scikit-learn member

When you execute this code

In [6]: roc_auc_score(np.array([[0, 0], [1, 0]]), np.array([[0, 0], [1, 1]]), average="samples")

You get

ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

What I proposed is that instead of raising an error, we should be able to set the score value for that sample to 0, 1 or skip this sample from the computation.

@GaelVaroquaux
scikit-learn member
@jnothman
scikit-learn member
@arjoly
scikit-learn member

Anyway, these edge cases are something we have debated before and if we are
to now resolve them, should do so in another issue.

+1 for another issue

@jnothman
scikit-learn member

I have rebased this PR and updated it to conform to what I understood from @GaelVaroquaux. Most particularly, binary classification will continue to work with the un-suffixed f1, etc scorers.

@jnothman
scikit-learn member

@GaelVaroquaux, @arjoly I have added to this PR a rewrite of the discussion of averaging approaches for multiclass/multilabel calculations based on binary metrics. Please critique!

@coveralls

Coverage Status

Coverage increased (+0.01%) when pulling 18d1446 on jnothman:prf_average_explicit into 0a7bef6 on scikit-learn:master.

@arjoly
scikit-learn member

Looks good to me thanks Joel!

@jnothman
scikit-learn member

Rebased. It would be appreciated if @GaelVaroquaux (who last reviewed it before afe2d23) or someone else could give this a final review. And I'd suggest squashing for merge.

@coveralls

Coverage Status

Coverage increased (+0.0%) when pulling bc4e1cc on jnothman:prf_average_explicit into f37618a on scikit-learn:master.

@jnothman
scikit-learn member

@GaelVaroquaux, even without a review of correctness, could I get your +/-1 on strategy, given your critique above? The strategy is basically to make the precision, recall and f1 scorers and their corresponding metric functions only work for binary problems (after a deprecation period); multiclass/multilabel problems need an explicit average argument or scorer name suffix.

@arjoly arjoly added this to the 0.16 milestone Sep 24, 2014
@jnothman
scikit-learn member

Squashing (commit history at bc4e1cc) and rebasing; and still awaiting a final review. (I'd like this to get into dev for a while so there is feedback before 0.16.)

@jnothman
scikit-learn member

Rebased again.

I'd really like to see this merged, finally. @mblondel, you expressed distaste in the default 'weighted' scheme for precision/recall/f1. Do you mind taking a look at this API change? Or @ogrisel or @MechCoder or @amueller? I think it would be good to have this merged into master for a while before the next release.

@GaelVaroquaux
scikit-learn member
@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Dec 7, 2014
sklearn/metrics/classification.py
warnings.warn('In the future, providing two `labels` values, as '
- 'well as `average` will average over those '
- 'labels. For now, please use `labels=None` with '
- '`pos_label` to evaluate precision, recall and '
+ 'well as `average!=\'binary\'` will average over '
@GaelVaroquaux
GaelVaroquaux Dec 7, 2014

I find that it is more elegant to simply use a double tick (") when the string is defined with simple ticks, and vice versa, rather than protecting the ticks.

Not that it really matters, so don't change anything.

@MechCoder
MechCoder Dec 8, 2014

Even I think the same way ;)

@GaelVaroquaux
scikit-learn member

Looks good to me! 👍 for merge.

Thanks a lot.

@GaelVaroquaux GaelVaroquaux changed the title from [MRG+1] Require explicit average arg for multiclass/label P/R/F metrics and scorers to [MRG+2] Require explicit average arg for multiclass/label P/R/F metrics and scorers Dec 7, 2014
@MechCoder
scikit-learn member

I can have a look at this tomorrow., (if it hasn't been merged by then already)

@jnothman
scikit-learn member
@MechCoder MechCoder commented on an outdated diff Dec 8, 2014
doc/modules/model_evaluation.rst
+
+Some metrics are essentially defined for binary classification tasks (e.g.
+:func:`f1_score`, :func:`roc_auc_score`). In these cases, by default
+only the positive label is evaluated, assuming by default that the positive
+class is labelled ``1`` (though this may be configurable through the
+``pos_label`` parameter).
+
+.. _average:
+
+In extending a binary metric to multiclass or multilabel problems, the data
+is treated as a collection of binary problems, one for each class.
+There are then a number of ways to average binary metric calculations across
+the set of classes, each of which may be useful in some scenario.
+Where available, you should select among these using the ``average`` parameter.
+
+* ``"macro"`` simply calculates calculates the mean of the binary metrics,
@MechCoder
MechCoder Dec 8, 2014

Why does it have to calculate twice? ;)

@MechCoder MechCoder commented on an outdated diff Dec 8, 2014
doc/modules/model_evaluation.rst
+ are nonetheless important, macro-averaging may be a means of highlighting
+ their performance. On the other hand, the assumption that all classes are
+ equally important is often untrue, such that macro-averaging will
+ over-emphasise the typically low performance on an infrequent class.
+* ``"weighted"`` accounts for class imbalance by computing the average of
+ binary metrics in which each class's score is weighted by its presence in the
+ true data sample.
+* ``"micro"`` gives each sample-class pair an equal contribution to the overall
+ metric (except as a result of sample-weight). Rather than summing the
+ quotients (i.e. correct out of total) per class, this sums the dividends and
+ divisors that make up the the per-class metrics to calculate an overall
+ quotient. Micro-averaging may be preferred in multilabel settings, including
+ multiclass classification where a majority class is to be ignored.
+* ``"samples"`` does not calculate a per-class measure, instead calculating the
+ metric over the true and predicted classes for each sample in the evaluation
+ data, and returning their (``sample_weight``-weighted) average.
@MechCoder
MechCoder Dec 8, 2014

Is this valid only for multi label data (I'm not sure.) If it is, I think its worth a mention over here.

@MechCoder MechCoder commented on an outdated diff Dec 8, 2014
doc/modules/model_evaluation.rst
+Where available, you should select among these using the ``average`` parameter.
+
+* ``"macro"`` simply calculates calculates the mean of the binary metrics,
+ giving equal weight to each class. In problems where infrequent classes
+ are nonetheless important, macro-averaging may be a means of highlighting
+ their performance. On the other hand, the assumption that all classes are
+ equally important is often untrue, such that macro-averaging will
+ over-emphasise the typically low performance on an infrequent class.
+* ``"weighted"`` accounts for class imbalance by computing the average of
+ binary metrics in which each class's score is weighted by its presence in the
+ true data sample.
+* ``"micro"`` gives each sample-class pair an equal contribution to the overall
+ metric (except as a result of sample-weight). Rather than summing the
+ quotients (i.e. correct out of total) per class, this sums the dividends and
+ divisors that make up the the per-class metrics to calculate an overall
+ quotient. Micro-averaging may be preferred in multilabel settings, including
@MechCoder
MechCoder Dec 8, 2014

I'm not sure quotient it the right word here, but I can't think of anything better.

@MechCoder MechCoder and 1 other commented on an outdated diff Dec 8, 2014
examples/text/document_classification_20newsgroups.py
@@ -208,8 +208,8 @@ def benchmark(clf):
test_time = time() - t0
print("test time: %0.3fs" % test_time)
- score = metrics.f1_score(y_test, pred)
- print("f1-score: %0.3f" % score)
+ score = metrics.f1_score(y_test, pred, average='micro')
@MechCoder
MechCoder Dec 8, 2014

Is there any reason, why this was explicitly changed from weighted to micro?

@jnothman
jnothman Dec 8, 2014

weighted f1 is not very commonly used in the field. We don't want to encourage it.

But as I presume this is a multiclass problem (with no majority class), it probably makes more sense to use macro, or else just to report accuracy which should be equivalent to multiclass micro f1.

So good catch!

@MechCoder MechCoder commented on the diff Dec 8, 2014
sklearn/metrics/classification.py
@@ -475,7 +475,7 @@ def zero_one_loss(y_true, y_pred, normalize=True, sample_weight=None):
return n_samples - score
-def f1_score(y_true, y_pred, labels=None, pos_label=1, average='weighted',
+def f1_score(y_true, y_pred, labels=None, pos_label=1, average='binary',
@MechCoder
MechCoder Dec 8, 2014

Sorry, but I did not look at the discussion, what was the reason that raising an error was preferred to returning a metric per class (average=None)?

@jnothman
jnothman Dec 8, 2014

Which would you rather debug when upgrading scikit-learn?

@MechCoder MechCoder commented on the diff Dec 8, 2014
sklearn/metrics/classification.py
@@ -822,7 +824,7 @@ def precision_recall_fscore_support(y_true, y_pred, beta=1.0, labels=None,
"""
average_options = (None, 'micro', 'macro', 'weighted', 'samples')
- if average not in average_options:
+ if average not in average_options and average != 'binary':
@MechCoder
MechCoder Dec 8, 2014

Why not just add it to the average_options tuple, since it is technically an option now too?

@jnothman
jnothman Dec 8, 2014

No, because average_options is used in the warning message below where binary is inappropriate. However, perhaps binary should now be included in the docstring...

@MechCoder MechCoder and 1 other commented on an outdated diff Dec 8, 2014
sklearn/metrics/scorer.py
log_loss=log_loss_scorer,
adjusted_rand_score=adjusted_rand_scorer)
+
+for name, metric in [('precision', precision_score),
+ ('recall', recall_score), ('f1', f1_score)]:
+ SCORERS.update({
+ name: make_scorer(metric),
+ '{0}'.format(name): make_scorer(partial(metric)),
+ '{0}_macro'.format(name): make_scorer(partial(metric, pos_label=None,
+ average='macro')),
+ '{0}_micro'.format(name): make_scorer(partial(metric, pos_label=None,
+ average='micro')),
+ '{0}_samples'.format(name): make_scorer(partial(metric, pos_label=None,
+ average='samples')),
+ '{0}_weighted'.format(name): make_scorer(partial(metric,
+ pos_label=None,
@MechCoder
MechCoder Dec 8, 2014

You can maybe use another for loop across ['macro', 'micro', 'samples', 'weighted'] to cut short 6-7 lines of code.

@jnothman
jnothman Dec 8, 2014

Yes, I think it was like that in some version. Can't remember why changed.

@MechCoder MechCoder commented on the diff Dec 8, 2014
sklearn/metrics/tests/test_score_objects.py
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = LinearSVC(random_state=0)
clf.fit(X_train, y_train)
- score1 = SCORERS['f1'](clf, X_test, y_test)
- score2 = f1_score(y_test, clf.predict(X_test))
- assert_almost_equal(score1, score2)
+
+ for prefix, metric in [('f1', f1_score), ('precision', precision_score),
+ ('recall', recall_score)]:
+
@MechCoder
MechCoder Dec 8, 2014

here also I think.

@jnothman
jnothman Dec 8, 2014

It's sometimes useful to leave test loops unrolled (when not using generators), so that the error message is as precise as possible.

@MechCoder
scikit-learn member

@jnothman I'm still at a point where I learn more from Pull Requests than Pull Requests learn from me, so sorry if my comments caused more pain (in explaining). than pleasure. ;)

@jnothman
scikit-learn member
@jnothman
scikit-learn member

I'll merge after travis confirms that I haven't done anything silly.

@jnothman jnothman FIX P/R/F metrics and scorers are now for binary problems only by def…
…ault

Scorers for different average parameters have been added.
081a554
@jnothman jnothman merged commit 56ee99c into scikit-learn:master Dec 9, 2014

1 check was pending

Details continuous-integration/travis-ci The Travis CI build is in progress
@arjoly
scikit-learn member

Thanks @jnothman !

@mblondel
scikit-learn member

Sorry I am at a conference and didn't have time to review. Glad that this is finally in. Regarding 'weighted', my main concern was just that we should really try to avoid using non-standard stuff as default. On the other hand, there is the question of backward compatibility and it's hard to tell which would be a better default between micro and macro averaging. When the default is used (weighted), we could potentially raise a warning and tell the user to explicitly specify the averaging option. This way, users will be able to correctly report the results they got when writing a paper.

@amueller

there is an s missing here, right? Also, The example seems pretty contrived ^^

scikit-learn member

I don't see where the s is missing. I can't say I looked at what the example was doing. Ideally, we shouldn't be showing '_weighted' in any of the examples. I guess it was a lazy solution to completing the PR.

@amueller

That line breaks ^^

scikit-learn member

Whoops.

scikit-learn member

Never mind, fixed it ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment