Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In macro-averaged multiclass precision, and recall, the default set of labels should come only from the target outputs (y_true) not from the predicted outputs (y_pred) #10573

Open
oxinabox opened this issue Feb 2, 2018 · 3 comments

Comments

@oxinabox
Copy link

oxinabox commented Feb 2, 2018

Description

In macro-averaged multiclass precision, and recall, the default set of labels should come only from the target outputs (y_true) not from the predicted outputs (y_pred).

The target outputs come from the test set.
I believe it is a solid assumption is that the classes we are interested in (at all) are the classes present in the the test set.
We don't want to average over classes that we are not interested in.
In the case where a label occurs only in the predicted argument

Trying to calculate recall on classes that only occur in the test set is (as the warning say) ill defined anyway.
This would make the behavior consistent with average=weighted , which only averages over true classes.

More consideration at the bottom, though it is a bit long winded even by my standards.

Steps/Code to Reproduce / Actual Results

In [1]: from sklearn.metrics import precision_recall_fscore_support

In [2]: precision_recall_fscore_support(list("aabbb"), list("aabbc"), average='macro')
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py:1137: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples.
  #!/usr/bin/python
Out[2]: (0.66666666666666663, 0.55555555555555547, 0.59999999999999998, None)

#### Expected Results
```python
In [1]: from sklearn.metrics import precision_recall_fscore_support

In [2]: precision_recall_fscore_support(list("aabbb"), list("aabbc"), average='macro') 
Out[2]: (1.0, 0.83333333333333326, 0.90000000000000002, None)

Possibly with a "Warning: labels ['c'] occurs in predictions but not in targets. It is not counted. To count it, define labels argument manually. However, this will cause Recall and F-Score to be ill-defined."

Also reasonable would be for it to throw an error,
and require the user to specify the labels one way or the other.
Either as labels=unique_labels(y_true, y_pred) (current behavior),
or as labels=unique_labels(y_true) (what I suggest is the more commonly desired behavior)

or as a final option,
not changing the behavior but adding to the current warning message: you may wish to only consider labels from y_true. To do this use "labels=labels=unique_labels(y_true)

There are matching changes that would come to average=None from this change as well.

Versions

I first encounter this in Scikit-Learn 0.14.1'.
(turned out I was using an old version by mistake),
but it is unchanged when I updated to 0.19.1

In [1]: import platform; print(platform.platform())
Linux-3.16.0-4-amd64-x86_64-with-debian-8.10

In [2]: import sys; print("Python", sys.version)
('Python', '2.7.9 (default, Jun 29 2016, 13:08:31) \n[GCC 4.9.2]')

In [3]: import numpy; print("NumPy", numpy.__version__)
('NumPy', '1.9.3')

In [4]: import scipy; print("SciPy", scipy.__version__)
('SciPy', '0.15.1')

In [5]: import sklearn; print("Scikit-Learn", sklearn.__version__)
('Scikit-Learn', '0.19.1')

Extended Motivation.

Now, in normal supervised multi-class ML circumstances you will not have a label that appears in the predicted output but not the target output.

There are times when you can, (I am working with one in https://github.com/oxinabox/NovelPerspective)
(though the question then becomes is Macro Averaged P/R/F1 a good metric for these? Maybe not)

Anyway, we must assert that we are in such a circumstance, as otherwise, why have we bothered to search both target and predicted output.

Consider:
These measures come from information extraction.
They are binary measures.
The precision is "Of all the times I found LABEL, how many of them were where they should have been " Prec = tp/(tp+fp)
The Recall is precision is "Of all times I should have found LABEL, how often did I actually find it?" Recall = tp/(tp+fn)

When define a test set (i.e. the target set), the classes in it must be the ones that matter. It is the ground truth.

We wouldn't make a test set that didn't have a class that matters (or would we?)

When we same Macro averaging,
we are thinking:
For each class rewrite the outputs so that it is binary as to (Found, Not Found) for the class in question.
And calculate the information extraction precision/recall for each class.
Then average those.

In that binary view, you never see the classes from the predicted output.
Just "Not Found" for every class in the target output.

Because it is the target output classes that you are asking questions about -- that is why they are in the test set.

Now consider the math
for some class that is only in the predicted output

lets say it occurs N times.
then tp=0 and fn=0 since it has no true positives or false negatives since it never occurs.

and fp=N since any time it occurs is a false positive.

Its (binary) precision will be 0. tp/(tp +fp)=0/(0+N)

its (binary) recall will be Undefined:
tp/(fn+tp) = 0/(0+0)
its F1 will thus also be undefined.

Sklearn gives a warning and sets the binary recall and F1 to zero.

This in turn decreases the (overall) macro average since it adds a class that scores 0 for precision and recall.

@jnothman
Copy link
Member

jnothman commented Feb 3, 2018

Firstly, I agree with the general underlying principle that we need to be assisting the user towards best practices. (Although some have argued that F1 for classification tasks is never best practice, and I tend to agree, let's leave that alone for now; the issue applies to recall alone, which everyone agrees is a good thing to measure.)

I also agree with you that in an ideal world, we would not be getting the set of labels from the prediction data, because then there is an arbitrary large penalty for a system that spuriously proposes a class known in the training but not in the test data. However, ignoring that class from the macro-average will introduce spurious variance into the scores of a cross-validation run. And it's hard to change what we currently do. So I'd rather add a warning than try to change the default behaviour.

Basically, for multiclass classification, labels should always be specified. I would certainly considered a PR that improved the warnings for cases where the users had identifiable usage problems.

@oxinabox
Copy link
Author

oxinabox commented Feb 5, 2018

(Although some have argued that F1 for classification tasks is never best practice, and I tend to agree, let's leave that alone for now; the issue applies to recall alone, which everyone agrees is a good thing to measure.)

While it is off-topic, I'm pretty interested in this, do you have any links?
The more I was thinking about the problem the more I thought something felt wrong about using this for my problem.

there is an arbitrary large penalty for a system that spuriously proposes a class known in the training but not in the test data

It isn't even only for the case of labels from training that are not from test.
There are other cases, for example in my own case, I have an information extraction-ish problem,
that the output labels could be from an infinite set of possibilities.

Anyway back on topic:

So I'd rather add a warning than try to change the default behaviour.
Basically, for multiclass classification, labels should always be specified. I would certainly considered a PR that improved the warnings for cases where the users had identifiable usage problems.

I feel warnings are too easily ignored (though warning is better than nothing)

What about in the next release adding a "Deprecation Warning: Not specifying labels for multiclass classification is deprecated and will not be available in the next release. Update your code to explicitly define what your labels are.".
Then removing it in the next release after that.
(This is the semver based process we follow for all Julia packages, but idk how it works in the Sklearn/Python world)

Possibly also adding extra options for label like label='auto_true' (documented as recommend for most usages) for source from the y_true only, and label='auto_both' to source from both.
Though both can still cause problems that being explicit would avoid.
(e.g as you say, unlucky folds in KFolds that misses some labels. For that case in particular I feel like this problem can be pushed up to the cross_validated_scorer which at least protects people who are using that for cross validated evaluations of normal classifiers)

@jnothman
Copy link
Member

jnothman commented Feb 5, 2018

There are other cases, for example in my own case, I have an information extraction-ish problem,
that the output labels could be from an infinite set of possibilities.

This sounds like a retrieval-like IE problem, as opposed to say NER where the number of possible phrases are quadratic in the length of the text, but the number of labels is few. I would not use macro-average for this, unless it's only a macro-average of recall. It gives too much weight to rare classes. Scikit-learn's metrics are also not designed to handle this use-case efficiently.

F1 for classification is different from F1 for IR and F1 for IE. It's likely problematic in all of them (or so David Powers argues), because of the inappropriateness of the harmonic mean, and perhaps because Jaccard should be preferred over Dice coefficient (= F1) as a true metric and because it obviates the double penalty issue described by Manning for NER.

For classification it's mostly a problem because it lends too much weight to the bias of a system towards the positive class. Basically, precision's denominator messes up a lot of things.

I don't think we're yet ready to force the user to specify the set of labels. We've had plans to try push the set of labels from the whole dataset into the evaluation metric when CV is used, but it's proven hard to implement.

I don't mind the labels='auto_true' approach and eventually making that default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants