New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In macro-averaged multiclass precision, and recall, the default set of labels should come only from the target outputs (y_true) not from the predicted outputs (y_pred) #10573
Comments
Firstly, I agree with the general underlying principle that we need to be assisting the user towards best practices. (Although some have argued that F1 for classification tasks is never best practice, and I tend to agree, let's leave that alone for now; the issue applies to recall alone, which everyone agrees is a good thing to measure.) I also agree with you that in an ideal world, we would not be getting the set of labels from the prediction data, because then there is an arbitrary large penalty for a system that spuriously proposes a class known in the training but not in the test data. However, ignoring that class from the macro-average will introduce spurious variance into the scores of a cross-validation run. And it's hard to change what we currently do. So I'd rather add a warning than try to change the default behaviour. Basically, for multiclass classification, labels should always be specified. I would certainly considered a PR that improved the warnings for cases where the users had identifiable usage problems. |
While it is off-topic, I'm pretty interested in this, do you have any links?
It isn't even only for the case of labels from training that are not from test. Anyway back on topic:
I feel warnings are too easily ignored (though warning is better than nothing) What about in the next release adding a "Deprecation Warning: Not specifying labels for multiclass classification is deprecated and will not be available in the next release. Update your code to explicitly define what your labels are.". Possibly also adding extra options for label like |
This sounds like a retrieval-like IE problem, as opposed to say NER where the number of possible phrases are quadratic in the length of the text, but the number of labels is few. I would not use macro-average for this, unless it's only a macro-average of recall. It gives too much weight to rare classes. Scikit-learn's metrics are also not designed to handle this use-case efficiently. F1 for classification is different from F1 for IR and F1 for IE. It's likely problematic in all of them (or so David Powers argues), because of the inappropriateness of the harmonic mean, and perhaps because Jaccard should be preferred over Dice coefficient (= F1) as a true metric and because it obviates the double penalty issue described by Manning for NER. For classification it's mostly a problem because it lends too much weight to the bias of a system towards the positive class. Basically, precision's denominator messes up a lot of things. I don't think we're yet ready to force the user to specify the set of labels. We've had plans to try push the set of labels from the whole dataset into the evaluation metric when CV is used, but it's proven hard to implement. I don't mind the |
Description
In macro-averaged multiclass precision, and recall, the default set of labels should come only from the target outputs (y_true) not from the predicted outputs (y_pred).
The target outputs come from the test set.
I believe it is a solid assumption is that the classes we are interested in (at all) are the classes present in the the test set.
We don't want to average over classes that we are not interested in.
In the case where a label occurs only in the predicted argument
Trying to calculate recall on classes that only occur in the test set is (as the warning say) ill defined anyway.
This would make the behavior consistent with
average=weighted
, which only averages over true classes.More consideration at the bottom, though it is a bit long winded even by my standards.
Steps/Code to Reproduce / Actual Results
Possibly with a
"Warning: labels ['c'] occurs in predictions but not in targets. It is not counted. To count it, define labels argument manually. However, this will cause Recall and F-Score to be ill-defined."
Also reasonable would be for it to throw an error,
and require the user to specify the labels one way or the other.
Either as
labels=unique_labels(y_true, y_pred)
(current behavior),or as
labels=unique_labels(y_true)
(what I suggest is the more commonly desired behavior)or as a final option,
not changing the behavior but adding to the current warning message:
you may wish to only consider labels from y_true. To do this use "labels=labels=unique_labels(y_true)
There are matching changes that would come to
average=None
from this change as well.Versions
I first encounter this in Scikit-Learn 0.14.1'.
(turned out I was using an old version by mistake),
but it is unchanged when I updated to 0.19.1
Extended Motivation.
Now, in normal supervised multi-class ML circumstances you will not have a label that appears in the predicted output but not the target output.
There are times when you can, (I am working with one in https://github.com/oxinabox/NovelPerspective)
(though the question then becomes is Macro Averaged P/R/F1 a good metric for these? Maybe not)
Anyway, we must assert that we are in such a circumstance, as otherwise, why have we bothered to search both target and predicted output.
Consider:
These measures come from information extraction.
They are binary measures.
The precision is "Of all the times I found LABEL, how many of them were where they should have been "
Prec = tp/(tp+fp)
The Recall is precision is "Of all times I should have found LABEL, how often did I actually find it?"
Recall = tp/(tp+fn)
When define a test set (i.e. the target set), the classes in it must be the ones that matter. It is the ground truth.
We wouldn't make a test set that didn't have a class that matters (or would we?)
When we same Macro averaging,
we are thinking:
For each class rewrite the outputs so that it is binary as to (Found, Not Found) for the class in question.
And calculate the information extraction precision/recall for each class.
Then average those.
In that binary view, you never see the classes from the predicted output.
Just "Not Found" for every class in the target output.
Because it is the target output classes that you are asking questions about -- that is why they are in the test set.
Now consider the math
for some class that is only in the predicted output
lets say it occurs
N
times.then
tp=0
andfn=0
since it has no true positives or false negatives since it never occurs.and
fp=N
since any time it occurs is a false positive.Its (binary) precision will be 0.
tp/(tp +fp)=0/(0+N)
its (binary) recall will be Undefined:
tp/(fn+tp) = 0/(0+0)
its F1 will thus also be undefined.
Sklearn gives a warning and sets the binary recall and F1 to zero.
This in turn decreases the (overall) macro average since it adds a class that scores 0 for precision and recall.
The text was updated successfully, but these errors were encountered: