Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/FIX Replace jaccard_similarity_score by sane jaccard_score #13151

Merged
merged 129 commits into from Mar 13, 2019

Conversation

Projects
None yet
6 participants
@jnothman
Copy link
Member

jnothman commented Feb 13, 2019

Fixes #7332. Alternative to #13092

Also simplifies division warning logic, such that it fixes #10812 and Fixes #10843 (with thanks to @qinhanmin2014 in #13143)

What does this implement/fix? Explain your changes.

The current Jaccard implementation is ridiculous for binary and multiclass problems, returning accuracy. This makes a new Jaccard function with API comparable to Precision, Recall and F-score, which are also fundamentally set-wise metrics.

This also drops the normalize parameter.

gxyd added some commits Nov 5, 2017

show errors and warning before anything
this deals with both multilabel and multiclass problems
fix multilablel classification
labels, sample_weight seems to be working fine, though haven't
fully testing them again, will do in next commit
@qinhanmin2014

This comment has been minimized.

Copy link
Member

qinhanmin2014 commented Mar 5, 2019

Hmm, pep8 error @jnothman :)

@qinhanmin2014

This comment has been minimized.

Copy link
Member

qinhanmin2014 commented Mar 7, 2019

ping reviewers in related issues/PRs, I think we need this in 0.21
@adrinjalali @amueller @glemaitre

@ogrisel

ogrisel approved these changes Mar 8, 2019

Copy link
Member

ogrisel left a comment

LGTM overall. I like the consistency with f1 / precision / recall.

However what I find a bit suboptimal is that we have the combination of the following:

1- Jaccard index in most useful with multilabel classification problems;
2- average="samples" is the most sensible averaging for this kind of problems (multilabel classification);
3- the default value of average (average="binary") yields an (informative error) for multilabel problems (it is only valid for binary classification problems for which the Jaccard index in not a commonly used evaluation metric).

Which means that for 99% of the regular use cases for jaccard_score in scikit-learn, the user has to pass the extra verbose jaccard_score(y_multilabel_true, y_multilabel_pred, average="samples").

One alternative would be to use average="samples" by default and make it explicit in the docstring that Jaccard index is most useful for multilabel classification problems.

That would make jaccard_score slightly less consistent with the F/P/R metrics but maybe more intuitive for the users.

WDYT?

In any case I like this PR overall.

Show resolved Hide resolved doc/modules/model_evaluation.rst Outdated
>>> jaccard_score(y_true[0], y_pred[0]) # doctest: +ELLIPSIS
0.6666...
>>> jaccard_score(y_true, y_pred, average='macro') # doctest: +ELLIPSIS
0.6666...

This comment has been minimized.

@ogrisel

ogrisel Mar 8, 2019

Member

I think it would make the example easier to follow if you inserted the jaccard_score(y_true, y_pred, average=None) case before the average='macro' case.

>>> jaccard_score(y_true, y_pred, average='micro')
0.33...
>>> jaccard_score(y_true, y_pred, average=None)
array([1., 0., 0., 1.])

This comment has been minimized.

@ogrisel

ogrisel Mar 8, 2019

Member

I think this example is a bit weird because the per-class jaccard indices are either 0. or 1. which is a bit of an edge case. Maybe make the example slightly less specific, e.g.:

>>> jaccard_score(np.array([0, 2, 1, 3]), np.array([0, 1, 1, 3]), average=None)
array([1. , 0.5, 0. , 1. ])

This comment has been minimized.

@ogrisel

ogrisel Mar 8, 2019

Member

Also could you give an intuition as to when it's useful to use the Jaccard index to score a multiclass (or binary) classification problem instead of using accuracy, AUC or F/P/R? It does not seem to be a common practice.

If so maybe we should add a note such as:

The Jaccard index is most useful to score multilabel classification models (with average="samples"). The generalization to binary and multiclass classification problems is provided for the sake of consistency but is not a common practice. Those two kinds of tasks are more commonly evaluated using other metrics such as accuracy, ROC AUC or Precision/Recall/F-score.

This comment has been minimized.

@jnothman

jnothman Mar 11, 2019

Author Member

You're right. It's not common practice for evaluation, although Jaccard does have some nice properties that F1 (= Dice) does not, such as being the complement of a true distance metric.

And in whatever context, @bthirion had assumed jaccard_similarity_score would work on binary problems, since many people first know this index for set comparison.

It is, I admit, quite unusual for multiclass... and I'd be happy just to disallow it, and even throw away this PR and just allow multilabel.

This comment has been minimized.

@jnothman

jnothman Mar 11, 2019

Author Member

FWIW, I think P/R/F are pretty weird for multiclass in some ways too...

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Mar 11, 2019

Member

And in whatever context, @bthirion had assumed jaccard_similarity_score would work on binary problems, since many people first know this index for set comparison.

Seems that the wiki page we cite also demonstrates the binary case? (See https://en.wikipedia.org/wiki/Jaccard_index, Similarity of asymmetric binary attributes)

It is, I admit, quite unusual for multiclass...

Agree. I'm happy to remove, or keep it for the sake of consistency.

FWIW, I think P/R/F are pretty weird for multiclass in some ways too...

What do you mean? I guess it's reasonable for P/R/F to support multiclass?

>>> jaccard_score(y_true, y_pred, average='weighted')
0.5
>>> jaccard_score(y_true, y_pred, average=None)
array([0., 0., 1.])

This comment has been minimized.

@ogrisel

ogrisel Mar 8, 2019

Member

Here a again maybe better illustrate with data that yield class-wise Jaccard that are not all 0 or 1.

Show resolved Hide resolved sklearn/metrics/tests/test_classification.py Outdated
Show resolved Hide resolved sklearn/metrics/tests/test_classification.py Outdated
assert_raise_message(ValueError, msg2, jaccard_score, y_true,
y_pred, average='binary')
msg3 = ("Samplewise metrics are not available outside of multilabel "
"classification.")

This comment has been minimized.

@ogrisel

ogrisel Mar 8, 2019

Member

It it's easy enough, can you include the actual type of target that was inferred from y_true and y_pred in the error message?

This comment has been minimized.

@ogrisel

ogrisel Mar 8, 2019

Member

Also maybe the phrasing "Samplewise averaging is not available outside...." would make more sense.

This comment has been minimized.

@jnothman

jnothman Mar 11, 2019

Author Member

Also maybe the phrasing "Samplewise averaging is not available outside...." would make more sense

The problem with this is only that the error is being raised within multilabel_confusion_matrix which doesn't do the averaging...

# size(y1 \inter y2) = [1, 2]
# size(y1 \union y2) = [2, 2]

jss = partial(assert_warns, DeprecationWarning, jaccard_similarity_score)

This comment has been minimized.

@ogrisel

ogrisel Mar 8, 2019

Member

nice trick :)

@jnothman

This comment has been minimized.

Copy link
Member Author

jnothman commented Mar 10, 2019

@qinhanmin2014

This comment has been minimized.

Copy link
Member

qinhanmin2014 commented Mar 11, 2019

Which means that for 99% of the regular use cases for jaccard_score in scikit-learn, the user has to pass the extra verbose jaccard_score(y_multilabel_true, y_multilabel_pred, average="samples").

This is actually the advantage of average="binary", right? (i.e., to make users aware of different average options).

@jnothman

This comment has been minimized.

Copy link
Member Author

jnothman commented Mar 12, 2019

Apologies if this doesn't pass CI... I'm having troubles with my build environment

@@ -1239,8 +1245,12 @@ def _check_set_wise_labels(y_true, y_pred, average, labels, pos_label):
"%r" % (pos_label, present_labels))
labels = [pos_label]
else:
applicable_options = list(average_options)
if y_type == 'multiclass':
applicable_options = applicable_options.remove('samples')

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Mar 12, 2019

Member

should be applicable_options.remove('samples') :)
I think we can use average_options directly because it's not used later.

jnothman added some commits Mar 12, 2019

@jnothman

This comment has been minimized.

Copy link
Member Author

jnothman commented Mar 12, 2019

What have I missed?

756     >>> y_pred = [0, 2, 1, 2]  # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
757     >>> y_true = [0, 1, 2, 2]
758     >>> jaccard_score(y_true, y_pred, average=None)
Expected:
    array([1. , 0. , 0.33...])
Got:
    array([1.        , 0.        , 0.33333333])
In the multiclass case:
>>> y_pred = [0, 2, 1, 2] # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Mar 12, 2019

Member

these should be added in L760?

This comment has been minimized.

@jnothman

jnothman Mar 12, 2019

Author Member

I keep being told by @adrinjalali that their scope is not just the line

jnothman added some commits Mar 12, 2019

>>> y_pred = [0, 2, 1, 2]
>>> y_true = [0, 1, 2, 2]
>>> jaccard_score(y_true, y_pred, average=None)
array([1., 0., 0.33...])

This comment has been minimized.

@thomasjpfan

thomasjpfan Mar 13, 2019

Contributor
  >>> jaccard_score(y_true, y_pred, average=None)
  ... # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
  array([1. , 0. , 0.33...])
  >>> jaccard_score(y_true, y_pred, average='macro')
  0.44...
  >>> jaccard_score(y_true, y_pred, average='micro')
  0.33...

For the micro case, total true positive = 2, total false positive + total false negative = 6.

This comment has been minimized.

@jnothman

jnothman Mar 13, 2019

Author Member

Yes, I was lazy and fudged this waiting for CI output, because I'm not actually able to compile and use the repo on my laptop atm :\

C compiler problems after migrating to this laptop that I've not worked out how to resolve yet.

@qinhanmin2014

This comment has been minimized.

Copy link
Member

qinhanmin2014 commented Mar 13, 2019

@jnothman tests are still failing, @thomasjpfan 's solution #13151 (comment) seems reasonable and works locally.

@jnothman

This comment has been minimized.

Copy link
Member Author

jnothman commented Mar 13, 2019

So shall we merge?

@qinhanmin2014 qinhanmin2014 merged commit 19c8af6 into scikit-learn:master Mar 13, 2019

9 checks passed

LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 99.35% of diff hit (target 91.85%)
Details
codecov/project 92.38% (+0.52%) compared to cd1cb30
Details
@jnothman

This comment has been minimized.

Copy link
Member Author

jnothman commented Mar 13, 2019

Sigh of relief. Thanks @gxyd for all your work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.