Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX f1_score with zero_division=1 uses directly confusion matrix statistic #27577

Merged
merged 19 commits into from Dec 11, 2023

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Oct 12, 2023

Fixes #26965
Fixes #27189
Fixes #27165

I open this PR because I am not able to fix the remaining issue in #27165.

Fix the behaviour of zero_division in f1_score by using the formulation based on confusion matrix statistics instead of precision recall where a division by zero might already has happen.

@github-actions
Copy link

github-actions bot commented Oct 12, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 6c77c9f. Link to the linter CI: here

@OmarManzoor
Copy link
Contributor

@glemaitre Thank you for handling this.

@glemaitre
Copy link
Member Author

I'll ping you when ready to be reviewed @OmarManzoor :)

@glemaitre
Copy link
Member Author

@OmarManzoor this is fine on my end. I think that we can forward with a review.
Since that now, the f-score can be defined even if the precision or recall are not, it simplifies the message to be shown. So we need to only raise independent message.

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @glemaitre. This looks good! There is the codecov patch test that is failing. Do we need to check that?

@glemaitre
Copy link
Member Author

Do we need to check that?

I checked and this is a false positive. I added new cases even and I don't see new line that are not covered. I assume this is kind of relative coverage.

@glevv
Copy link
Contributor

glevv commented Nov 1, 2023

Bit why is code coverage is calculating coverage for tests? Maybe there should be omit line in the coveragerc?

Codecov is telling that there was only a partial hit on war_for if. My guess is that not all combinations of average and metric were checked.

@glemaitre
Copy link
Member Author

Codecov is telling that there was only a partial hit on war_for if. My guess is that not all combinations of average and metric were checked.

The weird thing is that it should not be linked to the patch because we do at least as good as before. I would also find it strange that we don't have all combination but I will have a look.

@boyleconnor
Copy link
Contributor

@glemaitre @OmarManzoor FYI this bug was not limited to binary classification cases, e.g. this:

# Macro avg f-1 should be 1/3 (the "neutral" class has 100% precision & recall):
print(classification_report(
    y_true=[0, 2], y_pred=[2, 0],
    labels=[0, 1, 2], target_names=["negative", "neutral", "positive"],
    zero_division=1.0
))

gave this (erroneous) report (on version 1.3.2):

Broken output
              precision    recall  f1-score   support

    negative       0.00      0.00      1.00       1.0
    positive       0.00      0.00      1.00       1.0

    accuracy                           1.00       2.0
   macro avg       0.00      0.00      1.00       2.0
weighted avg       0.00      0.00      1.00       2.0

however, the same code gave the correct answer when I ran on your updated version (db791f7):

Fixed output
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00       1.0
     neutral       1.00      1.00      1.00       0.0
    positive       0.00      0.00      0.00       1.0

   micro avg       0.00      0.00      0.00       2.0
   macro avg       0.33      0.33      0.33       2.0
weighted avg       0.00      0.00      0.00       2.0

so it might be worth changing the title of this PR to reflect that; I encountered this bug in my own work and almost opened a new issue for it.

@glemaitre
Copy link
Member Author

Thanks. I will also change the the entry in the changelog.

@glemaitre glemaitre changed the title FIX f1_score with zero_division=1 on binary classes FIX f1_score with zero_division=1 uses directly confusion matrix statistic Dec 8, 2023
@glemaitre glemaitre added this to the 1.4 milestone Dec 8, 2023
doc/whats_new/v1.4.rst Outdated Show resolved Hide resolved
Copy link
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me.

Maybe the coverage goes down because we remove lines there were covered and keep lines that weren't covered?

@boyleconnor
Copy link
Contributor

This PR also appears to fix a similar issue that was occurring when zero_division=np.nan.

Here's the bad behavior on 1.3.2:

>>> print(classification_report([0, 0, 1, 1, 2, 2], [0, 0, 0, 3, 3, 1], labels=list(range(5)), zero_division=np.nan))
              precision    recall  f1-score   support

           0       0.67      1.00      0.80         2
           1       0.00      0.00       nan         2
           2        nan      0.00       nan         2
           3       0.00       nan       nan         0
           4        nan       nan       nan         0

   micro avg       0.33      0.33      0.33         6
   macro avg       0.22      0.33      0.80         6
weighted avg       0.33      0.33      0.80         6

and here's the fixed behavior as of commit db791f7:

>>> print(classification_report([0, 0, 1, 1, 2, 2], [0, 0, 0, 3, 3, 1], labels=list(range(5)), zero_division=np.nan))
              precision    recall  f1-score   support

           0       0.67      1.00      0.80         2
           1       0.00      0.00      0.00         2
           2        nan      0.00      0.00         2
           3       0.00       nan      0.00         0
           4        nan       nan       nan         0

   micro avg       0.33      0.33      0.33         6
   macro avg       0.22      0.33      0.20         6
weighted avg       0.33      0.33      0.27         6

glemaitre and others added 2 commits December 11, 2023 12:08
@glemaitre
Copy link
Member Author

Looking at the report of codecov, this is a false positive.

@glemaitre glemaitre merged commit 3b06962 into scikit-learn:main Dec 11, 2023
26 of 27 checks passed
@jnothman
Copy link
Member

As a regression, should this be backported into a patch release for 1.3 @glemaitre ?

@glemaitre
Copy link
Member Author

We are going to release 1.4 this week or next week since we did an RC at the end of last year.

@boyleconnor
Copy link
Contributor

@jnothman I agree a patch to 1.3 would be justified.

I also don't think I've seen anybody in the discussions on this repo point out that this is a particularly bad kind of error, in that it can be subtly wrong and go undetected. Especially with highly-multiclass classification, e.g.:

>>> sklearn.__version__
'1.3.0'
>>> sklearn.metrics.f1_score(y_true=list(range(104)), y_pred=list(range(100)) + [101, 102, 103, 104], average='macro', zero_division=1.0)
0.9809523809523809

vs

>>> sklearn.__version__
'1.2.2'
>>> sklearn.metrics.f1_score(y_true=list(range(104)), y_pred=list(range(100)) + [101, 102, 103, 104], average='macro', zero_division=1.0)
0.9523809523809523

@glemaitre
Copy link
Member Author

@jnothman I agree a patch to 1.3 would be justified.

I don't see the benefit of backporting in a particular bug fix release in 1.3 while we are releasing 1.4. One will have to upgrade version anyway and upgrading from 1.3 to 1.4 does not introduce some backward incompatible change. In addition, upgrading to 1.4 will bring other features and bug fixes.

glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Feb 10, 2024
…istic (scikit-learn#27577)

Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
Co-authored-by: Tim Head <betatim@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

F1 score not calculated properly Wrong behaviour when calculating f1_score with zero_division=1
6 participants