New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

geometric_mean_score inside classification_report_imbalanced is wrong? #394

Closed
klizter opened this Issue Jan 20, 2018 · 5 comments

Comments

Projects
None yet
2 participants
@klizter
Contributor

klizter commented Jan 20, 2018

Hey,

I was getting different reported values for geometric mean score when using the geometric_mean_score and was curious why. Looks like there are a few mixed up parameters in classification_report_imbalanced.

geometric_mean_score wants y_true before y_pred:

imblearn.metrics.geometric_mean_score(y_true, y_pred, labels=None, pos_label=1, average='multiclass', sample_weight=None, correction=0.0)

However, inside classification_report_imbalanced:

geo = geometric_mean_score(
            y_pred,
            y_true,
            labels=labels,
            average=None,
            sample_weight=None)

So to for index balanced geometric mean:

# Index balanced accuracy
    iba_gmean = make_index_balanced_accuracy(
        alpha=alpha, squared=True)(geometric_mean_score)
    iba = iba_gmean(
        y_pred,
        y_true,
        labels=labels,
        average=None,
        sample_weight=sample_weight)

On average, the results are quite different:

PRE     ++ REC     ++ SPE     ++ F1      ++ GEO     ++ IBA
  0.496 ##   0.485 ##   0.904 ##   0.462 ##   0.637 ##   0.407
                   pre       rec       spe        f1       geo       iba       sup
...
avg / total       0.50      0.48      0.90      0.46      0.68      0.45      2800

After changing param order I was able to get consistent results

Versions:

>>> ('Python', '2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) \n[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]')
>>> ('NumPy', '1.13.3')
>>> ('SciPy', '1.0.0')
>>> ('Scikit-Learn', '0.19.0')
>>> ('Imbalanced-Learn', '0.3.1')

@klizter klizter changed the title from geometric_mean_score in classification_report_imbalanced is wrong? to geometric_mean_score inside classification_report_imbalanced is wrong? Jan 21, 2018

@glemaitre

This comment has been minimized.

Member

glemaitre commented Jan 22, 2018

Can you provide a stand-alone script to reproduce the problem ? Please read https://stackoverflow.com/help/mcve. I am confused with your bug report.

Are you sure that the same average was used.

@klizter

This comment has been minimized.

Contributor

klizter commented Jan 23, 2018

classification_report_imbalanced calls geometric_mean_score with y_pred in place of y_true and vice versa. The output is different when these two parameters have switched place. In the documentation (and looking at how geometic_mean_score is implemented), y_true is the first parameter and y_pred the second.

classification_report_imbalanced should call geometric_mean_score with y_truth as first argument and y_pred as second argument. Because it yields different results otherwise, and the API specifies that order.

Edit: Seems like the parameter order miss match is not the main issue to why I'm getting inconsistent results (was not able reproduce the same problem in isolation). I'll report back when I know more. Thanks!

@klizter

This comment has been minimized.

Contributor

klizter commented Jan 23, 2018

Wait! This might confirm that the order of y_truth and y_pred has to be correct:

import numpy as np
from imblearn.metrics import geometric_mean_score
from sklearn.utils.multiclass import unique_labels


y_pred = np.random.randint(25, size=2000)
y_true = np.random.randint(25, size=2000)
labels = unique_labels(y_true, y_pred)

geo_pt = geometric_mean_score(y_pred, y_true, labels=labels, average=None, sample_weight=None)
geo_tp = geometric_mean_score(y_true, y_pred, labels=labels, average=None, sample_weight=None)

print geo_pt
print geo_tp

print np.average(geo_pt)
print np.average(geo_tp)
→ python script.py
[ 0.19193913  0.26226263  0.22981553  0.11717751  0.11349908  0.23257081
  0.16934315  0.22731255  0.23107233  0.17646221  0.16535631  0.14142755
  0.1565899   0.27222751  0.19445269  0.1824145   0.14627426  0.2151913
  0.11958237  0.17762588  0.25085307  0.23654274  0.11802481  0.11393099
  0.3001162 ]
[ 0.18524655  0.27342537  0.24588147  0.11263222  0.10280463  0.21830842
  0.15739718  0.24537299  0.22099909  0.17927076  0.15064496  0.15616334
  0.14944024  0.28286225  0.18534714  0.20246446  0.1616976   0.22034215
  0.10851487  0.18349999  0.28022071  0.24939906  0.11266257  0.11177659
  0.26637954]
0.189682600417
0.190510166464
@glemaitre

This comment has been minimized.

Member

glemaitre commented Jan 23, 2018

classification_report_imbalanced calls geometric_mean_score with y_pred in place of y_true and vice versa. The output is different when these two parameters have switched place. In the documentation (and looking at how geometic_mean_score is implemented), y_true is the first parameter and y_pred the second.

Oh I see thanks for reporting. You are right, those 2 lines need to be exchanged:

geo_mean = geometric_mean_score(
y_pred,
y_true,

iba = iba_gmean(
y_pred,
y_true,

Do you wish to make a PR to solve the issue.

@klizter

This comment has been minimized.

Contributor

klizter commented Jan 23, 2018

Great! I'll make a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment