New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: cross_val_score fails with scoring='neg_log_loss' when not all labels are predicted #9144

Closed
JohnNapier opened this Issue Jun 16, 2017 · 10 comments

Comments

Projects
None yet
5 participants
@JohnNapier
Copy link

JohnNapier commented Jun 16, 2017

Suppose we wish to classify some examples where one class is relatively rare.

import pandas as pd,numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from random import choice

df0=pd.DataFrame(np.random.normal(size=(100,3)))
df1=pd.Series([choice([-1,1]) for i in xrange(df0.shape[0])])
df1.iloc[-1]=0
df2=pd.Series(np.ones(df0.shape[0])/df0.shape[0])

clf=RandomForestClassifier(n_estimators=10)                               
score=cross_val_score(clf,X=df0,y=df1,cv=10,scoring='neg_log_loss',fit_params={'sample_weight':df2.values}).mean()

This will spit the error:

*** ValueError: y_true and y_pred contain different number of classes 3, 2. Please provide the true labels explicitly through the labels argument. Classes found in y_true: [-1  0  1]

This is a known problem for log_loss, which was fixed by adding an argument labels to that function: #4033

The problem above is, cross_val_score invokes that function, but does not allow us to pass the labels. Ideally, cross_val_score should be able to infer the true labels from y=df1.

sklearn version '0.18.1'

@JohnNapier

This comment has been minimized.

Copy link

JohnNapier commented Jun 16, 2017

I think something like this might fix it:

def cross_val_negLogLoss(clf,X,y,cv,sample_weight=None):
    from sklearn.metrics import log_loss
    xHL=[]
    for train,test in cv.split(X=X):
        fit=clf.fit(X=X.iloc[train,:],y=y.iloc[train],sample_weight=sample_weight.iloc[train].values)
        prob=fit.predict_proba(X.iloc[test,:])
        xHL.append(log_loss(y.iloc[test],prob,sample_weight=sample_weight.iloc[test].values,labels=clf.classes_))
    return -np.array(xHL)

We just need cross_val_score to pass labels=clf.classes_ as argument of log_loss.

As a side comment, it seems dangerous for the code to assume labels. Labels should not be an optional argument, particularly within a cross-validation function, where we have no idea if the training set will contain less classes than the training set. For example, suppose that the training set contains [-1,0], and the testing set contains [0,1]. The code will run assuming that the probability reported for "-1" corresponds to the probability of "0".

Scary, isn't it?

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jun 18, 2017

We just need cross_val_score to pass labels=clf.classes_ as argument of log_loss.

"just" ;) I think this is a pretty tricky issue, and I'm pretty sure I opened an issue to track it.
Right now I can only find #5097 that is only tangentially related. Basically the scorer objects should look at the classes_ attribute of the estimator. I don't think an argument to cross_val_score is the right way to go.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jun 18, 2017

Found the right issue: #6231 wasn't by me after all!

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 18, 2017

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jun 19, 2017

@jnothman hm I guess that makes sense for the training set score or when the CV doesn't use all the data. That's really tricky, though, right? How would that be possible with our API? at the creation of the scorer? and in the default case we do it at check_scorer?

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jun 19, 2017

Close this as duplicate of #6231 and discuss there?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 19, 2017

Yes.

@mlewis1729

This comment has been minimized.

Copy link
Contributor

mlewis1729 commented Sep 21, 2017

Just wanted to follow up with this discussion,
has there been any progress on solving the issue with log loss?

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Sep 21, 2017

@mlewis1729 check out #6231 and #9585. In the meantime you can specify labels explicitly.

@John-Almardeny

This comment has been minimized.

Copy link

John-Almardeny commented Aug 8, 2018

What I ended up doing is providing the labels explicitly like this:

def loss_scorer(y_true, y_pred, eps=1e-15, normalize=True, sample_weight=None, labels=None):
    return log_loss(y_true, y_pred, eps, normalize, sample_weight, labels=__LABELS__)

Then use it like this:

global __LABELS__
iris = datasets.load_iris()
__LABELS__ = list(set(iris.target))
.......
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5)
clf = GridSearchCV(estimator=model, param_grid=param_grid, cv=cv, scoring=
                         make_scorer(loss_scorer, greater_is_better=False, needs_proba=True))

Notes:

  1. This less likely to work with extreme high n_splits values
  2. Each class should contain at least 2 examples.
  3. Preferably use RepeatedStratifiedKFold to roughly ensure that each fold is representative of all strata of the data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment