Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSE is negative when returned by cross_val_score #2439

Closed
tdomhan opened this issue Sep 12, 2013 · 55 comments · Fixed by #7261
Milestone

Comments

@tdomhan
Copy link

@tdomhan tdomhan commented Sep 12, 2013

The Mean Square Error returned by sklearn.cross_validation.cross_val_score is always a negative. While being a designed decision so that the output of this function can be used for maximization given some hyperparameters, it's extremely confusing when using cross_val_score directly. At least I asked myself how a the mean of a square can possibly be negative and thought that cross_val_score was not working correctly or did not use the supplied metric. Only after digging in the sklearn source code I realized that the sign was flipped.

This behavior is mentioned in make_scorer in scorer.py, however it's not mentioned in cross_val_score and I think it should be, because otherwise it makes people think that cross_val_score is not working correctly.

@jaquesgrobler

This comment has been minimized.

Copy link
Member

@jaquesgrobler jaquesgrobler commented Sep 12, 2013

You're referring to

greater_is_better : boolean, default=True

Whether score_func is a score function (default), meaning high is good, 
or a loss function, meaning low is good. In the latter case, the scorer 
object will sign-flip the outcome of the score_func.

in http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html
? (just for reference's sake)

I agree that that it can be more clear in cross_val_score docs

Thanks for reporting

@ogrisel

This comment has been minimized.

Copy link
Member

@ogrisel ogrisel commented Sep 12, 2013

Indeed we overlooked that issue when doing the Scorer refactoring. The following is very counter-intuitive:

>>> import numpy as np
>>> from sklearn.datasets import load_boston
>>> from sklearn.linear_model import RidgeCV
>>> from sklearn.cross_validation import cross_val_score

>>> boston = load_boston()
>>> np.mean(cross_val_score(RidgeCV(), boston.data, boston.target, scoring='mean_squared_error'))
-154.53681864311497

/cc @larsmans

@ogrisel

This comment has been minimized.

Copy link
Member

@ogrisel ogrisel commented Sep 12, 2013

BTW I don't agree that it's a documentation issue. It's cross_val_score should return the value with the sign that matches the scoring name. Ideally the GridSearchCV(*params).fit(X, y).best_score_ should be consistent too. Otherwise the API is very confusing.

@tdomhan

This comment has been minimized.

Copy link
Author

@tdomhan tdomhan commented Sep 12, 2013

I also agree that a change to return the actual MSE without the sign switched would be the way better option.

The scorer object could just store the greater_is_better flag and whenever the scorer is used the sign could be flipped in case it's needed, e.g. in GridSearchCV.

@larsmans

This comment has been minimized.

Copy link
Member

@larsmans larsmans commented Sep 13, 2013

I agree that we have a usability issue here, but I don't fully agree with @ogrisel's solution that we should

return the value with the sign that matches the scoring name

because that's an unreliable hack in the long run. What if someone defines a custom scorer with a name such as mse? What if they do follow the naming pattern but wrap the scorer in a decorator that changes the name?

The scorer object could just store the greater_is_better flag and whenever the scorer is used the sign could be flipped in case it's needed, e.g. in GridSearchCV.

This is what scorers originally did, during development between the 0.13 and 0.14 releases and it made their definition a lot harder. It also made the code hard to follow because the greater_is_better attribute seemed to disappear in the scorer code, only to reappear in the middle of the grid search code. A special Scorer class was needed to do something that ideally, a simple function would do.

I believe that if we want to optimize scores, then they should be maximized. For the sake of user-friendlyness, I think we might introduce a parameter score_is_loss["auto", True, False] that only changes the display of scores and can use a heuristic based on the built-in names.

@larsmans

This comment has been minimized.

Copy link
Member

@larsmans larsmans commented Sep 13, 2013

That was a hurried response because I had to get off the train. What I meant by "display" is really the return value from cross_val_score. I think scorers should be simple and uniform and the algorithms should always maximize.

This does introduce an asymmetry between built-in and custom scorers.

Ping @GaelVaroquaux.

@jaquesgrobler

This comment has been minimized.

Copy link
Member

@jaquesgrobler jaquesgrobler commented Sep 13, 2013

I like the score_is_loss solution, or something to that effect.. the sign change to match the scoring name seems hard to maintain could cause problems as @larsmans mentioned

@tdomhan

This comment has been minimized.

Copy link
Author

@tdomhan tdomhan commented Sep 28, 2013

what's the conclusion, which solution should we go for? :)

@amelio-vazquez-reina

This comment has been minimized.

Copy link

@amelio-vazquez-reina amelio-vazquez-reina commented Oct 23, 2013

@tdomhan @jaquesgrobler @larsmans Do you know if this applies to r2 as well? I am noticing that the r2 scores returned by GridSearchCV are also mostly negative for ElasticNet, Lasso and Ridge.

@larsmans

This comment has been minimized.

Copy link
Member

@larsmans larsmans commented Oct 23, 2013

R² can be either positive or negative, and negative simply means your model is performing very poorly.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jan 17, 2014

IIRC, @GaelVaroquaux was a proponent of returning a negative number when greater_is_better=False.

@larsmans

This comment has been minimized.

Copy link
Member

@larsmans larsmans commented Jan 17, 2014

r2 is a score function (greater is better), so that should be positive if your model is any good -- but it's one of the few performance metrics that can actually be negative, meaning worse than 0.

@mblondel

This comment has been minimized.

Copy link
Member

@mblondel mblondel commented Feb 4, 2014

What is the consensus on this issue? In my opinion, cross_val_score is an evaluation tool, not a model selection one. It should thus return the original values.

I can fix it in my PR #2759, since the changes I made make it really easy to fix. The trick is to not flip the sign upfront but, instead, to access the greater_is_better attribute on the scorer when doing grid search.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Feb 4, 2014

What is the consensus on this issue? In my opinion, cross_val_score is
an evaluation tool, not a model selection one. It should thus return
the original values.

Special case are varying behaviors are a source of problems in software.

I simply think that we should rename "mse" to "negated_mse" in the list
of acceptable scoring strings.

@mblondel

This comment has been minimized.

Copy link
Member

@mblondel mblondel commented Feb 4, 2014

What if someone defines a custom scorer with a name such as mse? What if they do follow the naming pattern but wrap the scorer in a decorator that changes the name?

I don't think that @ogrisel was suggesting to use name matching, just to be consistent with the original metric. Correct me if I'm wrong @ogrisel.

@mblondel

This comment has been minimized.

Copy link
Member

@mblondel mblondel commented Feb 4, 2014

I simply think that we should rename "mse" to "negated_mse" in the list of acceptable scoring strings.

That's completely unintuitive if you don't know the internals of scikit-learn. If you have to bend the system like that, I think it's a sign that there's a design problem.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Feb 4, 2014

That's completely unintuitive if you don't know the internals of scikit-learn.
If you have to bend the system like that, I think it's a sign that there's a
design problem.

I disagree. Humans understand things with a lot of prior knowledge and
context. They are all but systematic. Trying to embed this in software
gives shopping-list like set of special cases. Not only does it make
software hard to maintain, but also it means that people who do not have
in mind those exceptions run into surprising behaviors and write buggy
code using the library.

@mblondel

This comment has been minimized.

Copy link
Member

@mblondel mblondel commented Feb 4, 2014

What special case do you have in mind?

To be clear, I think that the cross-validation scores stored in the GridSearchCV object should also be the original values (not with sign flipped).

AFAIK, flipping the sign was introduced so as to make the grid search implementation a little simpler but was not supposed to affect usability.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Feb 4, 2014

What special case do you have in mind?

Well, the fact that for some metrics bigger is better, whereas for others
it is the opposite.

AFAIK, flipping the sign was introduced so as to make the grid search
implementation a little simpler but was not supposed to affect
usability.

It's not about grid search, it's about separation of concerns: scores
need to be useable without knowing anything about them, or else code to
deal with their specificities will spread to the whole codebase. There is
already a lot of scoring code.

@mblondel

This comment has been minimized.

Copy link
Member

@mblondel mblondel commented Feb 4, 2014

But that's somewhat postponing the problem to user code. Nobody wants to plot "negated MSE" so users will have to flip signs back in their code. This is inconvenient, especially for multiple-metric cross-validation reports (PR #2759), as you need to handle each metric individually. I wonder if we can have the best of both worlds: generic code and intuitive results.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Feb 4, 2014

But that's somewhat postponing the problem to user code. Nobody wants
to plot "negated MSE" so users will have to flip signs back in their
code.

Certainly not the end of the world. Note that when reading papers or
looking at presentations I have the same problem: when the graph is not
well done, I loose a little bit of time and mental bandwidth trying to
figure if bigger is better or not.

This is inconvenient, especially for multiple-metric cross-validation
reports (PR #2759), as you need to handle each metric individually.

Why. If you just accept that its always bigger is better, it makes
everything easier, including the interpretation of results.

I wonder if we can have the best of both worlds: generic code and
intuitive results.

The risk is to have very complex code that slows us down for maintainance
and development. Scikit-learn is picking up weight.

@mblondel

This comment has been minimized.

Copy link
Member

@mblondel mblondel commented Feb 4, 2014

If you just accept that its always bigger is better

That's what she said :)

More seriously, I think one reason this is confusing people is because the output of cross_val_score is not consistent with the metrics. If we follow your logic, all metrics in sklearn.metrics should follow "bigger is better".

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Feb 4, 2014

That's what she said :)

Nice one!

More seriously, I think one reason this is confusing people is because
the output of cross_val_score is not consistent with the metrics. If we
follow your logic, all metrics in sklearn.metrics should follow "bigger
is better".

Agreed. That's why I like the idea of changing the name: it would pop up
to people's eyes.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Feb 4, 2014

More seriously, I think one reason this is confusing people is because the output of cross_val_score is not consistent with the metrics.

And this in turn makes scoring seem more mysterious than it is.

@Huitzilo

This comment has been minimized.

Copy link

@Huitzilo Huitzilo commented May 20, 2015

Got bitten by this today in 0.16.1 when trying to do linear regression. While the sign of the score is apparently not flipped anymore for classifiers, it is still flipped for linear regression. To add to the confusion, LinearRegression.score() returns a non-flipped version of the score.

I'd suggest to make it all consistent and return the non-sign-flipped score for linear models as well.

Example:

from sklearn import linear_model
from sklearn.naive_bayes import GaussianNB
from sklearn import cross_validation
from sklearn import datasets
iris = datasets.load_iris()
nb = GaussianNB()
scores = cross_validation.cross_val_score(nb, iris.data, iris.target)
print("NB score:\t  %0.3f" % scores.mean() )

iris_reg_data = iris.data[:,:3]
iris_reg_target = iris.data[:,3]
lr = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(lr, iris_reg_data, iris_reg_target)
print("LR score:\t %0.3f" % scores.mean() )

lrf = lr.fit(iris_reg_data, iris_reg_target)
score = lrf.score(iris_reg_data, iris_reg_target)
print("LR.score():\t  %0.3f" % score )

This gives:

NB score:     0.934    # sign is not flipped
LR score:    -0.755    # sign is flipped
LR.score():   0.938    # sign is not flipped
@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented May 20, 2015

Cross-validation flips all signs of models where greater is better. I still disagree with this decision. I think the main proponent of it were @GaelVaroquaux and maybe @mblondel [I remembered you refactoring the scorer code].

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented May 20, 2015

Oh never mind, all the discussion is above.
I feel flipping the sign by default in mse and r2 is even less intuitive :-/

@ogrisel

This comment has been minimized.

Copy link
Member

@ogrisel ogrisel commented Jun 2, 2015

r2 can be negative (for bad models). It cannot be larger than 1.

You are probably overfitting. try:

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.2, random_state=0)
model = LinearRegression()
model.fit(X_train, y_train)
pred_train = model.predict(X_train)
print("train r2: %f" % r2_score(y_train, pred_train))

pred_test = model.predict(X_test)
print("test r2: %f" % r2_score(y_test, pred_test))

Try with different values for the random_state integer seed that controls the random split.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jun 3, 2015

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Jun 3, 2015

Does that solve all problems? Are there other scores were greater is not better?

@larsmans

This comment has been minimized.

Copy link
Member

@larsmans larsmans commented Jun 4, 2015

There are:

  • log_loss
  • mean_absolute_error
  • median_absolute_error

According to doc/modules/model_evaluation.rst, that should be all of them.

@mblondel

This comment has been minimized.

Copy link
Member

@mblondel mblondel commented Jun 4, 2015

And hinge_loss I guess?

@mblondel

This comment has been minimized.

Copy link
Member

@mblondel mblondel commented Jun 4, 2015

Adding the neg_ prefix to all those losses feels awkward.

An idea would be to return the original scores (without sign flip) but instead of returning an ndarray, we return a class which extends ndarray with methods like best(), arg_best(), best_sorted(). This way the results are unsurprising and we have convenience methods for retrieving the best results.

@larsmans

This comment has been minimized.

Copy link
Member

@larsmans larsmans commented Jun 4, 2015

There's no scorer for hinge loss (and I've never seen it being used for evaluation).

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Jun 4, 2015

The scorer doesn't return a numpy array, it returns a float, right?
we could return a score object that has a custom ">" but looks like a float.
That feels more contrived to me than the previous solution, which was tagging the scorer with a bool "lower_is_better" which was then used in GridSearchCV.

@mblondel

This comment has been minimized.

Copy link
Member

@mblondel mblondel commented Jun 4, 2015

cross_val_score returns an array.

@mblondel

This comment has been minimized.

Copy link
Member

@mblondel mblondel commented Jun 5, 2015

Actually the scores returned by cross_val_score usually don't need to be sorted, just averaged.

Another idea is to add a sorted method to _BaseScorer.

my_scorer = make_scorer(my_metric, greater_is_better=False)
scores = my_scorer.sorted(scores)  # takes into account my_scorer._sign
best = scores[0]
@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Jun 5, 2015

cross_val_score returns an array, but the scorers return a float. I feel it would be odd to have specific logic in cross_val_score because you'd like to have the same behavior in GridSearchCV and in all other CV objects.

You'd also need an argsort method, because in GridSearchCV you want the best score and the best index.

@jenifferYingyiWu

This comment has been minimized.

Copy link

@jenifferYingyiWu jenifferYingyiWu commented Mar 15, 2016

How to implement "estimate the means and variances of the workers' errors from the control questions, then compute the weighted average after removing the estimated bias for the predictions " by scikit-learn?

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Aug 2, 2016

IIRC we discussed this in the sprint (last summer?!) and decided to go with neg_mse (or was it neg-mse) and deprecate all scorers / strings where we have a negative sign now.
Is this still the consensus? We should do that before 0.18 then.
Ping @GaelVaroquaux @agramfort @jnothman @ogrisel @raghavrv

@agramfort

This comment has been minimized.

Copy link
Member

@agramfort agramfort commented Aug 2, 2016

@raghavrv

This comment has been minimized.

Copy link
Member

@raghavrv raghavrv commented Aug 2, 2016

It was neg_mse

@ogrisel

This comment has been minimized.

Copy link
Member

@ogrisel ogrisel commented Aug 27, 2016

We also need:

  • neg_log_loss
  • neg_mean_absolute_error
  • neg_median_absolute_error
@shreyassks

This comment has been minimized.

Copy link

@shreyassks shreyassks commented Oct 29, 2018

model = Sequential()
keras.layers.Flatten()
model.add(Dense(11, input_dim=3, kernel_initializer = keras.initializers.he_normal(seed = 2),
kernel_regularizer = regularizers.l2(2)))
keras.layers.LeakyReLU(alpha = 0.1)
model.add(Dense(8, kernel_initializer = keras.initializers.he_normal(seed = 2)))
keras.layers.LeakyReLU(alpha = 0.1)
model.add(Dense(4, kernel_initializer = keras.initializers.he_normal(seed = 2)))
keras.layers.LeakyReLU(alpha = 0.1)
model.add(Dense(1, kernel_initializer = keras.initializers.he_normal(seed = 2)))
keras.layers.LeakyReLU(alpha = 0.2)
adag = RMSprop(lr = 0.0002)
model.compile(loss=losses.mean_squared_error,
optimizer=adag
)
history = model.fit(X_train, Y_train, epochs=2000,
batch_size=20, shuffle = True)

How to cross validate the above code? I want leave one out cross validation method to be used in this.

@jolespin

This comment has been minimized.

Copy link

@jolespin jolespin commented May 14, 2019

@shreyassks this isn't the correct place for your question but I would check this out: https://keras.io/scikit-learn-api . Wrap your network in a scikit-learn estimator then use w/ model_selection.cross_val_score

@TomGauss

This comment has been minimized.

Copy link

@TomGauss TomGauss commented Jun 3, 2019

Yes. I totally agree! This also happened to Brier_score_loss, it works perfectly fine using Brier_score_loss, but it gets confusing when it comes from the GridSearchCV, the negative Brier_score_loss returns. At least, it would be better output something like, because Brier_score_loss is a loss (the lower the better), the scoring function here flip the sign to make it negative.

@Nisza25

This comment has been minimized.

Copy link

@Nisza25 Nisza25 commented Oct 6, 2019

The idea is that cross_val_score should entirely focus on the absolute value of the result. In my knowledge, importance of negative sign (-) obtained for MSE (mean squared error) in cross_val_score is not predefined. Let's wait for the updated version of sklearn where this issue is taken care of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.