{{ message }}

# [MRG+1] Add new regression metric - Mean Squared Log Error #7655

Merged
merged 4 commits into from Nov 30, 2016
Merged

# [MRG+1] Add new regression metric - Mean Squared Log Error#7655

merged 4 commits into from Nov 30, 2016

## Conversation

### kdexd commented Oct 12, 2016 • edited

#### What does this implement/fix? Explain your changes.

• This PR implements a new metric - "Mean Squared Logarithmic Error" (name truncated to mean_squared_log_error). I have added the method alongwith other regression metrics in sklearn.metrics.regression module.
• Accompanying the implementation, this PR is complete with User Guide Documentation and API docstring.

• The metric is similar to mean_squared_error and MSE method can be used to calculate MSLE by cleverly passing arguments, but it always required external manual work.
• I felt that it would be a nice to have metric due to its frequent requirement.
• A lot of regression problems in various competitions, especially Kaggle, evaluate submissions based on this error metric or its square root. A Kaggle wiki page can be found here.

### jnothman commented Oct 13, 2016

changed the title [MRG] Add new regression metric - Mean Squared Log Error [WIP] Add new regression metric - Mean Squared Log Error Oct 13, 2016
changed the title [WIP] Add new regression metric - Mean Squared Log Error [MRG] Add new regression metric - Mean Squared Log Error Oct 13, 2016
reviewed
 ---------------------- The :func:mean_squared_log_error function computes a risk metric corresponding to the expected value of the logarithmic squared (quadratic) error loss or loss.

#### amueller Oct 14, 2016 Member

error loss or loss? Do you mean "error or loss"?

#### kdexd Oct 15, 2016 Author

Oops, minor typo. Fixing it

#### kdexd Oct 15, 2016 Author

@amueller I fixed this one ! There was a same typo above it as well. I fixed it on the fly.

reviewed
 y_type, y_true, y_pred, multioutput = _check_reg_targets( y_true, y_pred, multioutput) if not (y_true >= 0).all() and not (y_pred >= 0).all():

#### amueller Oct 14, 2016 Member

It can be used with anything > -1, right?

#### kdexd Oct 15, 2016 Author

@amueller It can be, but (1 + log(x)) will give huge negative values which change erratically on little change of x between (-1, 0). This will not make the score look sensible. Looking mathematically it is possible, but in practical usages this metric is used for non negative targets. Although if you suggest I'd change it.

#### kdexd Oct 15, 2016 Author

Additionally I just recalled that, I read somewhere - this metric is used for positive values only, still there is log(1 + x) to make everything inside log greater than one, and finally outside the log positive, which would be greater than zero. Making it allowable till -1 will nullify this 😄

alright.

#### jnothman Nov 6, 2016 Member

Yes, my reading of the equation agrees that it's designed for non-negative values with an exponential trend.

approved these changes
approved these changes

### kdexd commented Oct 18, 2016

 Hi @amueller and @jnothman, what more shall I do in this PR ? Also, is @RPGOne is a bot or a service ?

### jnothman commented Oct 18, 2016 • edited

 RPGOne is spam, as far as I know On 18 October 2016 at 20:49, Karan Desai notifications@github.com wrote: Hi @amueller https://github.com/amueller and @jnothman https://github.com/jnothman, what more shall I do in this PR ? Also, is @RPGOne https://github.com/RPGOne is a bot or a service ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub #7655 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz64Yn9iA-2Ob_9lyUJmobNRC007N4ks5q1JYNgaJpZM4KVLS7 .
suggested changes

### raghavrv left a comment

 The code is cleanly written. Thanks!
 Array-like value defines weights used to average errors. 'raw_values' : Returns a full set of errors in case of multioutput input.

#### raghavrv Oct 30, 2016 Member

I'd phrase it as when the input is of multioutput format.

 Sample weights. multioutput : string in ['raw_values', 'uniform_average'] or array-like of shape (n_outputs)

#### raghavrv Oct 30, 2016 Member

Humm how does this render in the documentation?

Could you maybe leave a blank line after this, to visually separate the type from description?

 if not (y_true >= 0).all() and not (y_pred >= 0).all(): raise ValueError("Mean Log Squared Error cannot be used when targets " "contain negative values.")

#### raghavrv Oct 30, 2016 Member

After this validation I think we can reuse the mean_sqared_error by passing the log values?

(There will be an additional check on y, but it will save us 10 lines of code)...

@amueller WDYT?

#### kdexd Oct 30, 2016 Author

@raghavrv it will break the test of this method, if in future mean_squared_error gets broken at all. But then I think your review is more appropriate because:

1. It will pacify DRY principle.
2. As this metric is kind of adapted from mean_squared_error, its behavior can be similar to that method, hence there is no issue if one test fails due to broken underlying method.

I'm temporary choosing the path which is consistent with Don't Repeat Yourself and which saves some lines of code. I'll amend my commit accordingly, if @amueller thinks the other way around.

removed the label Oct 30, 2016
added this to the 0.19 milestone Oct 30, 2016

### kdexd commented Oct 31, 2016

 Documentation of mean_squared_error in current master renders like this: There are some inconsistencies, my build after the changes you suggested looks like this ( mean_squared_log_error ): To keep the diffs in this PR specific to only one metric, I am leaving other docstrings untouched for a while, I'll be taking them up in a separate documentation cleanup issue. I have rephrased the line you reviewed and reused mean_squared_error as well. Thanks !

### raghavrv commented Nov 1, 2016

 Thanks for the screenshot of the doc! I'll be taking them up in a separate documentation cleanup issue. Much appreciated.

### raghavrv commented Nov 1, 2016

 I think it should also be added to the scorer so users can readily refer to it by neg_mean_squared_log_error...

### kdexd commented Nov 2, 2016

 @raghavrv It looks like there is a renaming scheduled for regression metrics similar to this one. For the sake of uniformity, I have added a deprecation message to mean_squared_log_error_scorer just like mean_squared_error_scorer and others. Let me know if I should not include that, and I will amend the commit accordingly, thanks !

### jnothman commented Nov 2, 2016

 No, don't add a deprecated version. That's only there for people using features in older versions. On 2 November 2016 at 13:36, Karan Desai notifications@github.com wrote: @raghavrv https://github.com/raghavrv It looks like there is a renaming scheduled for regression metrics similar to this one. For the sake of uniformity, I have added a deprecation message to mean_squared_log_error_scorer just like mean_squared_error_scorer and others. Let me know if I should not include that, and I will amend the commit accordingly, thanks ! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub #7655 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz6-edpn5mxkHZ8249LbmS10gyGymAks5q5_cxgaJpZM4KVLS7 .

### kdexd commented Nov 2, 2016

 Hello @jnothman, @raghavrv: I have made the needful additions / modifications in the PR. I'm up for anything else which is suitable to go in here, please have a look 😄
suggested changes
 assert_almost_equal(mean_absolute_error([0.], [0.]), 0.00, 2) assert_almost_equal(median_absolute_error([0.], [0.]), 0.00, 2) assert_almost_equal(explained_variance_score([0.], [0.]), 1.00, 2) assert_almost_equal(r2_score([0., 1], [0., 1]), 1.00, 2) assert_raises(ValueError, mean_squared_log_error, [-1.], [-1.])

#### raghavrv Nov 2, 2016 Member

Can you also check for the error message to be sure...

@raghavrv Done !

### jnothman commented Nov 6, 2016

 Kaggle calls this "[root] mean squared logarithmic error", not "[root] mean squared log error" which sounds like it's a function of the log of the error. I think this is an important distinction. I'm not sure if you need to rename the function and scorer to reflect this, but at least the documentation needs to be absolutely clear.
requested changes
 \text{MSLE}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples} - 1} (\log (1 + y_i) - \log (1 + \hat{y}_i) )^2. Here is a small example of usage of the :func:mean_squared_log_error

#### jnothman Nov 6, 2016 Member

Kaggle's note that "RMSLE penalizes an under-predicted estimate greater than an over-predicted estimate" may be valuable here.

 ---------------------- The :func:mean_squared_log_error function computes a risk metric corresponding to the expected value of the logarithmic squared (quadratic) error or loss.

#### jnothman Nov 6, 2016 Member

I think you want "squared logarithmic" rather than "logarithmic squared".

#### kdexd Nov 6, 2016 Author

Oops, thanks for pointing this out. I would have missed it completely ! Changing it.

 .. math:: \text{MSLE}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples} - 1} (\log (1 + y_i) - \log (1 +

#### jnothman Nov 6, 2016 Member

I presume this is meant to be applicable for non-negative regression targets? This should be stated. I think you should also give some sense of when this measure should be used, presumably for regressions over population counts and similar (i.e. targets with exponential growth).

#### kdexd Nov 6, 2016 Author

Yes, this is a nice to be included information in our user guide.

#### jnothman Nov 6, 2016 Member

Also would be good to be clear what base we use for the log.

 y_type, y_true, y_pred, multioutput = _check_reg_targets( y_true, y_pred, multioutput) if not (y_true >= 0).all() and not (y_pred >= 0).all():

#### jnothman Nov 6, 2016 Member

Yes, my reading of the equation agrees that it's designed for non-negative values with an exponential trend.

### kdexd commented Nov 6, 2016

 @jnothman Logarithmic made the name too long, but if needed, I'll change the names. But yes atleast I should be clear about it in the docstrings and User Guide. I'll push the required changes soon. Also, MSE and MAE have their square roots used quite frequently, but they are not included in scorer so I dropped [root]. Is it a good choice to provide RMSLE in scorer or it is fine this way ?
reviewed
 def mean_squared_log_error(y_true, y_pred, sample_weight=None, multioutput='uniform_average'): """Mean squared log error regression loss

#### jnothman Nov 6, 2016 Member

For instance, here "log" -> "logarithmic"

requested changes
 \hat{y}_i) )^2. \text{MSLE}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples} - 1} (\log_e (1 + y_i) - \log_e (1 + \hat{y}_i) )^2. Where :math:\log_e (x) means the natural logarithm of :math:x. This metric is best to

#### jnothman Nov 28, 2016 Member

some of these lines are much longer than we usually try to keep to (80 chars)

### kdexd commented Nov 28, 2016

 I have addressed all of your review comments and cleaned up my commit history to reduce down the whole work into isolated sequential commits containing the implementation, tests and documentation one in each ! Please let me know if there's anything else I should do..

### jnothman commented Nov 29, 2016

 FWIW, cleaning up commit history is superfluous.
requested changes

### jnothman left a comment

 Otherwise LGTM
 y_true, y_pred, multioutput) if not (y_true >= 0).all() and not (y_pred >= 0).all(): raise ValueError("Mean Squared Log Error cannot be used when targets "

#### jnothman Nov 29, 2016 Member

Either "logarithmic" or "mean_squared_log_error"

 @@ -23,6 +24,7 @@ def test_regression_metrics(n_samples=50): y_pred = y_true + 1 assert_almost_equal(mean_squared_error(y_true, y_pred), 1.) assert_almost_equal(mean_squared_log_error(y_true, y_pred), 0.01915163)

#### jnothman Nov 29, 2016 Member

I'd rather tests that explicitly check msle(x, y) = mse(ln(x), ln(y)) rather than checking against a hand-calculated number.

#### kdexd Nov 29, 2016 Author

Great, although I guess you mean ln(1+x)

yes, that

#### kdexd Nov 29, 2016 Author

Coming up in 5 minutes 😄

#### kdexd Nov 29, 2016 Author

@jnothman Done ! I was skeptical about the fact that if mean_squared_error actually gets faulty, then these tests will still pass as we are doing the same thing internally.

 ENH Implement mean squared log error in sklearn.metrics.regression 
 e00d9b3 
 ENH Add neg_mean_squared_log_error in metrics.scorer 
 62f1317 

### kdexd commented Nov 29, 2016

 Oh yes, if mean_squared_error would have broken that its test itself would fail !
approved these changes

### jnothman commented Nov 29, 2016

 LGTM
changed the title [MRG] Add new regression metric - Mean Squared Log Error [MRG+1] Add new regression metric - Mean Squared Log Error Nov 29, 2016
approved these changes

### amueller left a comment

 LGTM apart from nitpick. Would you mind fixing that?
 Parameters ---------- y_true : array-like of shape = (n_samples) or (n_samples, n_outputs)

#### amueller Nov 29, 2016 Member

nitpick: you should write (n_samples,) because it's a tuple. (also everywhere below where there's a tuple with one element)

#### kdexd Nov 30, 2016 Author

@amueller i read your comment after the PR was merged, although I think I will handle this is in a larger routine of documentation consistency, @raghavrv already directed me to a matching issue for the same. I will work on one more PR which is already open before starting that.

approved these changes
merged commit cb6a366 into scikit-learn:master Nov 30, 2016
2 of 3 checks passed
2 of 3 checks passed
continuous-integration/appveyor/pr AppVeyor build failed
Details
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

### jnothman commented Nov 30, 2016

 Thanks, @karandesai-96

### jnothman commented Nov 30, 2016

 Sorry I missed your comment before merging, @amueller :/ … On 30 November 2016 at 13:40, Karan Desai ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sklearn/metrics/regression.py <#7655>: > @@ -241,6 +243,73 @@ def mean_squared_error(y_true, y_pred, return np.average(output_errors, weights=multioutput) +def mean_squared_log_error(y_true, y_pred, + sample_weight=None, + multioutput='uniform_average'): + """Mean squared logarithmic error regression loss + + Read more in the :ref:User Guide . + + Parameters + ---------- + y_true : array-like of shape = (n_samples) or (n_samples, n_outputs) @amueller i read your comment after the PR was merged, although I think I will handle this is in a larger routine of documentation consistency, @raghavrv already directed me to a matching issue for the same. I will work on one more PR which is already open before starting that. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#7655>, or mute the thread .

### kdexd commented Nov 30, 2016

 Feels good to contribute to the community, thanks @jnothman @amueller @raghavrv for the review ! 😄

### amueller commented Nov 30, 2016

 @jnothman no worries, it was a nitpick of the highest order ;)
deleted the kdexd:msle-metric branch Dec 2, 2016

### kdexd commented Dec 22, 2016

 Hi, I was wondering whether this should go in CHANGELOG for next release.

### jnothman commented Dec 22, 2016

 With apologies, we forgot to ask you to add a changelog entry here. Please submit a new PR with it. THanks.

### kdexd commented Dec 22, 2016

 @jnothman: Sure, I'll do that in a moment, thanks for the headsup.
mentioned this pull request Dec 22, 2016
added a commit that referenced this pull request Dec 23, 2016
 [MRG + 1] Add changelog entry for MSLE implemented in #7655. (#8104) 
 7adeed1 
added a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
 [MRG+1] Add new regression metric - Mean Squared Log Error (scikit-le… 
 a8effcc 
…arn#7655)

* ENH Implement mean squared log error in sklearn.metrics.regression

* TST Add tests for mean squared log error.

* DOC Write user guide and docstring about mean squared log error.

* ENH Add neg_mean_squared_log_error in metrics.scorer
added a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
 [MRG + 1] Add changelog entry for MSLE implemented in scikit-learn#7655… 
 2d72037 
…. (scikit-learn#8104)
mentioned this pull request Mar 17, 2017
added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
 [MRG+1] Add new regression metric - Mean Squared Log Error (scikit-le… 
 d0be222 
…arn#7655)

* ENH Implement mean squared log error in sklearn.metrics.regression

* TST Add tests for mean squared log error.

* DOC Write user guide and docstring about mean squared log error.

* ENH Add neg_mean_squared_log_error in metrics.scorer
added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
 [MRG + 1] Add changelog entry for MSLE implemented in scikit-learn#7655… 
 cc34503 
…. (scikit-learn#8104)
added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
 [MRG+1] Add new regression metric - Mean Squared Log Error (scikit-le… 
 d6d1afb 
…arn#7655)

* ENH Implement mean squared log error in sklearn.metrics.regression

* TST Add tests for mean squared log error.

* DOC Write user guide and docstring about mean squared log error.

* ENH Add neg_mean_squared_log_error in metrics.scorer
added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
 [MRG + 1] Add changelog entry for MSLE implemented in scikit-learn#7655… 
 619a0a6 
…. (scikit-learn#8104)
added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
 [MRG+1] Add new regression metric - Mean Squared Log Error (scikit-le… 
 99d74b4 
…arn#7655)

* ENH Implement mean squared log error in sklearn.metrics.regression

* TST Add tests for mean squared log error.

* DOC Write user guide and docstring about mean squared log error.

* ENH Add neg_mean_squared_log_error in metrics.scorer
added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
 [MRG + 1] Add changelog entry for MSLE implemented in scikit-learn#7655… 
 cdb964a 
…. (scikit-learn#8104)
added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
 [MRG+1] Add new regression metric - Mean Squared Log Error (scikit-le… 
 f716d90 
…arn#7655)

* ENH Implement mean squared log error in sklearn.metrics.regression

* TST Add tests for mean squared log error.

* DOC Write user guide and docstring about mean squared log error.

* ENH Add neg_mean_squared_log_error in metrics.scorer
added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
 [MRG + 1] Add changelog entry for MSLE implemented in scikit-learn#7655… 
 8e9a835 
…. (scikit-learn#8104)