Mean Standardized Log Loss (MSLL) for uncertainty aware regression models #21665

patel-zeel · 2021-11-14T13:24:14Z

Describe the workflow you want to enable

Why MSLL?

Traditional metrics such as mean_squared_error or r2_score may not be able to correctly evaluate uncertainty aware models because they do not take predictive standard deviation (y_std) into account.

Why sklearn?

To the best of my knowledge, I am unaware of any other library in Python that is widely used and has a well-organized, standardized set of metrics implemented for ML regression models.

Mean Standardized Log Loss (MSLL)

If the above equation (standardized log loss) is averaged over all the values in y_pred and y_std, the resultant metric is Mean Standardized Log Loss (MSLL). The above equation is derived from Eq. 2.34, pp. 23 in GPML book - Cited by 23541.

Properties

MSLL is useful to evaluate probabilistic models (or any models that can output both y_pred and y_std, like GaussianProcessRegressor).
Lower the MSLL, better the model.

Which algorithms from sklearn can use this metric?

sklearn.gaussian_process.GaussianProcessRegressor
sklearn.linear_model.BayesianRidge
sklearn.linear_model.ARDRegression
...
Maybe more

A demo

An overfitted model (high MSLL)
An underfitted model (very high MSLL)
A well-fitted model (low MSLL)
A comparison between RMSE and MSLL

MSLL is high if the prediction intervals are not well calibrated, on the other side, RMSE can not take this criterion into account.

# Original Author: Vincent Dubourg <vincent.dubourg@gmail.com>
#         Jake Vanderplas <vanderplas@astro.washington.edu>
#         Jan Hendrik Metzen <jhm@informatik.uni-bremen.de>s
# License: BSD 3 clause
#
# Modified by: Zeel B Patel <patel_zeel@iitgn.ac.in>

import numpy as np
from matplotlib import pyplot as plt

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C

np.random.seed(1)

def f(x):
    """The function to predict."""
    return x * np.sin(x)

def msll(y_true, y_pred, y_std):
    first_term = 0.5 * np.log(2 * np.pi * y_std**2)
    second_term = ((y_true - y_pred)**2)/(2 * y_std**2)
    
    return np.mean(first_term + second_term)

X = np.linspace(0.1, 9.9, 20)
X = np.atleast_2d(X).T

# Observations and noise
y = f(X).ravel()
dy = 0.5 + 1.0 * np.random.random(y.shape)
noise = np.random.normal(0, dy)
y += noise

# Instantiate a Gaussian Process model
gp = GaussianProcessRegressor(kernel=kernel, alpha=dy ** 2, n_restarts_optimizer=1)

# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(X, y)

# Make the prediction on the meshed x-axis (ask for MSE as well)
y_pred, sigma = gp.predict(x, return_std=True)

# Plot the function, the prediction and the 95% confidence interval based on
# the MSE
plt.figure()
plt.plot(x, f(x), "r:", label=r"$f(x) = x\,\sin(x)$")
plt.errorbar(X.ravel(), y, dy, fmt="r.", markersize=10, label="Observations")
plt.plot(x, y_pred, "b-", label="Prediction")
plt.fill(
    np.concatenate([x, x[::-1]]),
    np.concatenate([y_pred - 1.9600 * sigma, (y_pred + 1.9600 * sigma)[::-1]]),
    alpha=0.5,
    fc="b",
    ec="None",
    label="95% confidence interval",
)
plt.xlabel("$x$")
plt.ylabel("$f(x)$")
plt.ylim(-10, 20)
plt.legend(loc="upper left")
plt.title('MSLL: ' + str(msll(f(x).ravel(), y_pred, sigma)))

plt.show()

Describe your proposed solution

A PR to include mean_standardized_log_loss in sklearn.metrics.

Potential usage

y_pred, y_std = model.predict(x, return_std=True)
msll = mean_standardized_log_loss(y_true, y_pred, y_std)

A rough version of `mean_standardized_log_loss` function:

def mean_standardized_log_loss(
    y_true, y_pred, *, sample_weight=None, multioutput="uniform_average", squared=True
):
    """Mean standardized log loss.
    Read more in the :ref:`User Guide <mean_standardized_log_loss>`.
    Parameters
    ----------
    y_true : array-like of shape (n_samples,) or (n_samples, n_outputs)
        Ground truth (correct) target values.
    y_pred : array-like of shape (n_samples,) or (n_samples, n_outputs)
        Estimated target values.
    y_std : array-like of shape (n_samples,) or (n_samples, n_outputs)
        Estimated standard deviation in predictions.
    sample_weight : array-like of shape (n_samples,), default=None
        Sample weights.
    multioutput : {'raw_values', 'uniform_average'} or array-like of shape \
            (n_outputs,), default='uniform_average'
        Defines aggregating of multiple output values.
        Array-like value defines weights used to average errors.
        'raw_values' :
            Returns a full set of errors in case of multioutput input.
        'uniform_average' :
            Errors of all outputs are averaged with uniform weight.

    Returns
    -------
    loss : float or ndarray of floats
        A non-negative floating point value (the best value is 0.0), or an
        array of floating point values, one for each individual target.
    Examples
    --------
    >>> from sklearn.metrics import mean_standardized_log_loss
    >>> y_true = [3, -0.5, 2, 7]
    >>> y_pred = [2.5, 0.0, 2, 8]
    >>> y_std = [0.1, 0, 0.05, 0.3]
    >>> mean_standardized_log_loss(y_true, y_pred, y_std)
    6.356
    >>> y_true = [[0.5, 1],[-1, 1],[7, -6]]
    >>> y_pred = [[0, 2],[-1, 2],[8, -5]]
    >>> y_std = [[0.01, 0.02],[0.01,0.04],[0.03,0.04]]
    >>> mean_standardized_log_loss(y_true, y_pred, y_std)
    5.511
    >>> mean_squared_error(y_true, y_pred, multioutput='raw_values')
    array([5.00107605, 6.02159874])
    >>> mean_squared_error(y_true, y_pred, multioutput=[0.3, 0.7])
    2.858
    """
    # y_type, y_true, y_pred, multioutput = _check_reg_targets(
    #   y_true, y_pred, multioutput
    # )
    # check_consistent_length(y_true, y_pred, sample_weight)
    
    ###########
    # Checks like the above ones to be implemented.
    ###########
    
    first_term = 0.5 * np.log(2 * np.pi * y_std**2)
    second_term = ((y_true - y_pred)**2)/(2 * y_std**2)
    
    output_errors = np.average(first_term + second_term, axis=0, weights=sample_weight)

    if isinstance(multioutput, str):
        if multioutput == "raw_values":
            return output_errors
        elif multioutput == "uniform_average":
            # pass None as weights to np.average: uniform mean
            multioutput = None

    return np.average(output_errors, weights=multioutput)

Describe alternatives you've considered, if relevant

No response

Additional context

A set of publications that have used MSLL as a metric

Publication	Venue	Citations
Rasmussen, C. E. (2003, February). Gaussian processes in machine learning. In Summer school on machine learning (pp. 63-71). Springer, Berlin, Heidelberg.		23541
Alvarez, M. A., & Lawrence, N. D. (2011). Computationally efficient convolved multiple output Gaussian processes. The Journal of Machine Learning Research, 12, 1459-1500.	JMLR '11	289
Chalupka, K., Williams, C. K., & Murray, I. (2013). A framework for evaluating approximation methods for Gaussian process regression. Journal of Machine Learning Research, 14, 333-350.	JMLR '13	144
Andrew Gordon Wilson, David A. Knowles, and Zoubin Ghahramani. 2012. Gaussian process regression networks. In Proceedings of the 29th International Conference on International Conference on Machine Learning (ICML'12). Omnipress, Madison, WI, USA, 1139–1146.	ICML '12	143
Wilson, A. G., Gilboa, E., Cunningham, J. P., & Nehorai, A. (2014, December). Fast Kernel Learning for Multidimensional Pattern Extrapolation. In NIPS (pp. 3626-3634).	NIPS '14	141
Chen, Z., & Wang, B. (2018). How priors of initial hyperparameters affect Gaussian process regression models. Neurocomputing, 275, 1702-1710.	Neurocomputing '18	44
Nguyen, D. T., Filippone, M., & Michiardi, P. (2019, April). Exact gaussian process regression with distributed computations. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing (pp. 1286-1295).		8
Lederer, A., Conejo, A. J. O., Maier, K. A., Xiao, W., Umlauft, J., & Hirche, S. (2021, July). Gaussian Process-Based Real-Time Learning for Safety Critical Applications. In International Conference on Machine Learning (pp. 6055-6064). PMLR.	ICML '21	2
Lederer, A., Conejo, A. J. O., Maier, K. A., Xiao, W., Umlauft, J., & Hirche, S. (2021, July). Gaussian Process-Based Real-Time Learning for Safety Critical Applications. In International Conference on Machine Learning (pp. 6055-6064). PMLR.	ICML '21	2
Gümbel, S., & Schmidt, T. (2020). Machine learning for multiple yield curve markets: fast calibration in the Gaussian affine framework. Risks, 8(2), 50.		1

Acknowledgement

I would like to thank @wesselb for suggesting this metric for the first time to me.

The text was updated successfully, but these errors were encountered:

adrinjalali · 2021-11-18T11:21:46Z

I think @lorentzenchr would be very much interested in this one.

lorentzenchr · 2021-11-19T19:50:28Z

The proposed score is the log-likelihood of a Normal/Gaussian distribution. Two questions come up to mind:

Are scores for expectations and variance/std deviation in-scope?
Is it ok to rely on the assumption of normality when scoring? There are more general scores for the combination of expectation and variance. If would need to look into references.

patel-zeel · 2021-11-20T17:33:50Z

Thank you for your response @lorentzenchr.

Are scores for expectations and variance/std deviation in-scope?

Can you please elaborate a bit more on this question?

Is it ok to rely on the assumption of normality when scoring? There are more general scores for the combination of expectation and variance. If would need to look into references.

A Generalized score seems interesting. I was wondering if you could point me towards a reference. However, I think that the normality assumption holds true for GaussianProcessRegressor, BayesianRidge and ARDRegression.

lorentzenchr · 2021-11-27T13:20:57Z

We are talking about the pair of point forecasts (mean, variance). The question is: Are pairs of point forecasts in-scope? Which API would we follow (metric(y_true, y_pred_mean, y_pred_var)? They seem very special for the Gaussian process estimators.

If we stick to the assumption of normal distributed targests, the log-likelihood may be fine. There is, however, literature for scoring functions (aka loss functions or metrics) for pairs of point forecasts, see Eq. (3.8) of https://doi.org/10.1214/19-EJS1552, where x1 is a prediction for the mean and x2 a prediction for the variance.

pyLP7 · 2022-06-15T08:20:15Z

Hi @patel-zeel. Thank you for your suggestion. I am not a scikit-learn core-dev but I landed here because I was also very interested in this metric found in Rasmussen, C. E. (2006, February). Comparing the formulas in such reference with your rough code, I would like to clarify a couple of doubts that have arisen: :

Shouldn't you include noise variance in the first and second terms of the MSLL? That is, replace y_pred with y_pred + dy**2? of course I mean the dy associated with your test dataset.
To standardize the score, shouldn't you subtract the loss you calculated on your train dataset - loss of the test dataset? It looks to me you are computing here only the second one.
In this reference, the MSLL is often compared with the Standardized Mean Squared Error (SMSE). Do you have a rough code version of the SMSE as well? If yes and you have already tested it, do you have any remarks, findings with respect to the MSLL? I would really appreciate that.

Thank you in advance!

patel-zeel · 2022-06-16T16:16:38Z

Hi @pyLP7, Several probabilistic metrics are now available in the gpytorch library. You can find them here.

Yes, ideally we should incorporate noise variance with predictive variance. For GPs, we include noise_variance in the predictive distribution itself. Some libraries like GPFlow make it very explicit with predict_y and predict_f functions. So, ideally, we should compute MSLL with predict_y.
I am not aware of such a practice if it is being followed by the community. I'd love to see if you have some references.
I am not aware of this metric called SMSE as well.

pyLP7 · 2022-06-16T20:01:54Z

Hi @patel-zeel and thank you for your reply. Thank you also for the Pytorch link with probabilistic metrics. I was just a bit confused because I thought we were on the scikit-learn module ;)

Anyway, my main reference is the old but gold Rasmussen & Williams, pp. 23, eq. 2.34 that you also mentioned immediately after presenting the MSLL formula. The screenshot you pasted looks different from the current reference though (check the link please). I therefore enclose a new picture here below to answer your pts. 2. and 3.

Just take a look please, maybe I misunderstood something :)

aaronlrees · 2023-08-14T13:38:03Z

From what I understand (and I could be wrong, please let me know if so) but I believe there is a problem with what's been said above. People are referring to equation (2.34) as the MSLL. This is incorrect - it is simply the negative log (Gaussian) likelihood. As highlighted above, the SLL is calculated by subtracting the loss obtained under the trivial model. To find the trivial model, simply calculate the mean and std of the y_test you're trying to predict and then plug these values into the negative log likelihood (2.34) to get your trivial loss. Subtract the trivial loss from the actual loss and you have the SLL.

lorentzenchr · 2023-08-14T21:55:14Z

@scikit-learn/core-devs Are there other opinions whether or not to include (scoring) metrics for a pair of point predictions like (mean, variance) = $(\mu,\sigma^2)$ ?

There is good literature for it, see https://www.jstor.org/stable/43974729 (theory heavy) and https://doi.org/10.1214/19-EJS1552 Eq 4.13, e.g. we know when such a metric is statistical consistent in the large sample limit. The easiest metrics would probably be an extension of the squared error, either

$(y - \mu)^2 + y^2 \log\frac{y^2}{\mu^2+\sigma^2}+\mu^2+\sigma^2-y^2$ (squared error for $\mu$ plus poisson deviance for the second moment, such that the score is a homogeneous function)
or the Dawid-Sebastiani score $\frac{(y-\mu)^2}{\sigma^2} + 2\log\sigma$. (This is even a scoring rule and assesses the whole predictive distribution. This is the metric proposed by the OP in this issue).

What would our API be?
metric(y_true, y_pred_mean, y_pred_var)?

patel-zeel · 2023-08-15T02:39:40Z

There are a bunch of distribution-free interval-based scikit-learn style metrics available here as well: mapie.metrics.

For the API, if it has to be consistent with taking two arguments only, one more option could be metric(y_true, (y_pred_mean, y_pred_var))

glemaitre · 2023-08-21T09:29:45Z

I don't think that we should restrain ourselves to the current metric API (i.e. only two parameters).

If I recall, the clustering metrics are already different.

What would our API be?
metric(y_true, y_pred_mean, y_pred_var)?

Our API for the "Bayesian" estimator has never been too much discussed. Since all of those can in general return the estimate of the mean and the standard deviation, I would expect our metric to take those arguments as input.

glemaitre · 2023-08-21T09:33:30Z

Then if we considered MAPIE and potentially think ahead of a new API that would return different estimates, then we would probably want to have a different API for these metrics. But I am unsure that we should prevent adding such a metric for this reason.

Micky774 · 2023-08-21T13:00:24Z

I'm okay with including metrics which consume expectation and variance, even if this changes the metrics API.

lorentzenchr · 2023-08-21T19:45:11Z

For me, it is more a question whether or not we provide models to compute both. Currently we have gaussian process and bayesian ridge and, in a way, bayesian mixture. So yes, we do.

patel-zeel added the New Feature label Nov 14, 2021

lorentzenchr mentioned this issue Nov 19, 2021

RFC Principled metrics for scoring and calibration of supervised learning #21718

Open

patel-zeel mentioned this issue Dec 9, 2021

[Feature Request] Metrics for GP regression/classification cornellius-gp/gpytorch#1857

Closed

lorentzenchr added module:metrics module:gaussian_process labels Jun 19, 2022

lorentzenchr added the help wanted label May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mean Standardized Log Loss (MSLL) for uncertainty aware regression models #21665

Mean Standardized Log Loss (MSLL) for uncertainty aware regression models #21665

patel-zeel commented Nov 14, 2021 •

edited

adrinjalali commented Nov 18, 2021

lorentzenchr commented Nov 19, 2021

patel-zeel commented Nov 20, 2021

lorentzenchr commented Nov 27, 2021

pyLP7 commented Jun 15, 2022

patel-zeel commented Jun 16, 2022 •

edited

pyLP7 commented Jun 16, 2022 •

edited

aaronlrees commented Aug 14, 2023 •

edited

lorentzenchr commented Aug 14, 2023 •

edited

patel-zeel commented Aug 15, 2023 •

edited

glemaitre commented Aug 21, 2023

glemaitre commented Aug 21, 2023

Micky774 commented Aug 21, 2023

lorentzenchr commented Aug 21, 2023

Mean Standardized Log Loss (MSLL) for uncertainty aware regression models #21665

Mean Standardized Log Loss (MSLL) for uncertainty aware regression models #21665

Comments

patel-zeel commented Nov 14, 2021 • edited

Describe the workflow you want to enable

Why MSLL?

Why sklearn?

Mean Standardized Log Loss (MSLL)

Properties

Which algorithms from sklearn can use this metric?

A demo

Describe your proposed solution

Potential usage

A rough version of mean_standardized_log_loss function:

Describe alternatives you've considered, if relevant

Additional context

A set of publications that have used MSLL as a metric

Acknowledgement

adrinjalali commented Nov 18, 2021

lorentzenchr commented Nov 19, 2021

patel-zeel commented Nov 20, 2021

lorentzenchr commented Nov 27, 2021

pyLP7 commented Jun 15, 2022

patel-zeel commented Jun 16, 2022 • edited

pyLP7 commented Jun 16, 2022 • edited

aaronlrees commented Aug 14, 2023 • edited

lorentzenchr commented Aug 14, 2023 • edited

patel-zeel commented Aug 15, 2023 • edited

glemaitre commented Aug 21, 2023

glemaitre commented Aug 21, 2023

Micky774 commented Aug 21, 2023

lorentzenchr commented Aug 21, 2023

patel-zeel commented Nov 14, 2021 •

edited

A rough version of `mean_standardized_log_loss` function:

patel-zeel commented Jun 16, 2022 •

edited

pyLP7 commented Jun 16, 2022 •

edited

aaronlrees commented Aug 14, 2023 •

edited

lorentzenchr commented Aug 14, 2023 •

edited

patel-zeel commented Aug 15, 2023 •

edited