Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mean Standardized Log Loss (MSLL) for uncertainty aware regression models #21665

Open
patel-zeel opened this issue Nov 14, 2021 · 14 comments
Open

Comments

@patel-zeel
Copy link
Contributor

patel-zeel commented Nov 14, 2021

Describe the workflow you want to enable

Why MSLL?

Traditional metrics such as mean_squared_error or r2_score may not be able to correctly evaluate uncertainty aware models because they do not take predictive standard deviation (y_std) into account.

Why sklearn?

To the best of my knowledge, I am unaware of any other library in Python that is widely used and has a well-organized, standardized set of metrics implemented for ML regression models.

Mean Standardized Log Loss (MSLL)

If the above equation (standardized log loss) is averaged over all the values in y_pred and y_std, the resultant metric is Mean Standardized Log Loss (MSLL). The above equation is derived from Eq. 2.34, pp. 23 in GPML book - Cited by 23541.

Properties

  • MSLL is useful to evaluate probabilistic models (or any models that can output both y_pred and y_std, like GaussianProcessRegressor).
  • Lower the MSLL, better the model.

Which algorithms from sklearn can use this metric?

A demo

  • An overfitted model (high MSLL)
    image

  • An underfitted model (very high MSLL)
    image

  • A well-fitted model (low MSLL)
    image

  • A comparison between RMSE and MSLL

MSLL is high if the prediction intervals are not well calibrated, on the other side, RMSE can not take this criterion into account.

image

# Original Author: Vincent Dubourg <vincent.dubourg@gmail.com>
#         Jake Vanderplas <vanderplas@astro.washington.edu>
#         Jan Hendrik Metzen <jhm@informatik.uni-bremen.de>s
# License: BSD 3 clause
#
# Modified by: Zeel B Patel <patel_zeel@iitgn.ac.in>

import numpy as np
from matplotlib import pyplot as plt

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C

np.random.seed(1)

def f(x):
    """The function to predict."""
    return x * np.sin(x)

def msll(y_true, y_pred, y_std):
    first_term = 0.5 * np.log(2 * np.pi * y_std**2)
    second_term = ((y_true - y_pred)**2)/(2 * y_std**2)
    
    return np.mean(first_term + second_term)

X = np.linspace(0.1, 9.9, 20)
X = np.atleast_2d(X).T

# Observations and noise
y = f(X).ravel()
dy = 0.5 + 1.0 * np.random.random(y.shape)
noise = np.random.normal(0, dy)
y += noise

# Instantiate a Gaussian Process model
gp = GaussianProcessRegressor(kernel=kernel, alpha=dy ** 2, n_restarts_optimizer=1)

# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(X, y)

# Make the prediction on the meshed x-axis (ask for MSE as well)
y_pred, sigma = gp.predict(x, return_std=True)

# Plot the function, the prediction and the 95% confidence interval based on
# the MSE
plt.figure()
plt.plot(x, f(x), "r:", label=r"$f(x) = x\,\sin(x)$")
plt.errorbar(X.ravel(), y, dy, fmt="r.", markersize=10, label="Observations")
plt.plot(x, y_pred, "b-", label="Prediction")
plt.fill(
    np.concatenate([x, x[::-1]]),
    np.concatenate([y_pred - 1.9600 * sigma, (y_pred + 1.9600 * sigma)[::-1]]),
    alpha=0.5,
    fc="b",
    ec="None",
    label="95% confidence interval",
)
plt.xlabel("$x$")
plt.ylabel("$f(x)$")
plt.ylim(-10, 20)
plt.legend(loc="upper left")
plt.title('MSLL: ' + str(msll(f(x).ravel(), y_pred, sigma)))

plt.show()

Describe your proposed solution

A PR to include mean_standardized_log_loss in sklearn.metrics.

Potential usage

y_pred, y_std = model.predict(x, return_std=True)
msll = mean_standardized_log_loss(y_true, y_pred, y_std)

A rough version of mean_standardized_log_loss function:

def mean_standardized_log_loss(
    y_true, y_pred, *, sample_weight=None, multioutput="uniform_average", squared=True
):
    """Mean standardized log loss.
    Read more in the :ref:`User Guide <mean_standardized_log_loss>`.
    Parameters
    ----------
    y_true : array-like of shape (n_samples,) or (n_samples, n_outputs)
        Ground truth (correct) target values.
    y_pred : array-like of shape (n_samples,) or (n_samples, n_outputs)
        Estimated target values.
    y_std : array-like of shape (n_samples,) or (n_samples, n_outputs)
        Estimated standard deviation in predictions.
    sample_weight : array-like of shape (n_samples,), default=None
        Sample weights.
    multioutput : {'raw_values', 'uniform_average'} or array-like of shape \
            (n_outputs,), default='uniform_average'
        Defines aggregating of multiple output values.
        Array-like value defines weights used to average errors.
        'raw_values' :
            Returns a full set of errors in case of multioutput input.
        'uniform_average' :
            Errors of all outputs are averaged with uniform weight.

    Returns
    -------
    loss : float or ndarray of floats
        A non-negative floating point value (the best value is 0.0), or an
        array of floating point values, one for each individual target.
    Examples
    --------
    >>> from sklearn.metrics import mean_standardized_log_loss
    >>> y_true = [3, -0.5, 2, 7]
    >>> y_pred = [2.5, 0.0, 2, 8]
    >>> y_std = [0.1, 0, 0.05, 0.3]
    >>> mean_standardized_log_loss(y_true, y_pred, y_std)
    6.356
    >>> y_true = [[0.5, 1],[-1, 1],[7, -6]]
    >>> y_pred = [[0, 2],[-1, 2],[8, -5]]
    >>> y_std = [[0.01, 0.02],[0.01,0.04],[0.03,0.04]]
    >>> mean_standardized_log_loss(y_true, y_pred, y_std)
    5.511
    >>> mean_squared_error(y_true, y_pred, multioutput='raw_values')
    array([5.00107605, 6.02159874])
    >>> mean_squared_error(y_true, y_pred, multioutput=[0.3, 0.7])
    2.858
    """
    # y_type, y_true, y_pred, multioutput = _check_reg_targets(
    #   y_true, y_pred, multioutput
    # )
    # check_consistent_length(y_true, y_pred, sample_weight)
    
    ###########
    # Checks like the above ones to be implemented.
    ###########
    
    first_term = 0.5 * np.log(2 * np.pi * y_std**2)
    second_term = ((y_true - y_pred)**2)/(2 * y_std**2)
    
    output_errors = np.average(first_term + second_term, axis=0, weights=sample_weight)

    if isinstance(multioutput, str):
        if multioutput == "raw_values":
            return output_errors
        elif multioutput == "uniform_average":
            # pass None as weights to np.average: uniform mean
            multioutput = None

    return np.average(output_errors, weights=multioutput)

Describe alternatives you've considered, if relevant

No response

Additional context

A set of publications that have used MSLL as a metric

Publication Venue Citations
Rasmussen, C. E. (2003, February). Gaussian processes in machine learning. In Summer school on machine learning (pp. 63-71). Springer, Berlin, Heidelberg. 23541
Alvarez, M. A., & Lawrence, N. D. (2011). Computationally efficient convolved multiple output Gaussian processes. The Journal of Machine Learning Research, 12, 1459-1500. JMLR '11 289
Chalupka, K., Williams, C. K., & Murray, I. (2013). A framework for evaluating approximation methods for Gaussian process regression. Journal of Machine Learning Research, 14, 333-350. JMLR '13 144
Andrew Gordon Wilson, David A. Knowles, and Zoubin Ghahramani. 2012. Gaussian process regression networks. In Proceedings of the 29th International Conference on International Conference on Machine Learning (ICML'12). Omnipress, Madison, WI, USA, 1139–1146. ICML '12 143
Wilson, A. G., Gilboa, E., Cunningham, J. P., & Nehorai, A. (2014, December). Fast Kernel Learning for Multidimensional Pattern Extrapolation. In NIPS (pp. 3626-3634). NIPS '14 141
Chen, Z., & Wang, B. (2018). How priors of initial hyperparameters affect Gaussian process regression models. Neurocomputing, 275, 1702-1710. Neurocomputing '18 44
Nguyen, D. T., Filippone, M., & Michiardi, P. (2019, April). Exact gaussian process regression with distributed computations. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing (pp. 1286-1295). 8
Lederer, A., Conejo, A. J. O., Maier, K. A., Xiao, W., Umlauft, J., & Hirche, S. (2021, July). Gaussian Process-Based Real-Time Learning for Safety Critical Applications. In International Conference on Machine Learning (pp. 6055-6064). PMLR. ICML '21 2
Lederer, A., Conejo, A. J. O., Maier, K. A., Xiao, W., Umlauft, J., & Hirche, S. (2021, July). Gaussian Process-Based Real-Time Learning for Safety Critical Applications. In International Conference on Machine Learning (pp. 6055-6064). PMLR. ICML '21 2
Gümbel, S., & Schmidt, T. (2020). Machine learning for multiple yield curve markets: fast calibration in the Gaussian affine framework. Risks, 8(2), 50. 1

Acknowledgement

I would like to thank @wesselb for suggesting this metric for the first time to me.

@adrinjalali
Copy link
Member

I think @lorentzenchr would be very much interested in this one.

@lorentzenchr
Copy link
Member

The proposed score is the log-likelihood of a Normal/Gaussian distribution. Two questions come up to mind:

  1. Are scores for expectations and variance/std deviation in-scope?
  2. Is it ok to rely on the assumption of normality when scoring? There are more general scores for the combination of expectation and variance. If would need to look into references.

@patel-zeel
Copy link
Contributor Author

Thank you for your response @lorentzenchr.

  1. Are scores for expectations and variance/std deviation in-scope?

Can you please elaborate a bit more on this question?

  1. Is it ok to rely on the assumption of normality when scoring? There are more general scores for the combination of expectation and variance. If would need to look into references.

A Generalized score seems interesting. I was wondering if you could point me towards a reference. However, I think that the normality assumption holds true for GaussianProcessRegressor, BayesianRidge and ARDRegression.

@lorentzenchr
Copy link
Member

We are talking about the pair of point forecasts (mean, variance). The question is: Are pairs of point forecasts in-scope? Which API would we follow (metric(y_true, y_pred_mean, y_pred_var)? They seem very special for the Gaussian process estimators.

If we stick to the assumption of normal distributed targests, the log-likelihood may be fine. There is, however, literature for scoring functions (aka loss functions or metrics) for pairs of point forecasts, see Eq. (3.8) of https://doi.org/10.1214/19-EJS1552, where x1 is a prediction for the mean and x2 a prediction for the variance.

@pyLP7
Copy link

pyLP7 commented Jun 15, 2022

Hi @patel-zeel. Thank you for your suggestion. I am not a scikit-learn core-dev but I landed here because I was also very interested in this metric found in Rasmussen, C. E. (2006, February). Comparing the formulas in such reference with your rough code, I would like to clarify a couple of doubts that have arisen: :

  1. Shouldn't you include noise variance in the first and second terms of the MSLL? That is, replace y_pred with y_pred + dy**2? of course I mean the dy associated with your test dataset.
  2. To standardize the score, shouldn't you subtract the loss you calculated on your train dataset - loss of the test dataset? It looks to me you are computing here only the second one.
  3. In this reference, the MSLL is often compared with the Standardized Mean Squared Error (SMSE). Do you have a rough code version of the SMSE as well? If yes and you have already tested it, do you have any remarks, findings with respect to the MSLL? I would really appreciate that.

Thank you in advance!

@patel-zeel
Copy link
Contributor Author

patel-zeel commented Jun 16, 2022

Hi @pyLP7, Several probabilistic metrics are now available in the gpytorch library. You can find them here.

  1. Yes, ideally we should incorporate noise variance with predictive variance. For GPs, we include noise_variance in the predictive distribution itself. Some libraries like GPFlow make it very explicit with predict_y and predict_f functions. So, ideally, we should compute MSLL with predict_y.
  2. I am not aware of such a practice if it is being followed by the community. I'd love to see if you have some references.
  3. I am not aware of this metric called SMSE as well.

@pyLP7
Copy link

pyLP7 commented Jun 16, 2022

Hi @patel-zeel and thank you for your reply. Thank you also for the Pytorch link with probabilistic metrics. I was just a bit confused because I thought we were on the scikit-learn module ;)

Anyway, my main reference is the old but gold Rasmussen & Williams, pp. 23, eq. 2.34 that you also mentioned immediately after presenting the MSLL formula. The screenshot you pasted looks different from the current reference though (check the link please). I therefore enclose a new picture here below to answer your pts. 2. and 3.

grafik

Just take a look please, maybe I misunderstood something :)

@aaronlrees
Copy link

aaronlrees commented Aug 14, 2023

From what I understand (and I could be wrong, please let me know if so) but I believe there is a problem with what's been said above. People are referring to equation (2.34) as the MSLL. This is incorrect - it is simply the negative log (Gaussian) likelihood. As highlighted above, the SLL is calculated by subtracting the loss obtained under the trivial model. To find the trivial model, simply calculate the mean and std of the y_test you're trying to predict and then plug these values into the negative log likelihood (2.34) to get your trivial loss. Subtract the trivial loss from the actual loss and you have the SLL.

@lorentzenchr
Copy link
Member

lorentzenchr commented Aug 14, 2023

@scikit-learn/core-devs Are there other opinions whether or not to include (scoring) metrics for a pair of point predictions like (mean, variance) = $(\mu,\sigma^2)$ ?

There is good literature for it, see https://www.jstor.org/stable/43974729 (theory heavy) and https://doi.org/10.1214/19-EJS1552 Eq 4.13, e.g. we know when such a metric is statistical consistent in the large sample limit. The easiest metrics would probably be an extension of the squared error, either

  • $(y - \mu)^2 + y^2 \log\frac{y^2}{\mu^2+\sigma^2}+\mu^2+\sigma^2-y^2$ (squared error for $\mu$ plus poisson deviance for the second moment, such that the score is a homogeneous function)
  • or the Dawid-Sebastiani score $\frac{(y-\mu)^2}{\sigma^2} + 2\log\sigma$. (This is even a scoring rule and assesses the whole predictive distribution. This is the metric proposed by the OP in this issue).

What would our API be?
metric(y_true, y_pred_mean, y_pred_var)?

@patel-zeel
Copy link
Contributor Author

patel-zeel commented Aug 15, 2023

There are a bunch of distribution-free interval-based scikit-learn style metrics available here as well: mapie.metrics.

For the API, if it has to be consistent with taking two arguments only, one more option could be metric(y_true, (y_pred_mean, y_pred_var))

@glemaitre
Copy link
Member

I don't think that we should restrain ourselves to the current metric API (i.e. only two parameters).

If I recall, the clustering metrics are already different.

What would our API be?
metric(y_true, y_pred_mean, y_pred_var)?

Our API for the "Bayesian" estimator has never been too much discussed. Since all of those can in general return the estimate of the mean and the standard deviation, I would expect our metric to take those arguments as input.

@glemaitre
Copy link
Member

Then if we considered MAPIE and potentially think ahead of a new API that would return different estimates, then we would probably want to have a different API for these metrics. But I am unsure that we should prevent adding such a metric for this reason.

@Micky774
Copy link
Contributor

I'm okay with including metrics which consume expectation and variance, even if this changes the metrics API.

@lorentzenchr
Copy link
Member

For me, it is more a question whether or not we provide models to compute both. Currently we have gaussian process and bayesian ridge and, in a way, bayesian mixture. So yes, we do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants