API to predict multiple quantiles at once #23334

ogrisel · 2022-05-12T09:07:45Z

Classifiers have a predict_proba method that makes it possible to quantify probabilistic ally the certainty in the predictions for a given input X_i.

Currently most regressors in scikit-learn only predict a conditional expectile E[Y|X], and some have a return_std option that makes it also possible to estimate sqrt(VAR[Y|X]), which can be used to quantify the certainty when assuming a Gaussian predictive distribution (typically for Gaussian processes which estimate a Gaussian predictive posterior distribution).

We do have pointwise quantile estimators (linear models, gradient boosting, hist gradient boosting) where the predict method returns a single point estimate for target quantile passed as an hyper-parameter instead of estimating an expectile.

Several people have expressed the need to have more generic API that can return an array of quantile estimates for a given input X_i.

The goal of this issue is to centralize the discussion of an API extension to be able to do this more uniformly in scikit-learn, either via a meta-estimator that wraps an array of point-wise quantile estimator to turn it into a quantile-array estimator or to directly have the base estimators able to do this directly (and sometimes more efficiently).

Some non-exhausitive list of related PRs and issues (feel free to add or suggest new ones):

Also related:

conformal predictions: https://github.com/scikit-learn-contrib/MAPIE
XGBoostLSS (location, scale and shape) https://github.com/StatMixedML/XGBoostLSS

Furthermore, models like Poisson regression that make a specific assumption about the conditional Y|X distribution, it would be possible to estimates of the inverse-CDF values of the estimated Y|X for instance. Those could probably also benefit from an expanded API.

If we do this, then we have the side question of how to evaluate such multi-quantile models. We could probably extend the pinball_loss scorer to average the pinball scores for an array of quantiles for instance.

/cc @GaelVaroquaux @amueller @lorentzenchr

The text was updated successfully, but these errors were encountered:

GaelVaroquaux · 2022-05-12T09:29:08Z

Thanks for creating this issue. My thoughts on this problem-space: as you mention, I think that we are better off focusing on an API for quantiles, and moving the code that we have on std to quantiles (in GPs, for instance). The reason is that quantiles are less parametric and more universal.

ogrisel · 2022-05-12T10:09:01Z

About names, MAPIE uses the term "prediction intervals" which is nice because it could be used either in a frequentist or a Bayesian context (instead of confidence vs credible intervals) and makes it explicit that this is for a specific prediction.

However "interval" is annoying because we might want to grid the inverse-CDF of the predictive distribution instead of using a pair of quantiles. So maybe we should never speak about intervals in our API (and maybe we should focus on names such as "prediction quantiles" or "grided inverse-CDF of the predictive distribution").

Then we have the problem about when to pass the quantiles: some estimators require them at fit time, and therefore the quantiles should be passed as constructor parameters. Others (e.g. PoissonRegressor, MAPIE confirm prediction wrapper) could compute any quantile predictions at predict time without having to specify the grid of quantiles before fitting.

I think we should strive to find an API that can handle both cases.

regressor = FitTimeQuantileAwareRegressor(quantiles=[0.05, 0.5, 0.95])
regressor.fit(X_train, y_train_observations)

y_test_quantiles = regressor.predict(X)

# or with a dedicated method:

y_test_quantiles = regressor.predict_quantiles(X)

if the latter, what would predict return for this estimator?

regressor = PredictTimeQuantileAwareRegressor()
regressor.fit(X_train, y_train_observations)

y_pred, y_quantiles = regressor.predict(X_test, return_quantiles=[0.05, 0.5, 0.95])

# or with a dedicated method:

y_quantiles = regressor.predict_quantiles(X_test, quantiles=[0.05, 0.5, 0.95])

here y_pred would typically be the usual expectile.

amueller · 2023-07-14T20:09:56Z

I was just going through this in my head again and came up with a very similar API. I agree with @ogrisel's suggestions, and I agree that doing it based on quantiles seems good. We should deprecate the return_std and use this instead I think.

lorentzenchr · 2023-07-27T08:59:38Z

AFAIU, our API is general enough to support multiple outputs from predict, same as predict_proba for multiclass classifiers.

If a model predicts quantiles, I guess a keyword like regressor.predict(X_test, quantile_levels=[0.05, 0.5, 0.95]) -> np.array of shape (n_samples, 3) makes sense to me.

I would not, however, mix quantiles and point estimates of the expectation/mean because if you specify two quantile levels, say 25% and 75%, the estimated expectation might still lie outside this interval. In this case, I would propose to use expectiles of different probability levels.

I guess, the problem is then to pass this meta-info ("hi there, here come 3 quantiles for levels 5%, 50% and 95%") to scorers and other model diagnostic tools.

Classifiers have a predict_proba method that makes it possible to quantify probabilistically the certainty in the predictions for a given input X_i.

To be precise, predict_proba does quantify the likelihood of Y=class ... given X_i to occur (not the certainty of the prediction itself, also not the certainty of predict in general, only if the threshold 50% is applied (which is a design flaw)).

joshdunnlime · 2023-11-30T21:57:17Z

Personally, it would be great to see the scikit package take some direction on what a future multi quantile api might look like. Scikit has such a great influence it is basically sets the standard for python ml best practice.

Here are a few packages with differing APIs based on scikit learn:
https://github.com/StatMixedML/XGBoostLSS/blob/master/xgboostlss/model.py#L468C47-L468C47
https://github.com/StatMixedML/LightGBMLSS/blob/master/lightgbmlss/model.py#L447
https://xgboost.readthedocs.io/en/latest/parameter.html#parameter-for-using-quantile-loss-reg-quantileerror
https://catboost.ai/en/docs/concepts/loss-functions-regression#MultiQuantile

Other well known packages using quantiles such as MAPIE and LightGBM often resort looping over a list of quantiles. NGBoost, PGBM and XGBoost-Distribution could all easily add multi-quantile predictions to their scikit APIs - my guess is the main reason they haven't is that there is not standard api coming from scikit on how best to do this!

I believe a .predict_quantile() method would be very nice but equally something like XGBoostLSS and LightGBMLSS's predict(X, pred_type="quantiles") works nicely.

I would love to hear your in put on this as I am looking to looking to work on some of these projects and think it would be good to conform on an somewhat standard Multi Quantile API.

fkiraly · 2024-01-26T11:40:14Z

skpro already implements an generic, composable, scikit-learn-like and fully compatible interface for probabilistic tabular regression predictions, that includes multiple quantile predictions.

It integrates with sktime, the scikit-learn-like time series toolbox (for probabilistic forecasts), and both are architected on top of scikit-base, which provides machinery for sklearn-like patterns (get_params, set_params, BaseEstimator, tags, configs, marketplace lookup of estimators, etc)

The specific multiple quantiles interface that is proposed - and widely in use with sktime already - is the predict_quantiles interface which is shared by skpro and sktime. There is also predict_interval for interval predictions, and predict_proba for fully distributional predictions, which produces sklearn-like distribution objects (implemented in skpro).

skpro already has an interface for MAPIE, cyclic_boosting, it has native basic compositors such as the "two-step method" or "squaring residuals", and contributions would be very much appreciated to interface further tabular probabilistic regression methods such as the LSS-verse (StatMixedML/XGBoostLSS#69) or ngboost (sktime/skpro#135)

fkiraly · 2024-01-26T11:50:16Z

Perhaps to add, where sktime/skpro/skbase depart from sklearn is in adopting an architecture where estimators do not need to be contributed to, or maintained directly in, the core package - simply since the time series space, as well as the tabular probabilistic regression space has a lot of individual, popular packages.

Instead, the architectural principle adopted is one of mini-packages, or dependency management on the level of estimator. For instance, MAPIE proper is maintained in its own repository, and so is cyclic_boosting, what lives in skpro is a mini-package-plus-interface-class which internally imports the logic from 3rd and 2nd-party vendors. The dependencies are not dependencies of skpro, the framework, but of the individual estimator. For the user however, it looks, feels, and works just like sklearn (except that some classes may complain and not construct if the python env does not provide a necessary package).

Of course, natively implemented estimators a la vanilla sklearn are also possible to contribute directly.

joshdunnlime · 2024-01-26T17:07:09Z

Thanks @fkiraly
The skpro API seems to conform with some of the suggestions here:

predict_proba
predict_quantile

As such, I think I will adopt this for my use case: XGBoostLSS and then LightGBMLSS.

It would be great if the sklearn core team adopt the skpro API as the longer term solution.

lorentzenchr · 2024-01-26T19:59:53Z

Could you be more concrete? For which model/estimator do you propose which extension of its API?

joshdunnlime · 2024-01-26T20:18:57Z

Updated my previous comment for added clarity. Essentially, skpro's API meets the requirements for quantile and/or distribution regression on top of the sklearn API.

@fkiraly Correst me if I am wrong, but it seems that the only thing skpro does not satisfy (that is mentioned in the issue) is something like the FitTimeQuantileAwareRegressor.

fkiraly · 2024-01-28T01:58:13Z

Correst me if I am wrong, but it seems that the only thing skpro does not satisfy (that is mentioned in the issue) is something like the FitTimeQuantileAwareRegressor

There was in fact one such case, namly the MultipleQuantileRegressor that @Ram0nB implemented.
https://github.com/sktime/skpro/blob/main/skpro/regression/multiquantile.py

The solution we adopted was:

quantiles at which predictions are computed are specified in the constructor
if other quantiles are requested, the prediction for the closest initially requested quatile is returned
this gives rise to a specific distribution as well, which is returned in predict_proba (see docs for its explicit form)

An alternative solution discussed was allowing to pass parameters such as alpha and coverage in fit, similar to the forecasting horizon fh in sktime.

We decided against it, as there are multiple predict-like methods, so the overhead in boilerplate seemed risky.

fkiraly · 2024-01-30T10:10:42Z

Here's an interesting factoid which imo highlights the importance of having both (a) a clear interface definition and (b) stringent tests for it.

Some sklearn estimators have a variance prediction mode (predict with return_std=True), but this does not seem to be systematically tested. We discovered that when interfacing the few sklearn estimators with such a mode, and running them through the battery of tests in skpro: #28310

fkiraly · 2024-03-24T13:29:02Z

just wondering - is this discussion stale, or what are the next steps?

fkiraly · 2024-04-18T13:47:25Z

Ping

In case it helps, if you were to make me a core dev, I'd be happy to devote substantial time to fold the API into sklearn, including:

transfer of existing probabilistic regressors in sklearn to the API, along the lines of the interface consistent skpro adapters
- with an option to include full distributional predictions or not
- with full, careful deprecation cycle where interfaces are changing
homogenization with the current sklearn testing, tag, and base framework
optionally, integration of MAPIE and other relevant 2nd party packages in the closer halo of sklearn

lorentzenchr · 2024-04-18T15:56:36Z

The specific multiple quantiles interface that is proposed - and widely in use with sktime already - is the predict_quantiles interface which is shared by skpro and sktime.

I‘m not 100% convinced of introducing such a method. For quantile regressors, predict already predicts a (single) quantile.

@ogrisel wrote in the initial statement:

The goal of this issue is to centralize the discussion of an API extension to be able to do this more uniformly in scikit-learn, either via a meta-estimator that wraps an array of point-wise quantile estimator to turn it into a quantile-array estimator or to directly have the base estimators able to do this directly (and sometimes more efficiently).

@scikit-learn/core-devs ping for API discussion.

thomasjpfan · 2024-04-30T03:09:31Z

@scikit-learn/core-devs ping for API discussion.

I'm +1 with the API proposed here: #23334 (comment). Now that we have metadata routing, we can properly support predict(X, quantile_levels=[0.05, 0.5, 0.95]).

ogrisel added API Needs Decision - API labels May 12, 2022

elephaint mentioned this issue Nov 2, 2022

[MRG] Add probabilistic estimates to HistGradientBoostingRegressor #24810

Open

joshdunnlime mentioned this issue Jan 4, 2024

Regression Probability Distribution & Multi-Quantile Output API #28060

Closed

joshdunnlime mentioned this issue Jan 26, 2024

skpro integration StatMixedML/XGBoostLSS#69

Closed

fkiraly mentioned this issue Jan 30, 2024

[ENH] track outcome of probabilistic prediction interface discussion on sklearn sktime/skpro#197

Open

lorentzenchr added Needs Decision Requires decision and removed Needs Decision - API labels Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API to predict multiple quantiles at once #23334

API to predict multiple quantiles at once #23334

ogrisel commented May 12, 2022 •

edited

Loading

GaelVaroquaux commented May 12, 2022 via email

ogrisel commented May 12, 2022 •

edited

Loading

amueller commented Jul 14, 2023 •

edited

Loading

lorentzenchr commented Jul 27, 2023

joshdunnlime commented Nov 30, 2023

fkiraly commented Jan 26, 2024 •

edited

Loading

fkiraly commented Jan 26, 2024

joshdunnlime commented Jan 26, 2024 •

edited

Loading

lorentzenchr commented Jan 26, 2024

joshdunnlime commented Jan 26, 2024

fkiraly commented Jan 28, 2024

fkiraly commented Jan 30, 2024

fkiraly commented Mar 24, 2024

fkiraly commented Apr 18, 2024 •

edited

Loading

lorentzenchr commented Apr 18, 2024

thomasjpfan commented Apr 30, 2024

API to predict multiple quantiles at once #23334

API to predict multiple quantiles at once #23334

Comments

ogrisel commented May 12, 2022 • edited Loading

GaelVaroquaux commented May 12, 2022 via email

ogrisel commented May 12, 2022 • edited Loading

amueller commented Jul 14, 2023 • edited Loading

lorentzenchr commented Jul 27, 2023

joshdunnlime commented Nov 30, 2023

fkiraly commented Jan 26, 2024 • edited Loading

fkiraly commented Jan 26, 2024

joshdunnlime commented Jan 26, 2024 • edited Loading

lorentzenchr commented Jan 26, 2024

joshdunnlime commented Jan 26, 2024

fkiraly commented Jan 28, 2024

fkiraly commented Jan 30, 2024

fkiraly commented Mar 24, 2024

fkiraly commented Apr 18, 2024 • edited Loading

lorentzenchr commented Apr 18, 2024

thomasjpfan commented Apr 30, 2024

ogrisel commented May 12, 2022 •

edited

Loading

ogrisel commented May 12, 2022 •

edited

Loading

amueller commented Jul 14, 2023 •

edited

Loading

fkiraly commented Jan 26, 2024 •

edited

Loading

joshdunnlime commented Jan 26, 2024 •

edited

Loading

fkiraly commented Apr 18, 2024 •

edited

Loading