Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API to predict multiple quantiles at once #23334

Open
ogrisel opened this issue May 12, 2022 · 16 comments
Open

API to predict multiple quantiles at once #23334

ogrisel opened this issue May 12, 2022 · 16 comments
Labels
API Needs Decision Requires decision

Comments

@ogrisel
Copy link
Member

ogrisel commented May 12, 2022

Classifiers have a predict_proba method that makes it possible to quantify probabilistic ally the certainty in the predictions for a given input X_i.

Currently most regressors in scikit-learn only predict a conditional expectile E[Y|X], and some have a return_std option that makes it also possible to estimate sqrt(VAR[Y|X]), which can be used to quantify the certainty when assuming a Gaussian predictive distribution (typically for Gaussian processes which estimate a Gaussian predictive posterior distribution).

We do have pointwise quantile estimators (linear models, gradient boosting, hist gradient boosting) where the predict method returns a single point estimate for target quantile passed as an hyper-parameter instead of estimating an expectile.

Several people have expressed the need to have more generic API that can return an array of quantile estimates for a given input X_i.

The goal of this issue is to centralize the discussion of an API extension to be able to do this more uniformly in scikit-learn, either via a meta-estimator that wraps an array of point-wise quantile estimator to turn it into a quantile-array estimator or to directly have the base estimators able to do this directly (and sometimes more efficiently).

Some non-exhausitive list of related PRs and issues (feel free to add or suggest new ones):

Also related:

Furthermore, models like Poisson regression that make a specific assumption about the conditional Y|X distribution, it would be possible to estimates of the inverse-CDF values of the estimated Y|X for instance. Those could probably also benefit from an expanded API.

If we do this, then we have the side question of how to evaluate such multi-quantile models. We could probably extend the pinball_loss scorer to average the pinball scores for an array of quantiles for instance.

/cc @GaelVaroquaux @amueller @lorentzenchr

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented May 12, 2022 via email

@ogrisel
Copy link
Member Author

ogrisel commented May 12, 2022

About names, MAPIE uses the term "prediction intervals" which is nice because it could be used either in a frequentist or a Bayesian context (instead of confidence vs credible intervals) and makes it explicit that this is for a specific prediction.

However "interval" is annoying because we might want to grid the inverse-CDF of the predictive distribution instead of using a pair of quantiles. So maybe we should never speak about intervals in our API (and maybe we should focus on names such as "prediction quantiles" or "grided inverse-CDF of the predictive distribution").

Then we have the problem about when to pass the quantiles: some estimators require them at fit time, and therefore the quantiles should be passed as constructor parameters. Others (e.g. PoissonRegressor, MAPIE confirm prediction wrapper) could compute any quantile predictions at predict time without having to specify the grid of quantiles before fitting.

I think we should strive to find an API that can handle both cases.

regressor = FitTimeQuantileAwareRegressor(quantiles=[0.05, 0.5, 0.95])
regressor.fit(X_train, y_train_observations)

y_test_quantiles = regressor.predict(X)

# or with a dedicated method:

y_test_quantiles = regressor.predict_quantiles(X)

if the latter, what would predict return for this estimator?

regressor = PredictTimeQuantileAwareRegressor()
regressor.fit(X_train, y_train_observations)

y_pred, y_quantiles = regressor.predict(X_test, return_quantiles=[0.05, 0.5, 0.95])

# or with a dedicated method:

y_quantiles = regressor.predict_quantiles(X_test, quantiles=[0.05, 0.5, 0.95])

here y_pred would typically be the usual expectile.

@amueller
Copy link
Member

amueller commented Jul 14, 2023

I was just going through this in my head again and came up with a very similar API. I agree with @ogrisel's suggestions, and I agree that doing it based on quantiles seems good. We should deprecate the return_std and use this instead I think.

@lorentzenchr
Copy link
Member

AFAIU, our API is general enough to support multiple outputs from predict, same as predict_proba for multiclass classifiers.

If a model predicts quantiles, I guess a keyword like regressor.predict(X_test, quantile_levels=[0.05, 0.5, 0.95]) -> np.array of shape (n_samples, 3) makes sense to me.

I would not, however, mix quantiles and point estimates of the expectation/mean because if you specify two quantile levels, say 25% and 75%, the estimated expectation might still lie outside this interval. In this case, I would propose to use expectiles of different probability levels.

I guess, the problem is then to pass this meta-info ("hi there, here come 3 quantiles for levels 5%, 50% and 95%") to scorers and other model diagnostic tools.

Classifiers have a predict_proba method that makes it possible to quantify probabilistically the certainty in the predictions for a given input X_i.

To be precise, predict_proba does quantify the likelihood of Y=class ... given X_i to occur (not the certainty of the prediction itself, also not the certainty of predict in general, only if the threshold 50% is applied (which is a design flaw)).

@joshdunnlime
Copy link

Personally, it would be great to see the scikit package take some direction on what a future multi quantile api might look like. Scikit has such a great influence it is basically sets the standard for python ml best practice.

Here are a few packages with differing APIs based on scikit learn:
https://github.com/StatMixedML/XGBoostLSS/blob/master/xgboostlss/model.py#L468C47-L468C47
https://github.com/StatMixedML/LightGBMLSS/blob/master/lightgbmlss/model.py#L447
https://xgboost.readthedocs.io/en/latest/parameter.html#parameter-for-using-quantile-loss-reg-quantileerror
https://catboost.ai/en/docs/concepts/loss-functions-regression#MultiQuantile

Other well known packages using quantiles such as MAPIE and LightGBM often resort looping over a list of quantiles. NGBoost, PGBM and XGBoost-Distribution could all easily add multi-quantile predictions to their scikit APIs - my guess is the main reason they haven't is that there is not standard api coming from scikit on how best to do this!

I believe a .predict_quantile() method would be very nice but equally something like XGBoostLSS and LightGBMLSS's predict(X, pred_type="quantiles") works nicely.

I would love to hear your in put on this as I am looking to looking to work on some of these projects and think it would be good to conform on an somewhat standard Multi Quantile API.

@fkiraly
Copy link

fkiraly commented Jan 26, 2024

skpro already implements an generic, composable, scikit-learn-like and fully compatible interface for probabilistic tabular regression predictions, that includes multiple quantile predictions.

It integrates with sktime, the scikit-learn-like time series toolbox (for probabilistic forecasts), and both are architected on top of scikit-base, which provides machinery for sklearn-like patterns (get_params, set_params, BaseEstimator, tags, configs, marketplace lookup of estimators, etc)

The specific multiple quantiles interface that is proposed - and widely in use with sktime already - is the predict_quantiles interface which is shared by skpro and sktime. There is also predict_interval for interval predictions, and predict_proba for fully distributional predictions, which produces sklearn-like distribution objects (implemented in skpro).

skpro already has an interface for MAPIE, cyclic_boosting, it has native basic compositors such as the "two-step method" or "squaring residuals", and contributions would be very much appreciated to interface further tabular probabilistic regression methods such as the LSS-verse (StatMixedML/XGBoostLSS#69) or ngboost (sktime/skpro#135)

@fkiraly
Copy link

fkiraly commented Jan 26, 2024

Perhaps to add, where sktime/skpro/skbase depart from sklearn is in adopting an architecture where estimators do not need to be contributed to, or maintained directly in, the core package - simply since the time series space, as well as the tabular probabilistic regression space has a lot of individual, popular packages.

Instead, the architectural principle adopted is one of mini-packages, or dependency management on the level of estimator. For instance, MAPIE proper is maintained in its own repository, and so is cyclic_boosting, what lives in skpro is a mini-package-plus-interface-class which internally imports the logic from 3rd and 2nd-party vendors. The dependencies are not dependencies of skpro, the framework, but of the individual estimator. For the user however, it looks, feels, and works just like sklearn (except that some classes may complain and not construct if the python env does not provide a necessary package).

Of course, natively implemented estimators a la vanilla sklearn are also possible to contribute directly.

@joshdunnlime
Copy link

joshdunnlime commented Jan 26, 2024

Thanks @fkiraly
The skpro API seems to conform with some of the suggestions here:

  • predict_proba
  • predict_quantile

As such, I think I will adopt this for my use case: XGBoostLSS and then LightGBMLSS.

It would be great if the sklearn core team adopt the skpro API as the longer term solution.

@lorentzenchr
Copy link
Member

Could you be more concrete? For which model/estimator do you propose which extension of its API?

@joshdunnlime
Copy link

Updated my previous comment for added clarity. Essentially, skpro's API meets the requirements for quantile and/or distribution regression on top of the sklearn API.

@fkiraly Correst me if I am wrong, but it seems that the only thing skpro does not satisfy (that is mentioned in the issue) is something like the FitTimeQuantileAwareRegressor.

@fkiraly
Copy link

fkiraly commented Jan 28, 2024

Correst me if I am wrong, but it seems that the only thing skpro does not satisfy (that is mentioned in the issue) is something like the FitTimeQuantileAwareRegressor

There was in fact one such case, namly the MultipleQuantileRegressor that @Ram0nB implemented.
https://github.com/sktime/skpro/blob/main/skpro/regression/multiquantile.py

The solution we adopted was:

  • quantiles at which predictions are computed are specified in the constructor
  • if other quantiles are requested, the prediction for the closest initially requested quatile is returned
  • this gives rise to a specific distribution as well, which is returned in predict_proba (see docs for its explicit form)

An alternative solution discussed was allowing to pass parameters such as alpha and coverage in fit, similar to the forecasting horizon fh in sktime.

We decided against it, as there are multiple predict-like methods, so the overhead in boilerplate seemed risky.

@fkiraly
Copy link

fkiraly commented Jan 30, 2024

Here's an interesting factoid which imo highlights the importance of having both (a) a clear interface definition and (b) stringent tests for it.

Some sklearn estimators have a variance prediction mode (predict with return_std=True), but this does not seem to be systematically tested. We discovered that when interfacing the few sklearn estimators with such a mode, and running them through the battery of tests in skpro: #28310

@fkiraly
Copy link

fkiraly commented Mar 24, 2024

just wondering - is this discussion stale, or what are the next steps?

@fkiraly
Copy link

fkiraly commented Apr 18, 2024

Ping

In case it helps, if you were to make me a core dev, I'd be happy to devote substantial time to fold the API into sklearn, including:

  • transfer of existing probabilistic regressors in sklearn to the API, along the lines of the interface consistent skpro adapters
    • with an option to include full distributional predictions or not
    • with full, careful deprecation cycle where interfaces are changing
  • homogenization with the current sklearn testing, tag, and base framework
  • optionally, integration of MAPIE and other relevant 2nd party packages in the closer halo of sklearn

@lorentzenchr
Copy link
Member

The specific multiple quantiles interface that is proposed - and widely in use with sktime already - is the predict_quantiles interface which is shared by skpro and sktime.

I‘m not 100% convinced of introducing such a method. For quantile regressors, predict already predicts a (single) quantile.

@ogrisel wrote in the initial statement:

The goal of this issue is to centralize the discussion of an API extension to be able to do this more uniformly in scikit-learn, either via a meta-estimator that wraps an array of point-wise quantile estimator to turn it into a quantile-array estimator or to directly have the base estimators able to do this directly (and sometimes more efficiently).

@scikit-learn/core-devs ping for API discussion.

@thomasjpfan
Copy link
Member

@scikit-learn/core-devs ping for API discussion.

I'm +1 with the API proposed here: #23334 (comment). Now that we have metadata routing, we can properly support predict(X, quantile_levels=[0.05, 0.5, 0.95]).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Needs Decision Requires decision
Projects
None yet
Development

No branches or pull requests

7 participants