Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Access the actual features (including lagged y) used as input for the model inside a reduction forecaster #5644

Open
hliebert opened this issue Dec 19, 2023 · 10 comments
Labels
enhancement Adding new functionality module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting

Comments

@hliebert
Copy link
Contributor

I would like to access the actual X features used as input for the model inside a reducer. It appears these are not currently stored (unless I've missed something).

Would it be possible to store them (or provide a method to recreate them)? One prominent use case for this is that the actual values are required to compute shapley values.

A simple example is given below. _y and _X are stored for the RecursiveTabularRegressionForecaster, but not the actual input passed to the nested estimator.

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sktime.datasets import load_longley
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.compose import make_reduction

y, X = load_longley()
horizon = ForecastingHorizon(np.arange(1, 4), is_relative=True)

random_forest = RandomForestRegressor()
reduction_forecaster = make_reduction(
    random_forest, window_length=5, strategy="recursive"
)

reduction_forecaster.fit(y, X, fh=horizon)
reduction_forecaster.predict(fh=horizon, X=X)

reduction_forecaster._X
reduction_forecaster._y
@hliebert hliebert added the enhancement Adding new functionality label Dec 19, 2023
@yarnabrina
Copy link
Collaborator

Just out of my curiosity, can I ask how are you planning to use Shapley values here?

For example, strategy="direct" will use multiple models, and "actuals" are going to vary for different forecast horizons with strategy="recursive" if I understand correctly (in the sense fh=1 will only use last window_length training observations, fh=2 will use last window_length - 1 training observations and prediction for fh=1 and so on).

So, what shall you pass as "actual" values to shap?

(As said, it's not related to your issue, but out of my curiosity as I am also planning to have some explainability component for my office work.)

@hliebert
Copy link
Contributor Author

hliebert commented Dec 19, 2023

In my current application I care most about a few horizons only (1-3), and later horizons are increasingly less relevant. For now I'm simply planning to just look at the values for the first few horizons separately.

I'm aware of the issues with features varying by horizon in the recursive case, I'm not sure what the best solution is. Maybe offering a method to recreate the input for a given horizon is easier than storing it?

@fkiraly fkiraly added the module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting label Dec 19, 2023
@fkiraly
Copy link
Collaborator

fkiraly commented Dec 22, 2023

Related, this request by @yarnabrina to access the internal data in two-step exogenous forecast:
#5598

I wonder whether there should be a programmatic way to access internal preprocessed data.

@fkiraly
Copy link
Collaborator

fkiraly commented Dec 22, 2023

From a design perspective, we could "dump" the formatted data in a _X-like argument, although we should be cautious as this can blow up the pickle size etc.

A "nicer" way would be to also allow the forecaster to act as a transformer, which is now possible with the object_type tag that can have multiple types. This was introduced to allow polymorphism for the graphical pipeline, FYI @benHeid.

There used to be a transform method, so I wonder whether this can simply be reactivated.

@fkiraly
Copy link
Collaborator

fkiraly commented Dec 22, 2023

On a related note, there is a transformer (not much used afaik) which also addresses the issue, the ReducerTransform, which could be used for shapley values.

@hliebert
Copy link
Contributor Author

From a design perspective, we could "dump" the formatted data in a _X-like argument, although we should be cautious as this can blow up the pickle size etc.

A "nicer" way would be to also allow the forecaster to act as a transformer, which is now possible with the object_type tag that can have multiple types. This was introduced to allow polymorphism for the graphical pipeline, FYI @benHeid.

There used to be a transform method, so I wonder whether this can simply be reactivated.

I'd be fine with either, although I'm not sure how "dumping" would look like with the recursive reducer.

@hliebert
Copy link
Contributor Author

hliebert commented Dec 29, 2023

On a related note, there is a transformer (not much used afaik) which also addresses the issue, the ReducerTransform, which could be used for shapley values.

Thanks for pointing this out, I'll have a look. Does this mean the API reference on the homepage is incomplete? I've looked through the list of transformers, and this one isn't listed.

@fkiraly
Copy link
Collaborator

fkiraly commented Jan 2, 2024

Does this mean the API reference on the homepage is incomplete

Yes, sorry.

The issue is, we are not sure how to test "estimator is not present on API reference page".
It should be picked up by the all_estimators utility though, since that is programmatic and crawls the package.

I've added it, and the new direct reducer prototype to the API reference:
#5690

@fkiraly
Copy link
Collaborator

fkiraly commented Jan 2, 2024

although I'm not sure how "dumping" would look like with the recursive reducer.

I think that is precisely @yarnabrina's question, i.e., what should it even do.

@hliebert
Copy link
Contributor Author

hliebert commented Jan 4, 2024

although I'm not sure how "dumping" would look like with the recursive reducer.

I think that is precisely @yarnabrina's question, i.e., what should it even do.

Maybe just dump _X in a dictionary by horizon? Or provide a method that takes horizon as argument and returns X after fitting.

Thanks for updating the docs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Adding new functionality module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting
Development

No branches or pull requests

3 participants