ForecastX with mix of future known and future unknown predictors #4378

yarnabrina · 2023-03-22T05:06:10Z

yarnabrina
Mar 22, 2023
Collaborator

I've just started using sktime, and really interested to use ForecastX which will allow to easily use exogenous variables, as they are commonly missing for forecast horizon. But in few of my use cases, we may know few of them in advance, and need to forecast only the rest. For example, we may know the product prices for a sales or revenue forecast. How do we use ForecastX in this scenario?

Let's consider the official example in documentation. Suppose that we are in 1960, and want to predict for next 2 years, and somehow ARMED and POP for 1961 and 1962 are known in advance (just assuming for sake of an example). How do I fit a model so that all 5 of the exogenous features are used for training, 3 exogenous variables (GNPDEFL, GNP and UNEMP) are predicted for the forecast horizon and combining those with known value of the other 2 (ARMED and POP), predictions for y are made?

I tried the following, but it failed.

Code

from sktime.datasets import load_longley
from sktime.forecasting.arima import ARIMA
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.compose import ForecastX
from sktime.forecasting.var import VAR

y, X = load_longley()

y_tr = y[:"1960"]
y_ts = y["1960":]

X_tr = X[:"1960"]
X_ts = X["1960":]

fh = ForecastingHorizon([1, 2, 3])
pipe = ForecastX(
    forecaster_X=VAR(),
    forecaster_y=ARIMA(),
    columns=["ARMED", "POP"],
)
pipe = pipe.fit(y_tr, X=X_tr, fh=fh)
y_pred = pipe.predict(X=X_ts.drop(columns=["ARMED", "POP"]), fh=fh)

ValueError: Found non-finite values in dataframe

Will appreciate some guidance in this problem. Thank you.

Answered by fkiraly

Mar 22, 2023

the problem in your example seems to be the funny indexing of pandas when you use colon with dates.
Unlike the very same indexing with integers (!), it produces overlapping folds.

The chain of cause/effect for the problem:

X_tr, X_ts both end up containing 1960
so you're asking the forecaster to predict 1961, 1962, 1963 (fh=[1,2,3] starting from period 0 = 1960). I think you probably intended to ask it to predict 1960, 1961, 1962.
either way, you're passing known data (X_ts) only for 1960, 1961, 1962
so in the pooled X in predict, you end up with missing data in 1963 (from the X-forecast and the X-known, 1963 is present in X-forecast but missing in X-known)
but the y-forecaster ARIMA can…

View full answer

fkiraly · 2023-03-22T16:59:47Z

fkiraly
Mar 22, 2023
Maintainer

the problem in your example seems to be the funny indexing of pandas when you use colon with dates.
Unlike the very same indexing with integers (!), it produces overlapping folds.

The chain of cause/effect for the problem:

X_tr, X_ts both end up containing 1960
so you're asking the forecaster to predict 1961, 1962, 1963 (fh=[1,2,3] starting from period 0 = 1960). I think you probably intended to ask it to predict 1960, 1961, 1962.
either way, you're passing known data (X_ts) only for 1960, 1961, 1962
so in the pooled X in predict, you end up with missing data in 1963 (from the X-forecast and the X-known, 1963 is present in X-forecast but missing in X-known)
but the y-forecaster ARIMA can't deal with missing data in X, so it breaks, the error message is weird too but is what pmdarima produces, "non-finite" (instead of, say, missing)

If you want to avoid that, make sure to subset so the past/future X and y are disjoint (or, at least, ensure that the known values for X fill up the nan spots that you need for a y-forecaster that cannot deal with nans - understanding that above this probably was accidental).

This code works:

from sktime.datasets import load_longley
from sktime.forecasting.arima import ARIMA
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.compose import ForecastX
from sktime.forecasting.var import VAR

y, X = load_longley()

y_tr = y[:"1959"]
y_ts = y["1960":]

X_tr = X[:"1959"]
X_ts = X["1960":]

fh = ForecastingHorizon([1, 2, 3])
pipe = ForecastX(
    forecaster_X=VAR(),
    forecaster_y=ARIMA(),
    columns=["ARMED", "POP"],
)
pipe = pipe.fit(y_tr, X=X_tr, fh=fh)
y_pred = pipe.predict(X=X_ts.drop(columns=["ARMED", "POP"]), fh=fh)

Side note, you could also avoid the "drop" in predict.
Hope this helps?

0 replies

fkiraly · 2023-03-22T17:01:38Z

fkiraly
Mar 22, 2023
Maintainer

Interesting question: what should/could an informative error message have been here? And, when should it have been raised?

2 replies

yarnabrina Mar 23, 2023
Collaborator Author

Perhaps a check with X.isna() can be made here, assuming it's always Series/DataFrame (not sure)?

sktime/sktime/forecasting/compose/_pipeline.py

Lines 1338 to 1339 in 19500a7

    
           X = self._get_forecaster_X_prediction(fh=fh, X=X) 
        
           y_pred = self.forecaster_y_.predict(fh=fh, X=X)

If there's any, and if the forecaster_y_ does not support it (don't know how to check that - I guess there's some tags for that), may be you can show a message similar to your explanation above?

y-forecaster can't deal with missing data in X

fkiraly Mar 24, 2023
Maintainer

yes, indeed!
the tag is handles-missing-data, we could do that.

Currently, I notice, the ForecastX forecaster does not set dependent tags, that would be the solution.

yarnabrina · 2023-03-23T05:12:02Z

yarnabrina
Mar 23, 2023
Collaborator Author

Another observation is that predict_quantiles for ForecastX does not seem to use the alpha inputs. Indeed it's not even passed to the underlying call.

sktime/sktime/forecasting/compose/_pipeline.py

Line 1425 in 19500a7

y_pred = self.forecaster_y_.predict_quantiles(fh=fh, X=X)

Any reason for this? Given BaseForecaster supports alpha, I'd assume it's supported for all estimators, at least as a valid keyword argument, even if unused. Why is it not passed?

This leads to very unexpected results:

>>> pipe.predict_quantiles(X=X_ts, fh=fh, alpha=[0.3, 0.7])
         Quantiles              
              0.05          0.95
1960  69583.430473  70587.653223
1961  69569.814972  70576.864647
1962  72161.834476  73168.900067

3 replies

fkiraly Mar 24, 2023
Maintainer

ah, no, that's just a bug!

fkiraly Mar 24, 2023
Maintainer

strange that the tests don't pick it up. I thought the predict_quantiles test checks for the expected column index.

fkiraly Mar 24, 2023
Maintainer

reported the bug here: #4386

geronimos · 2024-06-27T06:31:52Z

geronimos
Jun 27, 2024

I've been following this discussion and found the examples and solutions very insightful. However, I am encountering a specific challenge in this direction. It would be great if it could be addressed by ForecastX.

In my current project, I have a dataset where the availability of exogenous variables varies:

Some variables are known for the entire forecast horizon.
Some are known only partially.
Others are completely unknown for the forecast horizon.

From what I've gathered, ForecastX seems ideally used when the state of all exogenous variables is uniform—either all known or all forecasted. My understanding is that for handling non-uniform availability, one might need to forecast each exogenous variable separately as needed, merge these forecasts with the known data, and then proceed with forecasting the main series.

Here’s a simplified version of my problem using the Longley dataset, where I artificially introduce missing values to simulate my scenario:

import numpy as np
import pandas as pd
from sktime.datasets import load_longley

# Load the dataset
y, X = load_longley()

# Split the data to simulate a future prediction scenario
X_tr, X_ts = X.loc[:"1959"], X.loc["1960":]

# Introducing missing data for the example
X_ts.loc[:, "UNEMP"] = np.nan  # Completely unknown (for 1960, 1961, 1962)
X_ts.loc["1962", "ARMED"] = np.nan  # Partly unknown (for 1962)
X_ts.loc["1961":"1962", "POP"] = np.nan  # Partly unknown (for 1961 and 1962)


print(X_ts)

Output:

        GNPDEFL       GNP  UNEMP   ARMED       POP
Period                                            
1960      114.2  502601.0    NaN  2514.0  125368.0
1961      115.7  518173.0    NaN  2572.0       NaN
1962      116.9  554894.0    NaN     NaN       NaN

Could you suggest how to best approach this scenario with sktime and ForecastX? Is my understanding correct that I need to handle each variable's forecasting separately, or is there a more integrated approach available within sktime that I might be overlooking?

I'd appreciate any insights. Thank you in advance!

4 replies

yarnabrina Jun 27, 2024
Collaborator Author

As far as I am aware, not possible with ForecastX as of now. I recently added a new argument predict_behaviour, and a generalisation of that to each row may solve this problem.

Would you like to contribute this in sktime? That'd be very welcome.

fkiraly Jun 27, 2024
Maintainer

Is that correct really? There is the columns variable which allows to select columns to which the compositor is to be applied. These would be the future-unknown variables in case of non-uniform availability.

yarnabrina Jun 27, 2024
Collaborator Author

@fkiraly does columns allow partially future-known exogenous columns? E.g. in the example @geronimos posted above, how would you pass POP or ARMED which are known in future for 1 and 2 horizons from now, but not for full?

geronimos Jul 4, 2024

Thanks for your quick replies and help. @yarnabrina handling partially future-known exogenous columns is exactly the issue, I am looking at. @fkiraly I don't find a way to handle this at the moment.

Yes, I could dive into the details and try to contribute a solution to this :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ForecastX with mix of future known and future unknown predictors #4378

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

ForecastX with mix of future known and future unknown predictors #4378

yarnabrina Mar 22, 2023 Collaborator

Replies: 4 comments · 9 replies

fkiraly Mar 22, 2023 Maintainer

fkiraly Mar 22, 2023 Maintainer

yarnabrina Mar 23, 2023 Collaborator Author

fkiraly Mar 24, 2023 Maintainer

yarnabrina Mar 23, 2023 Collaborator Author

fkiraly Mar 24, 2023 Maintainer

fkiraly Mar 24, 2023 Maintainer

fkiraly Mar 24, 2023 Maintainer

geronimos Jun 27, 2024

yarnabrina Jun 27, 2024 Collaborator Author

fkiraly Jun 27, 2024 Maintainer

yarnabrina Jun 27, 2024 Collaborator Author

geronimos Jul 4, 2024

yarnabrina
Mar 22, 2023
Collaborator

Replies: 4 comments 9 replies

fkiraly
Mar 22, 2023
Maintainer

fkiraly
Mar 22, 2023
Maintainer

yarnabrina Mar 23, 2023
Collaborator Author

fkiraly Mar 24, 2023
Maintainer

yarnabrina
Mar 23, 2023
Collaborator Author

fkiraly Mar 24, 2023
Maintainer

fkiraly Mar 24, 2023
Maintainer

fkiraly Mar 24, 2023
Maintainer

geronimos
Jun 27, 2024

yarnabrina Jun 27, 2024
Collaborator Author

fkiraly Jun 27, 2024
Maintainer

yarnabrina Jun 27, 2024
Collaborator Author