Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] StatsForecastAutoArima behaves differently when using sktime's evaluate vs. temporal_train_test_split #5894

Open
fkiraly opened this issue Feb 5, 2024 Discussed in #5893 · 16 comments
Labels
bug Something isn't working module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting module:metrics&benchmarking metrics and benchmarking modules

Comments

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 5, 2024

Bug report from discussion forum.

Discussed in #5893

Originally posted by markross1 February 5, 2024
Hello,

I've been experimenting with SKTime's evaluate function to cross-validate forecasting models. I've encountered differing results when using the Statsforecast AutoArima output from evaluate compared to results obtained using the temporal_train_test_split function.

The discrepancy becomes more pronounced when I include an exogenous regressor. In my tests, the SKTime AutoArima model behaves as expected, but the Statsforecast AutoArima model produces unexpected results. However, when using Statsforecast AutoArima with temporal_train_test_split, I get results that match closely with SKTime's AutoArima. This issue seems to specifically arise with the Statsforecast AutoArima with the evaluate function and an exogenous regressor.

SKTime with evaluate:
autoarima_evaluate

StatsForecast with evaluate:
statsforecast_evaluate

StatsForecast with train-test-split
statsforecast_train_test_split

Does anyone have insights on why the Statsforecast AutoArima model behaves differently with the evaluate function with exogenous regressors in this example?

Below is an example that should demonstrate this (unless this is computer-specific somehow):

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# sktime imports
from sktime.split import SingleWindowSplitter, temporal_train_test_split
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.arima import AutoARIMA
from sktime.forecasting.statsforecast import StatsForecastAutoARIMA
from sktime.utils.plotting import plot_series
from sktime.forecasting.model_evaluation import evaluate
from sktime.datasets import load_airline


# generate exogenous variable
def generate_seats(y):
    np.random.seed(42)
    base_seats = 1.1 * y
    load_factor = np.random.uniform(low=0.82, high=0.85, size=y.shape)
    seats = y / load_factor
    seats = np.maximum(seats, base_seats)
    seats = np.ceil(seats)
    df = pd.DataFrame({'PAX': y, 'SEATS': seats, 'LF': y / seats})
    return df

# variables
y = load_airline()
X = generate_seats(y)[['SEATS']]

sw = SingleWindowSplitter(fh=list(range(1, 37)))

# Evaluate SKTime AutoARIMA
results = evaluate(forecaster=AutoARIMA(sp=12), y=y, X=X, cv=sw, return_data=True);

y_train = results['y_train'][0]
y_test = results['y_test'][0]
y_pred = results['y_pred'][0]
plot_series(y_train, y_test, y_pred, labels=['train', 'test', 'pred']);
plt.show()


# Statsforecast AutoArima with evaluate (this should show different results)
results = evaluate(forecaster=StatsForecastAutoARIMA(sp=12), y=y, X=X, cv=sw, return_data=True);
y_train = results['y_train'][0]
y_test = results['y_test'][0]
y_pred = results['y_pred'][0]
plot_series(y_train, y_test, y_pred, labels=['train', 'test', 'pred']);
plt.show()

## Statsforecast AutoArima manually (this should match closely with AutoArima and evaluate)
y_train, y_test, X_train, X_test = temporal_train_test_split(y=y, X=X, fh=list(range(1, 37)))
y_pred = StatsForecastAutoARIMA(sp=12).fit_predict(y=y_train, X=X_train, X_pred=X_test, fh=ForecastingHorizon(y_test.index, is_relative=False))
plot_series(y_train, y_test, y_pred, labels=['train', 'test', 'pred']);
plt.show()

Thank you in advance for your help.
-Marko

@fkiraly fkiraly added bug Something isn't working module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting module:metrics&benchmarking metrics and benchmarking modules labels Feb 5, 2024
@fkiraly
Copy link
Collaborator Author

fkiraly commented Feb 5, 2024

Imo a bug, clearly - but not so clear where it is located.

To me, it looks like the statsforecast y_pred is identical with the first three years of y_train.

What is odd is that the bug depends on which estimator is used, i.e., the condition seems to be a condition of evaluate and StatsForecastAutoARIMA, which makes a guess of what is going on hard.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Feb 5, 2024

a wild guess: it could be that sth produces incorrect indices internally - either the fh, or the adapter estimator. Then, loc or integer based indexing might retrieve - from the data - or request - from a prediction - the wrong values.

FYI @yarnabrina, any better or other guess?

@yarnabrina
Copy link
Member

yarnabrina commented Feb 5, 2024 via email

@fkiraly
Copy link
Collaborator Author

fkiraly commented Feb 5, 2024

@yarnabrina, evaluate is used inside all the forecasting tuners - so if it were broken, then the same probem would impact all tuners.

@Abhay-Lejith
Copy link
Contributor

I'd like to work on this issue! I've found the cause of the problem, will make a fix and PR tomorrow perhaps.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Feb 27, 2024

nice!

@fkiraly
Copy link
Collaborator Author

fkiraly commented Mar 24, 2024

@Abhay-Lejith, may I kindly ask where we left off with #6029? Did you abandon working on this? Which is fine if it were the case, just want to understand the status.

@Abhay-Lejith
Copy link
Contributor

@Abhay-Lejith, may I kindly ask where we left off with #6029? Did you abandon working on this? Which is fine if it were the case, just want to understand the status.

Yes, I am no longer working on that issue.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Mar 25, 2024

did we understand the "deeper reason" for this? Since it looks like you had a fix, but it broke other parts of sktime - it would be good to understand the reasoning and record it here for the next person working on this. You will be credited of course, if your diagnosis leads to the fix.

@Abhay-Lejith
Copy link
Contributor

Sorry, I did not investigate the matter further. The fix was causing other errors because I changed the splitting logic in the evaluate method, and this seemed to be causing problems elsewhere. Since you had told me the issue was a lot deeper and harder to fix, I did not look into it after that.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Mar 26, 2024

Can you explain perhaps why you changed things in evaluate?

@Abhay-Lejith
Copy link
Contributor

I modified how X_test is generated by the evaluate method so that it matches the X_test created by temporal_train_test_split. The difference in X_test generated by the two methods was the cause of the different behaviour initially.
All the details are there in the PR as well.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Mar 27, 2024

Ahhhh, yes, thanks for explaining it again to me.

Summarizing for the next person looking at this:

the problem is that in evaluate, the X_test passed to the estimator is actually all X seen, so it does not exactly match the y_test, but also contains X_train. This is what StatsForecastAutoArima expects.

There are two approaches to fix this:

  • in StatsForecastAutoArima.predict, add subsetting of X to the indices of fh, in _predict, before doing anything else.
  • modify the behaviour to evaluate, so X_test indices always match y_test

@Abhay-Lejith took approach no.2.

Changing this to be exactly the same (as done manually by @Abhay-Lejith upon correct diagnosis in the PR #6029) is equivalent to removing the TestPlusTrainSplitter, which fixes this bug, but causes other places in the code to break.

I think that's where our investigation got stuck, i.e., why those specific other places in the code break - and that would be the entry point for further work on approach no.2. That is:

  • remove TestPlusTrainSplitter from the evaluate fold generation logic
  • check that this bug is resolved (still)
  • investigate the new failures

@fkiraly
Copy link
Collaborator Author

fkiraly commented Mar 27, 2024

@yarnabrina, for approach no.1, we would have to add subsetting in the statsforecast adapter, _predict.

@yarnabrina
Copy link
Member

It's a bit difficult to specifically filter with fh, as statsforecast expects horizon as the predict input, but I can do that.

However I am wondering whether it would make sense to do this subsetting in base class itself or not, in case other forecasters also may be doing such things but it was never captured? May be in predict of BaseForecaster? I'm not very confident though, as that may trigger another set of new failures.

Also, after I make the change in the adapter how do we test if it's fixed or not? Should we add a test for statsforecast models or for evaluate?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Mar 28, 2024

@yarnabrina, my attempt to do this in the base class, conditional on a tag, is here: #6044

It is incomplete, but you are welcome to have a look at it.

Indeed I am also suspecting that this may be happening elsewhere, too, though it is hard to check, since it impacts primarily logic and is not captured by type checks.

Should we add a test for statsforecast models or for evaluate

Good question - both, perhaps? We can also magic-mock the estimator.
The failure case that started this issue might be a good test case, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting module:metrics&benchmarking metrics and benchmarking modules
Projects
Status: fixing
3 participants