[BUG] `StatsForecastAutoArima` behaves differently when using `sktime`'s `evaluate` vs. `temporal_train_test_split` #5894

fkiraly · 2024-02-05T16:10:18Z

Bug report from discussion forum.

Discussed in #5893

^{Originally posted by markross1 February 5, 2024}
Hello,

I've been experimenting with SKTime's evaluate function to cross-validate forecasting models. I've encountered differing results when using the Statsforecast AutoArima output from evaluate compared to results obtained using the temporal_train_test_split function.

The discrepancy becomes more pronounced when I include an exogenous regressor. In my tests, the SKTime AutoArima model behaves as expected, but the Statsforecast AutoArima model produces unexpected results. However, when using Statsforecast AutoArima with temporal_train_test_split, I get results that match closely with SKTime's AutoArima. This issue seems to specifically arise with the Statsforecast AutoArima with the evaluate function and an exogenous regressor.

SKTime with evaluate:

StatsForecast with evaluate:

StatsForecast with train-test-split

Does anyone have insights on why the Statsforecast AutoArima model behaves differently with the evaluate function with exogenous regressors in this example?

Below is an example that should demonstrate this (unless this is computer-specific somehow):

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# sktime imports
from sktime.split import SingleWindowSplitter, temporal_train_test_split
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.arima import AutoARIMA
from sktime.forecasting.statsforecast import StatsForecastAutoARIMA
from sktime.utils.plotting import plot_series
from sktime.forecasting.model_evaluation import evaluate
from sktime.datasets import load_airline


# generate exogenous variable
def generate_seats(y):
    np.random.seed(42)
    base_seats = 1.1 * y
    load_factor = np.random.uniform(low=0.82, high=0.85, size=y.shape)
    seats = y / load_factor
    seats = np.maximum(seats, base_seats)
    seats = np.ceil(seats)
    df = pd.DataFrame({'PAX': y, 'SEATS': seats, 'LF': y / seats})
    return df

# variables
y = load_airline()
X = generate_seats(y)[['SEATS']]

sw = SingleWindowSplitter(fh=list(range(1, 37)))

# Evaluate SKTime AutoARIMA
results = evaluate(forecaster=AutoARIMA(sp=12), y=y, X=X, cv=sw, return_data=True);

y_train = results['y_train'][0]
y_test = results['y_test'][0]
y_pred = results['y_pred'][0]
plot_series(y_train, y_test, y_pred, labels=['train', 'test', 'pred']);
plt.show()


# Statsforecast AutoArima with evaluate (this should show different results)
results = evaluate(forecaster=StatsForecastAutoARIMA(sp=12), y=y, X=X, cv=sw, return_data=True);
y_train = results['y_train'][0]
y_test = results['y_test'][0]
y_pred = results['y_pred'][0]
plot_series(y_train, y_test, y_pred, labels=['train', 'test', 'pred']);
plt.show()

## Statsforecast AutoArima manually (this should match closely with AutoArima and evaluate)
y_train, y_test, X_train, X_test = temporal_train_test_split(y=y, X=X, fh=list(range(1, 37)))
y_pred = StatsForecastAutoARIMA(sp=12).fit_predict(y=y_train, X=X_train, X_pred=X_test, fh=ForecastingHorizon(y_test.index, is_relative=False))
plot_series(y_train, y_test, y_pred, labels=['train', 'test', 'pred']);
plt.show()

Thank you in advance for your help.
-Marko

The text was updated successfully, but these errors were encountered:

fkiraly · 2024-02-05T16:11:48Z

Imo a bug, clearly - but not so clear where it is located.

To me, it looks like the statsforecast y_pred is identical with the first three years of y_train.

What is odd is that the bug depends on which estimator is used, i.e., the condition seems to be a condition of evaluate and StatsForecastAutoARIMA, which makes a guess of what is going on hard.

fkiraly · 2024-02-05T16:15:36Z

a wild guess: it could be that sth produces incorrect indices internally - either the fh, or the adapter estimator. Then, loc or integer based indexing might retrieve - from the data - or request - from a prediction - the wrong values.

FYI @yarnabrina, any better or other guess?

yarnabrina · 2024-02-05T18:46:24Z

I don't use `evaluate` myself, so very not much familiar with its codebase, but if direct use is working it's unlikely to be in adapter. I don't have any guess to debug, but I'd try to see if it's statsforecast specific or if it is happening with few other estimators as well, possibly with some common tag or etc. I use CV of course, so it's possibly affecting my work as well. I'll take a detailed look tomorrow to see if I can find any pattern or any other guesses.

…

On Mon, Feb 5, 2024 at 9:45 PM Franz Király ***@***.***> wrote: a wild guess: it could be that sth produces incorrect indices internally - either the fh, or the adapter estimator. Then, loc or integer based indexing might retrieve - from the data - or request - from a prediction - the wrong values. FYI @yarnabrina <https://github.com/yarnabrina>, any better or other guess? — Reply to this email directly, view it on GitHub <#5894 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJMCQBF47VPDEMPJEW5P7GLYSEATHAVCNFSM6AAAAABC2NIEZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRXGM2TOMBQGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

fkiraly · 2024-02-05T18:50:35Z

@yarnabrina, evaluate is used inside all the forecasting tuners - so if it were broken, then the same probem would impact all tuners.

Abhay-Lejith · 2024-02-27T19:34:29Z

I'd like to work on this issue! I've found the cause of the problem, will make a fix and PR tomorrow perhaps.

fkiraly · 2024-02-27T22:06:07Z

nice!

fkiraly · 2024-03-24T00:16:16Z

@Abhay-Lejith, may I kindly ask where we left off with #6029? Did you abandon working on this? Which is fine if it were the case, just want to understand the status.

Abhay-Lejith · 2024-03-25T14:28:08Z

@Abhay-Lejith, may I kindly ask where we left off with #6029? Did you abandon working on this? Which is fine if it were the case, just want to understand the status.

Yes, I am no longer working on that issue.

fkiraly · 2024-03-25T22:21:56Z

did we understand the "deeper reason" for this? Since it looks like you had a fix, but it broke other parts of sktime - it would be good to understand the reasoning and record it here for the next person working on this. You will be credited of course, if your diagnosis leads to the fix.

Abhay-Lejith · 2024-03-26T07:30:55Z

Sorry, I did not investigate the matter further. The fix was causing other errors because I changed the splitting logic in the evaluate method, and this seemed to be causing problems elsewhere. Since you had told me the issue was a lot deeper and harder to fix, I did not look into it after that.

fkiraly · 2024-03-26T11:16:00Z

Can you explain perhaps why you changed things in evaluate?

Abhay-Lejith · 2024-03-26T16:01:42Z

I modified how X_test is generated by the evaluate method so that it matches the X_test created by temporal_train_test_split. The difference in X_test generated by the two methods was the cause of the different behaviour initially.
All the details are there in the PR as well.

fkiraly · 2024-03-27T08:50:31Z

Ahhhh, yes, thanks for explaining it again to me.

Summarizing for the next person looking at this:

the problem is that in evaluate, the X_test passed to the estimator is actually all X seen, so it does not exactly match the y_test, but also contains X_train. This is what StatsForecastAutoArima expects.

There are two approaches to fix this:

in StatsForecastAutoArima.predict, add subsetting of X to the indices of fh, in _predict, before doing anything else.
modify the behaviour to evaluate, so X_test indices always match y_test

@Abhay-Lejith took approach no.2.

Changing this to be exactly the same (as done manually by @Abhay-Lejith upon correct diagnosis in the PR #6029) is equivalent to removing the TestPlusTrainSplitter, which fixes this bug, but causes other places in the code to break.

I think that's where our investigation got stuck, i.e., why those specific other places in the code break - and that would be the entry point for further work on approach no.2. That is:

remove TestPlusTrainSplitter from the evaluate fold generation logic
check that this bug is resolved (still)
investigate the new failures

fkiraly · 2024-03-27T08:52:33Z

@yarnabrina, for approach no.1, we would have to add subsetting in the statsforecast adapter, _predict.

yarnabrina · 2024-03-27T15:10:41Z

It's a bit difficult to specifically filter with fh, as statsforecast expects horizon as the predict input, but I can do that.

However I am wondering whether it would make sense to do this subsetting in base class itself or not, in case other forecasters also may be doing such things but it was never captured? May be in predict of BaseForecaster? I'm not very confident though, as that may trigger another set of new failures.

Also, after I make the change in the adapter how do we test if it's fixed or not? Should we add a test for statsforecast models or for evaluate?

fkiraly · 2024-03-28T16:53:24Z

@yarnabrina, my attempt to do this in the base class, conditional on a tag, is here: #6044

It is incomplete, but you are welcome to have a look at it.

Indeed I am also suspecting that this may be happening elsewhere, too, though it is hard to check, since it impacts primarily logic and is not captured by type checks.

Should we add a test for statsforecast models or for evaluate

Good question - both, perhaps? We can also magic-mock the estimator.
The failure case that started this issue might be a good test case, too.

fkiraly added bug Something isn't working module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting module:metrics&benchmarking metrics and benchmarking modules labels Feb 5, 2024

Abhay-Lejith mentioned this issue Feb 28, 2024

[BUG] Fixed StatsForecastAutoArima behaving differently when using sktime's evaluate vs. temporal_train_test_split #6029

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `StatsForecastAutoArima` behaves differently when using `sktime`'s `evaluate` vs. `temporal_train_test_split` #5894

[BUG] `StatsForecastAutoArima` behaves differently when using `sktime`'s `evaluate` vs. `temporal_train_test_split` #5894

fkiraly commented Feb 5, 2024

fkiraly commented Feb 5, 2024 •

edited

Loading

fkiraly commented Feb 5, 2024

yarnabrina commented Feb 5, 2024 via email

fkiraly commented Feb 5, 2024

Abhay-Lejith commented Feb 27, 2024

fkiraly commented Feb 27, 2024

fkiraly commented Mar 24, 2024 •

edited

Loading

Abhay-Lejith commented Mar 25, 2024

fkiraly commented Mar 25, 2024

Abhay-Lejith commented Mar 26, 2024

fkiraly commented Mar 26, 2024

Abhay-Lejith commented Mar 26, 2024

fkiraly commented Mar 27, 2024 •

edited

Loading

fkiraly commented Mar 27, 2024 •

edited

Loading

yarnabrina commented Mar 27, 2024

fkiraly commented Mar 28, 2024

[BUG] StatsForecastAutoArima behaves differently when using sktime's evaluate vs. temporal_train_test_split #5894

[BUG] StatsForecastAutoArima behaves differently when using sktime's evaluate vs. temporal_train_test_split #5894

Comments

fkiraly commented Feb 5, 2024

Discussed in #5893

fkiraly commented Feb 5, 2024 • edited Loading

fkiraly commented Feb 5, 2024

yarnabrina commented Feb 5, 2024 via email

fkiraly commented Feb 5, 2024

Abhay-Lejith commented Feb 27, 2024

fkiraly commented Feb 27, 2024

fkiraly commented Mar 24, 2024 • edited Loading

Abhay-Lejith commented Mar 25, 2024

fkiraly commented Mar 25, 2024

Abhay-Lejith commented Mar 26, 2024

fkiraly commented Mar 26, 2024

Abhay-Lejith commented Mar 26, 2024

fkiraly commented Mar 27, 2024 • edited Loading

fkiraly commented Mar 27, 2024 • edited Loading

yarnabrina commented Mar 27, 2024

fkiraly commented Mar 28, 2024

[BUG] `StatsForecastAutoArima` behaves differently when using `sktime`'s `evaluate` vs. `temporal_train_test_split` #5894

[BUG] `StatsForecastAutoArima` behaves differently when using `sktime`'s `evaluate` vs. `temporal_train_test_split` #5894

fkiraly commented Feb 5, 2024 •

edited

Loading

fkiraly commented Mar 24, 2024 •

edited

Loading

fkiraly commented Mar 27, 2024 •

edited

Loading

fkiraly commented Mar 27, 2024 •

edited

Loading