Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] panel forecasting should also work to forecast only a single time series (not all) #4209

Open
romanlutz opened this issue Feb 6, 2023 · 18 comments
Labels
API design API design & software architecture feature request New feature or request module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting

Comments

@romanlutz
Copy link
Contributor

romanlutz commented Feb 6, 2023

update by @fkiraly - summarizing the below, I think this is a combination of feature request - "forecasters should support forecasting on a subset of instances, if panel data" - and frustrated user expectation that it already works. Overall, interesting idea, and labelling this as "API design" in addition to discuss how it should look like.


Describe the bug

If I train a model on multiple time series but provide only a single one of the time series for predict I get a KeyError.

To Reproduce

import json
import numpy as np
import pandas as pd
from sktime.forecasting.arima import AutoARIMA
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.model_selection import temporal_train_test_split
time_column_name = "WeekStarting"
time_series_id_column_names = ["Store", "Brand"]
dataset_location = "https://raw.githubusercontent.com/Azure/azureml-examples/2fe81643865e1f4591e7734bd1a729093cafb826/v1/python-sdk/tutorials/automl-with-azureml/forecasting-orange-juice-sales/dominicks_OJ.csv"
data = pd.read_csv(dataset_location, parse_dates=[time_column_name])

# Drop the columns 'logQuantity' as it is a leaky feature.
data.drop("logQuantity", axis=1, inplace=True)

# Set up multi index with time series ID columns and time column.
data.set_index(time_series_id_column_names + [time_column_name], inplace=True, drop=True)
data = data.groupby(time_series_id_column_names).apply(lambda group: group.loc[group.name].asfreq("W-THU").interpolate())
data.sort_index(inplace=True, ascending=[True, True, True])

data.head(10)
use_stores = [2, 5, 8]
use_brands = ['tropicana', 'dominicks', 'minute.maid']
data_subset = data.loc[(use_stores, use_brands, slice(None)), :]
nseries = data_subset.groupby(time_series_id_column_names).ngroups
print(f"Data subset contains {nseries} individual time-series.")
target_column_name = "Quantity"

y = pd.DataFrame(data_subset[target_column_name])
X = data_subset.drop(columns=[target_column_name])
fh_dates = pd.DatetimeIndex(y.index.get_level_values(2).unique().sort_values().to_list()[-20:], freq='W-THU')
fh = ForecastingHorizon(fh_dates, is_relative=False)
y_train, y_test, X_train, X_test = \
    temporal_train_test_split(
        y=y,
        X=X,
        test_size=20)
# When using sktime directly we need to drop the time and time series ID columns.
model = AutoARIMA(suppress_warnings=True, error_action="ignore")
model.fit(y=y_train, X=X_train, fh=fh)
model.predict(fh=fh, X=X_test.iloc[:20]).head()

results in

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\envs\sktime\lib\site-packages\pandas\core\indexes\base.py", line 3802, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'dominicks'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...\sktime\forecasting\base\_base.py", line 407, in predict
    y_pred = self._vectorize("predict", X=X_inner, fh=fh)
  File "...\sktime\forecasting\base\_base.py", line 1770, in _vectorize
    y_preds = self._yvec.vectorize_est(
  File "...\sktime\datatypes\_vectorize.py", line 575, in vectorize_est
    args_i_rowvec = vec_dict(args_rowvec, i=i, vectorize_cols=False)
  File "...\sktime\datatypes\_vectorize.py", line 569, in vec_dict
    return {k: fun(v) for k, v in d.items()}
  File "...\sktime\datatypes\_vectorize.py", line 569, in <dictcomp>
    return {k: fun(v) for k, v in d.items()}
  File "...\sktime\datatypes\_vectorize.py", line 567, in fun
    return self._vectorize_slice(v, i=i, vectorize_cols=vectorize_cols)
  File "...\sktime\datatypes\_vectorize.py", line 457, in _vectorize_slice
    return self._get_X_at_index(row_ind=row_ind, col_ind=col_ind, X=other)
  File "...\sktime\datatypes\_vectorize.py", line 284, in _get_X_at_index
    res = X.loc[row_ind]
  File "...\pandas\core\indexing.py", line 1067, in __getitem__
    return self._getitem_tuple(key)
  File "...\pandas\core\indexing.py", line 1247, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "...\pandas\core\indexing.py", line 991, in _getitem_lowerdim
    return getattr(section, self.name)[new_key]
  File "...\pandas\core\indexing.py", line 1067, in __getitem__
    return self._getitem_tuple(key)
  File "...\pandas\core\indexing.py", line 1247, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "...\pandas\core\indexing.py", line 941, in _getitem_lowerdim
    return self._getitem_nested_tuple(tup)
  File "...\pandas\core\indexing.py", line 1047, in _getitem_nested_tuple
    obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
  File "...\pandas\core\indexing.py", line 1312, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "...\pandas\core\indexing.py", line 1260, in _get_label
    return self.obj.xs(label, axis=axis)
  File "...\pandas\core\generic.py", line 4041, in xs
    return self[key]
  File "...\pandas\core\frame.py", line 3807, in __getitem__
    indexer = self.columns.get_loc(key)
  File "...\pandas\core\indexes\base.py", line 3804, in get_loc
    raise KeyError(key) from err
KeyError: 'dominicks'

X_test.iloc[:20] corresponds to the first time series ("tropicana"), but "dominicks" is not in this subset. It is very much part of the training data, though.

Expected behavior

I would hope that this is supported since I may sometimes need to predict just for one time series and sometimes for another or multiple.

Additional context

Versions

System:
    python: 3.8.16 (default, Jan 17 2023, 22:25:28) [MSC v.1916 64 bit (AMD64)]
executable: C:\ProgramData\Anaconda3\envs\sktime\python.exe
   machine: Windows-10-10.0.19044-SP0

Python dependencies:
          pip: 22.3.1
   setuptools: 65.6.3
      sklearn: 1.0.2
       sktime: 0.16.0
  statsmodels: 0.13.5
        numpy: 1.23.5
        scipy: 1.10.0
       pandas: 1.5.3
   matplotlib: 3.6.3
       joblib: 1.2.0
        numba: 0.56.4
     pmdarima: 2.0.2
      tsfresh: None
@romanlutz romanlutz added the bug Something isn't working label Feb 6, 2023
@fkiraly
Copy link
Collaborator

fkiraly commented Feb 6, 2023

Hm, it would be much appreciated if you could try whether you can cause the problem with an on-board dataset of sktime, that is easier to debug as the only thing that needs to be set up is a copy of the code.

To generate datasets, you can use the generators from the sktime.datasets module, or _make_series from sktime.utils._testing.series.

(and if you cannot cause this with an onboard dataset, that would also be useful information, with code that you tried)

@fkiraly fkiraly added the module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting label Feb 6, 2023
@romanlutz
Copy link
Contributor Author

I couldn't quite find a panel dataset. Perhaps I missed something? In any case, I think I have quite the minimal example below with just 9 rows and 3 time series in one dataset.

My repro:

import pandas as pd
from sktime.forecasting.arima import AutoARIMA
from sktime.datatypes import get_examples
X = get_examples(mtype="pd-multiindex", as_scitype="Panel")[0]
y = pd.DataFrame(pd.Series([0,1,3,2,4,8,3,5,9]).set_axis(X.index))
f = AutoARIMA()
f.fit(X=X, fh=[0, 1, 2], y=y)
X_test = pd.DataFrame({"var_0": [4], "var_1": [6], "instances": [0], "timepoints": [3]})
X_test.set_index(["instances", "timepoints"], inplace=True)
f.predict(X=X_test, fh=[1])

results in

Traceback (most recent call last):
  File "...\pandas\core\indexes\base.py", line 3802, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 2263, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 2273, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 1

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...\sktime\forecasting\base\_base.py", line 407, in predict
    y_pred = self._vectorize("predict", X=X_inner, fh=fh)
  File "...\sktime\forecasting\base\_base.py", line 1770, in _vectorize
    y_preds = self._yvec.vectorize_est(
  File "...\sktime\datatypes\_vectorize.py", line 575, in vectorize_est
    args_i_rowvec = vec_dict(args_rowvec, i=i, vectorize_cols=False)
  File "...\sktime\datatypes\_vectorize.py", line 569, in vec_dict
    return {k: fun(v) for k, v in d.items()}
  File "...\sktime\datatypes\_vectorize.py", line 569, in <dictcomp>
    return {k: fun(v) for k, v in d.items()}
  File "...\sktime\datatypes\_vectorize.py", line 567, in fun
    return self._vectorize_slice(v, i=i, vectorize_cols=vectorize_cols)
  File "...\sktime\datatypes\_vectorize.py", line 457, in _vectorize_slice
    return self._get_X_at_index(row_ind=row_ind, col_ind=col_ind, X=other)
  File "...\sktime\datatypes\_vectorize.py", line 284, in _get_X_at_index
    res = X.loc[row_ind]
  File "...\pandas\core\indexing.py", line 1073, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "...\pandas\core\indexing.py", line 1312, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "...\pandas\core\indexing.py", line 1260, in _get_label
    return self.obj.xs(label, axis=axis)
  File "...\pandas\core\generic.py", line 4049, in xs
    loc, new_index = index._get_loc_level(key, level=0)
  File "...\pandas\core\indexes\multi.py", line 3160, in _get_loc_level
    indexer = self._get_level_indexer(key, level=level)
  File "...\pandas\core\indexes\multi.py", line 3263, in _get_level_indexer
    idx = self._get_loc_single_level_index(level_index, key)
  File "...\pandas\core\indexes\multi.py", line 2849, in _get_loc_single_level_index
    return level_index.get_loc(key)
  File "...\pandas\core\indexes\base.py", line 3804, in get_loc
    raise KeyError(key) from err
KeyError: 1

I believe the key here is the "instances" value of the second time series (1). Only the first time series (0) is actually present because I just so happen to be interested in only 0 this time. But I may have more queries at a later point. At that later point, I may just query for a few points of time series 1, or perhaps 2.

Or maybe I'm using sktime completely the wrong way here. In that case, I would very much appreciate your feedback, of course 🙂

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 11, 2023

@romanlutz, sorry for the late reply, was travelling.

I think the issue is a feature limitation of sktime, where sth like fh=1 predicts one step/period ahead in all instances; therefore many estimators will also require X to be seen at those futuer time indices.

There is currently no way to tell sktime, "I want a forecast only for time series no.1".
I agree that might be a nice feature, but it's not supported by the framework. Also, I'm not sure how many estimators, if at all, would support that out of the box.

Would appreciate your thoughts.

@romanlutz
Copy link
Contributor Author

Thanks, @fkiraly, no worries at all! We all live lives outside of open source 😄

I'm usually working with Azure ML forecasting models and they all support it. Hmm, this is kind of a blocker for my scenario. I am building a little interactive visualization and need to be able to query for individual time series forecasts rather than everything. I could have a model per time series, of course, but that seems a bit tedious. Alternatively, I could get all the forecasts and then discard what I don't need.

In any case, it seems like this is more of a feature request than a bug 🤣 Should we rephrase it or close and open a new item with a clearly stated feature request? I understand that it's not necessarily something that'll happen anytime soon, but capturing the request may be useful regardless.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 13, 2023

I'm usually working with Azure ML forecasting models and they all support it.

Hm, can you provide a code snippet in Azure ML forecasting, same scenario?
If it's not too much of a hassle, would be interesting to compare interface designs.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 13, 2023

Alternatively, I could get all the forecasts and then discard what I don't need.

Yes, if we were to build the interface, then this is probably what it would default to, for most forecasters.

I could also look into whether there's any quick way to enable that for global forecasters.
For non-global forecasters, the result will be the same if you only train on the instance that you want to forecast, in the first place.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 13, 2023

FYI @danbartl, @KishManani, @ilkersigirci - would this be a useful feature for reducers?

@danbartl
Copy link
Collaborator

Absolutely, sounds like a good thing to have.

@ilkersigirci
Copy link

I agree, it would be a good addition for reduced models.

@KishManani
Copy link
Contributor

I think we'd certainly want the ability to train on multiple instances but then perform inference on a subset of instances. This is quite easy to do with global forecasting models (i.e, via reduction).

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 14, 2023

I see - seems like a common enough scenario. I/we'll need to think about how exactly the interface should look like.

E.g., how do we tell the forecaster that we want a forecast only for series 0 and 1?
This can't be via the X, since we don't necessarily have an X.

I think it needs to be encoded in the fh somehow?

@fkiraly fkiraly added feature request New feature or request API design API design & software architecture and removed bug Something isn't working labels Feb 14, 2023
@fkiraly fkiraly changed the title [BUG] panel forecasting does not work if only a single time series is used [ENH] panel forecasting should also work to forecast only a single time series (not all) Feb 14, 2023
@fkiraly
Copy link
Collaborator

fkiraly commented Feb 14, 2023

Should we rephrase it or close and open a new item with a clearly stated feature request? I understand that it's not necessarily something that'll happen anytime soon, but capturing the request may be useful regardless.

@romanlutz, I rephrased this as a feature request. Would be interesting to see some code - hypothetical or real (Azure ML, e.g.,) how it could look like

@romanlutz
Copy link
Contributor Author

I think this is the closest to what you're asking for: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-forecast#forecasting-with-a-trained-model
Note that the training is all done in the cloud, so you don't see a fit method anywhere. The code that uses forecast_quantiles (which is about the same as your predict_quantiles) is what I'm trying to do with sktime. If I pass a subset of time series (in a panel case) then it happily just gives me the forecasts for those. In the example used there they distinguish time series by store and brand in order to predict orange juice sales per store and brand.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 15, 2023

hm, it is hard to read in terms of specification. Where is information being passed about the instance/subset that you want to forecast for?

@romanlutz
Copy link
Contributor Author

We only pass the rows in the test data that we want forecasts for. The time series identifying columns are just normal columns in the dataset (unlike sktime where they are in the index).

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 15, 2023

Ah, so it's basically a forecasting horizon equivalent that also has the instance identifier?
(I would be thankful if you could point to the code snippet or post a screenshot of it?)

@romanlutz
Copy link
Contributor Author

Here's a notebook: https://github.com/Azure/azureml-examples/blob/main/v1/python-sdk/tutorials/automl-with-azureml/forecasting-forecast-function/auto-ml-forecasting-function.ipynb

Some snippets that may be helpful:

forecast_horizon = n_test_periods  # note: this is just an integer with value 6
forecasting_parameters = ForecastingParameters(
    time_column_name=TIME_COLUMN_NAME,
    forecast_horizon=forecast_horizon,
    time_series_id_column_names=[TIME_SERIES_ID_COLUMN_NAME],
    target_lags=lags,
    freq="H",  # Set the forecast frequency to be hourly,
    cv_step_size="auto",
)

This is just bundled into a AutoML config that is passed to AzureML which ... deals with it. When it's done you can download a fitted_model to run forecast or forecast_quantiles (same as sktime but switch forecast out for predict).

There's also a section about forecasting away from the data which I found interesting. Not related to this at all, but you might find that interesting, too 🙂

@romanlutz
Copy link
Contributor Author

Getting back to your earlier comment regarding what this should look like @fkiraly

I see - seems like a common enough scenario. I/we'll need to think about how exactly the interface should look like.

E.g., how do we tell the forecaster that we want a forecast only for series 0 and 1? This can't be via the X, since we don't necessarily have an X.

I think it needs to be encoded in the fh somehow?

The reason this works a little easier for Azure ML is that the time series ID columns are part of X, so in other words X always exists even if there are no features.

I don't really see how it would relate to fh, but I could see it be a separate (optional) arg, e.g., predict(fh=fh, time_series_ids=[("abc", "TUV"), ("def", "WXY"), ...]) or perhaps even a separate method (predict_select or something like that) if you don't want to pollute the args of predict. 🙂
In my own example I'm creating a workaround: I duplicate the passed X (which is for one time series only) n-times for each time series (with their respective identifiers in the index) and then throw away the duplicated outputs. It's obviously not nice and quite wasteful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design API design & software architecture feature request New feature or request module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting
Projects
None yet
Development

No branches or pull requests

5 participants