-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] forecaster.update
fails on multi indexed data frame
#5853
Comments
Reproducible example to show the following:
>>>
>>> import pandas
>>> from sktime.forecasting.naive import NaiveForecaster
>>> from sktime.split import temporal_train_test_split
>>>
>>> model = NaiveForecaster()
>>>
>>> # datetime index
>>> sample_data = pandas.DataFrame(
... data={"M1": [1, 2, 3, 4], "M2": [11, 12, 13, 14]},
... index=pandas.to_datetime(["2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04"]),
... )
>>>
>>> y_train, y_test = temporal_train_test_split(sample_data, test_size=1)
>>>
>>> # fails even with non-multi-index data
>>> model_1 = model.clone()
>>>
>>> model_1.fit(y_train)
NaiveForecaster()
>>> model_1.update(y_test)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/anirban/sktime-fork/sktime/forecasting/base/_base.py", line 886, in update
self._update_y_X(y_inner, X_inner)
File "/home/anirban/sktime-fork/sktime/forecasting/base/_base.py", line 1588, in _update_y_X
self._set_cutoff_from_y(y)
File "/home/anirban/sktime-fork/sktime/forecasting/base/_base.py", line 1643, in _set_cutoff_from_y
cutoff_idx = get_cutoff(y, self.cutoff, return_index=True)
File "/home/anirban/sktime-fork/sktime/datatypes/_utilities.py", line 296, in get_cutoff
return sub_idx(obj.index, ix) if return_index else obj.index[ix]
File "/home/anirban/sktime-fork/sktime/datatypes/_utilities.py", line 282, in sub_idx
res.freq = pd.infer_freq(idx)
File "/home/anirban/conda-environments/sktime/lib/python3.10/site-packages/pandas/tseries/frequencies.py", line 155, in infer_freq
inferer = _FrequencyInferer(index)
File "/home/anirban/conda-environments/sktime/lib/python3.10/site-packages/pandas/tseries/frequencies.py", line 189, in __init__
raise ValueError("Need at least 3 dates to infer frequency")
ValueError: Need at least 3 dates to infer frequency
>>>
>>> # fails even during fit
>>> model_2 = model.clone()
>>>
>>> model_2.fit(y_test)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/anirban/sktime-fork/sktime/forecasting/base/_base.py", line 369, in fit
self._update_y_X(y_inner, X_inner)
File "/home/anirban/sktime-fork/sktime/forecasting/base/_base.py", line 1588, in _update_y_X
self._set_cutoff_from_y(y)
File "/home/anirban/sktime-fork/sktime/forecasting/base/_base.py", line 1643, in _set_cutoff_from_y
cutoff_idx = get_cutoff(y, self.cutoff, return_index=True)
File "/home/anirban/sktime-fork/sktime/datatypes/_utilities.py", line 296, in get_cutoff
return sub_idx(obj.index, ix) if return_index else obj.index[ix]
File "/home/anirban/sktime-fork/sktime/datatypes/_utilities.py", line 282, in sub_idx
res.freq = pd.infer_freq(idx)
File "/home/anirban/conda-environments/sktime/lib/python3.10/site-packages/pandas/tseries/frequencies.py", line 155, in infer_freq
inferer = _FrequencyInferer(index)
File "/home/anirban/conda-environments/sktime/lib/python3.10/site-packages/pandas/tseries/frequencies.py", line 189, in __init__
raise ValueError("Need at least 3 dates to infer frequency")
ValueError: Need at least 3 dates to infer frequency
>>>
>>> # range index
>>> sample_data = pandas.DataFrame(data={"M1": [1, 2, 3, 4], "M2": [11, 12, 13, 14]})
>>>
>>> y_train, y_test = temporal_train_test_split(sample_data, test_size=1)
>>>
>>> # works
>>> model_3 = model.clone()
>>>
>>> model_3.fit(y_train)
NaiveForecaster()
>>> model_3.update(y_test)
/home/anirban/sktime-fork/sktime/forecasting/base/_base.py:1928: UserWarning: NotImplementedWarning: NaiveForecaster does not have a custom `update` method implemented. NaiveForecaster will be refit each time `update` is called with update_params=True. To refit less often, use the wrappers in the forecasting.stream module, e.g., UpdateEvery.
warn(
/home/anirban/sktime-fork/sktime/forecasting/base/_base.py:1928: UserWarning: NotImplementedWarning: NaiveForecaster does not have a custom `update` method implemented. NaiveForecaster will be refit each time `update` is called with update_params=True. To refit less often, use the wrappers in the forecasting.stream module, e.g., UpdateEvery.
warn(
NaiveForecaster()
>>>
>>> # works
>>> model_4 = model.clone()
>>>
>>> model_4.fit(y_test)
NaiveForecaster()
>>> |
I also came across this issue ( The issue can actually be replicated in just a couple of lines: from sktime.forecasting.naive import NaiveForecaster
from sktime.utils._testing.hierarchical import _make_hierarchical
y = _make_hierarchical(hierarchy_levels=(2,), min_timepoints=2, max_timepoints=3)
forecaster = NaiveForecaster()
forecaster.fit(y) and it persists even for single/flat (non- y = pd.DataFrame(
data={"y": [1, 2]},
index=pd.to_datetime(["2020-01-01", "2020-01-02"])
)
forecaster.fit(y) The latter can be fixed by setting a frequency to the series' index. However, the multi-index case is not so easy to fix. For instance, this doesn't work: y_not_okay = pd.DataFrame(
{"y": [1, 2, 3, 4]},
index=pd.MultiIndex.from_tuples(
[
("a", pd.Timestamp("2020-01-01")),
("a", pd.Timestamp("2020-01-02")),
("b", pd.Timestamp("2020-01-02")),
("b", pd.Timestamp("2020-01-03")),
],
names=["instance-id", "time"],
),
)
forecaster.fit(y_not_okay) while this does: y_okay = pd.DataFrame(
{"y": [1, 2, 3, 4]},
index=pd.MultiIndex(
levels=[["a", "b"], pd.date_range("2020-01-01", periods=3, freq="D")],
codes=[[0, 0, 1, 1], [0, 1, 1, 2]],
names=["instance-id", "time"],
),
)
forecaster.fit(y_okay) The two DataFrames are (superficially) the same/equal: >>> y_not_okay.equals(y_okay)
True
>>> pd.testing.assert_frame_equal(y_not_okay, y_okay)
However, the second one ( I'm not sure how to tackle this yet because the Also, Im going to throw a wild guess that most people create multi-index hierarchical DataFrames via something like:
The problem with this approach is that there is no simple way to define a frequency for the multi-indexed time-index. The only way to do this would be via the Solution proposalWhat do you think of the idea of trying to convert the input multi-index to one that uses the levels-codes convention which would retain the frequency information? This approach would also simplify all subsequent calls to The other benefit is that this would solve this particular issue (#5853) of course... A simpler solution...Would it be possible to simply add a check here to fallback to sktime/sktime/datatypes/_utilities.py Lines 275 to 286 in d945457
Something like: def sub_idx(idx, ix, return_index=True):
"""Like sub-setting pd.index, but preserves freq attribute."""
if not return_index:
return idx[ix]
res = idx[[ix]]
if hasattr(idx, "freq"):
if idx.freq is None:
- res.freq = pd.infer_freq(idx)
+ with contextlib.suppress(ValueError):
+ res.freq = pd.infer_freq(idx)
else:
if res.freq != idx.freq:
res.freq = idx.freq
return res |
If you don't see any issues with the "simpler solution" approach, I can open a quick PR and get testing. |
I already provided a simpler non-multiindex example and a workaround with range index above and in the original discussion. There were some discussion on Discord as well to use range index in general as well to avoid all pandas freq related issues. That being said, I think your suggestion is perfectly fine. I don't know if it'd affect anything elsewhere though, but we'll know in your PR. I do not know why frequency inference is needed in the first place, do you know? Can you please tell? |
@yarnabrina range index is not an option as datetime information is sometimes needed to generate more informed forecasts. Think calendar features such as Black Friday, Easter, etc. I'll get working on the simple fix I proposed. Any thoughts on my other proposal? |
I think it is, for the main forecasting part. That's how I'm doing it in my office for more than a 6 months now.
Agreed, but it affects the transformations step in the pipeline, if I am not wrong. Few specific forecasters, like Prophet, may want to use the dates itself, but I don't think it's a general requirement.
I am not sure if I understood it well. Is the suggestion to run infer_freq per each different lowest level series? I am not clear how will it help if each series has < 3 observations. (As asked above, I am not clear why freq inference is required at all, so I'm waiting for @fkiraly or you or anyone else to explain that.) |
@benHeid you were working on that freq issue, can you please help as well? Why is it necessary to infer frequencies? |
@yarnabrina the series' frequency is inferred to retain the frequency information on the new slice of data. This fails for the edge case where the original data itself has fewer than 3 observations and no frequency set |
@yarnabrina no, the other way around. That is what currently happens. My suggestion is to do it once. I'll try to look into this later if I get the time so I can give more details in my proposal |
You are right. I am just thinking if in this case it would be better to not provide any frequency information if the original index has no frequency set.. But I think trying to infer one and ignore the warning if not possible is a good solution. Alternatively, we could do this more explicit, using an if statement that checks if an inference would be possible.. Probably, this would be better to regarding maintainability and understandability. |
I understand that of course :) My question is why do we need to know frequency at all. Does any of the algorithm use "frequency" itself in their forecasting logic? What can not be done if frequency is missing for a forecasting algorithm? |
I don't think I am very clear about this. In the most common cases, different lowest level series will have observations corresponding to same dates/datetimes, so number of distinct observations for index will not change, will it still satisfy pandas requirements? |
Maybe, but unfortunately sktime cannot make that assumption for all cases... The time index of hierarchical DataFrames do not have to align. You can have series with >100 observations and series with only 2 observations. They also dont have to start and/or end at the same time. Question: does sktime support series with different frequencies within a If not, my point is that the frequency does not have to be inferred for every subseries. It can be inferred once at the beginning only. As I show in the example above, this frequency info can be retained in the multi-index when done properly. (whether this is a good idea is a whole different question 🐒 ) |
I would have to do a little deep dive into the code but from top of mind, it's needed to resolve relative forecast horizons and that is one of the reasons this |
@yarnabrina could you take a look at #6097 ? Thanks in advance! |
…n 3 values (#6097) #### Reference Issues/PRs Solves #5853 #### What does this implement/fix? Explain your changes. Avoid propagating a ValueError to the end-user when the `pd.infer_freq` can't infer the frequency of the passed series. This happens when the series has fewer than 3 elements. This patch catches the exception and falls back to `None` for the series' frequency value. See examples discussed in #5853 and in the tests.
Discussed in #5833
Originally posted by ManuB68 January 25, 2024
I am trying to update a forecaster but it doesn't seem to work on my multi indexed data frame:
Code might look overcomplicated, due to extraction from a code iterating on dates and my poor knowledge of python
The execution fails on line:
forecaster.update(y_update)
with this message:
ValueError: Need at least 3 dates to infer frequency
The text was updated successfully, but these errors were encountered: