[BUG] `forecaster.update` fails on multi indexed data frame #5853

fkiraly · 2024-01-28T16:01:42Z

Discussed in #5833

^{Originally posted by ManuB68 January 25, 2024}
I am trying to update a forecaster but it doesn't seem to work on my multi indexed data frame:

python code:

df = pd.DataFrame
df = pd.DataFrame([['A', '2024-01-01', 1, 11], ['A', '2024-01-02', 2, 12], ['A', '2024-01-03', 3, 13], ['A', '2024-01-04', 4, 14],\
                   ['B', '2024-01-01', 101, 1], ['B', '2024-01-02', 102, 2], ['B', '2024-01-03', 103, 3], ['B', '2024-01-04', 104, 4],\
                    ['C', '2024-01-01', 10000, 10], ['C', '2024-01-02', 20000, 20], ['C', '2024-01-03', 30000, 30], ['C', '2024-01-04', 40000, 43]],\
                  columns=['Patients', 'Dates', 'M1', 'M2'])

df['Dates']= pd.to_datetime(df['Dates'])

df = df.reset_index(drop = True)
df = df.set_index(["Patients", "Dates"])

y_train, y_test = temporal_train_test_split(df, test_size=1)

fh = 1

forecaster = VAR() 

forecaster.fit(y_train)
y_pred = forecaster.predict(fh=fh)


Day = y_test.index.get_level_values("Dates")[-1]
y_update = y_test[y_test.index.get_level_values("Dates") == Day]

forecaster.update(y_update)

y_pred_updated = forecaster.predict(fh)

Code might look overcomplicated, due to extraction from a code iterating on dates and my poor knowledge of python
The execution fails on line:
forecaster.update(y_update)

with this message:
ValueError: Need at least 3 dates to infer frequency

The text was updated successfully, but these errors were encountered:

yarnabrina · 2024-01-29T16:07:13Z

Reproducible example to show the following:

not caused by multi-index
not caused by update
workaround by using range index

>>> 
>>> import pandas
>>> from sktime.forecasting.naive import NaiveForecaster
>>> from sktime.split import temporal_train_test_split
>>> 
>>> model = NaiveForecaster()
>>> 
>>> # datetime index
>>> sample_data = pandas.DataFrame(
...     data={"M1": [1, 2, 3, 4], "M2": [11, 12, 13, 14]},
...     index=pandas.to_datetime(["2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04"]),
... )
>>> 
>>> y_train, y_test = temporal_train_test_split(sample_data, test_size=1)
>>> 
>>> # fails even with non-multi-index data
>>> model_1 = model.clone()
>>> 
>>> model_1.fit(y_train)
NaiveForecaster()
>>> model_1.update(y_test)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/anirban/sktime-fork/sktime/forecasting/base/_base.py", line 886, in update
    self._update_y_X(y_inner, X_inner)
  File "/home/anirban/sktime-fork/sktime/forecasting/base/_base.py", line 1588, in _update_y_X
    self._set_cutoff_from_y(y)
  File "/home/anirban/sktime-fork/sktime/forecasting/base/_base.py", line 1643, in _set_cutoff_from_y
    cutoff_idx = get_cutoff(y, self.cutoff, return_index=True)
  File "/home/anirban/sktime-fork/sktime/datatypes/_utilities.py", line 296, in get_cutoff
    return sub_idx(obj.index, ix) if return_index else obj.index[ix]
  File "/home/anirban/sktime-fork/sktime/datatypes/_utilities.py", line 282, in sub_idx
    res.freq = pd.infer_freq(idx)
  File "/home/anirban/conda-environments/sktime/lib/python3.10/site-packages/pandas/tseries/frequencies.py", line 155, in infer_freq
    inferer = _FrequencyInferer(index)
  File "/home/anirban/conda-environments/sktime/lib/python3.10/site-packages/pandas/tseries/frequencies.py", line 189, in __init__
    raise ValueError("Need at least 3 dates to infer frequency")
ValueError: Need at least 3 dates to infer frequency
>>> 
>>> # fails even during fit
>>> model_2 = model.clone()
>>> 
>>> model_2.fit(y_test)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/anirban/sktime-fork/sktime/forecasting/base/_base.py", line 369, in fit
    self._update_y_X(y_inner, X_inner)
  File "/home/anirban/sktime-fork/sktime/forecasting/base/_base.py", line 1588, in _update_y_X
    self._set_cutoff_from_y(y)
  File "/home/anirban/sktime-fork/sktime/forecasting/base/_base.py", line 1643, in _set_cutoff_from_y
    cutoff_idx = get_cutoff(y, self.cutoff, return_index=True)
  File "/home/anirban/sktime-fork/sktime/datatypes/_utilities.py", line 296, in get_cutoff
    return sub_idx(obj.index, ix) if return_index else obj.index[ix]
  File "/home/anirban/sktime-fork/sktime/datatypes/_utilities.py", line 282, in sub_idx
    res.freq = pd.infer_freq(idx)
  File "/home/anirban/conda-environments/sktime/lib/python3.10/site-packages/pandas/tseries/frequencies.py", line 155, in infer_freq
    inferer = _FrequencyInferer(index)
  File "/home/anirban/conda-environments/sktime/lib/python3.10/site-packages/pandas/tseries/frequencies.py", line 189, in __init__
    raise ValueError("Need at least 3 dates to infer frequency")
ValueError: Need at least 3 dates to infer frequency
>>> 
>>> # range index
>>> sample_data = pandas.DataFrame(data={"M1": [1, 2, 3, 4], "M2": [11, 12, 13, 14]})
>>> 
>>> y_train, y_test = temporal_train_test_split(sample_data, test_size=1)
>>> 
>>> # works
>>> model_3 = model.clone()
>>> 
>>> model_3.fit(y_train)
NaiveForecaster()
>>> model_3.update(y_test)
/home/anirban/sktime-fork/sktime/forecasting/base/_base.py:1928: UserWarning: NotImplementedWarning: NaiveForecaster does not have a custom `update` method implemented. NaiveForecaster will be refit each time `update` is called with update_params=True. To refit less often, use the wrappers in the forecasting.stream module, e.g., UpdateEvery.
  warn(
/home/anirban/sktime-fork/sktime/forecasting/base/_base.py:1928: UserWarning: NotImplementedWarning: NaiveForecaster does not have a custom `update` method implemented. NaiveForecaster will be refit each time `update` is called with update_params=True. To refit less often, use the wrappers in the forecasting.stream module, e.g., UpdateEvery.
  warn(
NaiveForecaster()
>>> 
>>> # works
>>> model_4 = model.clone()
>>> 
>>> model_4.fit(y_test)
NaiveForecaster()
>>>

tpvasconcelos · 2024-03-09T00:51:38Z

I also came across this issue (ValueError: Need at least 3 dates to infer frequency) when using multi-indexed DataFrames containing multiple time-series instances.

The issue can actually be replicated in just a couple of lines:

from sktime.forecasting.naive import NaiveForecaster
from sktime.utils._testing.hierarchical import _make_hierarchical

y = _make_hierarchical(hierarchy_levels=(2,), min_timepoints=2, max_timepoints=3)

forecaster = NaiveForecaster()
forecaster.fit(y)

and it persists even for single/flat (non-MultiIndex) series:

y = pd.DataFrame(
    data={"y": [1, 2]},
    index=pd.to_datetime(["2020-01-01", "2020-01-02"])
)

forecaster.fit(y)

The latter can be fixed by setting a frequency to the series' index. However, the multi-index case is not so easy to fix.

For instance, this doesn't work:

y_not_okay = pd.DataFrame(
    {"y": [1, 2, 3, 4]},
    index=pd.MultiIndex.from_tuples(
        [
            ("a", pd.Timestamp("2020-01-01")),
            ("a", pd.Timestamp("2020-01-02")),
            ("b", pd.Timestamp("2020-01-02")),
            ("b", pd.Timestamp("2020-01-03")),
        ],
        names=["instance-id", "time"],
    ),
)

forecaster.fit(y_not_okay)

while this does:

y_okay = pd.DataFrame(
    {"y": [1, 2, 3, 4]},
    index=pd.MultiIndex(
        levels=[["a", "b"], pd.date_range("2020-01-01", periods=3, freq="D")],
        codes=[[0, 0, 1, 1], [0, 1, 1, 2]],
        names=["instance-id", "time"],
    ),
)

forecaster.fit(y_okay)

The two DataFrames are (superficially) the same/equal:

>>> y_not_okay.equals(y_okay)
True
>>> pd.testing.assert_frame_equal(y_not_okay, y_okay)

However, the second one (y_okay) stores the frequency-aware index internally which allows pandas to retain the frequency when returning slices of the multi-index.

I'm not sure how to tackle this yet because the y_okay method of using explicit levels and codes is not the most user-friendly way of defining hierarchical DataFrames, nor does it scale well for cases where the different instances' data do not overlap historically.

Also, Im going to throw a wild guess that most people create multi-index hierarchical DataFrames via something like:

>>> # Read data from some database...
>>> df = pd.DataFrame(
...     {
...         "instance-id": ["a", "a", "b", "b"],
...         "time": ["2020-01-01", "2020-01-02", "2020-01-02", "2020-01-03"],
...         "y": [1, 2, 3, 4],
...     },
... )
  instance-id        time  y
0           a  2020-01-01  1
1           a  2020-01-02  2
2           b  2020-01-02  3
3           b  2020-01-03  4

>>> # Convert it to multi-index via .set_index()
>>> y = df.set_index(["instance-id", "time"])
                        y
instance-id time         
a           2020-01-01  1
            2020-01-02  2
b           2020-01-02  3
            2020-01-03  4

The problem with this approach is that there is no simple way to define a frequency for the multi-indexed time-index. The only way to do this would be via the y_okay method described above making use of the levels attribute.

Solution proposal

What do you think of the idea of trying to convert the input multi-index to one that uses the levels-codes convention which would retain the frequency information? This approach would also simplify all subsequent calls to infer_freq() that sktime does internally every time an estimator is fitted (forecaster.fit(y) -> self._update_y_X() -> self._set_cutoff_from_y(y) -> get_cutoff() -> pd.infer_freq). The overhead here is probably not that significant but if would be nice to just infer the frequency of a multi-indexed series once instead of having to do this for every instance.

The other benefit is that this would solve this particular issue (#5853) of course...

A simpler solution...

Would it be possible to simply add a check here to fallback to None if the frequency cannot be inferred?

sktime/sktime/datatypes/_utilities.py

Lines 275 to 286 in d945457

    
           def sub_idx(idx, ix, return_index=True): 
        
               """Like sub-setting pd.index, but preserves freq attribute.""" 
        
               if not return_index: 
        
                   return idx[ix] 
        
               res = idx[[ix]] 
        
               if hasattr(idx, "freq"): 
        
                   if idx.freq is None: 
        
                       res.freq = pd.infer_freq(idx) 
        
                   else: 
        
                       if res.freq != idx.freq: 
        
                           res.freq = idx.freq 
        
               return res

Something like:

    def sub_idx(idx, ix, return_index=True):
        """Like sub-setting pd.index, but preserves freq attribute."""
        if not return_index:
            return idx[ix]
        res = idx[[ix]]
        if hasattr(idx, "freq"):
            if idx.freq is None:
-               res.freq = pd.infer_freq(idx)
+               with contextlib.suppress(ValueError):
+                   res.freq = pd.infer_freq(idx)
            else:
                if res.freq != idx.freq:
                    res.freq = idx.freq
        return res

tpvasconcelos · 2024-03-09T10:21:39Z

If you don't see any issues with the "simpler solution" approach, I can open a quick PR and get testing.

yarnabrina · 2024-03-09T14:09:02Z

I already provided a simpler non-multiindex example and a workaround with range index above and in the original discussion. There were some discussion on Discord as well to use range index in general as well to avoid all pandas freq related issues.

That being said, I think your suggestion is perfectly fine. I don't know if it'd affect anything elsewhere though, but we'll know in your PR.

I do not know why frequency inference is needed in the first place, do you know? Can you please tell?

tpvasconcelos · 2024-03-09T15:44:55Z

@yarnabrina range index is not an option as datetime information is sometimes needed to generate more informed forecasts. Think calendar features such as Black Friday, Easter, etc.

I'll get working on the simple fix I proposed.

Any thoughts on my other proposal?

yarnabrina · 2024-03-09T16:17:28Z

range index is not an option as datetime information is sometimes needed to generate more informed forecasts.

I think it is, for the main forecasting part. That's how I'm doing it in my office for more than a 6 months now.

Think calendar features such as Black Friday, Easter, etc.

Agreed, but it affects the transformations step in the pipeline, if I am not wrong. Few specific forecasters, like Prophet, may want to use the dates itself, but I don't think it's a general requirement.

Any thoughts on my other proposal?

I am not sure if I understood it well. Is the suggestion to run infer_freq per each different lowest level series? I am not clear how will it help if each series has < 3 observations.

(As asked above, I am not clear why freq inference is required at all, so I'm waiting for @fkiraly or you or anyone else to explain that.)

yarnabrina · 2024-03-09T16:20:18Z

@benHeid you were working on that freq issue, can you please help as well? Why is it necessary to infer frequencies?

tpvasconcelos · 2024-03-09T16:29:45Z

@yarnabrina the series' frequency is inferred to retain the frequency information on the new slice of data. This fails for the edge case where the original data itself has fewer than 3 observations and no frequency set

tpvasconcelos · 2024-03-09T16:32:31Z

I am not sure if I understood it well. Is the suggestion to run infer_freq per each different lowest level series?

@yarnabrina no, the other way around. That is what currently happens. My suggestion is to do it once. I'll try to look into this later if I get the time so I can give more details in my proposal

benHeid · 2024-03-09T16:47:02Z

@yarnabrina the series' frequency is inferred to retain the frequency information on the new slice of data. This fails for the edge case where the original data itself has fewer than 3 observations and no frequency set

You are right. I am just thinking if in this case it would be better to not provide any frequency information if the original index has no frequency set..

But I think trying to infer one and ignore the warning if not possible is a good solution. Alternatively, we could do this more explicit, using an if statement that checks if an inference would be possible.. Probably, this would be better to regarding maintainability and understandability.

yarnabrina · 2024-03-09T16:59:25Z

the series' frequency is inferred to retain the frequency information on the new slice of data. This fails for the edge case where the original data itself has fewer than 3 observations and no frequency set

I understand that of course :) My question is why do we need to know frequency at all. Does any of the algorithm use "frequency" itself in their forecasting logic? What can not be done if frequency is missing for a forecasting algorithm?

yarnabrina · 2024-03-09T17:04:15Z

I am not sure if I understood it well. Is the suggestion to run infer_freq per each different lowest level series?

@yarnabrina no, the other way around. That is what currently happens. My suggestion is to do it once. I'll try to look into this later if I get the time so I can give more details in my proposal

I don't think I am very clear about this. In the most common cases, different lowest level series will have observations corresponding to same dates/datetimes, so number of distinct observations for index will not change, will it still satisfy pandas requirements?

tpvasconcelos · 2024-03-09T18:12:24Z

In the most common cases, different lowest level series will have observations corresponding to same dates/datetimes, so number of distinct observations for index will not change

Maybe, but unfortunately sktime cannot make that assumption for all cases... The time index of hierarchical DataFrames do not have to align. You can have series with >100 observations and series with only 2 observations. They also dont have to start and/or end at the same time.

Question: does sktime support series with different frequencies within a pd-multiindex mtype?

If not, my point is that the frequency does not have to be inferred for every subseries. It can be inferred once at the beginning only. As I show in the example above, this frequency info can be retained in the multi-index when done properly. (whether this is a good idea is a whole different question 🐒 )

tpvasconcelos · 2024-03-09T18:25:00Z

I understand that of course :) My question is why do we need to know frequency at all. Does any of the algorithm use "frequency" itself in their forecasting logic? What can not be done if frequency is missing for a forecasting algorithm?

I would have to do a little deep dive into the code but from top of mind, it's needed to resolve relative forecast horizons and that is one of the reasons this get_cutoff() utility is used for. One could argue that explicit if better than implicit so relative forecast horizons maybe should always have an explicit frequency set. But I have no idea if this would open a whole other can of worms. It also depends on my question above: "does sktime support series with different frequencies within a pd-multiindex mtype". Because if that is the case, then you do want to have implicit forecast horizons that infer the frequency later down the line.

tpvasconcelos · 2024-03-22T20:12:32Z

@yarnabrina could you take a look at #6097 ? Thanks in advance!

…n 3 values (#6097) #### Reference Issues/PRs Solves #5853 #### What does this implement/fix? Explain your changes. Avoid propagating a ValueError to the end-user when the `pd.infer_freq` can't infer the frequency of the passed series. This happens when the series has fewer than 3 elements. This patch catches the exception and falls back to `None` for the series' frequency value. See examples discussed in #5853 and in the tests.

fkiraly added bug Something isn't working module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting labels Jan 28, 2024

fkiraly added this to Needs triage & validation in Bugfixing via automation Jan 28, 2024

tpvasconcelos mentioned this issue Mar 9, 2024

[BUG] Ignore ValueError on pd.infer_freq when index has fewer than 3 values #6097

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `forecaster.update` fails on multi indexed data frame #5853

[BUG] `forecaster.update` fails on multi indexed data frame #5853

fkiraly commented Jan 28, 2024

yarnabrina commented Jan 29, 2024

tpvasconcelos commented Mar 9, 2024 •

edited

Loading

tpvasconcelos commented Mar 9, 2024

yarnabrina commented Mar 9, 2024

tpvasconcelos commented Mar 9, 2024

yarnabrina commented Mar 9, 2024

yarnabrina commented Mar 9, 2024

tpvasconcelos commented Mar 9, 2024

tpvasconcelos commented Mar 9, 2024

benHeid commented Mar 9, 2024

yarnabrina commented Mar 9, 2024

yarnabrina commented Mar 9, 2024

tpvasconcelos commented Mar 9, 2024 •

edited

Loading

tpvasconcelos commented Mar 9, 2024

tpvasconcelos commented Mar 22, 2024

[BUG] forecaster.update fails on multi indexed data frame #5853

[BUG] forecaster.update fails on multi indexed data frame #5853

Comments

fkiraly commented Jan 28, 2024

Discussed in #5833

yarnabrina commented Jan 29, 2024

tpvasconcelos commented Mar 9, 2024 • edited Loading

Solution proposal

A simpler solution...

tpvasconcelos commented Mar 9, 2024

yarnabrina commented Mar 9, 2024

tpvasconcelos commented Mar 9, 2024

yarnabrina commented Mar 9, 2024

yarnabrina commented Mar 9, 2024

tpvasconcelos commented Mar 9, 2024

tpvasconcelos commented Mar 9, 2024

benHeid commented Mar 9, 2024

yarnabrina commented Mar 9, 2024

yarnabrina commented Mar 9, 2024

tpvasconcelos commented Mar 9, 2024 • edited Loading

tpvasconcelos commented Mar 9, 2024

tpvasconcelos commented Mar 22, 2024

[BUG] `forecaster.update` fails on multi indexed data frame #5853

[BUG] `forecaster.update` fails on multi indexed data frame #5853

tpvasconcelos commented Mar 9, 2024 •

edited

Loading

tpvasconcelos commented Mar 9, 2024 •

edited

Loading