Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] forecaster.update fails on multi indexed data frame #5853

Open
fkiraly opened this issue Jan 28, 2024 Discussed in #5833 · 15 comments
Open

[BUG] forecaster.update fails on multi indexed data frame #5853

fkiraly opened this issue Jan 28, 2024 Discussed in #5833 · 15 comments
Labels
bug Something isn't working module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting
Projects

Comments

@fkiraly
Copy link
Collaborator

fkiraly commented Jan 28, 2024

Discussed in #5833

Originally posted by ManuB68 January 25, 2024
I am trying to update a forecaster but it doesn't seem to work on my multi indexed data frame:

python code:

df = pd.DataFrame
df = pd.DataFrame([['A', '2024-01-01', 1, 11], ['A', '2024-01-02', 2, 12], ['A', '2024-01-03', 3, 13], ['A', '2024-01-04', 4, 14],\
                   ['B', '2024-01-01', 101, 1], ['B', '2024-01-02', 102, 2], ['B', '2024-01-03', 103, 3], ['B', '2024-01-04', 104, 4],\
                    ['C', '2024-01-01', 10000, 10], ['C', '2024-01-02', 20000, 20], ['C', '2024-01-03', 30000, 30], ['C', '2024-01-04', 40000, 43]],\
                  columns=['Patients', 'Dates', 'M1', 'M2'])

df['Dates']= pd.to_datetime(df['Dates'])

df = df.reset_index(drop = True)
df = df.set_index(["Patients", "Dates"])

y_train, y_test = temporal_train_test_split(df, test_size=1)

fh = 1

forecaster = VAR() 

forecaster.fit(y_train)
y_pred = forecaster.predict(fh=fh)


Day = y_test.index.get_level_values("Dates")[-1]
y_update = y_test[y_test.index.get_level_values("Dates") == Day]

forecaster.update(y_update)

y_pred_updated = forecaster.predict(fh)

Code might look overcomplicated, due to extraction from a code iterating on dates and my poor knowledge of python
The execution fails on line:
forecaster.update(y_update)

with this message:
ValueError: Need at least 3 dates to infer frequency

@fkiraly fkiraly added bug Something isn't working module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting labels Jan 28, 2024
@fkiraly fkiraly added this to Needs triage & validation in Bugfixing via automation Jan 28, 2024
@yarnabrina
Copy link
Collaborator

Reproducible example to show the following:

  1. not caused by multi-index
  2. not caused by update
  3. workaround by using range index
>>> 
>>> import pandas
>>> from sktime.forecasting.naive import NaiveForecaster
>>> from sktime.split import temporal_train_test_split
>>> 
>>> model = NaiveForecaster()
>>> 
>>> # datetime index
>>> sample_data = pandas.DataFrame(
...     data={"M1": [1, 2, 3, 4], "M2": [11, 12, 13, 14]},
...     index=pandas.to_datetime(["2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04"]),
... )
>>> 
>>> y_train, y_test = temporal_train_test_split(sample_data, test_size=1)
>>> 
>>> # fails even with non-multi-index data
>>> model_1 = model.clone()
>>> 
>>> model_1.fit(y_train)
NaiveForecaster()
>>> model_1.update(y_test)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/anirban/sktime-fork/sktime/forecasting/base/_base.py", line 886, in update
    self._update_y_X(y_inner, X_inner)
  File "/home/anirban/sktime-fork/sktime/forecasting/base/_base.py", line 1588, in _update_y_X
    self._set_cutoff_from_y(y)
  File "/home/anirban/sktime-fork/sktime/forecasting/base/_base.py", line 1643, in _set_cutoff_from_y
    cutoff_idx = get_cutoff(y, self.cutoff, return_index=True)
  File "/home/anirban/sktime-fork/sktime/datatypes/_utilities.py", line 296, in get_cutoff
    return sub_idx(obj.index, ix) if return_index else obj.index[ix]
  File "/home/anirban/sktime-fork/sktime/datatypes/_utilities.py", line 282, in sub_idx
    res.freq = pd.infer_freq(idx)
  File "/home/anirban/conda-environments/sktime/lib/python3.10/site-packages/pandas/tseries/frequencies.py", line 155, in infer_freq
    inferer = _FrequencyInferer(index)
  File "/home/anirban/conda-environments/sktime/lib/python3.10/site-packages/pandas/tseries/frequencies.py", line 189, in __init__
    raise ValueError("Need at least 3 dates to infer frequency")
ValueError: Need at least 3 dates to infer frequency
>>> 
>>> # fails even during fit
>>> model_2 = model.clone()
>>> 
>>> model_2.fit(y_test)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/anirban/sktime-fork/sktime/forecasting/base/_base.py", line 369, in fit
    self._update_y_X(y_inner, X_inner)
  File "/home/anirban/sktime-fork/sktime/forecasting/base/_base.py", line 1588, in _update_y_X
    self._set_cutoff_from_y(y)
  File "/home/anirban/sktime-fork/sktime/forecasting/base/_base.py", line 1643, in _set_cutoff_from_y
    cutoff_idx = get_cutoff(y, self.cutoff, return_index=True)
  File "/home/anirban/sktime-fork/sktime/datatypes/_utilities.py", line 296, in get_cutoff
    return sub_idx(obj.index, ix) if return_index else obj.index[ix]
  File "/home/anirban/sktime-fork/sktime/datatypes/_utilities.py", line 282, in sub_idx
    res.freq = pd.infer_freq(idx)
  File "/home/anirban/conda-environments/sktime/lib/python3.10/site-packages/pandas/tseries/frequencies.py", line 155, in infer_freq
    inferer = _FrequencyInferer(index)
  File "/home/anirban/conda-environments/sktime/lib/python3.10/site-packages/pandas/tseries/frequencies.py", line 189, in __init__
    raise ValueError("Need at least 3 dates to infer frequency")
ValueError: Need at least 3 dates to infer frequency
>>> 
>>> # range index
>>> sample_data = pandas.DataFrame(data={"M1": [1, 2, 3, 4], "M2": [11, 12, 13, 14]})
>>> 
>>> y_train, y_test = temporal_train_test_split(sample_data, test_size=1)
>>> 
>>> # works
>>> model_3 = model.clone()
>>> 
>>> model_3.fit(y_train)
NaiveForecaster()
>>> model_3.update(y_test)
/home/anirban/sktime-fork/sktime/forecasting/base/_base.py:1928: UserWarning: NotImplementedWarning: NaiveForecaster does not have a custom `update` method implemented. NaiveForecaster will be refit each time `update` is called with update_params=True. To refit less often, use the wrappers in the forecasting.stream module, e.g., UpdateEvery.
  warn(
/home/anirban/sktime-fork/sktime/forecasting/base/_base.py:1928: UserWarning: NotImplementedWarning: NaiveForecaster does not have a custom `update` method implemented. NaiveForecaster will be refit each time `update` is called with update_params=True. To refit less often, use the wrappers in the forecasting.stream module, e.g., UpdateEvery.
  warn(
NaiveForecaster()
>>> 
>>> # works
>>> model_4 = model.clone()
>>> 
>>> model_4.fit(y_test)
NaiveForecaster()
>>> 

@tpvasconcelos
Copy link
Contributor

tpvasconcelos commented Mar 9, 2024

I also came across this issue (ValueError: Need at least 3 dates to infer frequency) when using multi-indexed DataFrames containing multiple time-series instances.

The issue can actually be replicated in just a couple of lines:

from sktime.forecasting.naive import NaiveForecaster
from sktime.utils._testing.hierarchical import _make_hierarchical

y = _make_hierarchical(hierarchy_levels=(2,), min_timepoints=2, max_timepoints=3)

forecaster = NaiveForecaster()
forecaster.fit(y)

and it persists even for single/flat (non-MultiIndex) series:

y = pd.DataFrame(
    data={"y": [1, 2]},
    index=pd.to_datetime(["2020-01-01", "2020-01-02"])
)

forecaster.fit(y)

The latter can be fixed by setting a frequency to the series' index. However, the multi-index case is not so easy to fix.

For instance, this doesn't work:

y_not_okay = pd.DataFrame(
    {"y": [1, 2, 3, 4]},
    index=pd.MultiIndex.from_tuples(
        [
            ("a", pd.Timestamp("2020-01-01")),
            ("a", pd.Timestamp("2020-01-02")),
            ("b", pd.Timestamp("2020-01-02")),
            ("b", pd.Timestamp("2020-01-03")),
        ],
        names=["instance-id", "time"],
    ),
)

forecaster.fit(y_not_okay)

while this does:

y_okay = pd.DataFrame(
    {"y": [1, 2, 3, 4]},
    index=pd.MultiIndex(
        levels=[["a", "b"], pd.date_range("2020-01-01", periods=3, freq="D")],
        codes=[[0, 0, 1, 1], [0, 1, 1, 2]],
        names=["instance-id", "time"],
    ),
)

forecaster.fit(y_okay)

The two DataFrames are (superficially) the same/equal:

>>> y_not_okay.equals(y_okay)
True
>>> pd.testing.assert_frame_equal(y_not_okay, y_okay)

However, the second one (y_okay) stores the frequency-aware index internally which allows pandas to retain the frequency when returning slices of the multi-index.

I'm not sure how to tackle this yet because the y_okay method of using explicit levels and codes is not the most user-friendly way of defining hierarchical DataFrames, nor does it scale well for cases where the different instances' data do not overlap historically.

Also, Im going to throw a wild guess that most people create multi-index hierarchical DataFrames via something like:

>>> # Read data from some database...
>>> df = pd.DataFrame(
...     {
...         "instance-id": ["a", "a", "b", "b"],
...         "time": ["2020-01-01", "2020-01-02", "2020-01-02", "2020-01-03"],
...         "y": [1, 2, 3, 4],
...     },
... )
  instance-id        time  y
0           a  2020-01-01  1
1           a  2020-01-02  2
2           b  2020-01-02  3
3           b  2020-01-03  4

>>> # Convert it to multi-index via .set_index()
>>> y = df.set_index(["instance-id", "time"])
                        y
instance-id time         
a           2020-01-01  1
            2020-01-02  2
b           2020-01-02  3
            2020-01-03  4

The problem with this approach is that there is no simple way to define a frequency for the multi-indexed time-index. The only way to do this would be via the y_okay method described above making use of the levels attribute.

Solution proposal

What do you think of the idea of trying to convert the input multi-index to one that uses the levels-codes convention which would retain the frequency information? This approach would also simplify all subsequent calls to infer_freq() that sktime does internally every time an estimator is fitted (forecaster.fit(y) -> self._update_y_X() -> self._set_cutoff_from_y(y) -> get_cutoff() -> pd.infer_freq). The overhead here is probably not that significant but if would be nice to just infer the frequency of a multi-indexed series once instead of having to do this for every instance.

The other benefit is that this would solve this particular issue (#5853) of course...

A simpler solution...

Would it be possible to simply add a check here to fallback to None if the frequency cannot be inferred?

def sub_idx(idx, ix, return_index=True):
"""Like sub-setting pd.index, but preserves freq attribute."""
if not return_index:
return idx[ix]
res = idx[[ix]]
if hasattr(idx, "freq"):
if idx.freq is None:
res.freq = pd.infer_freq(idx)
else:
if res.freq != idx.freq:
res.freq = idx.freq
return res

Something like:

    def sub_idx(idx, ix, return_index=True):
        """Like sub-setting pd.index, but preserves freq attribute."""
        if not return_index:
            return idx[ix]
        res = idx[[ix]]
        if hasattr(idx, "freq"):
            if idx.freq is None:
-               res.freq = pd.infer_freq(idx)
+               with contextlib.suppress(ValueError):
+                   res.freq = pd.infer_freq(idx)
            else:
                if res.freq != idx.freq:
                    res.freq = idx.freq
        return res

@tpvasconcelos
Copy link
Contributor

If you don't see any issues with the "simpler solution" approach, I can open a quick PR and get testing.

@yarnabrina
Copy link
Collaborator

I already provided a simpler non-multiindex example and a workaround with range index above and in the original discussion. There were some discussion on Discord as well to use range index in general as well to avoid all pandas freq related issues.

That being said, I think your suggestion is perfectly fine. I don't know if it'd affect anything elsewhere though, but we'll know in your PR.


I do not know why frequency inference is needed in the first place, do you know? Can you please tell?

@tpvasconcelos
Copy link
Contributor

@yarnabrina range index is not an option as datetime information is sometimes needed to generate more informed forecasts. Think calendar features such as Black Friday, Easter, etc.

I'll get working on the simple fix I proposed.

Any thoughts on my other proposal?

@yarnabrina
Copy link
Collaborator

range index is not an option as datetime information is sometimes needed to generate more informed forecasts.

I think it is, for the main forecasting part. That's how I'm doing it in my office for more than a 6 months now.

Think calendar features such as Black Friday, Easter, etc.

Agreed, but it affects the transformations step in the pipeline, if I am not wrong. Few specific forecasters, like Prophet, may want to use the dates itself, but I don't think it's a general requirement.

Any thoughts on my other proposal?

I am not sure if I understood it well. Is the suggestion to run infer_freq per each different lowest level series? I am not clear how will it help if each series has < 3 observations.

(As asked above, I am not clear why freq inference is required at all, so I'm waiting for @fkiraly or you or anyone else to explain that.)

@yarnabrina
Copy link
Collaborator

@benHeid you were working on that freq issue, can you please help as well? Why is it necessary to infer frequencies?

@tpvasconcelos
Copy link
Contributor

@yarnabrina the series' frequency is inferred to retain the frequency information on the new slice of data. This fails for the edge case where the original data itself has fewer than 3 observations and no frequency set

@tpvasconcelos
Copy link
Contributor

I am not sure if I understood it well. Is the suggestion to run infer_freq per each different lowest level series?

@yarnabrina no, the other way around. That is what currently happens. My suggestion is to do it once. I'll try to look into this later if I get the time so I can give more details in my proposal

@benHeid
Copy link
Contributor

benHeid commented Mar 9, 2024

@yarnabrina the series' frequency is inferred to retain the frequency information on the new slice of data. This fails for the edge case where the original data itself has fewer than 3 observations and no frequency set

You are right. I am just thinking if in this case it would be better to not provide any frequency information if the original index has no frequency set..

But I think trying to infer one and ignore the warning if not possible is a good solution. Alternatively, we could do this more explicit, using an if statement that checks if an inference would be possible.. Probably, this would be better to regarding maintainability and understandability.

@yarnabrina
Copy link
Collaborator

the series' frequency is inferred to retain the frequency information on the new slice of data. This fails for the edge case where the original data itself has fewer than 3 observations and no frequency set

I understand that of course :) My question is why do we need to know frequency at all. Does any of the algorithm use "frequency" itself in their forecasting logic? What can not be done if frequency is missing for a forecasting algorithm?

@yarnabrina
Copy link
Collaborator

I am not sure if I understood it well. Is the suggestion to run infer_freq per each different lowest level series?

@yarnabrina no, the other way around. That is what currently happens. My suggestion is to do it once. I'll try to look into this later if I get the time so I can give more details in my proposal

I don't think I am very clear about this. In the most common cases, different lowest level series will have observations corresponding to same dates/datetimes, so number of distinct observations for index will not change, will it still satisfy pandas requirements?

@tpvasconcelos
Copy link
Contributor

tpvasconcelos commented Mar 9, 2024

In the most common cases, different lowest level series will have observations corresponding to same dates/datetimes, so number of distinct observations for index will not change

Maybe, but unfortunately sktime cannot make that assumption for all cases... The time index of hierarchical DataFrames do not have to align. You can have series with >100 observations and series with only 2 observations. They also dont have to start and/or end at the same time.

Question: does sktime support series with different frequencies within a pd-multiindex mtype?

If not, my point is that the frequency does not have to be inferred for every subseries. It can be inferred once at the beginning only. As I show in the example above, this frequency info can be retained in the multi-index when done properly. (whether this is a good idea is a whole different question 🐒 )

@tpvasconcelos
Copy link
Contributor

I understand that of course :) My question is why do we need to know frequency at all. Does any of the algorithm use "frequency" itself in their forecasting logic? What can not be done if frequency is missing for a forecasting algorithm?

I would have to do a little deep dive into the code but from top of mind, it's needed to resolve relative forecast horizons and that is one of the reasons this get_cutoff() utility is used for. One could argue that explicit if better than implicit so relative forecast horizons maybe should always have an explicit frequency set. But I have no idea if this would open a whole other can of worms. It also depends on my question above: "does sktime support series with different frequencies within a pd-multiindex mtype". Because if that is the case, then you do want to have implicit forecast horizons that infer the frequency later down the line.

@tpvasconcelos
Copy link
Contributor

@yarnabrina could you take a look at #6097 ? Thanks in advance!

fkiraly pushed a commit that referenced this issue Mar 22, 2024
…n 3 values (#6097)

#### Reference Issues/PRs

Solves #5853

#### What does this implement/fix? Explain your changes.

Avoid propagating a ValueError to the end-user when the `pd.infer_freq`
can't infer the frequency of the passed series. This happens when the
series has fewer than 3 elements. This patch catches the exception and
falls back to `None` for the series' frequency value.

See examples discussed in #5853 and in the tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting
Projects
Bugfixing
Needs triage & validation
Development

No branches or pull requests

4 participants