Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fh.to_relative() bug for DatetimeIndex #582

Merged
merged 11 commits into from
Jan 5, 2021
Merged

Fix fh.to_relative() bug for DatetimeIndex #582

merged 11 commits into from
Jan 5, 2021

Conversation

aiwalter
Copy link
Contributor

Reference Issues/PRs

Solved #534
Not solved #561 as this comes from AutoETS and is a different bug.

What does this implement/fix? Explain your changes.

Bug fix

@mloning
Copy link
Contributor

mloning commented Dec 29, 2020

@aiwalter thanks! We're working on the failing test on master #581

@aiwalter
Copy link
Contributor Author

Good to know, I was already wondering 😁

Copy link
Contributor

@mloning mloning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aiwalter, what about going from relative to absolute in to_absolute, do we also have to make changes there if the time index/cutoff is pd.DatetimeIndex/pd.Timestamp?

With this we should also be able to simplify _coerce_duration_to_int and other utility functions in utils/datetime which are a mess currently, but can do that in a separate PR.


# Bug fix for DatetimeIndex delta calculation (see #534)
if isinstance(index, pd.DatetimeIndex):
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the error that we need to catch here? Can we not check if either index or freq has a freq and raise an error otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the tests, there are some indices with only one value and freq=None, this is causing the error. Also we have to try first in order to convert e.g. the MS from DatetimeIndex into M of PeriodIndex automatically. If doing y.index.to_period(freq="MS") will result in the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-21-3c5e56bae74c> in <module>
----> 1 y.index.to_period(freq="MS")

c:\users\martin\miniconda3\envs\sktime37\lib\site-packages\pandas\core\indexes\extension.py in method(self, *args, **kwargs)
     76 
     77         def method(self, *args, **kwargs):
---> 78             result = attr(self._data, *args, **kwargs)
     79             if wrap:
     80                 if isinstance(result, type(self._data)):

c:\users\martin\miniconda3\envs\sktime37\lib\site-packages\pandas\core\arrays\datetimes.py in to_period(self, freq)
   1109             freq = res
   1110 
-> 1111         return PeriodArray._from_datetime64(self._data, freq, tz=self.tz)
   1112 
   1113     def to_perioddelta(self, freq):

c:\users\martin\miniconda3\envs\sktime37\lib\site-packages\pandas\core\arrays\period.py in _from_datetime64(cls, data, freq, tz)
    228         PeriodArray[freq]
    229         """
--> 230         data, freq = dt64arr_to_periodarr(data, freq, tz)
    231         return cls(data, freq=freq)
    232 

c:\users\martin\miniconda3\envs\sktime37\lib\site-packages\pandas\core\arrays\period.py in dt64arr_to_periodarr(data, freq, tz)
    957         data = data._values
    958 
--> 959     base = freq._period_dtype_code
    960     return c_dt64arr_to_periodarr(data.view("i8"), base, tz), freq
    961 

AttributeError: 'pandas._libs.tslibs.offsets.MonthBegin' object has no attribute '_period_dtype_code'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not check if either index or freq has a freq and raise an error otherwise?

Then we have to rewrite some tests. And I think it is ok to accept also freq=None as e.g. an index like pd.DatetimeIndex(['2000-02-10 00:00:00']) is just resulting in DatetimeIndex(['2000-02-10'], dtype='datetime64[ns]', freq=None)

@aiwalter
Copy link
Contributor Author

@mloning I am having a look into to_absolute() to see if we can improve things as well. I already found sth related to the freq, so lets keep this PR open a bit..

@mloning
Copy link
Contributor

mloning commented Dec 30, 2020

Hi @aiwalter - I had a look at this as well and reworked your initial solution a little bit, shall I just push it to this branch or create a separate PR?

@aiwalter
Copy link
Contributor Author

Ok feel free to push it here :)

@mloning
Copy link
Contributor

mloning commented Dec 30, 2020

@aiwalter Sorry if I accidentally overwrote any of your most recent stuff! Feel free to revert my changes if you don't like them. There's another issue with pd.DatetimeIndex which I mentioned in #534

@aiwalter
Copy link
Contributor Author

@mloning no problem, not sure which commit is failing now?

@mloning
Copy link
Contributor

mloning commented Dec 31, 2020

So with these changes now, this one still fails but raises a more helpful error:

from sktime.forecasting.all import *
y = load_airline()
y.index = y.index.to_timestamp()
y_train, y_test = temporal_train_test_split(y, test_size=120)
fh = ForecastingHorizon(y_test.index[0:5], is_relative=False)
fh.to_relative(cutoff=y_train.index[-1])

And this one works:

from sktime.forecasting.all import *
y = load_airline()
y.index = y.index.to_timestamp("M")
y_train, y_test = temporal_train_test_split(y, test_size=120)
fh = ForecastingHorizon(y_test.index[0:5], is_relative=False)
fh.to_relative(cutoff=y_train.index[-1])

@mloning
Copy link
Contributor

mloning commented Dec 31, 2020

@aiwalter Any thoughts on these changes? I'm happy to merge them as they are, but agree that we need to further try to simplify things for the handling of time indices and the forecasting horizon.

@aiwalter
Copy link
Contributor Author

@mloning I am not sure if it is the best UX when excluding some indices based on some freq values. The problem for some values like freq="MS" is here, as the freq value is resulting in None when doing .to_period().to_timestamp():

>>> pd.date_range(start='1/1/2018 00:00:00', end='2/1/2018 00:01:00', freq="MS")
DatetimeIndex(['2018-01-01', '2018-02-01'], dtype='datetime64[ns]', freq='MS')

>>> pd.date_range(start='1/1/2018 00:00:00', end='2/1/2018 00:01:00', freq="MS").to_period().to_timestamp()
DatetimeIndex(['2018-01-01', '2018-02-01'], dtype='datetime64[ns]', freq=None)

I am posting this in the pandas gitter as well if there is any other way to solve this.

Actually my solution with try/except was working well because I was able to read the freq from cutoff and assign it back. We could just test all freq values from here to make sure it is working for all? What do you think?

Further there is now an issue here:

from sktime.forecasting.all import *
y = load_airline()
y.index = y.index.to_timestamp("M")
y_train, y_test = temporal_train_test_split(y, test_size=10)

from sktime.forecasting.ets import AutoETS
from sktime.forecasting.fbprophet import Prophet
from sktime.forecasting.arima import AutoARIMA

forecaster = EnsembleForecaster([
    ("autoARIMA", AutoARIMA(sp=12)),
    ("autoETS", AutoETS(auto=True, sp=12, n_jobs=-1)),
    ("fbprophet", Prophet(seasonality_mode="multiplicative", add_country_holidays={"country_name": "US"}))
])
%time forecaster.fit(y_train)

forecaster.predict([1,2,3])

which results in this error:
ValueError: Index type not supported. Please consider using pd.PeriodIndex.

So that is actually sth related to #561 .

@aiwalter
Copy link
Contributor Author

aiwalter commented Jan 1, 2021

@mloning : See pandas issue. I didnt know about freq="infer", not sure how reliable this works 🤔

@mloning
Copy link
Contributor

mloning commented Jan 4, 2021

which results in this error:
ValueError: Index type not supported. Please consider using pd.PeriodIndex.

So that is actually sth related to #561 .

I think this is fine for now because "M" cannot reliably be handled otherwise, so better to raise an informative error.

Re pandas, I think it shouldn't forget freq even if values are missing, freq to me is not the same as complete, but that's a different issue.

I'm not a fan of the try-except workflow if it can be avoided, I'd prefer to handle a known set of cases and raise informative errors otherwise.

I'd suggest to merge this PR first since it's already an improvement, even though it doesn't fix all the issues, and then you can have another go at including more cases if you like?

@aiwalter
Copy link
Contributor Author

aiwalter commented Jan 4, 2021

@mloning its fine for me to merge this already. I was also thinking about opening our fbprophet for other index types by wrapping them around DatetimeIndex, but might do that in a separate PR. This will however make the index stuff even a bit more complex

@mloning
Copy link
Contributor

mloning commented Jan 5, 2021

Alright great @aiwalter - I'll merge this one now and then we can tackle the remaining issues.

A lot of the complications for us seem to stem from incomplete pandas functionality - perhaps it's best to contribute to pandas, but that will take some more time to get it approved and merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants