Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Natural gaps in timeseries observations #284

Closed
rmk17 opened this issue Mar 5, 2021 · 15 comments
Closed

Natural gaps in timeseries observations #284

rmk17 opened this issue Mar 5, 2021 · 15 comments
Labels
feature request Use this label to request a new feature triage Issue waiting for triaging

Comments

@rmk17
Copy link

rmk17 commented Mar 5, 2021

I have hourly energy observations taken during business days only. So 120 hrs per week, not 168. With Sat and Sun missing always and holidays as well. Seasonality is daily, weekly, yearly.

Was trying to follow samples and use TimeSeries.from_dataframe with default settings. I got a lot of NaNs inserted into DateTimeIndex, that matches pandas.asfreq('H') behaviour. So with train/test split train, val = series.split_before(pd.Timestamp('20200101')) I receive

len(data[:'20200101']), len(train), len(data['20200101':]), len(val)
(11856, 17328, 5904, 9480)
data['20200101':]['load']
dt_iso
2020-01-09 00:00:00     801.0410
2020-01-09 01:00:00     790.4990
2020-01-09 02:00:00     770.1160
2020-01-09 03:00:00     770.8910
2020-01-09 04:00:00     774.4680
                         ...    
2021-01-29 19:00:00   1,026.1950
2021-01-29 20:00:00   1,007.2650
2021-01-29 21:00:00     990.8280
2021-01-29 22:00:00     953.9190
2021-01-29 23:00:00     904.5980
Name: load, Length: 5904, dtype: float64
val['load']
                          load
date                          
2020-01-01 00:00:00        nan
2020-01-01 01:00:00        nan
2020-01-01 02:00:00        nan
2020-01-01 03:00:00        nan
2020-01-01 04:00:00        nan
...                        ...
2021-01-29 19:00:00 1,026.1950
2021-01-29 20:00:00 1,007.2650
2021-01-29 21:00:00   990.8280
2021-01-29 22:00:00   953.9190
2021-01-29 23:00:00   904.5980

[9480 rows x 1 columns]
Freq: H

So one can see that data has exploded with NaNs.

Obviously darts.utils.statistics.plot_acf() and a darts.utils.statistics.check_seasonality() do not work with NaNs.
plot_acf() gets me a straight line at zero, where should be AR lags up until 192.
check_seasonality() reports
[2021-03-05 17:39:16,120] INFO | darts.utils.statistics | The ACF has no local maximum for m < max_lag = 24.

If I supply fill_missing_dates=False for the TimeSeries.from_dataframe():
series = TimeSeries.from_dataframe(data, time_col='date', value_cols=['load'], fill_missing_dates=False)
I get Could not infer frequency. Are some dates missing? Try specifying 'fill_missing_dates=True'

With freq='H' parameter to function above no luck also. The same loophole as with pandas DateTimeIndex with freq='H'

In statsmodels Sarimax models I was able to overcome datetime freq warning by converting index to PeriodIndex which supports gaps.
data.index = pd.DatetimeIndex(data.index).to_period('H')

Please advise what are my options with Darts to be able to work with time series data with natural gaps?

Thanks a lot in advance for the attention.

P.S. Some pictures to illustrate time series and gaps
Capture-1
Capture

@rmk17 rmk17 added feature request Use this label to request a new feature triage Issue waiting for triaging labels Mar 5, 2021
@tneuer
Copy link
Contributor

tneuer commented Mar 8, 2021

Hm, that is an extremely interesting problem! Thanks for mentioning it. A few questions and remarks:

  1. Maybe as a short term solution you could try using "Custom Business Hours" from pandas and first make the time index work with pure pandas. Then it should hopefully work with Darts when using TimeSeries.from_dataframe(your_df).

  2. When using the PeriodIndex is this then the intended behaviour, so does it fill in some missing values or just ignore them?

  3. We have the implementation of dummy time indexing on our Suggested TODO list for the next release. Now seeing how this could solve real problems we probably consider implementing this sooner, rather than later! We'll keep you updated!

Let us know if any suggestions helped. We hope we can solve this issue somehow.

@hrzn
Copy link
Contributor

hrzn commented Mar 8, 2021

At the moment time series with holes are not supported. As @tneuer said we have it on our roadmap to address this. In the meanwhile if you'd like to train a forecasting model on each of the sub-segments you can still create several TimeSeries instances and then fit a model on all those time series.

@tneuer
Copy link
Contributor

tneuer commented Mar 11, 2021

Hey @rmk17, could you solve your problem? I would have two more suggestions:

  1. Try and give some dummy index yourself, something like
pd.date_range(start=data.index[0], freq="H", periods=len(data))

Then use darts to fit any model you like and back-transform in the end and introduce the gaps again. It sounds a little bit tedious (and it is) but the transformation needs only to be done once in the beginning and once in the end.

  1. Your data shows very strong seasonality with m=7days and m=24hours so you could also try an impute the missing values for Saturday and Sunday by taking the mean at the same time of the previous 5 business days. With such a strong seasonality this shouldn't introduce too much bias and you would have more reliable results. This of course only works if you expect that on the weekend the time series shows the same behaviour.

@rmk17 rmk17 closed this as completed Mar 11, 2021
@rmk17
Copy link
Author

rmk17 commented Mar 11, 2021

Hello @tneuer and @hrzn.

Thank you for expressing interest in my issue.

At the moment I was able to make a custom index with the dates I need by passing a list of holidays to CustomBusinessDay(). When I try to use asfreq(‘H’) on that index, I got extra hours inserted again between first and last date.CustomBusinessHour, on the other hand can not use holidays and I found no way to mix them both without loosing freq attribute.

  1. As for your suggestion for dummy index.
new_index = pd.date_range(start=data.index[0], freq="H", periods=len(data))
print(len(new_index), new_index[-1])
print(len(data.index), data.index[-1])

46032 2018-07-01 23:00:00
46032 2021-01-29 23:00:00

So it does create an Index with freq attribute, but since pandas does not allow for gaps, index will end prematurely timewise. Also in case of gaps introduced with data at some later point, freq attribute is dropped by pandas.

  1. pandas.core.indexes.period.PeriodIndex on the other hand, keeps original data structure intact, both in size and time:

type(data.index), len(data.index), data.index[-1] prints:
(pandas.core.indexes.datetimes.DatetimeIndex, 46032, Timestamp('2021-01-29 23:00:00'))

Whereas
data.index = pd.DatetimeIndex(data.index).to_period('H')
type(data.index), len(data.index), data.index[-1] prints:

(pandas.core.indexes.period.PeriodIndex, 46032, Period('2021-01-29 23:00', 'H'))

  1. As for imputation, I have already imputed missing values (both with regression and XGBoost), considering the fact the I have hourly weather data for all observations.
    But since data is highly auto-regressive in nature (with the most significant lag being 24 and those around him), this imputation brings unnecessary errors on Monday forecasts.

Many thanks for all the input you gave.

P.S. Sorry, accidentally pressed "Close issue" button.

@rmk17 rmk17 reopened this Mar 11, 2021
@StatMixedML
Copy link

StatMixedML commented Sep 10, 2021

@rmk17 Not sure if the problem still persists, but what you can do is to impute the missing values with np.nan and then add an observed values indicator/feature that is 1 for the actual observed values/time steps and 0 for the NAs. With this, the model does not put any emphasis on the NAs, and you do not have to artificially impute the missing time steps.

Regarding the strong seasonality problem: The above suggested method wouldn't introduce any artificial data into the model so that the seasonality is preserved. However, depending on the choice of the model (e.g., RNN) you might want to add some seasonally lagged features (depending on the frequency and seasonality of your data) or add an attention block. As far as the TCN network is concerned, since it has access to the full history, it might pick it up itself.

@hrzn hrzn closed this as completed Oct 7, 2021
@montyhall
Copy link

@rmk17 Not sure if the problem still persists, but what you can do is to impute the missing values with np.nan and then add an observed values indicator/feature that is 1 for the actual observed values/time steps and 0 for the NAs. With this, the model does not put any emphasis on the NAs, and you do not have to artificially impute the missing time steps.

Regarding the strong seasonality problem: The above suggested method wouldn't introduce any artificial data into the model so that the seasonality is preserved. However, depending on the choice of the model (e.g., RNN) you might want to add some seasonally lagged features (depending on the frequency and seasonality of your data) or add an attention block. As far as the TCN network is concerned, since it has access to the full history, it might pick it up itself.

@StatMixedML i have implemented such a strategy when building my own models (often through masking the loss function). But can you share how you would do this through the darts API?

thank you

@temiwale88
Copy link

Any updates on the road map. I too have natural gaps in my data and not sure how to proceed. Thanks!

@gabrielgcbs
Copy link

I would also like to know if there are any updates on built-in ways to work with natural gaps. I'm trying to do a stock prediction price, but the stock market only works on weekdays, so I have a similar problem to the stated on this issue.

@hrzn
Copy link
Contributor

hrzn commented Sep 27, 2022

I would also like to know if there are any updates on built-in ways to work with natural gaps. I'm trying to do a stock prediction price, but the stock market only works on weekdays, so I have a similar problem to the stated on this issue.

Have you tried using a business day frequency ("B") ?

@gabrielgcbs
Copy link

I would also like to know if there are any updates on built-in ways to work with natural gaps. I'm trying to do a stock prediction price, but the stock market only works on weekdays, so I have a similar problem to the stated on this issue.

Have you tried using a business day frequency ("B") ?

Oh, I actually wasn't aware of that frequency type. That might solve my problem, thank you so much!

@temiwale88
Copy link

@gabrielgcbs . Can we chat? I'm trying to do the same! Probably a 'fool's errand' to many but seems interesting.

@optionsraghu
Copy link

Hi @hrzn @rmk17 , has this issue been take care of in the latest Darts please? I am also finding it very difficult to read a dataframe that has natural gaps into a TimeSeries object. Imputation on weekends do not make business sense. Is there a way we can tell TimeSeries to ignore the gaps? Pandas is able to look away at the gaps, I am sure Darts can too? Using 'B' as the business days also do not help BTW. Many thanks for loking.

@hrzn
Copy link
Contributor

hrzn commented Jan 4, 2023

@optionsraghu You can represent gaps or missing values in Darts: just fill missing dates with NaNs. However, that won't take you far because handling gaps / missing dates is actually a modelling problem. How do you want a model to capture the fact that certain values are missing? There are several possible answers, which can depend of what a missing value means, and how to capture them with a model. Currently in Darts all forecasting models assume that values are present (i.e not NaN) for all dates, and so feeding them with series containing NaNs will result in NaNs in the output. In the future, some models might be developed and integrated in Darts that explicitly account and model the phenomenon of missing values on certain dates, and also are able to predict "NaNs" values, but that is not the case today, and that would probably not change our requirements on TimeSeries.

In pandas, you can build a Dataframe with missing dates, but it won't be time-indexed and will not represent a time series. TimeSeries must have a set frequency (or be integer-indexed), and contain all dates matching the frequency. This is obviously needed e.g. to be able to generate future dates when forecasting.

Finally, note that you can still decide to see each of the series between the gaps as separate TimeSeries (see extract_subseries() for this) and train any global model on all these complete series jointly.

@optionsraghu
Copy link

@hrzn Many thanks for your kind help. As I mentioned I was able to overcome this issue with just converting pandas df time index into simple integers (rangeIndex) and then feeding into darts. That should solve the issue for the timebeing as all I need is to keep the causality order intact when training the model. Again I presume here that having a rangeindex is sufficient enough for Darts to slice appropriate training data and label data in the learning process. In that context is there a way I can look into the individual train and label data points just to confirm my assumption? Not a big deal if not. You guys have done an excellent job by abstracting so many nuances upstream. Being in the intersection of stats,domain and computer coding (jack of many trades in other words!) I find your package extremely useful. The documentation can be a little bit more explanatory but then I understand it is a work in progress. Kudos to your team.

@hrzn
Copy link
Contributor

hrzn commented Jan 4, 2023

@hrzn Many thanks for your kind help. As I mentioned I was able to overcome this issue with just converting pandas df time index into simple integers (rangeIndex) and then feeding into darts. That should solve the issue for the timebeing as all I need is to keep the causality order intact when training the model. Again I presume here that having a rangeindex is sufficient enough for Darts to slice appropriate training data and label data in the learning process. In that context is there a way I can look into the individual train and label data points just to confirm my assumption? Not a big deal if not. You guys have done an excellent job by abstracting so many nuances upstream. Being in the intersection of stats,domain and computer coding (jack of many trades in other words!) I find your package extremely useful. The documentation can be a little bit more explanatory but then I understand it is a work in progress. Kudos to your team.

Thanks for the kind words :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Use this label to request a new feature triage Issue waiting for triaging
Projects
None yet
Development

No branches or pull requests

8 participants