gaps in df #1433

optionsraghu · 2022-12-17T17:53:02Z

    Hi @hrzn @rmk17 , has this issue been take care of in the latest Darts please? I am also finding it very difficult to read a dataframe that has natural gaps into a TimeSeries object. Imputation on weekends do not make business sense. Is there a way we can tell TimeSeries to ignore the gaps? Pandas is able to look away at the gaps, I am sure Darts can too? Using 'B' as the business days also do not help BTW. Many thanks for loking.

Originally posted by @optionsraghu in #284 (comment)

The text was updated successfully, but these errors were encountered:

optionsraghu · 2022-12-17T17:54:02Z

Hi @hrzn @rmk17 , has this issue been take care of in the latest Darts please? I am also finding it very difficult to read a dataframe that has natural gaps into a TimeSeries object. Imputation on weekends do not make business sense. Is there a way we can tell TimeSeries to ignore the gaps? Pandas is able to look away at the gaps, I am sure Darts can too? Using 'B' as the business days also do not help BTW. Many thanks for looking. Sorry for re-posting.

alexcolpitts96 · 2022-12-19T05:29:04Z

I am not the people you have tagged, but I have had some similar work arounds.

As @hrzn mentioned in #284, darts doesn't support gaps in observation, however, it is a relatively easy problem to approach. Darts is really strong once you have continuous (or MOSTLY continuous data), but doing these transformations in darts is tricky and I will suggest you do it in pandas.

Consider your pd.df with the observations and gaps. I am not sure if you are looking at data from multiple entities so I will explain as if you were. Group your dataframe with by week and entity (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Grouper.html). Convert the grouped dataframe into a list of dataframes. Now feed each new sub-dataframe into darts. If you need meta-data or static covariates, I will suggest you add them before doing the grouper. If you do not do this, you will lose which entity a sub-dataframe (and thus sub-TimeSeries) belongs to.

dennisbader · 2022-12-19T08:22:50Z

Hi @optionsraghu, have a look at #1420 and #1418, that might help solve this issue.

optionsraghu · 2022-12-19T17:29:15Z

Thank you @dennisbader . I will check them out.

optionsraghu · 2022-12-19T20:35:52Z

Hi @dennisbader , Unfortunately it seems more complicated than your solution as there seems to be custom holidays in the dataset that I am unaware of and there are thousands of rows. The one workaround I am doing is to sort the df with pandas on time and then remove the date/time column and then reading into series to further model. That way there is no conflict of interest in the frequency /gaps in data. I hope this should not change the modeling behaviour as the Df is anyway sorted...

hrzn · 2023-01-04T13:36:49Z

I've made a comment on #284 to justify why TimeSeries does not allow missing dates, hopefully that should help clarify things somewhat.

@optionsraghu you can try to do as you suggest and remove the date column from your DF before loading it as a (integer-indexed) TimeSeries. However I suspect this could cause you issues afterwards. How will you know which future dates a given value corresponds to? Having the values sorted is not enough, you also need no missing date in the input. If your missing dates are sparse, you can try loading the TimeSeries specifying fill_missing_dates=True (which will place NaNs for missing dates), and then use fill_missing_values(). That makes the hypothesis that your missing values can be interpolated though. Again, dealing with missing values is a modelling problem, not (only) a data representation problem :)

optionsraghu · 2023-01-04T18:49:29Z

@hrzn I am not sure I understand the question on the future dates? A lot of real world data have breaks and missing values. Employee attendance, Car park traffic etc all have weekend breaks. While date is an interesting and important attribute, the causality of various features in an orderly manner is what matters. Interpolation and imputation can skew the model in these cases.

hrzn · 2023-01-04T19:28:28Z

Maybe what I mean is best explained by an example. Assume you have the following date/value pairs corresponding to a business day frequency, with one missing entry (the second Monday), with some arbitrary values:

Mon 1.0
Tues 2.0
Wed 3.0
Thu 4.0
Fri 5.0
Tue 6.0
Wed 7.0

You basically have two choices now.

Go with datetimes, use a business day frequency (or something custom - see here) and build the following TimeSeries

Mon 1.0
Tues 2.0
Wed 3.0
Thu 4.0
Fri 5.0
Mon NaN
Tue 6.0
Wed 7.0

The second Monday has been filled with a NaN, which will cause an issue with all forecasting models currently in Darts. A model handling NaNs would have to be a model specially designed not just for forecasting, but also to handle missing observations.
One way to get rid of this, which can sometimes be very reasonable for some problems, is to run fill_missing_values() to interpolate missing values (and e.g. assign here something like Mon 5.5).
Another way is to replace missing values by "placeholder" values, potentially even of higher dimensionality (in multivariate series). This could allow models having a strong enough representation power to potentially learn about the meaning of these missing values.

Go with integer indices (RangeIndex), and build the following TimeSeries

Of course this way of doing things remove missing values concerns when building the TimeSeries representation. And if the actual dates don't matter (only the order/causality of the values), then it's a perfectly reasonable way to go. But I feel this case where strictly only the order matters is rare in practice, and almost always the relative distance between values matters as well. To see this imagine that you train a forecasting model on this series. It means that in the training sometimes the Tuesdays will be 5 time steps appart (when there's no hole), and sometimes 4 (when there's a hole, as in our example). So this approach works whenever the value to forecast is really not attached to the date, but only to the previous non-missing values.

Furthermore, the point I was making about the future date is this. Assume that you forecast the 3 next values of this series. You'll get something like this:

7 8.0
8 9.0
9 10.0

How will you know what date 8 corresponds to, if you assume that your original series can have arbitrarily missing dates to being with? Is it Friday? Or maybe Saturday, if the Friday was missing. I hope you see the point.

optionsraghu · 2023-01-05T10:01:55Z

@hrzn I agree with you in that date requirements. I think it also depends on the business case on hand. For critial processes yes it it needed. Thanks.

optionsraghu closed this as completed Jan 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gaps in df #1433

gaps in df #1433

optionsraghu commented Dec 17, 2022

optionsraghu commented Dec 17, 2022

alexcolpitts96 commented Dec 19, 2022

dennisbader commented Dec 19, 2022

optionsraghu commented Dec 19, 2022

optionsraghu commented Dec 19, 2022

hrzn commented Jan 4, 2023 •

edited

optionsraghu commented Jan 4, 2023

hrzn commented Jan 4, 2023 •

edited

optionsraghu commented Jan 5, 2023

gaps in df #1433

gaps in df #1433

Comments

optionsraghu commented Dec 17, 2022

optionsraghu commented Dec 17, 2022

alexcolpitts96 commented Dec 19, 2022

dennisbader commented Dec 19, 2022

optionsraghu commented Dec 19, 2022

optionsraghu commented Dec 19, 2022

hrzn commented Jan 4, 2023 • edited

optionsraghu commented Jan 4, 2023

hrzn commented Jan 4, 2023 • edited

optionsraghu commented Jan 5, 2023

hrzn commented Jan 4, 2023 •

edited

hrzn commented Jan 4, 2023 •

edited