Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gaps in df #1433

Closed
optionsraghu opened this issue Dec 17, 2022 · 9 comments
Closed

gaps in df #1433

optionsraghu opened this issue Dec 17, 2022 · 9 comments

Comments

@optionsraghu
Copy link

    Hi @hrzn @rmk17 , has this issue been take care of in the latest Darts please? I am also finding it very difficult to read a dataframe that has natural gaps into a TimeSeries object. Imputation on weekends do not make business sense. Is there a way we can tell TimeSeries to ignore the gaps? Pandas is able to look away at the gaps, I am sure Darts can too? Using 'B' as the business days also do not help BTW. Many thanks for loking.

Originally posted by @optionsraghu in #284 (comment)

@optionsraghu optionsraghu changed the title Hi @hrzn @rmk17 , has this issue been take care of in the latest Darts please? I am also finding it very difficult to read a dataframe that has natural gaps into a TimeSeries object. Imputation on weekends do not make business sense. Is there a way we can tell TimeSeries to ignore the gaps? Pandas is able to look away at the gaps, I am sure Darts can too? Using 'B' as the business days also do not help BTW. Many thanks for loking. gaps in df Dec 17, 2022
@optionsraghu
Copy link
Author

Hi @hrzn @rmk17 , has this issue been take care of in the latest Darts please? I am also finding it very difficult to read a dataframe that has natural gaps into a TimeSeries object. Imputation on weekends do not make business sense. Is there a way we can tell TimeSeries to ignore the gaps? Pandas is able to look away at the gaps, I am sure Darts can too? Using 'B' as the business days also do not help BTW. Many thanks for looking. Sorry for re-posting.

@alexcolpitts96
Copy link
Contributor

I am not the people you have tagged, but I have had some similar work arounds.

As @hrzn mentioned in #284, darts doesn't support gaps in observation, however, it is a relatively easy problem to approach. Darts is really strong once you have continuous (or MOSTLY continuous data), but doing these transformations in darts is tricky and I will suggest you do it in pandas.

Consider your pd.df with the observations and gaps. I am not sure if you are looking at data from multiple entities so I will explain as if you were. Group your dataframe with by week and entity (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Grouper.html). Convert the grouped dataframe into a list of dataframes. Now feed each new sub-dataframe into darts. If you need meta-data or static covariates, I will suggest you add them before doing the grouper. If you do not do this, you will lose which entity a sub-dataframe (and thus sub-TimeSeries) belongs to.

@dennisbader
Copy link
Collaborator

Hi @optionsraghu, have a look at #1420 and #1418, that might help solve this issue.

@optionsraghu
Copy link
Author

Thank you @dennisbader . I will check them out.

@optionsraghu
Copy link
Author

Hi @dennisbader , Unfortunately it seems more complicated than your solution as there seems to be custom holidays in the dataset that I am unaware of and there are thousands of rows. The one workaround I am doing is to sort the df with pandas on time and then remove the date/time column and then reading into series to further model. That way there is no conflict of interest in the frequency /gaps in data. I hope this should not change the modeling behaviour as the Df is anyway sorted...

@hrzn
Copy link
Contributor

hrzn commented Jan 4, 2023

I've made a comment on #284 to justify why TimeSeries does not allow missing dates, hopefully that should help clarify things somewhat.

@optionsraghu you can try to do as you suggest and remove the date column from your DF before loading it as a (integer-indexed) TimeSeries. However I suspect this could cause you issues afterwards. How will you know which future dates a given value corresponds to? Having the values sorted is not enough, you also need no missing date in the input. If your missing dates are sparse, you can try loading the TimeSeries specifying fill_missing_dates=True (which will place NaNs for missing dates), and then use fill_missing_values(). That makes the hypothesis that your missing values can be interpolated though. Again, dealing with missing values is a modelling problem, not (only) a data representation problem :)

@optionsraghu
Copy link
Author

@hrzn I am not sure I understand the question on the future dates? A lot of real world data have breaks and missing values. Employee attendance, Car park traffic etc all have weekend breaks. While date is an interesting and important attribute, the causality of various features in an orderly manner is what matters. Interpolation and imputation can skew the model in these cases.

@hrzn
Copy link
Contributor

hrzn commented Jan 4, 2023

Maybe what I mean is best explained by an example. Assume you have the following date/value pairs corresponding to a business day frequency, with one missing entry (the second Monday), with some arbitrary values:

Mon 1.0
Tues 2.0
Wed 3.0
Thu 4.0
Fri 5.0
Tue 6.0
Wed 7.0

You basically have two choices now.

  1. Go with datetimes, use a business day frequency (or something custom - see here) and build the following TimeSeries
Mon 1.0
Tues 2.0
Wed 3.0
Thu 4.0
Fri 5.0
Mon NaN
Tue 6.0
Wed 7.0

The second Monday has been filled with a NaN, which will cause an issue with all forecasting models currently in Darts. A model handling NaNs would have to be a model specially designed not just for forecasting, but also to handle missing observations.
One way to get rid of this, which can sometimes be very reasonable for some problems, is to run fill_missing_values() to interpolate missing values (and e.g. assign here something like Mon 5.5).
Another way is to replace missing values by "placeholder" values, potentially even of higher dimensionality (in multivariate series). This could allow models having a strong enough representation power to potentially learn about the meaning of these missing values.

  1. Go with integer indices (RangeIndex), and build the following TimeSeries
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0

Of course this way of doing things remove missing values concerns when building the TimeSeries representation. And if the actual dates don't matter (only the order/causality of the values), then it's a perfectly reasonable way to go. But I feel this case where strictly only the order matters is rare in practice, and almost always the relative distance between values matters as well. To see this imagine that you train a forecasting model on this series. It means that in the training sometimes the Tuesdays will be 5 time steps appart (when there's no hole), and sometimes 4 (when there's a hole, as in our example). So this approach works whenever the value to forecast is really not attached to the date, but only to the previous non-missing values.

Furthermore, the point I was making about the future date is this. Assume that you forecast the 3 next values of this series. You'll get something like this:

7 8.0
8 9.0
9 10.0

How will you know what date 8 corresponds to, if you assume that your original series can have arbitrarily missing dates to being with? Is it Friday? Or maybe Saturday, if the Friday was missing. I hope you see the point.

@optionsraghu
Copy link
Author

@hrzn I agree with you in that date requirements. I think it also depends on the business case on hand. For critial processes yes it it needed. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants