Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features/indexing #150

Merged
merged 22 commits into from
Aug 3, 2020
Merged

Features/indexing #150

merged 22 commits into from
Aug 3, 2020

Conversation

grll
Copy link
Contributor

@grll grll commented Jul 16, 2020

Fixes #DARTS-95.

Summary

Add better indexing support and column names to TimeSeries.

Other Information

@grll grll requested review from hrzn and TheMP as code owners July 16, 2020 08:53
darts/timeseries.py Outdated Show resolved Hide resolved
darts/timeseries.py Outdated Show resolved Hide resolved
darts/timeseries.py Outdated Show resolved Hide resolved
darts/timeseries.py Outdated Show resolved Hide resolved
grll and others added 2 commits July 16, 2020 15:54
Co-authored-by: Julien Herzen <julien.herzen@unit8.co>
columns = [columns]
raise_if_not(len(columns) == df.shape[1], "`{}` columns specified when `{}` was expected from `df` "
"shape".format(len(columns), df.shape[1]))
df.columns = columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in line 64 we are changing the DataFrame provided by the user. I think we should not touch it, but rather do everything on the "internal" dataframe df_, which is a copy of df created on line 74.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah indeed I agree

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw relying on sort_index() to create a copy of the df might make it a bit unclear that it is a copy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes actually this made me notice a problem we currently have I think. In line 62 you change the columns of the original data frame right? This means that the original data frame that the user provides gets changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pennfranc You are right I modified this behavior in the latest commits, my question was more about using ??
sort_index() to return a new DF I was personally not sure it was returning a copy or a view in that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good point, I think a comment clarifying at which point we are creating a copy of the df might be nice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for a comment to make it clear. An option could also be to split it in using both df.copy() first, and then df_.sort_index(inplace=True).

darts/timeseries.py Outdated Show resolved Hide resolved
darts/timeseries.py Show resolved Hide resolved
darts/timeseries.py Outdated Show resolved Hide resolved
darts/timeseries.py Outdated Show resolved Hide resolved
darts/timeseries.py Show resolved Hide resolved
@@ -541,7 +567,8 @@ def from_dataframe(df: pd.DataFrame,
def from_times_and_values(times: pd.DatetimeIndex,
values: np.ndarray,
freq: Optional[str] = None,
fill_missing_dates: Optional[bool] = True) -> 'TimeSeries':
fill_missing_dates: Optional[bool] = True,
column: Optional[str] = None,) -> 'TimeSeries':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setup with the column parameter assumes that we are creating a univariate time series, right? But technically we can also use this function to create a multivariate time series by passing a 2-
dimensional numpy array as values. So we can just change the data type to also allow for a list of strings to support multivariate time series creation, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right I would have to check that everything works with 2d array

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll have to use an argument similar to the init method: columns: Optional[Union[List[str], str]].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to rely more on pandas by accepting a pd._typing.Axes as columns parameter type. It's up to the user to come with either a list of columns or a pandas index...

@@ -21,7 +21,8 @@ class TimeSeries:
def __init__(self,
df: pd.DataFrame,
freq: Optional[str] = None,
fill_missing_dates: Optional[bool] = True):
fill_missing_dates: Optional[bool] = True,
columns: Optional[Union[List[str], str]] = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not convinced that we need this argument to the init method. I think it's enough to simply re-use the column names of the dataframe, which would simplify everything. For instance the checks that column names are unique and of the correct size etc; Pandas can take care of that, and removing it for darts would improve separation of concerns IMO. It would still be good to keep this argument in from_times_and_values() method though, in order to build a DF with correct column names before building the TimeSeries.
WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grll what was teh final decision regarding this comment from @hrzn ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that separation of concerns is good but we still need to enforce a few more things that is not guaranteed by pd.DataFrame such as uniqueness and str format. I created a private method in the latest commit to do that as it is used in a couple of places.

@grll grll merged commit 0d9b394 into develop Aug 3, 2020
@LeoTafti LeoTafti deleted the features/indexing branch October 15, 2020 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants