-
Notifications
You must be signed in to change notification settings - Fork 880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Features/indexing #150
Features/indexing #150
Conversation
Co-authored-by: Julien Herzen <julien.herzen@unit8.co>
darts/timeseries.py
Outdated
columns = [columns] | ||
raise_if_not(len(columns) == df.shape[1], "`{}` columns specified when `{}` was expected from `df` " | ||
"shape".format(len(columns), df.shape[1])) | ||
df.columns = columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and in line 64 we are changing the DataFrame provided by the user. I think we should not touch it, but rather do everything on the "internal" dataframe df_
, which is a copy of df
created on line 74.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah indeed I agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw relying on sort_index() to create a copy of the df might make it a bit unclear that it is a copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opinions @TheMP @pennfranc @Droxef ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes actually this made me notice a problem we currently have I think. In line 62 you change the columns of the original data frame right? This means that the original data frame that the user provides gets changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pennfranc You are right I modified this behavior in the latest commits, my question was more about using ??
sort_index()
to return a new DF I was personally not sure it was returning a copy or a view in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good point, I think a comment clarifying at which point we are creating a copy of the df might be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for a comment to make it clear. An option could also be to split it in using both df.copy()
first, and then df_.sort_index(inplace=True)
.
darts/timeseries.py
Outdated
@@ -541,7 +567,8 @@ def from_dataframe(df: pd.DataFrame, | |||
def from_times_and_values(times: pd.DatetimeIndex, | |||
values: np.ndarray, | |||
freq: Optional[str] = None, | |||
fill_missing_dates: Optional[bool] = True) -> 'TimeSeries': | |||
fill_missing_dates: Optional[bool] = True, | |||
column: Optional[str] = None,) -> 'TimeSeries': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This setup with the column
parameter assumes that we are creating a univariate time series, right? But technically we can also use this function to create a multivariate time series by passing a 2-
dimensional numpy array as values
. So we can just change the data type to also allow for a list of strings to support multivariate time series creation, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right I would have to check that everything works with 2d array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'll have to use an argument similar to the init
method: columns: Optional[Union[List[str], str]]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to rely more on pandas by accepting a pd._typing.Axes
as columns parameter type. It's up to the user to come with either a list of columns or a pandas index...
darts/timeseries.py
Outdated
@@ -21,7 +21,8 @@ class TimeSeries: | |||
def __init__(self, | |||
df: pd.DataFrame, | |||
freq: Optional[str] = None, | |||
fill_missing_dates: Optional[bool] = True): | |||
fill_missing_dates: Optional[bool] = True, | |||
columns: Optional[Union[List[str], str]] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not convinced that we need this argument to the init method. I think it's enough to simply re-use the column names of the dataframe, which would simplify everything. For instance the checks that column names are unique and of the correct size etc; Pandas can take care of that, and removing it for darts would improve separation of concerns IMO. It would still be good to keep this argument in from_times_and_values()
method though, in order to build a DF with correct column names before building the TimeSeries.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that separation of concerns is good but we still need to enforce a few more things that is not guaranteed by pd.DataFrame such as uniqueness and str format. I created a private method in the latest commit to do that as it is used in a couple of places.
Fixes #DARTS-95.
Summary
Add better indexing support and column names to TimeSeries.
Other Information