Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/lagged features names #1679

Merged
merged 19 commits into from
Apr 11, 2023
Merged

Feat/lagged features names #1679

merged 19 commits into from
Apr 11, 2023

Conversation

madtoinou
Copy link
Collaborator

Fixes #1670.

Summary

Added helper functions in the tabularization and regression model in order to generate labels for the lagged features (and static covariates if applicable), which are stored in the RegressionModel.lagged_features_name_ attribute (List[List[str]). This enable the usage of the feature_importances attribute, available for some sklearn models. It was not possible before because the create_lagged_data method returns arrays and lose the name of the columns.

If the model was fit on a single TimeSeries, the attribute contains only one List[str], if trained on a Sequence of TimeSeries, the attribute contains several List[str].

The Lists containing the lagged features names are always nested, the API could eventually be simplified:

  • if the components names are identical across all the TimeSeries used during training, retain the first one only
  • wrapping the access to this attribute in a function, asking the user to provide a series (and mapping the training TimeSeries to their lagged features names)

Additional information

Added the corresponding tests.

madtoinou and others added 2 commits April 5, 2023 16:02
@madtoinou
Copy link
Collaborator Author

Updated the PR:

  • if argument is a single TimeSeries, its components' name are used
  • if argument is a Sequence[TimeSeries] and all the ts have exactly identical components names, these names are used
  • if argument is a Sequence[TimeSeries] and any of the ts contains a component with a different name, generic names are generated ('target', 'past_cov', 'future_cov').

Each "variate" (target, past_covariates and future_covariates) is processed independently: it's possible to have a mixture of generic names and "original names".

@codecov-commenter
Copy link

codecov-commenter commented Apr 5, 2023

Codecov Report

Patch coverage: 97.59% and project coverage change: -0.06 ⚠️

Comparison is base (ebb9eb6) 94.11% compared to head (8d2a03e) 94.05%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1679      +/-   ##
==========================================
- Coverage   94.11%   94.05%   -0.06%     
==========================================
  Files         125      125              
  Lines       11447    11491      +44     
==========================================
+ Hits        10773    10808      +35     
- Misses        674      683       +9     
Impacted Files Coverage Δ
darts/models/forecasting/baselines.py 100.00% <ø> (ø)
darts/models/forecasting/forecasting_model.py 94.77% <87.50%> (+0.23%) ⬆️
darts/utils/data/tabularization.py 98.99% <95.45%> (-1.01%) ⬇️
darts/models/forecasting/ensemble_model.py 95.45% <100.00%> (+4.88%) ⬆️
...ts/models/forecasting/regression_ensemble_model.py 100.00% <100.00%> (ø)
darts/models/forecasting/regression_model.py 97.27% <100.00%> (+0.12%) ⬆️
darts/models/forecasting/theta.py 90.27% <100.00%> (+0.70%) ⬆️
...arts/models/forecasting/torch_forecasting_model.py 90.01% <100.00%> (-0.21%) ⬇️

... and 8 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@madtoinou
Copy link
Collaborator Author

Updated to use the same naming conventions as the explainability module: {comp_name}_lag{lag_idx} for unique component names, comp{i}_{variate_type}_lag{lag_idx} for generic names.

Copy link
Collaborator

@dennisbader dennisbader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks a lot @madtoinou 🚀

Ready for release! 🥳

@@ -527,6 +534,120 @@ def create_lagged_prediction_data(
return X, times


def create_lagged_components_names(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we handle the static covariates in here as well?

@@ -358,6 +361,32 @@ def _create_lagged_data(

return training_samples, training_labels

def _create_lagged_components_name(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this method and put everything into the helper function create_lagged_component_names

)

# adding the static covariates on the right of each features_cols_name
features_cols_name = self._add_static_covariates_name(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can move this into the helper function create_lagged_component_names

@@ -445,6 +474,41 @@ def _add_static_covariates(
features = features[0]
return features

def _add_static_covariates_name(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be part of create_lagged_component_names in my opinion

) -> Union[np.array, Sequence[np.array]]:
"""
Add static covariates names to the features name for RegressionModels.
Accounts for series with potentially different static covariates to accomodate for the maximum
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't the number of static covariates guaranteed to be identical? The models should throw an error when using series with different static covariate numbers, no?

@@ -527,6 +534,120 @@ def create_lagged_prediction_data(
return X, times


def create_lagged_components_names(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def create_lagged_components_names(
def create_lagged_component_names(

past_covariates=past_covariates,
future_covariates=future_covariates,
)
self.model.lagged_features_name_ = lagged_features_names
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use a naming convention lagged_feature_names similar to feature_importances_ from sklearn.

Also shouldn't we store this in the Darts model, rather than the actual one?
Would also require to define it in the model constructor

Suggested change
self.model.lagged_features_name_ = lagged_features_names
self.lagged_feature_names_ = lagged_feature_names

Comment on lines 585 to 597
target_series = (
[target_series] if not isinstance(target_series, Sequence) else target_series
)
past_covariates = (
[past_covariates]
if not isinstance(past_covariates, Sequence)
else past_covariates
)
future_covariates = (
[future_covariates]
if not isinstance(future_covariates, Sequence)
else future_covariates
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
target_series = (
[target_series] if not isinstance(target_series, Sequence) else target_series
)
past_covariates = (
[past_covariates]
if not isinstance(past_covariates, Sequence)
else past_covariates
)
future_covariates = (
[future_covariates]
if not isinstance(future_covariates, Sequence)
else future_covariates
)
target_series = series2seq(target_series)
past_covariates = series2seq(past_covariates)
future_covariates = series2seq(future_covariates)

[lags, lags_past_covariates, lags_future_covariates],
["target", "past_cov", "future_cov"],
):
unique_components_names = set(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can directly skip this iteration if variates is None (simplifies also later on)

Suggested change
unique_components_names = set(
if variate is None:
continue
unique_components_names = set(

@dennisbader dennisbader merged commit 65861d1 into master Apr 11, 2023
@dennisbader dennisbader deleted the feat/lagged_features_names branch April 11, 2023 22:15
alexcolpitts96 pushed a commit to alexcolpitts96/darts that referenced this pull request May 31, 2023
* feat: create and store the lagged features names in the regression models

* feat: adding corresponding tests in tabularization

* fix: support any kind of Sequence to generate the lagged features name

* feat: verify that the number of lagged feature names matches the feature_importances in the relevant regression models

* fix: if any of the variate is a sequence of ts with different components names, create generic name for the corresponding variate, updated the tests

* fix: using the same naming convention for the lagged components as the explainability module

* refactor and fix some type hint warnings

* simplified lagged feature name generation and moved out of regression model

* fix regr model tests

* fix create lagged data tests

* fix small bug in unit test

* fix bug in unittest from last PR

---------

Co-authored-by: dennisbader <dennis.bader@gmx.ch>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

I can't see the importances of my variables
3 participants