Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactorised tabularisation + Jupyter notebook w/ experiments. #1399

Merged
merged 34 commits into from Jan 23, 2023

Conversation

mabilton
Copy link
Contributor

Addresses #1308 - Speedup creation of lagged data.

Summary

Hi there.

I've started some preliminary work on refactoring /darts/utils/data/tabularization.py.

Here’s a basic outline of the changes I’ve made:

  1. I’ve split up the original _create_lagged_data function into two: one which creates the X and y arrays for training/validation (create_lagged_features_and_labels), and another that creates the X array for predicting (create_lagged_features). The rationale behind this change is twofold:
    • In my personal opinion, clearly dividing the ‘logic’ of constructing the test/validation arrays and the logic of constructing the prediction arrays makes these functions more easily maintainable, testable, and understandable to users.
    • There’s some computational savings when constructing the prediction X array, since the y array is not constructed at all (note that in the current implementation of _create_lagged_data, y is still assembled even when is_training=False.
  2. Instead of taking the union of all the specified series and then dropping nan columns, I instead first compute the intersection of all the dates found in each series and then lag the index of these common dates to construct the features/labels. The notable benefit of this approach is that it completely eliminates the for loops over the lag values.
  3. Unlike _create_lagged_data, I've made the functions I've implemented explicitly public, in case users wish to implement their own algorithm with relies on assembling lagged values.

Other Information

From my very informal experiments, I’ve observed a ~10 fold speed-up on 'simple’ problems which involve a couple of lag values, and a ~40 fold speed-up on 'larger’ problems which involves many lag values (i.e. more than 10). If you’d like to run these experiments yourself, feel free to checkout the tabularization_experiments.ipynb notebook. For reference, these benchmarks were performed on a ~4 year old Dell XPS 15 laptop.

In saying all this, what I've done is still very much a work in progress. Notably:

  1. I still need to write tests, although I do run 'correctness tests' in tabularization_experiments.ipynb over a very large number of input parameter combinations (10k+ combos), so I'm pretty confident my implementation here is correct.
  2. I still haven't quite got the prediction data generation working. More specifically, I'm a bit confused about the intended behaviour of _create_lagged_data when is_training=False. To see what I mean by this, please refer to the 'Understanding Behaviour of _create_lagged_data when is_training=False' section of tabularization_experiments.ipynb. For this example, why is X = [15., 32., 61.] and Ts = [6]? In particular, the 61 value contributed by future_series to X is only -1 lags away from the Ts = 6 value, not -3 lags away?
  3. I notice that _create_lagged_data has an outer for loop over all of the specified target_series. What's the precise reason for this? Is there any assumption that we can make about the shapes of the timeseries its iterating over? In particular, would it be possible to first concatenate all the specified timeseries together, create the lagged variables, and then np.split the result at the end?
  4. I believe that in special situations where past_series, future_series, and target_series are all sampled at the same frequency, using something like numpy.lib.stride_tricks.sliding_window_view would probably be even faster than what I've written here, so that might be worth looking at.

Any comments/feedback on what I've done here would be very welcome - thanks in advance for any help.

Cheers,
Matt.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@codecov-commenter
Copy link

codecov-commenter commented Nov 29, 2022

Codecov Report

Base: 93.97% // Head: 94.06% // Increases project coverage by +0.08% 🎉

Coverage data is based on head (557d100) compared to base (5483e2f).
Patch coverage: 99.37% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1399      +/-   ##
==========================================
+ Coverage   93.97%   94.06%   +0.08%     
==========================================
  Files         122      122              
  Lines       10744    10960     +216     
==========================================
+ Hits        10097    10309     +212     
- Misses        647      651       +4     
Impacted Files Coverage Δ
darts/models/forecasting/regression_model.py 97.06% <95.65%> (-0.26%) ⬇️
darts/explainability/shap_explainer.py 87.98% <100.00%> (+0.05%) ⬆️
darts/utils/data/tabularization.py 100.00% <100.00%> (ø)
darts/timeseries.py 92.12% <0.00%> (-0.23%) ⬇️
darts/ad/anomaly_model/filtering_am.py 91.93% <0.00%> (-0.13%) ⬇️
...arts/models/forecasting/torch_forecasting_model.py 89.52% <0.00%> (-0.05%) ⬇️
darts/models/forecasting/block_rnn_model.py 98.24% <0.00%> (-0.04%) ⬇️
darts/models/forecasting/nhits.py 99.27% <0.00%> (-0.01%) ⬇️
darts/datasets/__init__.py 100.00% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@eliane-maalouf
Copy link
Contributor

Thank you for this submission. We will be looking at it shortly and get back to you as soon as possible.

@mabilton
Copy link
Contributor Author

mabilton commented Dec 2, 2022

Hey @eliane-maalouf , thanks for that : )

Copy link
Contributor

@eliane-maalouf eliane-maalouf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this refactoring could be a nice improvement to the tabularization function.
My main propositions/comments would be the following :

  • I would recommend to keep the calls to tabularization code under the function name _create_lagged_data and allow the specification of the function call as being done from fit() or from predict(). IMO this is less risky in terms of breaking code that is depending on this function (at least the existing unittest should still pass).
  • Currently, only fit() is using the tabularization code being refactored. predict() is redoing a similar process (without calling _create_lagged_data) and allowing to add previous predictions when the prediction horizon n > output_chunk_length. It might be nice to unify the features generation across the different functions. It would make the code easier to maintain indeed. As for the is_training argument is currently used in shap_explainer.py which also needs to build features for inference.

darts/utils/data/tabularization.py Show resolved Hide resolved
darts/utils/data/tabularization.py Show resolved Hide resolved
@mabilton
Copy link
Contributor Author

mabilton commented Dec 4, 2022

Hey @eliane-maalouf - thanks for your insightful comments : ). Just a few comments/questions from me:

  1. It seems to me that _create_lagged_data is being treated as a de-facto public method (i.e. other parts of the code base are treating _create_lagged_data as a function with a stable interface which is meant to be called outside of the file it is defined in). In which case, would it be worth it to just 'bite the bullet' and make create_lagged_data an explicitly public method, since:
    • This would clearly indicate to future users/contributors that create_lagged_data is meant to be called by other parts of the code base and, therefore, that changes made to that function may be breaking.
    • From my understanding, it's generally considered a 'code smell' if a private method/function is being explicitly tested/called outside of the file it's defined in; here's a pretty good discussion about this point on StackExchange.
    • Searching over the code base, it appears that only shap_explainer.py, catboost_model.py, gradient_boost_model.py, regression_model.py, and test_regression_models.py explicitly call _create_lagged_data, so I don't suspect it would be a huge job to change these references.
  2. I didn't realise that RegressionModel.predict was basically 're-implementing' prediction feature generation; when I have time, I'll look over this and get back to you.
  3. Fair enough point on 'unifying' training and prediction feature generation under a single function using the is_training argument. One option which gives us the 'best of both worlds' might be to have an additionally helper function that calls either create_lagged_features_and_labels or create_lagged_features depending on whether is_training.

@hrzn hrzn added this to In review in darts via automation Dec 6, 2022
@eliane-maalouf
Copy link
Contributor

Hello @mabilton , to follow up on this PR,

  • we propose to leave out the predict part for another PR, this way this PR can remain limited in scope for the tabularization function. We are creating a separate issue for the unification of fit() and predict() feature generation.
  • Indeed only few classes are using create_lagged_data. So maybe the best way forward is that you supplement this PR with the changes using your functions and make sure that the current unittest pass correctly so we can validate that the proposed changes are not breaking the current state of the library.
  • concerning making create_lagged_data a public method, I think I will ask @hrzn for his opinion on this design choice.
    In the unittests, if I am not mistaken, only the output of create_lagged_data is explicitly tested against an expected output, but the inner workings of the function are not explicitly tested. In the case of the modifications you are proposing, I see that the core change is in _find_intersecting_times so explicitly testing the expected output of this function would still be useful IMO. WDYT?

@hrzn
Copy link
Contributor

hrzn commented Dec 8, 2022

Hi @mabilton and thanks a lot for this proposal! As far as I can tell it looks good and could be included in an upcoming release. We will need a bit more time to review thoroughly :) Few points already:

  • As proposed by @eliane-maalouf, I would be in favour that you drop the old function in this PR, and replace it by yours :) This way we can truly assess what the change would look like and what the current unit tests tell us.
  • I agree with you it probably makes more sense to have create_lagged_data public at this point. It's probably not something that we would expect users to call often, but we can expose it. You can make changes in RegressionModels to update the name like you propose.

Thanks again!

darts/utils/data/tabularization.py Outdated Show resolved Hide resolved
darts/utils/data/tabularization.py Outdated Show resolved Hide resolved
darts/utils/data/tabularization.py Outdated Show resolved Hide resolved
darts/utils/data/tabularization.py Outdated Show resolved Hide resolved
@mabilton
Copy link
Contributor Author

Hi @eliane-maalouf, @hrzn - just a quick update from me.

I've managed to implement the 'sliding window' method for equal frequency series I mentioned in my initial PR; as expected, it tends to be 2 - 3 times faster than using the 'time intersection' method when all of the specified series are of the same frequency. In implementing this, I also refactored what I previously had. Notably:

  • I've split up what was _find_time_intersection into two new functions: get_feature_times, which returns the times in each series which could be used to create features, and get_shared_times, which just finds the temporal intersection of an arbitrary number of time indices / TimeSeries. Splitting up the logic of these two functions hopefully makes the code more understandable, and should make it easier to explicitly test behaviors.
  • I've added a check_inputs argument, so that classes that have already done input checks don't need to repeat these inside of create_lagged_data.
  • I've updated the tabularization_experiments.ipynb to cover more test cases. As a summary, the greatest computational gains compared to _create_lagged_data is for unequal frequency data when max_samples_per_ts is specified to a small value (~200+ times faster). This is because, in its current form, _create_lagged_data computes all of the possible features and then takes the last max_samples_per_ts samples, whereas the newly implemented create_lagged_data avoids computing features it doesn't need. Also, just to be clear, I will remove tabularization_experiments.ipynb and _create_lagged_data from my PR once everything is ready - they're just there in the mean time for comparison purposes.
  • Where possible, I've factored in your comments on my previous code - thanks for the help : )

In its current form, create_lagged_data still doesn't fully comply with what's expected by other parts of the library, so tests won't currently pass. One difficulty I'm having is that I don't quite understand how _create_lagged_data is meant to behave when is_training = False. To understand what I mean by this, it would be appreciated if you could have a quick look over the Unexpected Behaviour of _create_lagged_data when is_training=False section of tabularization_experiments.ipynb, where I go over a simple example where _create_lagged_data shows some (perhaps?) unexpected when is_training = False. Once I understand what's supposed to happen when is_training = False, I should be ready to alter what I have to comply with the tests.

Hopefully what I've written here makes sense - let me know if you have any questions. Once again, thanks for all your help : ).

Cheers,
Matt.

@eliane-maalouf
Copy link
Contributor

@mabilton thanks for the update, I will have a look next week.

@hrzn
Copy link
Contributor

hrzn commented Dec 22, 2022

Thanks a lot @mabilton ! This looks really promising. Please bear with us (we are slow to review atm - busy preparing v0.23). I'm hopeful we can add your refactoring and improvements into v0.24.

@mabilton
Copy link
Contributor Author

mabilton commented Dec 22, 2022

Hey @hrzn - no worries! All of this can definitely wait until after Christmas and the New Year; in any case, I'm currently in the process of writing up tests for what I have (and fixing bugs that I discover while doing this), so it's probably for the best that you hold off on reviewing what I have for the time being. Thanks for all your hard work @hrzn and @eliane-maalouf - have a Merry Christmas and Happy New Year : ).

Cheers,
Matt.

@hrzn
Copy link
Contributor

hrzn commented Dec 22, 2022

Hey @hrzn - no worries! All of this can definitely wait until after Christmas and the New Year; in any case, I'm currently in the process of writing up tests for what I have (and fixing bugs that I discover while doing this), so it's probably for the best that you hold off on reviewing what I have for the time being. Thanks for all your hard work @hrzn and @eliane-maalouf - have a Merry Christmas and Happy New Year : ).

Cheers,
Matt.

Thank you @mabilton , happy new year to you too!

@mabilton
Copy link
Contributor Author

Hey @hrzn, @eliane-maalouf - thank you both for all your comments. I'll try to work through them ASAP.

@mabilton
Copy link
Contributor Author

Hey @hrzn, @eliane-maalouf - just letting you both know that I've read over all your very useful comments and that I'm in the process of implementing the suggested changes. In particular, I've had to make a few adjustments to the tabularization and testing code to account for the case where lags_future_covariates is positive. I've been a bit busy in my day-to-day life recently, so these changes may take a few more days to complete - apologies for the delay.

@mabilton
Copy link
Contributor Author

Hey @hrzn, @eliane-maalouf . So (I think) I've managed to address pretty much all of your comments in my latest push - if I've missed something, please let me know and apologies in advance.

There are two outstanding issues I haven't really addressed yet:

  1. I don't think we've come to an agreement about where _add_static_covariates should be placed. In my personal opinion, it would be ideal if we could somehow refactor _add_static_covariates' so that it doesn't need access to the underlying RegressionModels object. That way, we could think about adding static covariates to each block inside of create_lagged_data as soon as each block is formed. What do you guys reckon?
  2. I still haven't had a chance to add an additional static covariates test, just to make sure that the actual values being appended are correct - I'll try squeeze that in before this PR is merged if I find time.

Finally, after mulling over it some more, I think that the 'moving window' method can actually be used for series of different frequencies by correctly selecting the stride parameter. I think that's a bit outside the scope of this PR though, but I can perhaps look to implement that alongside #1487.

Once again, any and all comments are welcome, and thanks in advance for any help - apologies for the delay in making these amendments.

Cheers,
Matt.

@mabilton
Copy link
Contributor Author

Looks like the 04-RNN-examples.ipynb notebook is failing since tensorboardX can't be imported - any ideas what's going on here?

@hrzn
Copy link
Contributor

hrzn commented Jan 20, 2023

Looks like the 04-RNN-examples.ipynb notebook is failing since tensorboardX can't be imported - any ideas what's going on here?

Hi @mabilton and many thanks for revising your PR. Will check your updates very soon.
Regarding the issues with the CI tests, they are due to a recent release of PyTorch Lightning which was breaking a few things in our tests. I have now fixed this in master so the issues should be ✅
Thanks!

@mabilton
Copy link
Contributor Author

Hey @hrzn, thanks for the update + bug fix.

I've finished amending one of the static covariates tests so that the static covariate values appended to the feature matrix are explicitly checked (as opposed to just checking that the shape of the feature matrix returned by RegressionModel._create_lagged_data is correct).

To perhaps clarify what I mentioned in my previous comment about appending on the static covariate columns as soon as each feature matrix block is constructed, it might pay to have a look at helper_get_static_covs_expected_X in test_regression_models.py - this is basically a simplified implementation of the strategy I'm suggesting. Thoughts on this?

Cheers,
Matt.

Copy link
Contributor

@hrzn hrzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @mabilton !
Only got one small comment left regarding the docstring for preventing lag=0 for past covariates.

Regarding your comment about the alternative way of adding the static covariates to the features array, as far as I can tell it makes sense 👍 I would suggest we wait for another PR and close this one first though.

Many thanks again, great stuff!


The `lags` specified for the `target_series` must all be less than or equal to -1 (i.e. one can't use the value
of the target series at time `t` to predict the target series at the same time `t`). Conversely, the values in
`lags_past_covariates` and/or `lags_future_covariates` must be less than or equal to 0 (i.e. we *are* able to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part of the docstring still requires adaptation, right?

@mabilton
Copy link
Contributor Author

Hey @hrzn, thanks for your kind feedback. Good spot with the docstring - I've fixed that now. Thanks to both you and @eliane-maalouf for your help throughout this PR.

@hrzn
Copy link
Contributor

hrzn commented Jan 23, 2023

Hey @hrzn, thanks for your kind feedback. Good spot with the docstring - I've fixed that now. Thanks to both you and @eliane-maalouf for your help throughout this PR.

Great thanks @mabilton :) merging now 🚀

@hrzn hrzn merged commit 864d190 into unit8co:master Jan 23, 2023
darts automation moved this from In review to Done Jan 23, 2023
@mabilton mabilton deleted the refactor/tabularization branch January 24, 2023 03:52
@dennisbader dennisbader moved this from Done to Released in darts May 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
darts
Released
Development

Successfully merging this pull request may close these issues.

None yet

4 participants