Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] end-to-end dask integration #4013

Open
fkiraly opened this issue Dec 29, 2022 · 5 comments
Open

[ENH] end-to-end dask integration #4013

fkiraly opened this issue Dec 29, 2022 · 5 comments
Labels
API design API design & software architecture enhancement Adding new functionality

Comments

@fkiraly
Copy link
Collaborator

fkiraly commented Dec 29, 2022

This issue collects a roadmap for achieving end-to-end integration with dask data containers at the framework (not estimator) level. For discussion and design.

  • dask based mtypes, especially for panel and hierarchical data
  • base classes, esp forecasters and transformers, recognize dask data frames and avoid calls to compute. Possibly:
    • option: pass to VectorizeDF, which recognizes dask and handles vectorization via dask groupby
    • option: if forecaster/transformer does not support dask natively, it is wrapped in a delayed call
    • option: if transformer is instance-wise, always uses dask groupby for vectorization
  • dask support in key transformers: lag, windowing, aggregation/summarization, temporal features
@fkiraly fkiraly added API design API design & software architecture enhancement Adding new functionality labels Dec 29, 2022
@topher-lo
Copy link
Contributor

Another option (same spirit as option 1): pass to VectorizeDF and handle parallelisation via dask map_partitions. This is what nixtla's mlforecast does!

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jan 5, 2023

Another option (same spirit as option 1): pass to VectorizeDF and handle parallelisation via dask map_partitions. This is what nixtla's mlforecast does!

hm, indeed. Just from naive inspection, this seems to be two ways to achieve the same end.

Do you have a good grip on which of groupby vs map_partitions would be "better", for a yet-to-specify multi-faceted definition of "better"?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jan 5, 2023

@topher-lo, I would be keen to hear how you would design the dispatch inside VectorizeDF (if you have the time to think about it)

@topher-lo
Copy link
Contributor

topher-lo commented Jan 9, 2023

Definitely map_partitions. Dask maintainers recommend using map_partitions AKA "map dataframe chunks" over groupby for general functions. It's generally a lot more performant as well: e.g. in this blog post (https://targomo.medium.com/how-we-learned-to-love-dask-and-achieved-a-40x-speedup-aa14e72d99c0). groupby is optimized for operations under the Pandas groupby context and performs poorly for general functions.

@topher-lo
Copy link
Contributor

topher-lo commented Jan 9, 2023

@fkiraly FYI this is how nixtla/mlforecast implemented a distributed time series data structure: https://github.com/Nixtla/mlforecast/blob/main/nbs/distributed.core.ipynb

fkiraly added a commit that referenced this issue Jan 28, 2023
This PR moves functionality for broadcast/vectorized application of estimator methods to `VectorizedDF`, to a new method called `vectorize_est`. It also serves to explore refactor end states.

Reasons to move broadcast/vectorized application to `VectorizedDF`:

* duplication between transformers and forecasters which have very similar logic implemented
* current state can be seen as a violation of the law of Demeter; target state improves cohesion, reduces coupling between base classes and `VectorizedDF`
* localized logic in `VectorizedDF` seems to be a more natural starting point for adding `dask` and scaling/distributed features, see #4013

Also contains a direct unit test for `vectorize_est`.
Note that the new method `vectorize_est` is also indirectly covered by tests for vectorization functionality in transformers and forecasters, which also covers the external points of the refactor in `BaseForecaster` and `BaseTransformer`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design API design & software architecture enhancement Adding new functionality
Projects
None yet
Development

No branches or pull requests

2 participants