-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Improve performance of pandas multi-index data type operations with many groups #4139
Comments
Indeed this has been on our minds lately! At the root of the problems you describe are, I think, three things:
We also need imo a more principled way of profiling runtime and memory use of the boilerplate layer - I think @danbartl and @KishManani have created their own ad-hoc profilers, but it might be nice to move longer term to, say, a CI step which monitors critical layers and raises the alarm if inefficiencies are found. For context why things are as they currently are: |
Thanks for the great initiative. I think the biggest improvements after fixing the checks is cutting down on the number of checks. In the simple screen I posted in your PR 4140 the multiindex check is run a total of 22 times... |
I am experiencing the same issue for the same use case! The checks during fitting and transforming can add a lot of latency. The changes in #4140 are a great step forward! I think the more we can do to reduce any redundancy in checks that @danbartl mentioned in #4140 (comment) the better! |
https://github.com/benchmark-action/github-action-benchmark/tree/master/examples/pytest @fkiraly @KishManani I think this would be a nice start to integrate, maybe we can talk about that in the future? |
that looks great, @danbartl! |
@fkiraly could you possibly give me an intro to how github actions are handled currently for sktime? just 5 to 10 minutes to get an idea where to look into would be awesome :) |
Hi all. Thanks for all the replies and the positive feedback! I was quite busy last week, so sorry for the delayed response. Regarding the checks (@danbartl): VectorizedDF as a placeholder (@fkiraly): automated vectorization (@fkiraly): profiling runtime and memory use (@danbartl, @fkiraly, @KishManani): |
No worries. Most active contributors have day jobs and are volunteers. VectorizedDF as a placeholder (@fkiraly):
Sure - do you have any ideas?
Absolutely! That's the idea, the change in
Fully agreed! |
Sure! Can do that in-person in one of the community meetings, or perhaps the following in writing helps. This is the developer documentation on CI: The most important files are As a human readable tutorial, this is quite good: https://py-pkgs.org/08-ci-cd.html, by @TomasBeuzen and @ttimbers (kudos) |
Thanks for the shout-out @fkiraly - was quite surprised to see a notification from |
Yes! Totally appreciate the shout out 😊 |
Well, it would have been so great to have this a years ago when we set up the package. Nice book I'm also planning to use at work :-) |
This is already part of the PR I opened. I modified
Agreed. I know that my PR is already quite large, but all updates have a common goal which is performance improvements. The problem is, that panel iteration is so ubiqutous. I could split it in two, changes to the core and updated transformers. |
Split out of #4140 Contributes to #4139 This PR primarily improves instance iteration performance in `VectorizedDF`, by replacing pandas loc-iteration with groupby. It also fixes a memory leak I observed when `_get_X_at_index` is called multiple times, by returning a copy of a slice (I still don't fully understand the root cause, but this solved it).
…F.__getitem__ and VectorizedDF.get_iloc_indexer (#4228) Followup to #4195 Contributes to #4139 This PR implements `BaseForecastingErrorMetric._evaluate_vectorized` using `VectorizedDF.vectorize_est`. Removes the last reference to `VectorizedDF.__getitem__`. Random access is not needed, and developers should use `__iter__` for iteration instead (implemented in #4195). Also, unused method `get_iloc_indexer` is marked as deprecated and should be removed in a future version.
Is your feature request related to a problem? Please describe.
The current implementation is showing poor performance with panel or hierarchical datasets in the pandas multi-index mtype format if the number of instances is substantial.
I was observing this with a retail dataset: daily sales data for thousands of articles across hundreds of stores and a number of exogenous features.
I am trying to create a forecasting pipeline for each article, which includes different series transformations (e.g. ColumnSelect, Lag, Impute, ...) and a regression model. So any performance issue is amplified by the number of models and probably only striking because of that.
The root cause, I found, is the iterative indexing approach of sktime to apply functions to each instance (series) in the data frame, even in cases where a more performant solution exists. Pandas grouped operations are very slow if you iterate over the group indices (
df.loc(idx) for idx in groups
). Groupby/apply with a python function is a first good step to improve things, but the most performant way is utilizing grouped operations which are delegated to compiled code.As an example, consider imputing with a mean function. Currently, the Imputer is basically doing this:
A first step would be to rewrite it as
An even more performant implementation would not use any iteration at all, but apply an optimized pandas operation on the whole multi-index frame:
To get a feeling for the impact, I timed
Imputer(method="mean").fit_transform(X[['price']])
for a dataset with 1000 stores and each series consisting around 2000 observations, the difference I measured was 60.7s vs 2.4s. Remember that in my case this will be multiplied by the number of transformations in the pipeline, stores, models, hyperparameter optimization runs, etc.Describe the solution you'd like
Any series wise operation on multi-index frames should use optimized pandas methods, if possible.
Core operations, like
VectorizedDF.as_list
orcheck_pdmultiindex_panel
should be refactored (partially covered by #3827). Unnecessary calls should be avoided.Forecasters and series transformers should be re-evaluated if they can accept multi-index frames internally (
"X_inner_mtype": ["pd.DataFrame", "pd-multiindex", "pd_multiindex_hier"]
). Implementations should only fall back to an iterative strategy, if no optimized solution exists. In the case of the Imputer, for example, this is the case for some of the imputation strategies:There are even cases, where we don't need any grouped operation at all. Like the constant imputation strategy or the ColumnSelect transformer, for example.
Describe alternatives you've considered
Additional context
I'll open a draft PR with the changes I made so far. Hopefully this will help to get this started.
The text was updated successfully, but these errors were encountered: