Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Improve performance of pandas multi-index data type operations with many groups #4139

Open
hoesler opened this issue Jan 23, 2023 · 13 comments
Assignees
Labels
enhancement Adding new functionality module:base-framework BaseObject, registry, base framework module:datatypes datatypes module: data containers, checkers & converters

Comments

@hoesler
Copy link
Contributor

hoesler commented Jan 23, 2023

Is your feature request related to a problem? Please describe.
The current implementation is showing poor performance with panel or hierarchical datasets in the pandas multi-index mtype format if the number of instances is substantial.
I was observing this with a retail dataset: daily sales data for thousands of articles across hundreds of stores and a number of exogenous features.
I am trying to create a forecasting pipeline for each article, which includes different series transformations (e.g. ColumnSelect, Lag, Impute, ...) and a regression model. So any performance issue is amplified by the number of models and probably only striking because of that.

The root cause, I found, is the iterative indexing approach of sktime to apply functions to each instance (series) in the data frame, even in cases where a more performant solution exists. Pandas grouped operations are very slow if you iterate over the group indices (df.loc(idx) for idx in groups). Groupby/apply with a python function is a first good step to improve things, but the most performant way is utilizing grouped operations which are delegated to compiled code.

As an example, consider imputing with a mean function. Currently, the Imputer is basically doing this:

means = [mean(panel.loc[idx]) for idx in get_indices(panel, 'store')]
[impute(panel.loc[idx], mean) for idx, mean in zip(panel, means)]

A first step would be to rewrite it as

means = [mean(series) for series in panel.groupby('store')]
[series.fillna(mean) for series, mean in zip(panel.groupby('store'), means)]

An even more performant implementation would not use any iteration at all, but apply an optimized pandas operation on the whole multi-index frame:

means = panel.groupby('store').mean()
panel.groupby('store').fillna(means)

To get a feeling for the impact, I timed Imputer(method="mean").fit_transform(X[['price']]) for a dataset with 1000 stores and each series consisting around 2000 observations, the difference I measured was 60.7s vs 2.4s. Remember that in my case this will be multiplied by the number of transformations in the pipeline, stores, models, hyperparameter optimization runs, etc.

Describe the solution you'd like
Any series wise operation on multi-index frames should use optimized pandas methods, if possible.

Core operations, like VectorizedDF.as_list or check_pdmultiindex_panel should be refactored (partially covered by #3827). Unnecessary calls should be avoided.

Forecasters and series transformers should be re-evaluated if they can accept multi-index frames internally ("X_inner_mtype": ["pd.DataFrame", "pd-multiindex", "pd_multiindex_hier"]). Implementations should only fall back to an iterative strategy, if no optimized solution exists. In the case of the Imputer, for example, this is the case for some of the imputation strategies:

def transform(X, y):
   if self.method == "forecaster":
        return self.vectorize("transform", VectorizedDF(y), VectorizedDF(X))
    elif self.method == "mean":
        return X.groupby(level=index.names[:-1]).fillna(value=self._mean)

There are even cases, where we don't need any grouped operation at all. Like the constant imputation strategy or the ColumnSelect transformer, for example.

Describe alternatives you've considered

Additional context

I'll open a draft PR with the changes I made so far. Hopefully this will help to get this started.

@hoesler hoesler added the enhancement Adding new functionality label Jan 23, 2023
@fkiraly
Copy link
Collaborator

fkiraly commented Jan 23, 2023

Indeed this has been on our minds lately!

At the root of the problems you describe are, I think, three things:

  • checks and conversions are inefficient and sometimes badly implemented. There's an element of "mea culpa" here, and some of it is also old legacy code that has been moved into the checks system without much change. @danbartl has done a lot of excellent work recently to improve performance!
  • the vectorization/broadcasting functionality via VectorizedDF is currently naive - but it's meant as a placeholder. The design, from a layers perspective, should allow to slowly supplant this with more efficient vectorization, up to use of delayed computation via dask. This may need a refactor first (see here for current work: [ENH] refactor - localize broadcasting in VectorizedDF #4132)
  • frequently used transformers are not implemented for hierarchical data but could be implemented more efficiently. Currently the default is using automated vectorization, so the lack of efficient hierarchical version is compounded by the naivete of the vectorization functionality. @KishManani has worked on some transformers recently, of course there are a lot of transformers...

We also need imo a more principled way of profiling runtime and memory use of the boilerplate layer - I think @danbartl and @KishManani have created their own ad-hoc profilers, but it might be nice to move longer term to, say, a CI step which monitors critical layers and raises the alarm if inefficiencies are found.

For context why things are as they currently are:
historically, we were following to some extent the maxim "first make it work, then make it good".
sktime has elements of design research mapping new horizons in the time series space, and a way to experiment is implement some features and see what users like - for instance, automated broadcasting/vectorization, or the support for unequal length time series or pandas-based data containers.
Implementing a mockup that works for small to medium data allows to see whether it's "the right interface", by seeing which way it is being used.

@fkiraly fkiraly added module:datatypes datatypes module: data containers, checkers & converters module:base-framework BaseObject, registry, base framework labels Jan 23, 2023
@danbartl
Copy link
Collaborator

Thanks for the great initiative. I think the biggest improvements after fixing the checks is cutting down on the number of checks. In the simple screen I posted in your PR 4140 the multiindex check is run a total of 22 times...

@KishManani
Copy link
Contributor

I was observing this with a retail dataset: daily sales data for thousands of articles across hundreds of stores and a number of exogenous features.
I am trying to create a forecasting pipeline for each article, which includes different series transformations (e.g. ColumnSelect, Lag, Impute, ...) and a regression model. So any performance issue is amplified by the number of models and probably only striking because of that.

I am experiencing the same issue for the same use case! The checks during fitting and transforming can add a lot of latency. The changes in #4140 are a great step forward! I think the more we can do to reduce any redundancy in checks that @danbartl mentioned in #4140 (comment) the better!

@danbartl
Copy link
Collaborator

https://github.com/benchmark-action/github-action-benchmark/tree/master/examples/pytest

@fkiraly @KishManani I think this would be a nice start to integrate, maybe we can talk about that in the future?

@fkiraly
Copy link
Collaborator

fkiraly commented Jan 30, 2023

that looks great, @danbartl!
Would you like to open an issue with a rough spec? I assume we may like this as a CI element.

@danbartl danbartl self-assigned this Jan 30, 2023
@danbartl
Copy link
Collaborator

@fkiraly could you possibly give me an intro to how github actions are handled currently for sktime? just 5 to 10 minutes to get an idea where to look into would be awesome :)
Thanks!

@hoesler
Copy link
Contributor Author

hoesler commented Jan 30, 2023

Hi all. Thanks for all the replies and the positive feedback! I was quite busy last week, so sorry for the delayed response.

Regarding the checks (@danbartl):
Indeed, speeding up checks and reducing the number of calls is a necessary step. My PR does that both explicitly and implicitly, because working on panel data directly prevents redundant checks for each series instance.

VectorizedDF as a placeholder (@fkiraly):
Thanks for the explanation. Now that #4132 is merged, do you think VectorizedDF is ready to get some improvements from my PR?
Also, a small comment on supporting large data processing frameworks like dask or spark: I would consider inverting the vectorization logic for those. As these frameworks are meant to bring compute to data instead of data to compute, they know best how to do this in an optimal fashion. So instead of for panel.as_list(): ... (where panel might be a VectorizedDF), I would use panel.apply(udf) (in spark this would translate into data.groupby().apply() for example).

automated vectorization (@fkiraly):
My optimizations cover both, the generic vectorization by improving VectorizedDF.as_list, and some transformers by working on panels directly. So I think, my PR could make a good start.

profiling runtime and memory use (@danbartl, @fkiraly, @KishManani):
Having an automated way of measuring improvements would be awesome!

@fkiraly
Copy link
Collaborator

fkiraly commented Jan 30, 2023

Hi all. Thanks for all the replies and the positive feedback! I was quite busy last week, so sorry for the delayed response.

No worries. Most active contributors have day jobs and are volunteers.

VectorizedDF as a placeholder (@fkiraly):

Thanks for the explanation. Now that #4132 is merged, do you think VectorizedDF is ready to get some improvements from my PR?

Sure - do you have any ideas?
I would just recommend to keep PRs small and have a small scope in change, or small file footprint. So, if independent changes are made, they are preferable in separate PR.

Also, a small comment on supporting large data processing frameworks like dask or spark: I would consider inverting the vectorization logic for those.

Absolutely!

That's the idea, the change in VectorizedDF is a partial inversion.
The next step would be giving VectorizedDF the responsibility to "bring the data to the compute" (or, handle, say, dask containers)

Having an automated way of measuring improvements would be awesome!

Fully agreed!

@fkiraly
Copy link
Collaborator

fkiraly commented Jan 30, 2023

@fkiraly could you possibly give me an intro to how github actions are handled currently for sktime? just 5 to 10 minutes to get an idea where to look into would be awesome :)

Sure! Can do that in-person in one of the community meetings, or perhaps the following in writing helps.

This is the developer documentation on CI:
https://www.sktime.org/en/stable/developer_guide/continuous_integration.html
oddly, the table at the end does not render but contains all the code locations, I have opened an issue to fix this:
#4179

The most important files are .github/workflows/test.yml which has the PR CI tests, the things to understand is the format of "github actions" being run, and "run" commands (commandline).

As a human readable tutorial, this is quite good: https://py-pkgs.org/08-ci-cd.html, by @TomasBeuzen and @ttimbers (kudos)

@TomasBeuzen
Copy link

Thanks for the shout-out @fkiraly - was quite surprised to see a notification from sktime appear in my inbox as I use the package semi-regularly in my work. So, thanks to you and the team for the great work 😃

@ttimbers
Copy link

Yes! Totally appreciate the shout out 😊

@fkiraly
Copy link
Collaborator

fkiraly commented Jan 30, 2023

Well, it would have been so great to have this a years ago when we set up the package.
@mloning did most of this at the start and then @lmmentel, but I also had a lot of headache reading official documentation that primarily gives indigestion and information only as a side effect.

Nice book I'm also planning to use at work :-)

@hoesler
Copy link
Contributor Author

hoesler commented Jan 31, 2023

@fkiraly

Thanks for the explanation. Now that #4132 is merged, do you think VectorizedDF is ready to get some improvements from my PR?

Sure - do you have any ideas?

This is already part of the PR I opened. I modified VectorizedDF.to_list.

I would just recommend to keep PRs small and have a small scope in change, or small file footprint. So, if independent changes are made, they are preferable in separate PR.

Agreed. I know that my PR is already quite large, but all updates have a common goal which is performance improvements. The problem is, that panel iteration is so ubiqutous. I could split it in two, changes to the core and updated transformers.
Let me know in the PR.

fkiraly pushed a commit that referenced this issue Feb 3, 2023
…rmer (#4193)

Split out of #4140
Contributes to #4139

This PR allows `ColumnSelect` to transform multi-index dataframes directly, instead using an expensive iterative instance wise 
transformation.
fkiraly pushed a commit that referenced this issue Feb 11, 2023
…4194)

Split out of #4140
Contributes to #4139

This PR adds native multi-index/hierarchical support to `Imputer` for some imputation strategies, making it much more efficient than expensive iteration.
fkiraly pushed a commit that referenced this issue Feb 12, 2023
Split out of #4140
Contributes to #4139

This PR primarily improves instance iteration performance in `VectorizedDF`, by replacing pandas loc-iteration with groupby. It also fixes a memory leak I observed when `_get_X_at_index` is called multiple times, by returning a copy of a slice (I still don't fully understand the root cause, but this solved it).
fkiraly pushed a commit that referenced this issue Feb 13, 2023
Split out of #4140
Contributes to #4139
Overlaps with #3827, #3991

This PR improves the performance of `check_pdmultiindex_panel`.

It was developed in parallel to the work of @danbartl, so some improvements might be outdated or just implemented differently.
fkiraly pushed a commit that referenced this issue Mar 7, 2023
…F.__getitem__ and VectorizedDF.get_iloc_indexer (#4228)

Followup to #4195
Contributes to #4139

This PR implements `BaseForecastingErrorMetric._evaluate_vectorized` using `VectorizedDF.vectorize_est`.
Removes the last reference to `VectorizedDF.__getitem__`.
Random access is not needed, and developers should use `__iter__` for iteration instead (implemented in #4195).
Also, unused method `get_iloc_indexer` is marked as deprecated and should be removed in a future version.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Adding new functionality module:base-framework BaseObject, registry, base framework module:datatypes datatypes module: data containers, checkers & converters
Projects
None yet
Development

No branches or pull requests

6 participants