[ENH] Improve performance of pandas multi-index data type operations with many groups #4139

hoesler · 2023-01-23T11:46:41Z

Is your feature request related to a problem? Please describe.
The current implementation is showing poor performance with panel or hierarchical datasets in the pandas multi-index mtype format if the number of instances is substantial.
I was observing this with a retail dataset: daily sales data for thousands of articles across hundreds of stores and a number of exogenous features.
I am trying to create a forecasting pipeline for each article, which includes different series transformations (e.g. ColumnSelect, Lag, Impute, ...) and a regression model. So any performance issue is amplified by the number of models and probably only striking because of that.

The root cause, I found, is the iterative indexing approach of sktime to apply functions to each instance (series) in the data frame, even in cases where a more performant solution exists. Pandas grouped operations are very slow if you iterate over the group indices (df.loc(idx) for idx in groups). Groupby/apply with a python function is a first good step to improve things, but the most performant way is utilizing grouped operations which are delegated to compiled code.

As an example, consider imputing with a mean function. Currently, the Imputer is basically doing this:

means = [mean(panel.loc[idx]) for idx in get_indices(panel, 'store')]
[impute(panel.loc[idx], mean) for idx, mean in zip(panel, means)]

A first step would be to rewrite it as

means = [mean(series) for series in panel.groupby('store')]
[series.fillna(mean) for series, mean in zip(panel.groupby('store'), means)]

An even more performant implementation would not use any iteration at all, but apply an optimized pandas operation on the whole multi-index frame:

means = panel.groupby('store').mean()
panel.groupby('store').fillna(means)

To get a feeling for the impact, I timed Imputer(method="mean").fit_transform(X[['price']]) for a dataset with 1000 stores and each series consisting around 2000 observations, the difference I measured was 60.7s vs 2.4s. Remember that in my case this will be multiplied by the number of transformations in the pipeline, stores, models, hyperparameter optimization runs, etc.

Describe the solution you'd like
Any series wise operation on multi-index frames should use optimized pandas methods, if possible.

Core operations, like VectorizedDF.as_list or check_pdmultiindex_panel should be refactored (partially covered by #3827). Unnecessary calls should be avoided.

Forecasters and series transformers should be re-evaluated if they can accept multi-index frames internally ("X_inner_mtype": ["pd.DataFrame", "pd-multiindex", "pd_multiindex_hier"]). Implementations should only fall back to an iterative strategy, if no optimized solution exists. In the case of the Imputer, for example, this is the case for some of the imputation strategies:

def transform(X, y):
   if self.method == "forecaster":
        return self.vectorize("transform", VectorizedDF(y), VectorizedDF(X))
    elif self.method == "mean":
        return X.groupby(level=index.names[:-1]).fillna(value=self._mean)

There are even cases, where we don't need any grouped operation at all. Like the constant imputation strategy or the ColumnSelect transformer, for example.

Describe alternatives you've considered

Additional context

I'll open a draft PR with the changes I made so far. Hopefully this will help to get this started.

The text was updated successfully, but these errors were encountered:

fkiraly · 2023-01-23T17:53:26Z

Indeed this has been on our minds lately!

At the root of the problems you describe are, I think, three things:

checks and conversions are inefficient and sometimes badly implemented. There's an element of "mea culpa" here, and some of it is also old legacy code that has been moved into the checks system without much change. @danbartl has done a lot of excellent work recently to improve performance!
the vectorization/broadcasting functionality via VectorizedDF is currently naive - but it's meant as a placeholder. The design, from a layers perspective, should allow to slowly supplant this with more efficient vectorization, up to use of delayed computation via dask. This may need a refactor first (see here for current work: [ENH] refactor - localize broadcasting in VectorizedDF #4132)
frequently used transformers are not implemented for hierarchical data but could be implemented more efficiently. Currently the default is using automated vectorization, so the lack of efficient hierarchical version is compounded by the naivete of the vectorization functionality. @KishManani has worked on some transformers recently, of course there are a lot of transformers...

We also need imo a more principled way of profiling runtime and memory use of the boilerplate layer - I think @danbartl and @KishManani have created their own ad-hoc profilers, but it might be nice to move longer term to, say, a CI step which monitors critical layers and raises the alarm if inefficiencies are found.

For context why things are as they currently are:
historically, we were following to some extent the maxim "first make it work, then make it good".
sktime has elements of design research mapping new horizons in the time series space, and a way to experiment is implement some features and see what users like - for instance, automated broadcasting/vectorization, or the support for unequal length time series or pandas-based data containers.
Implementing a mockup that works for small to medium data allows to see whether it's "the right interface", by seeing which way it is being used.

danbartl · 2023-01-23T21:55:05Z

Thanks for the great initiative. I think the biggest improvements after fixing the checks is cutting down on the number of checks. In the simple screen I posted in your PR 4140 the multiindex check is run a total of 22 times...

KishManani · 2023-01-23T22:59:14Z

I was observing this with a retail dataset: daily sales data for thousands of articles across hundreds of stores and a number of exogenous features.
I am trying to create a forecasting pipeline for each article, which includes different series transformations (e.g. ColumnSelect, Lag, Impute, ...) and a regression model. So any performance issue is amplified by the number of models and probably only striking because of that.

I am experiencing the same issue for the same use case! The checks during fitting and transforming can add a lot of latency. The changes in #4140 are a great step forward! I think the more we can do to reduce any redundancy in checks that @danbartl mentioned in #4140 (comment) the better!

danbartl · 2023-01-29T20:42:27Z

https://github.com/benchmark-action/github-action-benchmark/tree/master/examples/pytest

@fkiraly @KishManani I think this would be a nice start to integrate, maybe we can talk about that in the future?

fkiraly · 2023-01-30T10:14:47Z

that looks great, @danbartl!
Would you like to open an issue with a rough spec? I assume we may like this as a CI element.

danbartl · 2023-01-30T11:14:17Z

@fkiraly could you possibly give me an intro to how github actions are handled currently for sktime? just 5 to 10 minutes to get an idea where to look into would be awesome :)
Thanks!

hoesler · 2023-01-30T17:04:41Z

Hi all. Thanks for all the replies and the positive feedback! I was quite busy last week, so sorry for the delayed response.

Regarding the checks (@danbartl):
Indeed, speeding up checks and reducing the number of calls is a necessary step. My PR does that both explicitly and implicitly, because working on panel data directly prevents redundant checks for each series instance.

VectorizedDF as a placeholder (@fkiraly):
Thanks for the explanation. Now that #4132 is merged, do you think VectorizedDF is ready to get some improvements from my PR?
Also, a small comment on supporting large data processing frameworks like dask or spark: I would consider inverting the vectorization logic for those. As these frameworks are meant to bring compute to data instead of data to compute, they know best how to do this in an optimal fashion. So instead of for panel.as_list(): ... (where panel might be a VectorizedDF), I would use panel.apply(udf) (in spark this would translate into data.groupby().apply() for example).

automated vectorization (@fkiraly):
My optimizations cover both, the generic vectorization by improving VectorizedDF.as_list, and some transformers by working on panels directly. So I think, my PR could make a good start.

profiling runtime and memory use (@danbartl, @fkiraly, @KishManani):
Having an automated way of measuring improvements would be awesome!

fkiraly · 2023-01-30T22:14:36Z

Hi all. Thanks for all the replies and the positive feedback! I was quite busy last week, so sorry for the delayed response.

No worries. Most active contributors have day jobs and are volunteers.

VectorizedDF as a placeholder (@fkiraly):

Thanks for the explanation. Now that #4132 is merged, do you think VectorizedDF is ready to get some improvements from my PR?

Sure - do you have any ideas?
I would just recommend to keep PRs small and have a small scope in change, or small file footprint. So, if independent changes are made, they are preferable in separate PR.

Also, a small comment on supporting large data processing frameworks like dask or spark: I would consider inverting the vectorization logic for those.

Absolutely!

That's the idea, the change in VectorizedDF is a partial inversion.
The next step would be giving VectorizedDF the responsibility to "bring the data to the compute" (or, handle, say, dask containers)

Having an automated way of measuring improvements would be awesome!

Fully agreed!

fkiraly · 2023-01-30T22:24:03Z

@fkiraly could you possibly give me an intro to how github actions are handled currently for sktime? just 5 to 10 minutes to get an idea where to look into would be awesome :)

Sure! Can do that in-person in one of the community meetings, or perhaps the following in writing helps.

This is the developer documentation on CI:
https://www.sktime.org/en/stable/developer_guide/continuous_integration.html
oddly, the table at the end does not render but contains all the code locations, I have opened an issue to fix this:
#4179

The most important files are .github/workflows/test.yml which has the PR CI tests, the things to understand is the format of "github actions" being run, and "run" commands (commandline).

As a human readable tutorial, this is quite good: https://py-pkgs.org/08-ci-cd.html, by @TomasBeuzen and @ttimbers (kudos)

TomasBeuzen · 2023-01-30T22:41:32Z

Thanks for the shout-out @fkiraly - was quite surprised to see a notification from sktime appear in my inbox as I use the package semi-regularly in my work. So, thanks to you and the team for the great work 😃

ttimbers · 2023-01-30T22:49:13Z

Yes! Totally appreciate the shout out 😊

fkiraly · 2023-01-30T23:33:20Z

Well, it would have been so great to have this a years ago when we set up the package.
@mloning did most of this at the start and then @lmmentel, but I also had a lot of headache reading official documentation that primarily gives indigestion and information only as a side effect.

Nice book I'm also planning to use at work :-)

hoesler · 2023-01-31T08:56:15Z

@fkiraly

Thanks for the explanation. Now that #4132 is merged, do you think VectorizedDF is ready to get some improvements from my PR?

Sure - do you have any ideas?

This is already part of the PR I opened. I modified VectorizedDF.to_list.

I would just recommend to keep PRs small and have a small scope in change, or small file footprint. So, if independent changes are made, they are preferable in separate PR.

Agreed. I know that my PR is already quite large, but all updates have a common goal which is performance improvements. The problem is, that panel iteration is so ubiqutous. I could split it in two, changes to the core and updated transformers.
Let me know in the PR.

…rmer (#4193) Split out of #4140 Contributes to #4139 This PR allows `ColumnSelect` to transform multi-index dataframes directly, instead using an expensive iterative instance wise transformation.

…4194) Split out of #4140 Contributes to #4139 This PR adds native multi-index/hierarchical support to `Imputer` for some imputation strategies, making it much more efficient than expensive iteration.

Split out of #4140 Contributes to #4139 This PR primarily improves instance iteration performance in `VectorizedDF`, by replacing pandas loc-iteration with groupby. It also fixes a memory leak I observed when `_get_X_at_index` is called multiple times, by returning a copy of a slice (I still don't fully understand the root cause, but this solved it).

@danbartl

Split out of #4140 Contributes to #4139 Overlaps with #3827, #3991 This PR improves the performance of `check_pdmultiindex_panel`. It was developed in parallel to the work of @danbartl, so some improvements might be outdated or just implemented differently.

…F.__getitem__ and VectorizedDF.get_iloc_indexer (#4228) Followup to #4195 Contributes to #4139 This PR implements `BaseForecastingErrorMetric._evaluate_vectorized` using `VectorizedDF.vectorize_est`. Removes the last reference to `VectorizedDF.__getitem__`. Random access is not needed, and developers should use `__iter__` for iteration instead (implemented in #4195). Also, unused method `get_iloc_indexer` is marked as deprecated and should be removed in a future version.

hoesler added the enhancement Adding new functionality label Jan 23, 2023

hoesler mentioned this issue Jan 23, 2023

[ENH] Improve multiindex performance #4140

Closed

5 tasks

fkiraly added module:datatypes datatypes module: data containers, checkers & converters module:base-framework BaseObject, registry, base framework labels Jan 23, 2023

danbartl mentioned this issue Jan 23, 2023

[ENH] get time index rework to get faster run times for grouped data #4141

Merged

danbartl self-assigned this Jan 30, 2023

hoesler mentioned this issue Feb 13, 2023

[ENH] Improve vectorized metric calculation and deprecate VectorizedDF.__getitem__ and VectorizedDF.get_iloc_indexer #4228

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Improve performance of pandas multi-index data type operations with many groups #4139

[ENH] Improve performance of pandas multi-index data type operations with many groups #4139

hoesler commented Jan 23, 2023 •

edited

Loading

fkiraly commented Jan 23, 2023 •

edited

Loading

danbartl commented Jan 23, 2023

KishManani commented Jan 23, 2023

danbartl commented Jan 29, 2023

fkiraly commented Jan 30, 2023

danbartl commented Jan 30, 2023

hoesler commented Jan 30, 2023

fkiraly commented Jan 30, 2023

fkiraly commented Jan 30, 2023 •

edited

Loading

TomasBeuzen commented Jan 30, 2023

ttimbers commented Jan 30, 2023

fkiraly commented Jan 30, 2023 •

edited

Loading

hoesler commented Jan 31, 2023

[ENH] Improve performance of pandas multi-index data type operations with many groups #4139

[ENH] Improve performance of pandas multi-index data type operations with many groups #4139

Comments

hoesler commented Jan 23, 2023 • edited Loading

Describe alternatives you've considered

Additional context

fkiraly commented Jan 23, 2023 • edited Loading

danbartl commented Jan 23, 2023

KishManani commented Jan 23, 2023

danbartl commented Jan 29, 2023

fkiraly commented Jan 30, 2023

danbartl commented Jan 30, 2023

hoesler commented Jan 30, 2023

fkiraly commented Jan 30, 2023

fkiraly commented Jan 30, 2023 • edited Loading

TomasBeuzen commented Jan 30, 2023

ttimbers commented Jan 30, 2023

fkiraly commented Jan 30, 2023 • edited Loading

hoesler commented Jan 31, 2023

hoesler commented Jan 23, 2023 •

edited

Loading

fkiraly commented Jan 23, 2023 •

edited

Loading

fkiraly commented Jan 30, 2023 •

edited

Loading

fkiraly commented Jan 30, 2023 •

edited

Loading