Speed up columns slices: `etna.datasets.utils.select_columns` #775

Mr-Geekman · 2022-06-28T07:10:17Z

🚀 Feature Request

In a lot of places we use df.loc[:, pd.IndexSlice[segments, column]] to select column from all the segments. It appears to be very slow on a lot of segments.

We should find places where we use it and make sure that it can be replaced with df.loc[:, pd.IndexSlice[:, column]] without problems.

Where was some problem with the second choice: #188. We should investigate is it still existing and in which conditions:

Is it applicable for selection only one column? (SklearnTransform selects many)
Can it be avoided by some trick in taking slices (sorting columns for example).

Proposal

Find all places with slow slice df.loc[:, pd.IndexSlice[segments, column]] where column is scalar. Replace them with function (you can add it etna.datasets.utils). Try to replace slow slice in function with fast slice: df.loc[:, pd.IndexSlice[:, column]. Make sure that in that case we don't have reordering of columns in different pandas versions.
Do the same but with list of values in column (e.g. SklearnTransform) and investigate reordering issue during testing. We want to avoid it without putting all the segments into the slice.
Make some benchmarking that changed transforms (or other calls) become faster. Add code for benchmarking and its results in the comments of PR. E.g. you can take dataframe with 50000 segments, 100 timestamps, 5 additional int columns, 5 additional float columns, 5 additional category columns.

Test cases

Make sure that current tests pass for scalar case.
Make sure that current tests pass for list case.
Add tests on function for selection of one column.
Add tests on function for selection of multiple columns (in SklearnTransform we had some tests on reordering, it can be useful).

Additional context

No response

The text was updated successfully, but these errors were encountered:

alex-hse-repository · 2022-07-22T05:49:27Z

Make sure that you do not forget to fix this, this and this places in TSDataset

alex-hse-repository · 2022-07-29T06:00:18Z

And here

alex-hse-repository · 2022-08-17T14:14:08Z

And here, here

Mr-Geekman · 2022-08-25T09:12:54Z

I'll try to explain the core of the issue.
We have a wide dataframe df. We want to select a few columns: [column_1, column_2]:

res = df.loc[:, pd.IndexSlice[:, [column_1, column_2]]]

In pandas 1.1*: we will get a dataframe where at the last index column_1 and column_2 become ordered by its index in df. So, if column_2 goes in df before column_1 we will get them in unexpected order where first we get values from column_2 and then from column_1. Names of the columns ordered like the values itself.

In pandas 1.1.* and >= 1.2: we will get columns in order that we gave to loc.

If we make selection like:

res = df.loc[:, pd.IndexSlice[segments, [column_1, column_2]]]

then in both cases we get an order from loc.

Mr-Geekman · 2022-08-29T16:01:53Z

More detailed results. Imagine we have a df_wide with segments: ["segment_2", "segment_1", "segment_0"] and with features: ["target", "exog_2", "exog_1", "exog_0"].

Calling df_wide.loc[:, pd.IndexSlice[:, ["exog_1", "exog_2"]]] gives order of columns:

pandas=1.1.5:

segment_2/exog_2
segment_2/exog_1
segment_0/exog_2
segment_0/exog_1
segment_1/exog_2
segment_1/exog_1

pandas=1.3.5:

segment_2/exog_1
segment_0/exog_1
segment_1/exog_1
segment_2/exog_2
segment_0/exog_2
segment_1/exog_2

Calling df.loc[:, pd.IndexSlice[["segment_2", "segment_0", "segment_1"], ["exog_1", "exog_2"]]] gives order of columns:

pandas=1.1.5:

segment_2/exog_1
segment_2/exog_2
segment_0/exog_1
segment_0/exog_2
segment_1/exog_1
segment_1/exog_2

pandas=1.3.5:

segment_2/exog_1
segment_2/exog_2
segment_0/exog_1
segment_0/exog_2
segment_1/exog_1
segment_1/exog_2

If we don't give segments we have different results on different pandas versions.
If we give segments, the results are the same.

Mr-Geekman added the enhancement New feature or request label Jun 28, 2022

martins0n changed the title ~~Speed up columns slices~~ Speed up columns slices: etna.datasets.utils.select_columns Jun 28, 2022

martins0n added this to the Optimization milestone Jul 5, 2022

Mr-Geekman mentioned this issue Jul 21, 2022

New data access methods in TSDataset #809

Merged

4 tasks

Mr-Geekman mentioned this issue Aug 1, 2022

Create DirectEnsemble #824

Merged

4 tasks

alex-hse-repository added the important label Aug 17, 2022

Mr-Geekman self-assigned this Aug 25, 2022

Mr-Geekman mentioned this issue Sep 1, 2022

Speed up columns slices #900

Merged

4 tasks

Mr-Geekman closed this as completed in #900 Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up columns slices: `etna.datasets.utils.select_columns` #775

Speed up columns slices: `etna.datasets.utils.select_columns` #775

Mr-Geekman commented Jun 28, 2022 •

edited

alex-hse-repository commented Jul 22, 2022 •

edited

alex-hse-repository commented Jul 29, 2022

alex-hse-repository commented Aug 17, 2022

Mr-Geekman commented Aug 25, 2022 •

edited

Mr-Geekman commented Aug 29, 2022

Speed up columns slices: etna.datasets.utils.select_columns #775

Speed up columns slices: etna.datasets.utils.select_columns #775

Comments

Mr-Geekman commented Jun 28, 2022 • edited

🚀 Feature Request

Proposal

Test cases

Additional context

alex-hse-repository commented Jul 22, 2022 • edited

alex-hse-repository commented Jul 29, 2022

alex-hse-repository commented Aug 17, 2022

Mr-Geekman commented Aug 25, 2022 • edited

Mr-Geekman commented Aug 29, 2022

Speed up columns slices: `etna.datasets.utils.select_columns` #775

Speed up columns slices: `etna.datasets.utils.select_columns` #775

Mr-Geekman commented Jun 28, 2022 •

edited

alex-hse-repository commented Jul 22, 2022 •

edited

Mr-Geekman commented Aug 25, 2022 •

edited