Skip to content

Speed up columns slices: etna.datasets.utils.select_columns #775

Closed
Mr-Geekman opened this issue Jun 28, 2022 · 5 comments 路 Fixed by #900
Closed

Speed up columns slices: etna.datasets.utils.select_columns #775

Mr-Geekman opened this issue Jun 28, 2022 · 5 comments 路 Fixed by #900
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@Mr-Geekman
Copy link
Contributor

Mr-Geekman commented Jun 28, 2022

馃殌 Feature Request

In a lot of places we use df.loc[:, pd.IndexSlice[segments, column]] to select column from all the segments. It appears to be very slow on a lot of segments.

We should find places where we use it and make sure that it can be replaced with df.loc[:, pd.IndexSlice[:, column]] without problems.

Where was some problem with the second choice: #188. We should investigate is it still existing and in which conditions:

  1. Is it applicable for selection only one column? (SklearnTransform selects many)
  2. Can it be avoided by some trick in taking slices (sorting columns for example).

Proposal

  1. Find all places with slow slice df.loc[:, pd.IndexSlice[segments, column]] where column is scalar. Replace them with function (you can add it etna.datasets.utils). Try to replace slow slice in function with fast slice: df.loc[:, pd.IndexSlice[:, column]. Make sure that in that case we don't have reordering of columns in different pandas versions.
  2. Do the same but with list of values in column (e.g. SklearnTransform) and investigate reordering issue during testing. We want to avoid it without putting all the segments into the slice.
  3. Make some benchmarking that changed transforms (or other calls) become faster. Add code for benchmarking and its results in the comments of PR. E.g. you can take dataframe with 50000 segments, 100 timestamps, 5 additional int columns, 5 additional float columns, 5 additional category columns.

Test cases

  1. Make sure that current tests pass for scalar case.
  2. Make sure that current tests pass for list case.
  3. Add tests on function for selection of one column.
  4. Add tests on function for selection of multiple columns (in SklearnTransform we had some tests on reordering, it can be useful).

Additional context

No response

@Mr-Geekman Mr-Geekman added the enhancement New feature or request label Jun 28, 2022
@martins0n martins0n changed the title Speed up columns slices Speed up columns slices: etna.datasets.utils.select_columns Jun 28, 2022
@martins0n martins0n added this to the Optimization milestone Jul 5, 2022
@alex-hse-repository
Copy link
Collaborator

alex-hse-repository commented Jul 22, 2022

Make sure that you do not forget to fix this, this and this places in TSDataset

@alex-hse-repository
Copy link
Collaborator

And here

@alex-hse-repository
Copy link
Collaborator

And here, here

@Mr-Geekman Mr-Geekman self-assigned this Aug 25, 2022
@Mr-Geekman
Copy link
Contributor Author

Mr-Geekman commented Aug 25, 2022

I'll try to explain the core of the issue.
We have a wide dataframe df. We want to select a few columns: [column_1, column_2]:

res = df.loc[:, pd.IndexSlice[:, [column_1, column_2]]]

In pandas 1.1*: we will get a dataframe where at the last index column_1 and column_2 become ordered by its index in df. So, if column_2 goes in df before column_1 we will get them in unexpected order where first we get values from column_2 and then from column_1. Names of the columns ordered like the values itself.

In pandas 1.1.* and >= 1.2: we will get columns in order that we gave to loc.

If we make selection like:

res = df.loc[:, pd.IndexSlice[segments, [column_1, column_2]]]

then in both cases we get an order from loc.

@Mr-Geekman
Copy link
Contributor Author

More detailed results. Imagine we have a df_wide with segments: ["segment_2", "segment_1", "segment_0"] and with features: ["target", "exog_2", "exog_1", "exog_0"].

Calling df_wide.loc[:, pd.IndexSlice[:, ["exog_1", "exog_2"]]] gives order of columns:

  1. pandas=1.1.5:
  • segment_2/exog_2
  • segment_2/exog_1
  • segment_0/exog_2
  • segment_0/exog_1
  • segment_1/exog_2
  • segment_1/exog_1
  1. pandas=1.3.5:
  • segment_2/exog_1
  • segment_0/exog_1
  • segment_1/exog_1
  • segment_2/exog_2
  • segment_0/exog_2
  • segment_1/exog_2

Calling df.loc[:, pd.IndexSlice[["segment_2", "segment_0", "segment_1"], ["exog_1", "exog_2"]]] gives order of columns:

  1. pandas=1.1.5:
  • segment_2/exog_1
  • segment_2/exog_2
  • segment_0/exog_1
  • segment_0/exog_2
  • segment_1/exog_1
  • segment_1/exog_2
  1. pandas=1.3.5:
  • segment_2/exog_1
  • segment_2/exog_2
  • segment_0/exog_1
  • segment_0/exog_2
  • segment_1/exog_1
  • segment_1/exog_2
  1. If we don't give segments we have different results on different pandas versions.
  2. If we give segments, the results are the same.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants