[ENH] investigating & improving runtime performance of data type checks (input, output) #3827

danbartl · 2022-11-22T12:20:49Z

Investigation to see why checking for datatypes can take very long.
A description of all checks can be found here.
https://docs.google.com/spreadsheets/d/1cDx_yg2HhFcPfNR1Trc1Vr1FDuSOTkXwid5GTKIS2aE/edit?usp=sharing

After some investigation, most computing time was consumed when having _index_equally_spaced for periodindex.

Some investigations revealed that groupby can optimize / reduce time needed slightly, but periodindex should be avoided.

Findings are included here:
https://github.com/danbartl/sktime/blob/check_test/perftest.ipynb

Conclusion:

most checks are very fast, only _index_equally_spaced can take very long
Will create a pull request to change method to groupby, increasing performance
periodindex should be avoided for large datasets
unfortunately, only periodindex supports freq argument for multiindex, further supporting the idea that freq should be outside argument
polars library performs 5 times as fast as fastest pandas method

fkiraly · 2022-11-22T12:33:35Z

Amazing and impressive!

Based on this, I wonder whether we can integrate a slightly more polished version of the performance testing into sktime itself, as a developer utility.

fkiraly · 2022-11-22T12:36:13Z

Re some of these checks:

some changes can be made by accident by transformers, but this can be certified in the test suite (as opposed to in an output check).

Regarding whether transformers can change:

in-principle, transformers could return an empty data frame, e.g., when subsetting or selecting
transformers can also change the number of time points, number of samples in a panel, number of variables

@KishManani

…ndex` (#3991) This PR fixes #3990; related PR: #3827. FYI @KishManani, @danbartl. Combining information from @danbartl and @KishManani, I understood where the issue must have been coming from - from a line of `np.diff` on the checked index, in `_index_equally_spaced`, but only in the case of `pd.PeriodIndex` (the latter condition I only understood from @KishManani's analysis). I have used this insight to replace the time and memory consuming check by a much quicker one, exploiting mathematical assumptions on the `pd.PeriodIndex` (unique and sorted index).

…pe checks (#3935) Increases speed of `pandas` based panel and hierachical mtype checks by using `groupby`. See #3827

…or non-nested data (#4130) Check modules take very long to run, see also #3827 This PR introduces changes to checks for nested dataframe by having quick check whether all columns are `object` - only then does it make sense to run the much slower `are_columns_nested` check. Greatly speeding up large hierarchical / panel data sets, since check is much more quickly passed.

@danbartl

Split out of #4140 Contributes to #4139 Overlaps with #3827, #3991 This PR improves the performance of `check_pdmultiindex_panel`. It was developed in parallel to the work of @danbartl, so some improvements might be outdated or just implemented differently.

danbartl added the enhancement Adding new functionality label Nov 22, 2022

fkiraly changed the title ~~[ENH] Change logic of checks~~ [ENH] invesgitating & improving runtime performance of data input checks Nov 22, 2022

fkiraly changed the title ~~[ENH] invesgitating & improving runtime performance of data input checks~~ [ENH] invesgitating & improving runtime performance of data type checks (input, output) Nov 22, 2022

fkiraly changed the title ~~[ENH] invesgitating & improving runtime performance of data type checks (input, output)~~ [ENH] investigating & improving runtime performance of data type checks (input, output) Nov 22, 2022

fkiraly added module:datatypes datatypes module: data containers, checkers & converters module:base-framework BaseObject, registry, base framework labels Nov 22, 2022

danbartl mentioned this issue Dec 14, 2022

[ENH] improve performance of pandas based panel and hierachical mtype checks #3935

Merged

This was referenced Dec 24, 2022

[ENH] Improve space and time performance during transform when using PeriodIndex for time related transformers #3990

Closed

[ENH] speed up mtype check for pandas based mtypes with pd.PeriodIndex #3991

Merged

fkiraly pushed a commit that referenced this issue Jan 2, 2023

[ENH] improve performance of pandas based panel and hierachical mty…

1257ec2

…pe checks (#3935) Increases speed of `pandas` based panel and hierachical mtype checks by using `groupby`. See #3827

danbartl mentioned this issue Jan 20, 2023

[ENH] significantly speed up nested_univ (nested dataframe) check for non-nested data #4130

Merged

hoesler mentioned this issue Jan 23, 2023

[ENH] Improve performance of pandas multi-index data type operations with many groups #4139

Open

hoesler mentioned this issue Feb 3, 2023

[ENH] Improve panel mtype check performance #4196

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] investigating & improving runtime performance of data type checks (input, output) #3827

[ENH] investigating & improving runtime performance of data type checks (input, output) #3827

danbartl commented Nov 22, 2022 •

edited

fkiraly commented Nov 22, 2022

fkiraly commented Nov 22, 2022

[ENH] investigating & improving runtime performance of data type checks (input, output) #3827

[ENH] investigating & improving runtime performance of data type checks (input, output) #3827

Comments

danbartl commented Nov 22, 2022 • edited

fkiraly commented Nov 22, 2022

fkiraly commented Nov 22, 2022

danbartl commented Nov 22, 2022 •

edited