Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] investigating & improving runtime performance of data type checks (input, output) #3827

Open
danbartl opened this issue Nov 22, 2022 · 2 comments
Labels
enhancement Adding new functionality module:base-framework BaseObject, registry, base framework module:datatypes datatypes module: data containers, checkers & converters

Comments

@danbartl
Copy link
Collaborator

danbartl commented Nov 22, 2022

Investigation to see why checking for datatypes can take very long.
A description of all checks can be found here.
https://docs.google.com/spreadsheets/d/1cDx_yg2HhFcPfNR1Trc1Vr1FDuSOTkXwid5GTKIS2aE/edit?usp=sharing

After some investigation, most computing time was consumed when having _index_equally_spaced for periodindex.

Some investigations revealed that groupby can optimize / reduce time needed slightly, but periodindex should be avoided.

Findings are included here:
https://github.com/danbartl/sktime/blob/check_test/perftest.ipynb

Conclusion:

  • most checks are very fast, only _index_equally_spaced can take very long
  • Will create a pull request to change method to groupby, increasing performance
  • periodindex should be avoided for large datasets
  • unfortunately, only periodindex supports freq argument for multiindex, further supporting the idea that freq should be outside argument
  • polars library performs 5 times as fast as fastest pandas method
@danbartl danbartl added the enhancement Adding new functionality label Nov 22, 2022
@fkiraly
Copy link
Collaborator

fkiraly commented Nov 22, 2022

Amazing and impressive!

Based on this, I wonder whether we can integrate a slightly more polished version of the performance testing into sktime itself, as a developer utility.

@fkiraly
Copy link
Collaborator

fkiraly commented Nov 22, 2022

Re some of these checks:

some changes can be made by accident by transformers, but this can be certified in the test suite (as opposed to in an output check).

Regarding whether transformers can change:

  • in-principle, transformers could return an empty data frame, e.g., when subsetting or selecting
  • transformers can also change the number of time points, number of samples in a panel, number of variables

@fkiraly fkiraly changed the title [ENH] Change logic of checks [ENH] invesgitating & improving runtime performance of data input checks Nov 22, 2022
@fkiraly fkiraly changed the title [ENH] invesgitating & improving runtime performance of data input checks [ENH] invesgitating & improving runtime performance of data type checks (input, output) Nov 22, 2022
@fkiraly fkiraly changed the title [ENH] invesgitating & improving runtime performance of data type checks (input, output) [ENH] investigating & improving runtime performance of data type checks (input, output) Nov 22, 2022
@fkiraly fkiraly added module:datatypes datatypes module: data containers, checkers & converters module:base-framework BaseObject, registry, base framework labels Nov 22, 2022
fkiraly added a commit that referenced this issue Dec 30, 2022
…ndex` (#3991)

This PR fixes #3990; related PR: #3827. FYI @KishManani, @danbartl.

Combining information from @danbartl and @KishManani, I understood where the issue must have been coming from - from a line of `np.diff` on the checked index, in `_index_equally_spaced`, but only in the case of `pd.PeriodIndex` (the latter condition I only understood from @KishManani's analysis).

I have used this insight to replace the time and memory consuming check by a much quicker one, exploiting mathematical assumptions on the `pd.PeriodIndex` (unique and sorted index).
fkiraly pushed a commit that referenced this issue Jan 2, 2023
…pe checks (#3935)

Increases speed of `pandas` based panel and hierachical mtype checks by using `groupby`.

See #3827
fkiraly pushed a commit that referenced this issue Jan 29, 2023
…or non-nested data (#4130)

Check modules take very long to run, see also 
#3827

This PR introduces changes to checks for nested dataframe by having quick check whether all columns are `object` - only then does it make sense to run the much slower `are_columns_nested` check.

Greatly speeding up large hierarchical / panel data sets, since check is much more quickly passed.
fkiraly pushed a commit that referenced this issue Feb 13, 2023
Split out of #4140
Contributes to #4139
Overlaps with #3827, #3991

This PR improves the performance of `check_pdmultiindex_panel`.

It was developed in parallel to the work of @danbartl, so some improvements might be outdated or just implemented differently.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Adding new functionality module:base-framework BaseObject, registry, base framework module:datatypes datatypes module: data containers, checkers & converters
Projects
None yet
Development

No branches or pull requests

2 participants