New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] investigating & improving runtime performance of data type checks (input, output) #3827
Labels
enhancement
Adding new functionality
module:base-framework
BaseObject, registry, base framework
module:datatypes
datatypes module: data containers, checkers & converters
Comments
Amazing and impressive! Based on this, I wonder whether we can integrate a slightly more polished version of the performance testing into |
Re some of these checks: some changes can be made by accident by transformers, but this can be certified in the test suite (as opposed to in an output check). Regarding whether transformers can change:
|
fkiraly
changed the title
[ENH] Change logic of checks
[ENH] invesgitating & improving runtime performance of data input checks
Nov 22, 2022
fkiraly
changed the title
[ENH] invesgitating & improving runtime performance of data input checks
[ENH] invesgitating & improving runtime performance of data type checks (input, output)
Nov 22, 2022
fkiraly
changed the title
[ENH] invesgitating & improving runtime performance of data type checks (input, output)
[ENH] investigating & improving runtime performance of data type checks (input, output)
Nov 22, 2022
fkiraly
added
module:datatypes
datatypes module: data containers, checkers & converters
module:base-framework
BaseObject, registry, base framework
labels
Nov 22, 2022
This was referenced Dec 24, 2022
fkiraly
added a commit
that referenced
this issue
Dec 30, 2022
…ndex` (#3991) This PR fixes #3990; related PR: #3827. FYI @KishManani, @danbartl. Combining information from @danbartl and @KishManani, I understood where the issue must have been coming from - from a line of `np.diff` on the checked index, in `_index_equally_spaced`, but only in the case of `pd.PeriodIndex` (the latter condition I only understood from @KishManani's analysis). I have used this insight to replace the time and memory consuming check by a much quicker one, exploiting mathematical assumptions on the `pd.PeriodIndex` (unique and sorted index).
fkiraly
pushed a commit
that referenced
this issue
Jan 29, 2023
…or non-nested data (#4130) Check modules take very long to run, see also #3827 This PR introduces changes to checks for nested dataframe by having quick check whether all columns are `object` - only then does it make sense to run the much slower `are_columns_nested` check. Greatly speeding up large hierarchical / panel data sets, since check is much more quickly passed.
5 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
enhancement
Adding new functionality
module:base-framework
BaseObject, registry, base framework
module:datatypes
datatypes module: data containers, checkers & converters
Investigation to see why checking for datatypes can take very long.
A description of all checks can be found here.
https://docs.google.com/spreadsheets/d/1cDx_yg2HhFcPfNR1Trc1Vr1FDuSOTkXwid5GTKIS2aE/edit?usp=sharing
After some investigation, most computing time was consumed when having _index_equally_spaced for periodindex.
Some investigations revealed that groupby can optimize / reduce time needed slightly, but periodindex should be avoided.
Findings are included here:
https://github.com/danbartl/sktime/blob/check_test/perftest.ipynb
Conclusion:
The text was updated successfully, but these errors were encountered: