[ENH] Improve panel mtype check performance #4196

hoesler · 2023-02-03T10:07:14Z

Reference Issues/PRs

Split out of #4140
Contributes to #4139
Overlaps with #3827, #3991

What does this implement/fix? Explain your changes.

This PR improves the performance of check_pdmultiindex_panel.

It was developed in parallel to the work of @danbartl, so some improvements might be outdated or just implemented differently. I would be grateful, if he could support here.

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

Did you add any tests for the change?

Any other comments?

PR checklist

For all contributions

I've added myself to the list of contributors.
Optionally, I've updated sktime's CODEOWNERS to receive notifications about future changes to these files.
The PR title starts with either [ENH], [MNT], [DOC], or [BUG] indicating whether the PR topic is related to enhancement, maintenance, documentation, or bug.

For new estimators

I've added the estimator to the online documentation.
I've updated the existing example notebooks or provided a new one to showcase how my estimator works.

danbartl · 2023-02-08T22:05:23Z

looks great to me, I think the only "collision" is with my approach to change get_cutoff, but I removed that since?

Sorry currently a bit caught up, I will give this a closer look in the next days.

danbartl · 2023-02-09T09:54:16Z

Hi I am really confused now, to me performance seems to go down with the latest PR?

hoesler · 2023-02-09T13:02:19Z

Hi I am really confused now, to me performance seems to go down with the latest PR?

What are you referring to with "with the latest PR"?
I am still getting a much better performance for check_pdmultiindex_panel here than on main.

0.09s vs 12.4s using this test case:

import timeit

import numpy as np
import pandas as pd

from datatypes._panel._check import check_pdmultiindex_panel

index = pd.MultiIndex.from_product(
    [np.arange(1000), pd.period_range(end="2022-12-31", periods=365 * 5, freq="D")],
    names=["store", "date"],
)

data = np.random.default_rng().standard_normal((len(index), 50), dtype=np.float32)

df = pd.DataFrame(data=data, index=index)

print(timeit.timeit(lambda: check_pdmultiindex_panel(df), number=10) / 10)

danbartl · 2023-02-09T22:23:31Z

Hi sorry didnt have the time test thoroughly, I use the following code

# -*- coding: utf-8 -*-
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

from sktime.forecasting.compose import ForecastingPipeline, make_reduction
from sktime.forecasting.compose._reduce import _RecursiveReducer
from sktime.transformations.series.date import DateTimeFeatures
from sktime.transformations.series.summarize import WindowSummarizer
from sktime.utils._testing.hierarchical import _make_hierarchical

regressor = make_pipeline(
    LinearRegression(),
)

kwargs = {
    "lag_feature": {
        "lag": [1],
    }
}


forecaster_global = make_reduction(
    regressor,
    scitype="tabular-regressor",
    transformers=[WindowSummarizer(**kwargs, n_jobs=1, truncate="bfill")],
    window_length=None,
    strategy="recursive",
    pooling="global",
)

y = _make_hierarchical(
    hierarchy_levels=(10000,), min_timepoints=1000, max_timepoints=1000
)
from sktime.datatypes._utilities import get_time_index

y_no_freq = get_time_index(y.reset_index().set_index(["h0","time"]))


def Main():

    # from time import perf_counter
    # t1_start = perf_counter()

    _ = forecaster_global.fit(y)

    y_pred_global = forecaster_global.predict(fh=[1,2])

    # t1_stop = perf_counter()
    # print("Elapsed time during the whole program in seconds:", t1_stop-t1_start)


import cProfile
import pstats

cProfile.run("Main()", "output_newmulti_adjust.dat")

Which resulted for me in a slower performance when investigating the cProfile file. Hope I can dig in a bit more tomorrow.

hoesler · 2023-02-10T10:14:31Z

Ok, so you are testing more than just check_pdmultiindex_panel. I think that can be misleading, or were you comparing CPU times for the function within the profiles?

danbartl · 2023-02-10T13:55:58Z

Ok, so you are testing more than just check_pdmultiindex_panel. I think that can be misleading, or were you comparing CPU times for the function within the profiles?

I broke it down to your example

import timeit

import numpy as np
import pandas as pd

from sktime.datatypes._panel._check import check_pdmultiindex_panel
from sktime.utils._testing.hierarchical import _make_hierarchical

# index = pd.MultiIndex.from_product(
#     [np.arange(1000), pd.period_range(end="2022-12-31", periods=365 * 5, freq="D").to_timestamp()],
#     names=["store", "date"],
# )
y = _make_hierarchical(
    hierarchy_levels=(10000,), min_timepoints=1000, max_timepoints=1000
)

data = np.random.default_rng().standard_normal((len(index), 50), dtype=np.float32)

df = pd.DataFrame(data=data, index=index)

print(timeit.timeit(lambda: check_pdmultiindex_panel(y), number=10) / 10)

I think the key point is that you consider PeriodIndex, while I conser DateTimeIndex. With Datetimeindex and many groups, the PR code presented here takes twice as long on my PC. Maybe we need to implement a logic to check PeriodIndex.

The great boost to your PeriodeIndex comes from this line, I guess?

    if isinstance(index, pd.PeriodIndex):
        return index.is_full

Great find, we definitely need that .

hoesler · 2023-02-10T15:53:23Z

Interesting. I will investigate further with your code.
And yes, I was focusing on PeriodIndex. The main finding here was that operations are fast if you use its numeric representation, like with is_full.

hoesler · 2023-02-10T17:59:56Z

Did a quick comparison of three scenarios:
My example with two different index types and yours.

This PR@0e13388:
panel_period: 0.090
panel_datetime: 0.098
hier_datetime: 1.1557

main@8a14adb:
panel_period: 12.308
panel_datetime: 0.628
hier_datetime: 1.472

My code performed consistently better. For your example the difference is small but it's sill better. For the other two it's significant. Not only for the period index.

Very strange, that we observe so different outcomes.

import timeit

import numpy as np
import pandas as pd

from datatypes._panel._check import check_pdmultiindex_panel
from utils._testing.hierarchical import _make_hierarchical


def create_panel(time_range_fun):
    index = pd.MultiIndex.from_product(
        [np.arange(1000), time_range_fun(end="2022-12-31", periods=365 * 5, freq="D")],
        names=["store", "date"],
    )

    data = np.random.default_rng().standard_normal((len(index), 50), dtype=np.float32)

    return pd.DataFrame(data=data, index=index)


for key, val in {
    "panel_period": create_panel(pd.period_range),
    "panel_datetime": create_panel(pd.date_range),
    "hier_datetime": _make_hierarchical(
        hierarchy_levels=(10000,), min_timepoints=1000, max_timepoints=1000
    ),
}.items():

    def func():
        check_pdmultiindex_panel(val, return_metadata=False)

    n = 10
    avg_time = timeit.timeit(func, number=n) / n

    print(f"{key}: {avg_time}")

hoesler · 2023-02-12T13:33:39Z

Found some further improvements.
Latest benchmark (min runtime, now with metadata calculation for the scenarios above except yours now with sorted index):

panel_period: 0.145111
panel_datetime: 0.172680
hier_datetime: 0.991199

vs

panel_period: 12.283228
panel_datetime: 0.544896
hier_datetime: 1.357230

The slowest part is now _index_equally_spaced, but I don't see how we could improve that, except for maybe storing it's result as metadata in the index to prevent repeated calculation. Don't know if that is feasible, but worth investigating.
And of course, prevent unnecessary metadata calculation (#4191)!

fkiraly · 2023-02-12T15:48:11Z

FYI, I'm currently working on using the config mechanisms to turn off checks, metadata etc - I suppose that's orthogonal to this PR.

hoesler · 2023-02-12T16:27:18Z

Nice. Thanks. I didn't start yet with an implementation, but my idea was to split check and metadata into two methods and, as a first step, call the existing functions as fallback somehow. But I guess this is indeed orthogonal.

fkiraly · 2023-02-12T19:50:20Z

but my idea was to split check and metadata into two methods and, as a first step, call the existing functions as fallback somehow. But I guess this is indeed orthogonal.

Hm, good idea. How would it look like, in an example?

fkiraly · 2023-02-12T19:51:38Z

Here are two relevant issues:

[ENH] refactor datatypes mtype related functionality into classes #3512 - I think the mtypes should be turned into classes, they are a bit all over the place now.
[ENH] Investigate visions for use in datatypes module #2337 - we also thought of using visions in a refactor

danbartl · 2023-02-12T21:13:15Z

@fkiraly @hoesler
I still get a reduced performance for the 10,000 groups example. Still in my view this can be merged, since performance is real good for less groups and the increase is not too high and apparently hardware dependent. @fkiraly if you have time to run the code on your machine would be nice, else it is good to go from my side.

Thanks a lot!

fkiraly · 2023-02-12T22:58:53Z

Performance statistics

main@c41efee7f - note that includes #4195

panel_period: 37.70587424999103
panel_datetime: 1.4263513099984266
hier_datetime: 2.135691480000969

this@0e1338875

panel_period: 0.2671109699993394
panel_datetime: 0.28593878000974654
hier_datetime: 2.568562889995519

after merge of main (including #4195) into this PR:

panel_period: 0.002741659991443157
panel_datetime: 0.0015702700009569525
hier_datetime: 2.3584587200079112

i.e., I also notice hier_datetime getting longer

fkiraly

Thanks a lot! Great improvement!

I would also approve it.

fkiraly · 2023-02-13T00:24:32Z

Still in my view this can be merged, since performance is real good for less groups and the increase is not too high and apparently hardware dependent. @fkiraly if you have time to run the code on your machine would be nice, else it is good to go from my side.

I'll take this as approval, @danbartl - did you want to press the "approve" button?

hoesler added 11 commits February 2, 2023 13:54

improve nested panel check

a4118b0

improve _index_equally_spaced performance for PeriodIndex

4f435e5

improve get_cutoff performance

3a12bf3

improve panel check

c082e19

fix test errors

9922eaa

fix is_monotonic_increasing call on empty series

7f00630

fix rebase

d5bbb2a

fix black formatting errors

fc23551

add myself to the list of contributors

9c7efb0

fix is_equal_length metadata calc in panel check

d8a108d

remove duplicate check

8c374a1

hoesler requested a review from fkiraly as a code owner February 3, 2023 10:07

fkiraly added module:datatypes datatypes module: data containers, checkers & converters enhancement Adding new functionality labels Feb 3, 2023

fkiraly added 5 commits February 3, 2023 20:35

Merge branch 'main' into pr/4196

35b2e40

Merge branch 'main' into pr/4196

a552663

fix merge accident

f2c8067

Update .all-contributorsrc

7662e60

Merge branch 'main' into pr/4196

0e13388

danbartl self-assigned this Feb 9, 2023

add shortcut for monotonic increasing multiindex

3b48901

fkiraly approved these changes Feb 12, 2023

View reviewed changes

fkiraly changed the title ~~[ENH] Improve panel mtype check performance~~ &danbartl [ENH] Improve panel mtype check performance Feb 13, 2023

fkiraly changed the title ~~&danbartl [ENH] Improve panel mtype check performance~~ [ENH] Improve panel mtype check performance Feb 13, 2023

fkiraly merged commit 04c81c8 into sktime:main Feb 13, 2023

hoesler deleted the improve-panel-mtype-check branch February 13, 2023 11:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Improve panel mtype check performance #4196

[ENH] Improve panel mtype check performance #4196

hoesler commented Feb 3, 2023

danbartl commented Feb 8, 2023 •

edited

danbartl commented Feb 9, 2023

hoesler commented Feb 9, 2023

danbartl commented Feb 9, 2023

hoesler commented Feb 10, 2023

danbartl commented Feb 10, 2023 •

edited

hoesler commented Feb 10, 2023

hoesler commented Feb 10, 2023

hoesler commented Feb 12, 2023

fkiraly commented Feb 12, 2023

hoesler commented Feb 12, 2023

fkiraly commented Feb 12, 2023

fkiraly commented Feb 12, 2023

danbartl commented Feb 12, 2023

fkiraly commented Feb 12, 2023 •

edited

fkiraly left a comment

fkiraly commented Feb 13, 2023

[ENH] Improve panel mtype check performance #4196

[ENH] Improve panel mtype check performance #4196

Conversation

hoesler commented Feb 3, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

Did you add any tests for the change?

Any other comments?

PR checklist

For all contributions

For new estimators

danbartl commented Feb 8, 2023 • edited

danbartl commented Feb 9, 2023

hoesler commented Feb 9, 2023

danbartl commented Feb 9, 2023

hoesler commented Feb 10, 2023

danbartl commented Feb 10, 2023 • edited

hoesler commented Feb 10, 2023

hoesler commented Feb 10, 2023

hoesler commented Feb 12, 2023

fkiraly commented Feb 12, 2023

hoesler commented Feb 12, 2023

fkiraly commented Feb 12, 2023

fkiraly commented Feb 12, 2023

danbartl commented Feb 12, 2023

fkiraly commented Feb 12, 2023 • edited

fkiraly left a comment

Choose a reason for hiding this comment

fkiraly commented Feb 13, 2023

danbartl commented Feb 8, 2023 •

edited

danbartl commented Feb 10, 2023 •

edited

fkiraly commented Feb 12, 2023 •

edited