Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Improve panel mtype check performance #4196

Merged
merged 17 commits into from Feb 13, 2023

Conversation

hoesler
Copy link
Contributor

@hoesler hoesler commented Feb 3, 2023

Reference Issues/PRs

Split out of #4140
Contributes to #4139
Overlaps with #3827, #3991

What does this implement/fix? Explain your changes.

This PR improves the performance of check_pdmultiindex_panel.

It was developed in parallel to the work of @danbartl, so some improvements might be outdated or just implemented differently. I would be grateful, if he could support here.

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

Did you add any tests for the change?

Any other comments?

PR checklist

For all contributions
  • I've added myself to the list of contributors.
  • Optionally, I've updated sktime's CODEOWNERS to receive notifications about future changes to these files.
  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG] indicating whether the PR topic is related to enhancement, maintenance, documentation, or bug.
For new estimators
  • I've added the estimator to the online documentation.
  • I've updated the existing example notebooks or provided a new one to showcase how my estimator works.

@hoesler hoesler requested a review from fkiraly as a code owner February 3, 2023 10:07
@fkiraly fkiraly added module:datatypes datatypes module: data containers, checkers & converters enhancement Adding new functionality labels Feb 3, 2023
@danbartl
Copy link
Collaborator

danbartl commented Feb 8, 2023

looks great to me, I think the only "collision" is with my approach to change get_cutoff, but I removed that since?

Sorry currently a bit caught up, I will give this a closer look in the next days.

@danbartl
Copy link
Collaborator

danbartl commented Feb 9, 2023

Hi I am really confused now, to me performance seems to go down with the latest PR?

@hoesler
Copy link
Contributor Author

hoesler commented Feb 9, 2023

Hi I am really confused now, to me performance seems to go down with the latest PR?

What are you referring to with "with the latest PR"?
I am still getting a much better performance for check_pdmultiindex_panel here than on main.

0.09s vs 12.4s using this test case:

import timeit

import numpy as np
import pandas as pd

from datatypes._panel._check import check_pdmultiindex_panel

index = pd.MultiIndex.from_product(
    [np.arange(1000), pd.period_range(end="2022-12-31", periods=365 * 5, freq="D")],
    names=["store", "date"],
)

data = np.random.default_rng().standard_normal((len(index), 50), dtype=np.float32)

df = pd.DataFrame(data=data, index=index)

print(timeit.timeit(lambda: check_pdmultiindex_panel(df), number=10) / 10)

@danbartl danbartl self-assigned this Feb 9, 2023
@danbartl
Copy link
Collaborator

danbartl commented Feb 9, 2023

Hi sorry didnt have the time test thoroughly, I use the following code

# -*- coding: utf-8 -*-
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

from sktime.forecasting.compose import ForecastingPipeline, make_reduction
from sktime.forecasting.compose._reduce import _RecursiveReducer
from sktime.transformations.series.date import DateTimeFeatures
from sktime.transformations.series.summarize import WindowSummarizer
from sktime.utils._testing.hierarchical import _make_hierarchical

regressor = make_pipeline(
    LinearRegression(),
)

kwargs = {
    "lag_feature": {
        "lag": [1],
    }
}


forecaster_global = make_reduction(
    regressor,
    scitype="tabular-regressor",
    transformers=[WindowSummarizer(**kwargs, n_jobs=1, truncate="bfill")],
    window_length=None,
    strategy="recursive",
    pooling="global",
)

y = _make_hierarchical(
    hierarchy_levels=(10000,), min_timepoints=1000, max_timepoints=1000
)
from sktime.datatypes._utilities import get_time_index

y_no_freq = get_time_index(y.reset_index().set_index(["h0","time"]))


def Main():

    # from time import perf_counter
    # t1_start = perf_counter()

    _ = forecaster_global.fit(y)

    y_pred_global = forecaster_global.predict(fh=[1,2])

    # t1_stop = perf_counter()
    # print("Elapsed time during the whole program in seconds:", t1_stop-t1_start)


import cProfile
import pstats

cProfile.run("Main()", "output_newmulti_adjust.dat")

Which resulted for me in a slower performance when investigating the cProfile file. Hope I can dig in a bit more tomorrow.

@hoesler
Copy link
Contributor Author

hoesler commented Feb 10, 2023

Ok, so you are testing more than just check_pdmultiindex_panel. I think that can be misleading, or were you comparing CPU times for the function within the profiles?

@danbartl
Copy link
Collaborator

danbartl commented Feb 10, 2023

Ok, so you are testing more than just check_pdmultiindex_panel. I think that can be misleading, or were you comparing CPU times for the function within the profiles?

I broke it down to your example

import timeit

import numpy as np
import pandas as pd

from sktime.datatypes._panel._check import check_pdmultiindex_panel
from sktime.utils._testing.hierarchical import _make_hierarchical

# index = pd.MultiIndex.from_product(
#     [np.arange(1000), pd.period_range(end="2022-12-31", periods=365 * 5, freq="D").to_timestamp()],
#     names=["store", "date"],
# )
y = _make_hierarchical(
    hierarchy_levels=(10000,), min_timepoints=1000, max_timepoints=1000
)

data = np.random.default_rng().standard_normal((len(index), 50), dtype=np.float32)

df = pd.DataFrame(data=data, index=index)

print(timeit.timeit(lambda: check_pdmultiindex_panel(y), number=10) / 10)

I think the key point is that you consider PeriodIndex, while I conser DateTimeIndex. With Datetimeindex and many groups, the PR code presented here takes twice as long on my PC. Maybe we need to implement a logic to check PeriodIndex.

The great boost to your PeriodeIndex comes from this line, I guess?

    if isinstance(index, pd.PeriodIndex):
        return index.is_full

Great find, we definitely need that .

@hoesler
Copy link
Contributor Author

hoesler commented Feb 10, 2023

Interesting. I will investigate further with your code.
And yes, I was focusing on PeriodIndex. The main finding here was that operations are fast if you use its numeric representation, like with is_full.

@hoesler
Copy link
Contributor Author

hoesler commented Feb 10, 2023

Did a quick comparison of three scenarios:
My example with two different index types and yours.

This PR@0e13388:
panel_period: 0.090
panel_datetime: 0.098
hier_datetime: 1.1557

main@8a14adb:
panel_period: 12.308
panel_datetime: 0.628
hier_datetime: 1.472

My code performed consistently better. For your example the difference is small but it's sill better. For the other two it's significant. Not only for the period index.

Very strange, that we observe so different outcomes.

import timeit

import numpy as np
import pandas as pd

from datatypes._panel._check import check_pdmultiindex_panel
from utils._testing.hierarchical import _make_hierarchical


def create_panel(time_range_fun):
    index = pd.MultiIndex.from_product(
        [np.arange(1000), time_range_fun(end="2022-12-31", periods=365 * 5, freq="D")],
        names=["store", "date"],
    )

    data = np.random.default_rng().standard_normal((len(index), 50), dtype=np.float32)

    return pd.DataFrame(data=data, index=index)


for key, val in {
    "panel_period": create_panel(pd.period_range),
    "panel_datetime": create_panel(pd.date_range),
    "hier_datetime": _make_hierarchical(
        hierarchy_levels=(10000,), min_timepoints=1000, max_timepoints=1000
    ),
}.items():

    def func():
        check_pdmultiindex_panel(val, return_metadata=False)

    n = 10
    avg_time = timeit.timeit(func, number=n) / n

    print(f"{key}: {avg_time}")

@hoesler
Copy link
Contributor Author

hoesler commented Feb 12, 2023

Found some further improvements.
Latest benchmark (min runtime, now with metadata calculation for the scenarios above except yours now with sorted index):

panel_period: 0.145111
panel_datetime: 0.172680
hier_datetime: 0.991199

vs

panel_period: 12.283228
panel_datetime: 0.544896
hier_datetime: 1.357230

The slowest part is now _index_equally_spaced, but I don't see how we could improve that, except for maybe storing it's result as metadata in the index to prevent repeated calculation. Don't know if that is feasible, but worth investigating.
And of course, prevent unnecessary metadata calculation (#4191)!

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 12, 2023

FYI, I'm currently working on using the config mechanisms to turn off checks, metadata etc - I suppose that's orthogonal to this PR.

@hoesler
Copy link
Contributor Author

hoesler commented Feb 12, 2023

Nice. Thanks. I didn't start yet with an implementation, but my idea was to split check and metadata into two methods and, as a first step, call the existing functions as fallback somehow. But I guess this is indeed orthogonal.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 12, 2023

but my idea was to split check and metadata into two methods and, as a first step, call the existing functions as fallback somehow. But I guess this is indeed orthogonal.

Hm, good idea. How would it look like, in an example?

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 12, 2023

Here are two relevant issues:

@danbartl
Copy link
Collaborator

@fkiraly @hoesler
I still get a reduced performance for the 10,000 groups example. Still in my view this can be merged, since performance is real good for less groups and the increase is not too high and apparently hardware dependent. @fkiraly if you have time to run the code on your machine would be nice, else it is good to go from my side.

Thanks a lot!

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 12, 2023

Performance statistics

main@c41efee7f - note that includes #4195

panel_period: 37.70587424999103
panel_datetime: 1.4263513099984266
hier_datetime: 2.135691480000969

this@0e1338875

panel_period: 0.2671109699993394
panel_datetime: 0.28593878000974654
hier_datetime: 2.568562889995519

after merge of main (including #4195) into this PR:

panel_period: 0.002741659991443157
panel_datetime: 0.0015702700009569525
hier_datetime: 2.3584587200079112

i.e., I also notice hier_datetime getting longer

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! Great improvement!

I would also approve it.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 13, 2023

Still in my view this can be merged, since performance is real good for less groups and the increase is not too high and apparently hardware dependent. @fkiraly if you have time to run the code on your machine would be nice, else it is good to go from my side.

I'll take this as approval, @danbartl - did you want to press the "approve" button?

@fkiraly fkiraly changed the title [ENH] Improve panel mtype check performance &danbartl [ENH] Improve panel mtype check performance Feb 13, 2023
@fkiraly fkiraly changed the title &danbartl [ENH] Improve panel mtype check performance [ENH] Improve panel mtype check performance Feb 13, 2023
@fkiraly fkiraly merged commit 04c81c8 into sktime:main Feb 13, 2023
@hoesler hoesler deleted the improve-panel-mtype-check branch February 13, 2023 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Adding new functionality module:datatypes datatypes module: data containers, checkers & converters
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants