New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Improve the handling of CategoricalIndex
type when handling panel and hierarchical data.
#4062
Comments
can you kindly post code with a dummy example? I can't quite visualize the failure case and the expected outcome. |
Of course! Checkout out this issue which illustrates how columns with the Here is a dummy example about how this could effect the performance of transformers that use import pandas as pd
from sklearn import clone
from sktime.transformations.series.summarize import WindowSummarizer
from sktime.utils._testing.hierarchical import _make_hierarchical
# Make hierarchical data
df = _make_hierarchical(hierarchy_levels=(10000,), min_timepoints=1000, max_timepoints=1000)
# Create the same panel dataframe with
# the index type being categorical
df_cat = df.reset_index()
df_cat["h0"] = df_cat["h0"].astype("category").cat.as_ordered()
df_cat = df_cat.set_index(["h0", "time"])
# Take a sample of the time series instances.
# Something we might want to do for our own application.
idx_cat = df_cat.index.levels[0][:3]
idx_obj = df.index.levels[0][:3]
df_sample_cat = df_cat.loc[idx_cat]
df_sample = df.loc[idx_obj]
# Create the same panel dataframe
# but drop the unused categories
index_cat = df_sample_cat.index.get_level_values(0).remove_unused_categories()
time_index = df_sample_cat.index.get_level_values(-1)
df_sample_cat_slim = df_sample_cat.set_index(keys=[index_cat, time_index])
# Make transformer
trafo = WindowSummarizer(
lag_feature={
"lag": [1, 2, 3,], # Lag features.
"mean": [[1, 3]]
},
target_cols=["c0"],
truncate="bfill", # Backfill missing values from lagging and windowing.
)
# First level index has object type
%%timeit
trafo.fit(df_sample)
# 21.6 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
result = trafo.transform(df_sample)
# 49.6 ms ± 1.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# First level index has categorical type
%%timeit
trafo.fit(df_sample_cat)
# 36.5 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
result = trafo.transform(df_sample_cat)
# 81.5 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# First level index has categorical type
# but drop the unused categories in the index
# (i.e., the time series instances no longer present)
%%timeit
trafo.fit(df_sample_cat_slim)
# 23.5 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
trafo.transform(df_sample_cat_slim)
# 39.8 ms ± 434 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) Two observations:
One comment:
I still think 3 possible solutions are the ones I've listed above. |
Ah, thanks for the explanation! I have some comments and questions. Question: do we know how in your original example a lot of categories ended up in the index? Maybe there was sth that should not have happened in the first place. Comment:
Regarding solutions, I think I also see a fourth solution: in input checks, always coerce input series so that |
So in my case it was the following. I'm working with the M5 dataset. To save on memory I converted any categorical column to the
Your understanding is correct.
This is also a neat solution. I have a nagging doubt though and would love to get your opinion. Do you think, however, that we're dealing with a peculiarity of |
Hm, I suppose there are multiple things that come together here.
Regarding a clear answer - at the moment I can't see one, but I will keep thinking. |
Related to: #4061 & #3935. In these checks we are doing operations such as:
time_grp = time_obj.groupby(level=0, group_keys=True, as_index=True)
. I've encountered an edge case when the outer index is of typeCategoricalIndex
, doing agroupby
operation will return a result for all categories, even those categories which are not present in the dataframe. I noticed a slowdown when usingWindowSummarizer
on a panel dataframe where I had aCategoricalIndex
despite sampling to a few time series instances, I believe that this is caused by the fact that there was a very large number of categories (i.e., time series instances). The latency disappeared when I converted theCategoricalIndex
's categories to just an integer index. This means that anygroupby
operations in the checks and also intransform
(in the case ofWindowSummarizer
) are actually returning a much larger number of rows than expected (WindowSummarizer
filters these rows out because it doesXt_return = Xt_return.loc[idx]
, so the user doesn't know that a lot of unneccesary rows were computed). It is common to use thecategorical
type inpandas
to be more memory efficient.Potential fixes are:
CategoricalIndex
before using anysktime
transformers.observed=True
argument ingroupby
, so that thegroupby
does not return values for categories that are not in the dataframe.The text was updated successfully, but these errors were encountered: