Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Improve the handling of CategoricalIndex type when handling panel and hierarchical data. #4062

Open
KishManani opened this issue Jan 4, 2023 · 6 comments
Labels
enhancement Adding new functionality module:transformations transformations module: time series transformation, feature extraction, pre-/post-processing

Comments

@KishManani
Copy link
Contributor

KishManani commented Jan 4, 2023

Related to: #4061 & #3935. In these checks we are doing operations such as: time_grp = time_obj.groupby(level=0, group_keys=True, as_index=True). I've encountered an edge case when the outer index is of type CategoricalIndex, doing a groupby operation will return a result for all categories, even those categories which are not present in the dataframe. I noticed a slowdown when using WindowSummarizer on a panel dataframe where I had a CategoricalIndex despite sampling to a few time series instances, I believe that this is caused by the fact that there was a very large number of categories (i.e., time series instances). The latency disappeared when I converted the CategoricalIndex's categories to just an integer index. This means that any groupby operations in the checks and also in transform (in the case of WindowSummarizer) are actually returning a much larger number of rows than expected (WindowSummarizer filters these rows out because it does Xt_return = Xt_return.loc[idx], so the user doesn't know that a lot of unneccesary rows were computed). It is common to use the categorical type in pandas to be more memory efficient.

Potential fixes are:

  1. Do nothing. Users should introduce a pre-processing step where they remove unused categories in the CategoricalIndex before using any sktime transformers.
  2. Use the observed=True argument in groupby, so that the groupby does not return values for categories that are not in the dataframe.
  3. Perform some checks on the index and warn the user.
@KishManani KishManani added the documentation Documentation & tutorials label Jan 4, 2023
@KishManani
Copy link
Contributor Author

Update: Actually this might be isolated to WindowSummarizer as the groupby operations in #3935 and #4061 don't act on the non-time index.

@fkiraly
Copy link
Collaborator

fkiraly commented Jan 4, 2023

can you kindly post code with a dummy example? I can't quite visualize the failure case and the expected outcome.

@fkiraly fkiraly added module:transformations transformations module: time series transformation, feature extraction, pre-/post-processing enhancement Adding new functionality and removed documentation Documentation & tutorials labels Jan 4, 2023
@KishManani
Copy link
Contributor Author

KishManani commented Jan 7, 2023

Of course! Checkout out this issue which illustrates how columns with the category dtype behave during groupby.

Here is a dummy example about how this could effect the performance of transformers that use groupby:

import pandas as pd
from sklearn import clone
from sktime.transformations.series.summarize import WindowSummarizer
from sktime.utils._testing.hierarchical import _make_hierarchical

# Make hierarchical data
df = _make_hierarchical(hierarchy_levels=(10000,), min_timepoints=1000, max_timepoints=1000)

# Create the same panel dataframe with
# the index type being categorical
df_cat = df.reset_index()
df_cat["h0"] = df_cat["h0"].astype("category").cat.as_ordered()
df_cat = df_cat.set_index(["h0", "time"])

# Take a sample of the time series instances.
# Something we might want to do for our own application.
idx_cat = df_cat.index.levels[0][:3]
idx_obj = df.index.levels[0][:3]

df_sample_cat = df_cat.loc[idx_cat]
df_sample = df.loc[idx_obj]

# Create the same panel dataframe
# but drop the unused categories
index_cat = df_sample_cat.index.get_level_values(0).remove_unused_categories()
time_index = df_sample_cat.index.get_level_values(-1)
df_sample_cat_slim = df_sample_cat.set_index(keys=[index_cat, time_index])

# Make transformer 
trafo = WindowSummarizer(
    lag_feature={
        "lag": [1, 2, 3,],  # Lag features.
        "mean": [[1, 3]]
    },
    target_cols=["c0"],
    truncate="bfill",  # Backfill missing values from lagging and windowing.
)

# First level index has object type
%%timeit
trafo.fit(df_sample)
# 21.6 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
result = trafo.transform(df_sample)
# 49.6 ms ± 1.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# First level index has categorical type 
%%timeit
trafo.fit(df_sample_cat)
# 36.5 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
result = trafo.transform(df_sample_cat)
# 81.5 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# First level index has categorical type 
# but drop the unused categories in the index
# (i.e., the time series instances no longer present)
%%timeit
trafo.fit(df_sample_cat_slim)
# 23.5 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
trafo.transform(df_sample_cat_slim)
# 39.8 ms ± 434 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Two observations:

  1. The dataframe which has a categorical index containing the categories of the original dataframe can take about twice as long to run.
  2. Stochastically, sometimes when I run transform for the first it can take up to 1-2 seconds to run. I'm not sure why... even after re-creating the transformer I can't systematically reproduce this.

One comment:

  1. If a transformer does not know it is dealing with a CategoricalIndex then after doing any groupby operations internally, we may end up with a dataframe with more instances than was provided to the transformer.

I still think 3 possible solutions are the ones I've listed above.

@fkiraly
Copy link
Collaborator

fkiraly commented Jan 10, 2023

Ah, thanks for the explanation!

I have some comments and questions.

Question: do we know how in your original example a lot of categories ended up in the index? Maybe there was sth that should not have happened in the first place.

Comment:
I think I understand the problem now. May I rephrase to check whether I do:

  • a CategoricalIndex carries with it the "universe of categories". This may be much larger than the actualy occurring categories, and typically happens when row sub-setting or selecting a frame with a CategoricalIndex. Let's call this "situation A"
  • vanilla groupby is inefficient when applied to a CategoricalIndex in situation A, as compared to a groupby loop over only the occurring categories (which is done when observed=True)
  • that's the problem in this case, right?

Regarding solutions, I think observed=True is indeed the most straighforward one, and seems to address the issue.

I also see a fourth solution: in input checks, always coerce input series so that CategoricalIndex has minimal universe. I wouldn't see where we would carry along information of a larger universe where it was subsampled from - but I might be overlooking sth here.

@KishManani
Copy link
Contributor Author

KishManani commented Jan 11, 2023

Question: do we know how in your original example a lot of categories ended up in the index? Maybe there was sth that should not have happened in the first place.

So in my case it was the following. I'm working with the M5 dataset. To save on memory I converted any categorical column to the categorical dtype and I store this as a partitioned parquet file. I then load a segment of the data that I want to train a model with. However, at this point the columns remember all the categories, not just the ones in the segment of the data that I loaded. So, after loading the data I then have two choices: 1) use remove_unused_categories as a preprocessing step, 2) in any groupby I do I specify that observed=True.

that's the problem in this case, right?

Your understanding is correct.

I also see a fourth solution: in input checks, always coerce input series so that CategoricalIndex has minimal universe. I wouldn't see where we would carry along information of a larger universe where it was subsampled from - but I might be overlooking sth here.

This is also a neat solution.

I have a nagging doubt though and would love to get your opinion. Do you think, however, that we're dealing with a peculiarity of pandas and something that perhaps the user should consider instead? Perhaps I, as the user, should be aware if I'm carrying around the universe of categories when I only need a subset and choose to get rid of them myself?

@fkiraly
Copy link
Collaborator

fkiraly commented Jan 12, 2023

I have a nagging doubt though and would love to get your opinion. Do you think, however, that we're dealing with a peculiarity of pandas and something that perhaps the user should consider instead? Perhaps I, as the user, should be aware if I'm carrying around the universe of categories when I only need a subset and choose to get rid of them myself?

Hm, I suppose there are multiple things that come together here.

  • as a user, I think I should be aware of the data model. A categorical column, in common modern formalism, does come with its own universe. So, if I prepare my data, I should prepare the universe as appropriate.
  • Having said that, it is probably fair that in re-sampling the universe of the "whole" is retained.
  • Also, when re-sampling happens inside an estimator rather than at the hands of a user, there is a separate decision to make - "should the user be aware" cannot impact this, because it happens layers removed from the user. Not sure when and where that situation occurs, as opposed to the one where a user handles the data manually, but it's worth noting I think.

Regarding a clear answer - at the moment I can't see one, but I will keep thinking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Adding new functionality module:transformations transformations module: time series transformation, feature extraction, pre-/post-processing
Projects
None yet
Development

No branches or pull requests

2 participants