Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] ENH Adds warning with pandas category does not match lexicon ordering #15050

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Sep 21, 2019

Reference Issues/PRs

Resolves #14953

What does this implement/fix? Explain your changes.

Gets the dtypes in _fit and checks only when it has "categories".

Any other comments?

The fun part begins when we try to get the encoders to respect pandas categories (when they are numerical) :)

@thomasjpfan
Copy link
Member Author

@jnothman @amueller This would be good in the next release if we want to support pd.Categorical in the encoders in 0.24.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise lgtm

"""Checks cats is lexicographic consistent with categories in fitted X.
"""
msg = ("'auto' categories is used, but the Categorical dtype provided "
"is not consistent with the automatic lexicographic ordering")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should the user do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth listing or diffing the category list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should the user do?

Right now...nothing good. The user would need order their categorical dtype to be lexicographic ordering to avoid this warning, which I think is bad. This should most likely be a FutureWarning suggesting that in the future we will respect the Categorical dtype. We can add support for a categories='dtype' which will respect the Categorical dtype.

Is it worth listing or diffing the category list?

Sometimes the only difference is the order. What kind of diff would be good to have?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just print the two orders then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also added another sentence regarding passing a list to the categories parameter.

@jnothman
Copy link
Member

Please add what's new

@jnothman
Copy link
Member

Maybe it's worth solving for LabelEncoder too (see #12086, #13351)

@thomasjpfan
Copy link
Member Author

I agree with #12086 (comment)

For OneHotEncoder and OrdinalEncoder, we can have a category=dtype'` option to allow users to use the categorical dtype.

The LabelEncoder would be slightly more harder to do, since it is used internally in some estimators. If we were to warn in in those situations, there is nothing a user can do.

If we want to resolve both issues, we may need a sklearn.set_config(use_pandas_categories_order) to configure it globally.

@adrinjalali adrinjalali added this to the 0.22 milestone Oct 28, 2019
@adrinjalali adrinjalali added this to In progress in Categorical via automation Oct 28, 2019
@adrinjalali adrinjalali moved this from In progress to Review in progress in Categorical Oct 28, 2019
@adrinjalali adrinjalali added this to In progress in Meeting Issues via automation Oct 29, 2019
@adrinjalali adrinjalali moved this from In progress to Review in progress in Meeting Issues Oct 29, 2019
@rth
Copy link
Member

rth commented Nov 6, 2019

So is the goal of this overall to decrease number of versions needed to change the behavior for category="auto" from the current one to that of categories="dtype" for categorical data by 1 version?

If I understand correctly, the user is still getting the OHE they wanted with some categories in the OneHotEncoder estimator. Ideally these should match those of the categorical dtype. Currently they sometimes don't, but at the same time I'm not sure there was too much expectation that they would. Getting a warning one can do nothing about is not necessarily very helpful. Personally I currently have a pipeline with categorical dtypes + OHE and I would probably just ignore this warning, if it was raised. Or am I missing something?

@thomasjpfan
Copy link
Member Author

Currently they sometimes don't, but at the same time I'm not sure there was too much expectation that they would. Getting a warning one can do nothing about is not necessarily very helpful.

Yes this warning is not helpful without the other PR on categories="dtypes" (#15396)

@jnothman
Copy link
Member

jnothman commented Nov 6, 2019 via email

@thomasjpfan
Copy link
Member Author

It's helpful in the sense that the user can manually set the categories in
the encoder to match the dtype, and in the sense that they are now aware
that the dtype is not bring respected. What's wrong with that?

Should that be the error message in this case? "Please set categories as a list to set the order of your categories explicitly?"

@jnothman
Copy link
Member

jnothman commented Nov 7, 2019 via email

Categorical automation moved this from Review in progress to Reviewer approved Nov 7, 2019
msg = ("'auto' categories is used, but the Categorical dtype "
"provided is not consistent with the automatic "
"lexicographic ordering, lexicon order: {}, dtype order: "
"{}. Please pass a custom list of categories to the "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a bit weaker "Consider passing" rather than "please pass"

Meeting Issues automation moved this from Review in progress to Reviewer approved Nov 7, 2019
@rth
Copy link
Member

rth commented Nov 7, 2019

Consider the case, when you have a dataset split with train_test_split, with a significant number of categories. Some categories might then differ between train and test, and one would use OneHotEncoder(handle_unknown="ignore") to handle unkown categories in train. To silence this warning the user would then need not only to sort categories, but also to remove unknown categories from the categorical dtype in train set. I'm just saying, I would not spend time doing that, I prefer having train and test use the same categories in dtype in my present project, and seeing repeated warning for this is annoying.

Users only need to be told once that the categorical dtype is not respected (not each time they fit a pipeline) and maybe the documentation could be a better place to document this?

@jnothman
Copy link
Member

jnothman commented Nov 7, 2019 via email

@rth
Copy link
Member

rth commented Nov 7, 2019

Why do they need to remove unknown categories from the dtype?

>>> import pandas as pd
>>> from sklearn.preprocessing import OneHotEncoder
>>> from sklearn.model_selection import train_test_split
>>> X = pd.DataFrame({'z': pd.Categorical(['b', 'c', 'c', 'a', 'b'])})
>>> X_train, X_test = train_test_split(X, shuffle=False)
>>> X_train['z']
0    b
1    c
2    c
Name: z, dtype: category
Categories (3, object): [a, b, c]
>>> ohe = OneHotEncoder(categories="auto", handle_unknown="ignore").fit(X_train)
>>> ohe.categories_
[array(['b', 'c'], dtype=object)]
>>> X_train['z'].cat.categories
Index(['a', 'b', 'c'], dtype='object')

This warning is going to be raised, currently with this PR I believe? Sorting categories is not going to help...

@thomasjpfan
Copy link
Member Author

I am happy with moving this to 0.23.

@qinhanmin2014
Copy link
Member

I vote +1 because I agree with Thomas's comment #14953 (comment) (i.e., users won't get unexpected warning), perhaps I'm wrong.
Let's move this to 0.23.

@qinhanmin2014 qinhanmin2014 modified the milestones: 0.22, 0.23 Nov 17, 2019
@NicolasHug
Copy link
Member

I don't think we should rely on some arbitrary pandas internal behavior that may change tomorrow

@qinhanmin2014
Copy link
Member

I don't think we should rely on some arbitrary pandas internal behavior that may change tomorrow

arbitrary pandas internal behavior? If so, I agree that we should reconsider this PR. I admit that I lack experience outside scikit-learn.

Actually I come up with another question: should we raise warning in OneHotEncoder where the order doesn't matter?

@NicolasHug
Copy link
Member

NicolasHug commented Nov 17, 2019

arbitrary pandas internal behavior?

I'm referring to the fact that pandas happens to use the lexicographic ordering by default. If we start relying on this, we're making pandas a de-facto dependency.

should we raise warning in OneHotEncoder where the order doesn't matter?

I have the same concern, again the discussion is tracked in #14953 (comment)

@qinhanmin2014
Copy link
Member

I'm referring to the fact that pandas happens to use the lexicographic ordering by default

I'll trust you but note that this behavior is actually documented.
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.html

categories : Index-like (unique), optional
The unique categories for this categorical. If not given,
the categories are assumed to be the unique values of values
(sorted, if possible, otherwise in the order in which they appear).

So will pandas change it without a deprecation cycle? I don't know.

@qinhanmin2014
Copy link
Member

So let my clarify my point: since this feature is designed for pandas, I guess it's reasonable to rely on some documented behaviors in pandas.

@adrinjalali
Copy link
Member

removing from the milestone. We can put it back when we find a way forward.

@adrinjalali adrinjalali removed this from the 0.23 milestone Apr 22, 2020
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still would rather have this than not, at least in OrdinalEncoder, and maybe in OneHotEncoder when the dtype is ordered.

@thomasjpfan
Copy link
Member Author

Updated to only warn in OrdinalEncoder

@cmarmo cmarmo modified the milestone: 0.24 Oct 26, 2020
@cmarmo cmarmo added this to the 1.0 milestone Dec 18, 2020
Base automatically changed from master to main January 22, 2021 10:51
@adrinjalali
Copy link
Member

Any reason this is flagged to be in v1.0? Hasn't been active since December.

@thomasjpfan
Copy link
Member Author

This PR feels like a "needs decision" as we need to decide what the desired behavior of OrdinalEncoder + pandas Categroicals is. REF: #14953 (comment)

I'll push this milestone till 1.1, since I do not think we can decide before the 1.0 release.

Looking at this again, I think I would like an IntegerEncoder and the order is an implementation detail:

  1. Pass in a numpy array with something that can be ordered -> use the lexicon ordering
  2. Pass in a pandas categorical -> use the values already encoded by the CategoricalDtype

@thomasjpfan thomasjpfan modified the milestones: 1.0, V 1.1 Aug 21, 2021
@adrinjalali
Copy link
Member

Removing the milestone since there's been no activity since last release.

@adrinjalali adrinjalali removed this from the 1.1 milestone Apr 7, 2022
@jeremiedbb jeremiedbb added the Needs Decision Requires decision label Apr 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Categorical
  
Review in progress
Meeting Issues
  
Review in progress
Development

Successfully merging this pull request may close these issues.

Handle pd.Categorical in encoders
9 participants