Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle pd.Categorical in encoders #14953

Open
jnothman opened this issue Sep 11, 2019 · 36 comments · May be fixed by #15050
Open

Handle pd.Categorical in encoders #14953

jnothman opened this issue Sep 11, 2019 · 36 comments · May be fixed by #15050
Labels
Breaking Change Issue resolution would not be easily handled by the usual deprecation cycle. Moderate Anything that requires some knowledge of conventions and best practices module:preprocessing
Milestone

Comments

@jnothman
Copy link
Member

In sklearn.preprocessing._encoders._BaseEncoder, columns with pd.Categorical dtype are converted to arrays.

Xi = check_array(Xi, ensure_2d=False, dtype=None,

If the categories ordering is explicitly specified by the user to the constructor of OneHotEncoder or OrdinalEncoder, then this is fine... but if 'auto' is used, lexicographic ordering will be assumed, disregarding the encoding order determined by the Categorical dtype.

I propose that we raise a warning if:

  • a feature passed to OneHotEncoder or OrdinalEncoder has a Pandas Categorical dtype (is there a way to duck type this?)
  • the categories for that feature are 'auto'
  • the lexicographically sorted features do not match the feature's dtype.categories ordering.

The warning might be something like UserWarning("'auto' categories is used, but the Categorical dtype provided is not consistent with the automatic lexicographic ordering")... or else something more intelligible.

We may change this warning to a FutureWarning with "From version 0.24 the category ordering specified by a Categorical dtype will be respected in encoders."

@jnothman jnothman added Easy Well-defined and straightforward way to resolve Moderate Anything that requires some knowledge of conventions and best practices help wanted labels Sep 11, 2019
@harsh020
Copy link
Contributor

@jnothman Hi, I am a bit new to open source and would like to work on this issue, if available.

Let me know if I got things correct,

  • a feature passed to OneHotEncoder or OrdinalEncoder has a Pandas Categorical dtype (is there a way to duck type this?)
  • the categories for that feature are 'auto'
  • the lexicographically sorted features do not match the feature's dtype.categories ordering.

if any of the above point hold true we need to raise warning.

Few queries,

  • should we raise the warning in the __init__ , or during fit or the _fit of the base class.
  • could you please explain a bit on the third point, i.e,
  • the lexicographically sorted features do not match the feature's dtype.categories ordering.

Few queries in general, if you don't mind answering them -

  • the check_array function in validation.py also used in this mentioned module, does it return the encoded i.e, converts the values to float? If yes, then what about the strings?
  • if the check_array does the main job for us then why use _encode? Why does _encode gives the list of features the dtype of object rather than 'str' or 'int'? Is it just bcs its a list?

Thanks.

@jnothman
Copy link
Member Author

jnothman commented Sep 16, 2019 via email

@harsh020
Copy link
Contributor

harsh020 commented Sep 17, 2019

@jnothman Thanks.
Just one more thing the last point

the lexicographically sorted features do not match the feature's dtype.categories ordering.

Do we have lexicographically sorted feature list? And what is feature's dtype.categories (is it the self.categories_ list that we are finding?)

Thanks.

@jnothman
Copy link
Member Author

jnothman commented Sep 18, 2019 via email

@harsh020
Copy link
Contributor

if self.categories == 'auto':

@jnothman Hey, I thought of doing it as -
In the lines above in line 85, within the if condition we can just check whether the other two conditions are true, that is whether X_list[i] is a categorical dtype and then we can check cats variable has same order as that of X_list[i]'s dtype.category.

Is it a feasible approach?

Thanks.

@amueller
Copy link
Member

I agree with the proposed behavior, but I doubt this issue will be easy ;) [also this issue is tagged easy and moderate?]

@harsh020
Copy link
Contributor

harsh020 commented Sep 19, 2019

@amueller Thanks for the heads up, I won't be opening any PR until I make sure that the approach is correct and get it reviewed by you guys.

In the lines above in line 85, within the if condition we can just check whether the other two conditions are true, that is whether X_list[i] is a categorical dtype and then we can check cats variable has same order as that of X_list[i]'s dtype.category.

And from your reply I think that this approach won't be any good. Still working on it.

Thanks.

@harsh020
Copy link
Contributor

@jnothman I was going through the check_array function, please correct me if wrong, the function converts any type of object passed to it to an array of features with string values.

If, so I think we need to verify the three conditions to raise the warning in the check_array function itself.

Thanks.

@jnothman
Copy link
Member Author

jnothman commented Sep 21, 2019 via email

@thomasjpfan
Copy link
Member

It would be nice to have support for this now and not wait till 0.24. Is adding option in categories a good idea? (I do not have a good idea what this new option would be called. categories=use-pandas-categories-if-avaliable?)

@jnothman
Copy link
Member Author

jnothman commented Oct 23, 2019 via email

@jorisvandenbossche
Copy link
Member

This duplicates #12086 somewhat, there is some discussion there as well (but basically what @jnothman is proposing here to introduce a warning for it). Will close the other issue.

I agree with @thomasjpfan that it would be nice to already have the new behaviour now, but I was wondering if we cannot combine that we the changes in OrdinalEncoder for strings with categories='sort' / 'auto' ? For example, if people specify categories='auto', we use the order of the categorical dtype, if categories='sort', we lexicographically sort them (the current behaviour).
That way it could also be possible to raise a warning in the default case, but having a way for users to both have the new behaviour or keep the old behaviour and silence the warning.

BTW, I don't think that the actual implementation to use the categorical dtype's categories should be very hard (there is actually a PR for this: #13351), as the required preparatory work to handle a DataFrame column by column is already done (#13253).

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 28, 2019

The question is also: what default behaviour do we want in the long run?
Personally, I think the default should become to use the categorical's dtype ordering/categories. If we want that, having this behaviour for categories='auto' (instead of a new categories='dtype') might be nicer.

EDIT: hmm, of course what I am forgetting is that users can already explicitly do categories='auto' right now. So eg changing the actual default to categories=None (meaning the old default), so we can raise a warning if needed and point users towards specifying categories='auto'/'sort' to choose explicit behaviour will still introduce a breaking change for those who already explicitly passed that keyword ..

@thomasjpfan
Copy link
Member

thomasjpfan commented Oct 28, 2019

In the long run I would want auto to mean dtype. The question is how we get there. If we want to maintain backward compatibility, we need to warn that 'auto' will now respect categorical dtypes in 0.24. With this warning, it would be nice to say that "you can use the categorical='dtype' option now to enable this behavior now. When 'dtype' becomes the new 'auto', we would need to deprecated 'dtype' because it means the same as 'auto'.

(This is a fairly long deprecation path)

@jorisvandenbossche
Copy link
Member

Yes, I was mainly trying to think if we can't do it with a shorter path, without having a new option that afterwards becomes obsolete. But yeah, as edited my comment above, it's difficult to do that in a fully backwards compatible way ..

@NicolasHug
Copy link
Member

I propose that we raise a warning if a feature passed to OneHotEncoder or OrdinalEncoder has a Pandas Categorical dtype

We only want to warn when the Categorical dtype is ordered right? Categories aren't necessarily ordered and this is actually pandas' default.

The current PR #15396 warns when categories aren't ordered which seems wrong to me.

(Sorry if this has been previously discussed)

@thomasjpfan
Copy link
Member

pandas uses the lexicon order for its encoding when the categorical dtype is unordered, so it so happens that this is okay. Although I agree we should not rely on this, and warn when the category is ordered.

@jnothman
Copy link
Member Author

jnothman commented Nov 10, 2019 via email

@NicolasHug
Copy link
Member

Ordinal categories are far less common than pure nominal categories, so IMHO the pandas default makes sense, and we would be warning for no good reason in most cases.

Why would you want to warn something about the order when there is no order in the first place?

@jnothman
Copy link
Member Author

jnothman commented Nov 10, 2019 via email

@NicolasHug
Copy link
Member

Sure, but then:

  • it's not really an issue for the one hot encoder so we should probably not warn there
  • does it even make sense to allow the ordinal encoder for unordered categories?

@jnothman
Copy link
Member Author

jnothman commented Nov 10, 2019 via email

@NicolasHug
Copy link
Member

... which makes absolutely no sense unless those strings are ordered

@glemaitre
Copy link
Member

I think that #14984, #15050, and #15396 might not be blockers for 0.22 and I would move them for 0.23.

I think that it could be great to have a single issue (superseded #14953, #14954) to discuss the overall behaviour for categories in OneHotEncoder and OrdinalEncoder and from there having several PRs which follows the discussed proposals.

@agramfort
Copy link
Member

... which makes absolutely no sense unless those strings are ordered

@NicolasHug I would not be so sure about this. Trees can cope with
random orders provided you make them deep enough.

@NicolasHug
Copy link
Member

True but trees aren't always deep, typically in GB

To reproduce a OHE split using OE, you would need in the worst case C - 1 splits. That's not negligible when the OHE splits multiple time on the same feature during fitting. And because of the arbitrary order, you might just not ever consider such a split because the gain is too low.

The right way to handle nominal categories in trees is still to use a OHE, or to natively support categories like the nocats PRs.

In any case, going back to the original issue, my concern here is that the current proposal is to raise an order-related warning even when there is actually no order, which I think will just confuse / frustrate users.

@agramfort
Copy link
Member

agramfort commented Nov 17, 2019 via email

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 18, 2019

Also note that you can have a pandas Categorical with a specific order without having it "ordered" (it the sense of the ordered=False/True attribute.

Eg:

In [12]: cat = pd.Categorical(['high', 'low', 'medium', 'low'], categories=['low', 'medium', 'high']) 

In [13]: cat  
Out[13]: 
[high, low, medium, low]
Categories (3, object): [low, medium, high]

So when people are passing this to a OrdinalEncoder, they might actually be doing it "correctly" already, even though it is not an "ordered" categorical, it just happens to have its categories in the sensible (non lexico) order.

@cmarmo
Copy link
Member

cmarmo commented Oct 26, 2020

@thomasjpfan two PRs of yours are meant to close this issue: #15050 has two approvals, so I have milestoned it against #15396. I'm unable to understand if they are both necessary. Do you mind clarifying and fixing conflicts (in one or both), if you think the milestone is still relevant? Thanks a lot for your collaboration.

@thomasjpfan
Copy link
Member

I think the end goal is to have "auto" == use the encoded provided by pandas, at least for OridinalEncoder.

#15050 is a warning to tell the user that the order we use does not match the pandas categorical. I can update this to say that "auto" will use the pandas ordering in 0.26. This warning can be a little annoying because the only way to avoid it is to use a python warnings filter.

#15396 would need to be updated to adjust 'auto' and than not merged until 0.26, because it will contain the implementation for using the pandas categorical.

@jnothman
Copy link
Member Author

jnothman commented Nov 1, 2020 via email

@cmarmo cmarmo added this to the 1.0 milestone Nov 2, 2020
@cmarmo cmarmo added the Breaking Change Issue resolution would not be easily handled by the usual deprecation cycle. label Nov 2, 2020
@adrinjalali
Copy link
Member

Doesn't seem like we'd get this in time. Moving to 2.0

@amueller
Copy link
Member

is there a current/ recent issue tracking the support for pd.categorical? So it was added to hgb, not not anywhere else, right?
I feel like there should be a consistent story, and a lot of conversation has happened over the last .. 5 years? but I can't finda current status or summary.
I saw #24967, and some related thoughts in #27947, but I haven't seen anything more recent?

@glemaitre
Copy link
Member

We have a related PR that wants to leverage this feature: #27911

@amueller
Copy link
Member

Thanks, that's quite interesting, but also somewhat orthogonal. I was thinking more about the API surface. Currently a user doesn't know whether categorical features are treated correctly or not in a model without reading the documentation of each model and default hyper-parameters. Passing categoricals encoded as integers and with categorical dtype to LogisticRegression, RandomForestClassifier and HistogramGradientBoostingClassifier has very different results.

@glemaitre
Copy link
Member

Indeed, we did not got the discussion. HistGradientBoosting is the first model that handle categorical feature in a "native manner". I assume that we could have something similar for all tree-based approach avoiding. You get missing values and categorical handling without the need of preprocessing.

For linear model, we got kind of backward by deprecating normalize parameter and totally rely on the ColumnTransformer. The fact that you need to handle both numerical + categorical + missing values make those models trickier.

The pattern of using the TableVectorizer from skrubs (https://skrub-data.org/stable/generated/skrub.TableVectorizer.html#skrub.TableVectorizer) could be a way to simplify such pipeline and the same pattern could also potentially used with tree-based models but I'm not sure if this necessary when one deals only with numerical and categorical (when dates come to play then it could be useful).

But I rather think that we should have the discussion on that topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Breaking Change Issue resolution would not be easily handled by the usual deprecation cycle. Moderate Anything that requires some knowledge of conventions and best practices module:preprocessing
Projects
No open projects