Handle pd.Categorical in encoders #14953

jnothman · 2019-09-11T11:06:13Z

In sklearn.preprocessing._encoders._BaseEncoder, columns with pd.Categorical dtype are converted to arrays.

scikit-learn/sklearn/preprocessing/_encoders.py

Line 60 in 03ea20d

Xi = check_array(Xi, ensure_2d=False, dtype=None,

If the categories ordering is explicitly specified by the user to the constructor of OneHotEncoder or OrdinalEncoder, then this is fine... but if 'auto' is used, lexicographic ordering will be assumed, disregarding the encoding order determined by the Categorical dtype.

I propose that we raise a warning if:

a feature passed to OneHotEncoder or OrdinalEncoder has a Pandas Categorical dtype (is there a way to duck type this?)
the categories for that feature are 'auto'
the lexicographically sorted features do not match the feature's dtype.categories ordering.

The warning might be something like UserWarning("'auto' categories is used, but the Categorical dtype provided is not consistent with the automatic lexicographic ordering")... or else something more intelligible.

We may change this warning to a FutureWarning with "From version 0.24 the category ordering specified by a Categorical dtype will be respected in encoders."

The text was updated successfully, but these errors were encountered:

harsh020 · 2019-09-16T15:30:34Z

@jnothman Hi, I am a bit new to open source and would like to work on this issue, if available.

Let me know if I got things correct,

a feature passed to OneHotEncoder or OrdinalEncoder has a Pandas Categorical dtype (is there a way to duck type this?)

the categories for that feature are 'auto'

the lexicographically sorted features do not match the feature's dtype.categories ordering.

if any of the above point hold true we need to raise warning.

Few queries,

should we raise the warning in the __init__ , or during fit or the _fit of the base class.
could you please explain a bit on the third point, i.e,

the lexicographically sorted features do not match the feature's dtype.categories ordering.

Few queries in general, if you don't mind answering them -

the check_array function in validation.py also used in this mentioned module, does it return the encoded i.e, converts the values to float? If yes, then what about the strings?
if the check_array does the main job for us then why use _encode? Why does _encode gives the list of features the dtype of object rather than 'str' or 'int'? Is it just bcs its a list?

Thanks.

jnothman · 2019-09-16T21:02:27Z

I think check_array treats pandas categoricals as strings for now. You can try! And the condition for warning is that all of the above hold, not any.

harsh020 · 2019-09-17T19:22:13Z

@jnothman Thanks.
Just one more thing the last point

the lexicographically sorted features do not match the feature's dtype.categories ordering.

Do we have lexicographically sorted feature list? And what is feature's dtype.categories (is it the self.categories_ list that we are finding?)

Thanks.

jnothman · 2019-09-18T01:04:21Z

I mean that if we use the current encoding logic, we sort the feature's values. But if the feature has a categorical dtype, it also specifies an ordering (x.dtype.categories where x is a feature column).

harsh020 · 2019-09-18T17:55:41Z

scikit-learn/sklearn/preprocessing/_encoders.py

Line 85 in 03ea20d

if self.categories == 'auto':

@jnothman Hey, I thought of doing it as -
In the lines above in line 85, within the if condition we can just check whether the other two conditions are true, that is whether X_list[i] is a categorical dtype and then we can check cats variable has same order as that of X_list[i]'s dtype.category.

Is it a feasible approach?

Thanks.

amueller · 2019-09-18T19:29:03Z

I agree with the proposed behavior, but I doubt this issue will be easy ;) [also this issue is tagged easy and moderate?]

harsh020 · 2019-09-19T14:37:40Z

@amueller Thanks for the heads up, I won't be opening any PR until I make sure that the approach is correct and get it reviewed by you guys.

In the lines above in line 85, within the if condition we can just check whether the other two conditions are true, that is whether X_list[i] is a categorical dtype and then we can check cats variable has same order as that of X_list[i]'s dtype.category.

And from your reply I think that this approach won't be any good. Still working on it.

Thanks.

harsh020 · 2019-09-19T15:31:19Z

@jnothman I was going through the check_array function, please correct me if wrong, the function converts any type of object passed to it to an array of features with string values.

If, so I think we need to verify the three conditions to raise the warning in the check_array function itself.

Thanks.

jnothman · 2019-09-21T14:05:21Z

No, modifying check_array sounds like a bad idea. you might ratherneed to do some checking (or setting flags) before the call to check_array.

thomasjpfan · 2019-10-23T18:02:36Z

It would be nice to have support for this now and not wait till 0.24. Is adding option in categories a good idea? (I do not have a good idea what this new option would be called. categories=use-pandas-categories-if-avaliable?)

jnothman · 2019-10-23T21:03:32Z

categories='dtype'?

jorisvandenbossche · 2019-10-28T13:29:16Z

This duplicates #12086 somewhat, there is some discussion there as well (but basically what @jnothman is proposing here to introduce a warning for it). Will close the other issue.

I agree with @thomasjpfan that it would be nice to already have the new behaviour now, but I was wondering if we cannot combine that we the changes in OrdinalEncoder for strings with categories='sort' / 'auto' ? For example, if people specify categories='auto', we use the order of the categorical dtype, if categories='sort', we lexicographically sort them (the current behaviour).
That way it could also be possible to raise a warning in the default case, but having a way for users to both have the new behaviour or keep the old behaviour and silence the warning.

BTW, I don't think that the actual implementation to use the categorical dtype's categories should be very hard (there is actually a PR for this: #13351), as the required preparatory work to handle a DataFrame column by column is already done (#13253).

jorisvandenbossche · 2019-10-28T13:34:49Z

The question is also: what default behaviour do we want in the long run?
Personally, I think the default should become to use the categorical's dtype ordering/categories. If we want that, having this behaviour for categories='auto' (instead of a new categories='dtype') might be nicer.

EDIT: hmm, of course what I am forgetting is that users can already explicitly do categories='auto' right now. So eg changing the actual default to categories=None (meaning the old default), so we can raise a warning if needed and point users towards specifying categories='auto'/'sort' to choose explicit behaviour will still introduce a breaking change for those who already explicitly passed that keyword ..

thomasjpfan · 2019-10-28T13:41:07Z

In the long run I would want auto to mean dtype. The question is how we get there. If we want to maintain backward compatibility, we need to warn that 'auto' will now respect categorical dtypes in 0.24. With this warning, it would be nice to say that "you can use the categorical='dtype' option now to enable this behavior now. When 'dtype' becomes the new 'auto', we would need to deprecated 'dtype' because it means the same as 'auto'.

(This is a fairly long deprecation path)

jorisvandenbossche · 2019-10-28T13:49:11Z

Yes, I was mainly trying to think if we can't do it with a shorter path, without having a new option that afterwards becomes obsolete. But yeah, as edited my comment above, it's difficult to do that in a fully backwards compatible way ..

NicolasHug · 2019-11-08T16:08:29Z

I propose that we raise a warning if a feature passed to OneHotEncoder or OrdinalEncoder has a Pandas Categorical dtype

We only want to warn when the Categorical dtype is ordered right? Categories aren't necessarily ordered and this is actually pandas' default.

The current PR #15396 warns when categories aren't ordered which seems wrong to me.

(Sorry if this has been previously discussed)

thomasjpfan · 2019-11-08T16:49:34Z

pandas uses the lexicon order for its encoding when the categorical dtype is unordered, so it so happens that this is okay. Although I agree we should not rely on this, and warn when the category is ordered.

jnothman · 2019-11-10T11:55:53Z

As I noted in the pr, I would be more comfortable warning in any case precisely because the categoricals do not have the ordered flag set by default.

NicolasHug · 2019-11-10T14:07:41Z

Ordinal categories are far less common than pure nominal categories, so IMHO the pandas default makes sense, and we would be warning for no good reason in most cases.

Why would you want to warn something about the order when there is no order in the first place?

jnothman · 2019-11-10T21:22:58Z

This comes back to what someone is using an OrdinalEncoder for... The ordering obviously has an effect.

NicolasHug · 2019-11-10T21:38:12Z

Sure, but then:

it's not really an issue for the one hot encoder so we should probably not warn there
does it even make sense to allow the ordinal encoder for unordered categories?

jnothman · 2019-11-10T22:03:25Z

Sigh. Yes, people use the OrdinalEncoder just to turn strings into ints so they can be fed to a forest.

NicolasHug · 2019-11-10T22:30:33Z

... which makes absolutely no sense unless those strings are ordered

glemaitre · 2019-11-15T14:23:18Z

I think that #14984, #15050, and #15396 might not be blockers for 0.22 and I would move them for 0.23.

I think that it could be great to have a single issue (superseded #14953, #14954) to discuss the overall behaviour for categories in OneHotEncoder and OrdinalEncoder and from there having several PRs which follows the discussed proposals.

agramfort · 2019-11-17T09:04:33Z

... which makes absolutely no sense unless those strings are ordered

@NicolasHug I would not be so sure about this. Trees can cope with
random orders provided you make them deep enough.

NicolasHug · 2019-11-17T13:38:51Z

True but trees aren't always deep, typically in GB

To reproduce a OHE split using OE, you would need in the worst case C - 1 splits. That's not negligible when the OHE splits multiple time on the same feature during fitting. And because of the arbitrary order, you might just not ever consider such a split because the gain is too low.

The right way to handle nominal categories in trees is still to use a OHE, or to natively support categories like the nocats PRs.

In any case, going back to the original issue, my concern here is that the current proposal is to raise an order-related warning even when there is actually no order, which I think will just confuse / frustrate users.

agramfort · 2019-11-17T16:04:01Z

I would not raise a warning but maybe I assume that people know what they are doing...

…

jorisvandenbossche · 2019-11-18T12:25:53Z

Also note that you can have a pandas Categorical with a specific order without having it "ordered" (it the sense of the ordered=False/True attribute.

Eg:

In [12]: cat = pd.Categorical(['high', 'low', 'medium', 'low'], categories=['low', 'medium', 'high']) 

In [13]: cat  
Out[13]: 
[high, low, medium, low]
Categories (3, object): [low, medium, high]

So when people are passing this to a OrdinalEncoder, they might actually be doing it "correctly" already, even though it is not an "ordered" categorical, it just happens to have its categories in the sensible (non lexico) order.

cmarmo · 2020-10-26T15:00:13Z

@thomasjpfan two PRs of yours are meant to close this issue: #15050 has two approvals, so I have milestoned it against #15396. I'm unable to understand if they are both necessary. Do you mind clarifying and fixing conflicts (in one or both), if you think the milestone is still relevant? Thanks a lot for your collaboration.

thomasjpfan · 2020-10-30T19:01:42Z

I think the end goal is to have "auto" == use the encoded provided by pandas, at least for OridinalEncoder.

#15050 is a warning to tell the user that the order we use does not match the pandas categorical. I can update this to say that "auto" will use the pandas ordering in 0.26. This warning can be a little annoying because the only way to avoid it is to use a python warnings filter.

#15396 would need to be updated to adjust 'auto' and than not merged until 0.26, because it will contain the implementation for using the pandas categorical.

jnothman · 2020-11-01T12:52:01Z

Is this something where we should just break auto behaviour for pandas categoricals in version 1.0??

adrinjalali · 2021-08-22T09:38:26Z

Doesn't seem like we'd get this in time. Moving to 2.0

amueller · 2024-01-22T19:15:44Z

is there a current/ recent issue tracking the support for pd.categorical? So it was added to hgb, not not anywhere else, right?
I feel like there should be a consistent story, and a lot of conversation has happened over the last .. 5 years? but I can't finda current status or summary.
I saw #24967, and some related thoughts in #27947, but I haven't seen anything more recent?

glemaitre · 2024-01-22T19:36:39Z

We have a related PR that wants to leverage this feature: #27911

amueller · 2024-01-22T19:49:56Z

Thanks, that's quite interesting, but also somewhat orthogonal. I was thinking more about the API surface. Currently a user doesn't know whether categorical features are treated correctly or not in a model without reading the documentation of each model and default hyper-parameters. Passing categoricals encoded as integers and with categorical dtype to LogisticRegression, RandomForestClassifier and HistogramGradientBoostingClassifier has very different results.

glemaitre · 2024-01-23T23:02:15Z

Indeed, we did not got the discussion. HistGradientBoosting is the first model that handle categorical feature in a "native manner". I assume that we could have something similar for all tree-based approach avoiding. You get missing values and categorical handling without the need of preprocessing.

For linear model, we got kind of backward by deprecating normalize parameter and totally rely on the ColumnTransformer. The fact that you need to handle both numerical + categorical + missing values make those models trickier.

The pattern of using the TableVectorizer from skrubs (https://skrub-data.org/stable/generated/skrub.TableVectorizer.html#skrub.TableVectorizer) could be a way to simplify such pipeline and the same pattern could also potentially used with tree-based models but I'm not sure if this necessary when one deals only with numerical and categorical (when dates come to play then it could be useful).

But I rather think that we should have the discussion on that topic.

jnothman added Easy Well-defined and straightforward way to resolve Moderate Anything that requires some knowledge of conventions and best practices help wanted labels Sep 11, 2019

jnothman mentioned this issue Sep 11, 2019

OrdinalEncoder with string categories should force user to specify the order #14563

Closed

jnothman added this to In progress in Big picture ideas for 0.22,0.23 Sep 11, 2019

thomasjpfan linked a pull request Sep 21, 2019 that will close this issue

[MRG] ENH Adds warning with pandas category does not match lexicon ordering #15050

Open

jorisvandenbossche mentioned this issue Oct 28, 2019

LabelEncoder ignores pandas CategoricalDtype order #12086

Closed

thomasjpfan mentioned this issue Oct 29, 2019

[MRG] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder #15396

Closed

glemaitre mentioned this issue Nov 15, 2019

[MRG] Sorting ordering option in OrdinalEncoder #14984

Closed

glemaitre mentioned this issue Nov 15, 2019

OrdinalEncoder: Deprecate automatically assuming lexicographic ordering #14954

Open

adrinjalali added this to To do in Categorical Nov 16, 2019

rth mentioned this issue Nov 22, 2019

Add "other" / min_frequency option to OneHotEncoder #12153

Closed

rth removed the Easy Well-defined and straightforward way to resolve label Dec 4, 2019

rth mentioned this issue Dec 4, 2019

META OHE / OrdinalEncoder: NaN support, unfrequent cat. and pd.Categorical #15796

Open

cmarmo removed the help wanted label Jul 23, 2020

cmarmo added this to the 1.0 milestone Nov 2, 2020

cmarmo added the Breaking Change Issue resolution would not be easily handled by the usual deprecation cycle. label Nov 2, 2020

adrinjalali modified the milestones: 1.0, 2.0 Aug 22, 2021

cmarmo added the module:preprocessing label Mar 23, 2022

lorentzenchr mentioned this issue Jun 2, 2023

ENH Adds native pandas categorical support to gradient boosting #26411

Merged

Handle pd.Categorical in encoders #14953

Handle pd.Categorical in encoders #14953

Comments

jnothman commented Sep 11, 2019

harsh020 commented Sep 16, 2019

jnothman commented Sep 16, 2019 via email

harsh020 commented Sep 17, 2019 • edited

jnothman commented Sep 18, 2019 via email

harsh020 commented Sep 18, 2019

amueller commented Sep 18, 2019

harsh020 commented Sep 19, 2019 • edited

harsh020 commented Sep 19, 2019

jnothman commented Sep 21, 2019 via email

thomasjpfan commented Oct 23, 2019

jnothman commented Oct 23, 2019 via email

jorisvandenbossche commented Oct 28, 2019

jorisvandenbossche commented Oct 28, 2019 • edited

thomasjpfan commented Oct 28, 2019 • edited

jorisvandenbossche commented Oct 28, 2019

NicolasHug commented Nov 8, 2019

thomasjpfan commented Nov 8, 2019

jnothman commented Nov 10, 2019 via email

NicolasHug commented Nov 10, 2019

jnothman commented Nov 10, 2019 via email

NicolasHug commented Nov 10, 2019

jnothman commented Nov 10, 2019 via email

NicolasHug commented Nov 10, 2019

glemaitre commented Nov 15, 2019

agramfort commented Nov 17, 2019

NicolasHug commented Nov 17, 2019

agramfort commented Nov 17, 2019 via email

jorisvandenbossche commented Nov 18, 2019 • edited

cmarmo commented Oct 26, 2020

thomasjpfan commented Oct 30, 2020

jnothman commented Nov 1, 2020 via email

adrinjalali commented Aug 22, 2021

amueller commented Jan 22, 2024

glemaitre commented Jan 22, 2024

amueller commented Jan 22, 2024

glemaitre commented Jan 23, 2024

harsh020 commented Sep 17, 2019 •

edited

harsh020 commented Sep 19, 2019 •

edited

jorisvandenbossche commented Oct 28, 2019 •

edited

thomasjpfan commented Oct 28, 2019 •

edited

jorisvandenbossche commented Nov 18, 2019 •

edited