Handle Error Policy in OrdinalEncoder #13488

daskol · 2019-03-21T14:41:34Z

Preprocessor class OneHotEncoder allows transformation if unknown values are found. It would be great to introduce the same option to OrdinalEncoder. It seems simple to do since OrdinalEncoder (as well as OneHotEncoder) is derived from _BaseEncoder which actually implements handling error policy.

The text was updated successfully, but these errors were encountered:

jnothman · 2019-03-21T21:19:53Z

Would #12264 meet your use cases? What specific behaviour do you seek? What use cases?

daskol · 2019-03-21T23:38:08Z

No, #12264 is not my case but it is desirable too. I would like OrdinalEncoder do not throw an exception if it meets an unexpected value. I mean that unexpected value is value which categories does not contain.

jnothman · 2019-03-22T00:17:18Z

What would it do rather than throw an error?

daskol · 2019-03-22T11:29:17Z

handle_unknown : ‘error’ or ‘ignore’, default=’error’.

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

I could ignore unknown catregiries value as OneHotEncoder does. Another possible scenario (say sentinel) could replace unknown value with default one which could be specified in OrdinalEncoder's constructor.

NicolasHug · 2019-03-22T18:00:09Z

The policy of OnehotEncoder for unknown categories is to set all the columns to 0. That wouldn't be great for OrdinalEncoder since unknown categories and the first one (i.e. zero) would be mixed.

Allowing a fallback category value seems reasonable to me.

@daskol could you please describe your particular use-case that motivates this change?

ogrisel · 2019-03-27T10:36:54Z

We discussed the issue with @jorisvandenbossche and I think the sanest strategy would be to have:

min_frequency=5 (5 is an example, the default could be 1) to set the threshold to collapse all categories that appear less than 5 times in the training set into a virtual category
rare_category="rare_value" as a parameter to control the name of the virtual category used to map all the rare values. This will be mostly be useful for inverse_transform and for consistency if we implement a similar category collapsing option in other categorical variable preprocessors such as OneHotEncoder and ImpactEncoder / TargetEncoder (whatever its name).
handle_unknown="treat_as_rare" that would map any inference-time unknown categories to the integer that is mapped to the virtual rare_category (even when its never used on the training set).

IMO handle_unknown="treat_as_rare" is the sanest way to handle inference-time unknown values from a statistical point of view.

ogrisel · 2019-03-27T10:39:06Z

Personally I find this issue really annoying. At the moment we cannot use OrdinalEncoder on data with a long-tailed distribution of categorical variable frequencies in a cross validation loops without triggering the unknown category exception at prediction time.

amueller · 2019-04-02T19:30:03Z

also see #12153 for something related. I think min_frequency is good, but I also want max_levels or something like that. Basically we could reimplement all the different pruning options we have in CountVectorizer....

thomasjpfan · 2019-05-10T13:00:39Z

If we add a “rare” category to OrdinalEncoder it seems like it goes against the “Ordinal” part of the encoder. It assumes the rare or unknown category has the highest (or lowest) value. If we do introduce this, it would be good to document this behavior.

Ideally, if OrdinalEncoder can handle most of the logic that has to deal with unknown and infrequent categories, OneHotEncoder would need to do less. (I am thinking of composition ie OneHotEncoder has a OrdinalEncoder)

nathanielmhld · 2019-07-30T19:02:30Z

Hi! I wanted to add that it seems to me that the feature of allowing the encoder to be fit on data that contains some categories, and then applied to data that contains maybe an additional category or two, seems like a common use case for all kinds of categorical data. Even if one category isn't very uncommon, if you're doing a random split for training and validation it's only a matter of time before this error comes up. I like the "min_frequency" solution for its generality, but to (naive) me, it seems too complicated. To me, it seems the default behavior should be to send all categories not present in the original fitting to a single virtual category. Or maybe a "create_virtual_category = True" option. If this is amenable, I'd be happy to take a crack at making it, I'm trying to spend more time working on open source code!

jnothman · 2019-07-31T11:20:01Z

Nathaniel, should the virtual category be the first or the last category? Here we presume order matters...

nathanielmhld · 2019-07-31T13:42:54Z

I didn't pick that up from the docs.. my understanding was that the encoder assigned essentially arbitrary integer values to each category, but you're saying it assigns them maybe based on frequency or something else which means that order matters? If that's the case, I'd like to suggest an update to the docs because I didn't quite understand that from the docs. If it's not, then couldn't we just stick the virtual category at the end of the original ordinal categories?

jnothman · 2019-07-31T14:19:46Z

No, I just mean that if you want an ordinal encoding, rather than one-hot, it will often be because order matters. You're certainly welcome to help us improve the documentation by submitting specific changes as a pull request.

amueller · 2019-08-02T20:52:57Z

I think the OrdinalEncoder is weird because it is indeed the intention that the order matters - that's why it's called OrdinalEncoder. But the order is always lexical, which rarely makes sense. I have a hard time coming up with usage scenarios for OrdinalEncoder because of that.
I guess the user could always rename the their categories so that lexical ordering is useful. But then what's the point of the encoder?
If we say "it's ordered but the order is arbitrary" that might make it easier to reason about it?

@ogrisel iirc you were one of the people that wanted this encoder. Can you give examples of your motivation?

nathanielmhld · 2019-08-02T20:57:46Z

I totally agree @amueller. I was using it for preprocessing data before LightGBM, which requires integer data, but you can flag certain columns as categorical and so the order does NOT matter. I am confused about what "order matters" means because by its very nature, there is no correct order of the classes. I'm not sure that an "order matters" OrdinalEncoder makes sense to me, when no order is specified.

jnothman · 2019-08-03T22:29:59Z

I think we never intended the behaviour to always require lexicographic ordering... we just haven't got around to fixing that (taking order specified by user, or from a pandas categorical dtype ordering). Should that be a priority?

jnothman · 2019-08-03T22:34:11Z

Wait... you aren't constrained to lexicographic ordering in OrdinalEncoder: OrdinalEncoder([['S', 'M', 'L', 'XL']]) works.

NicolasHug · 2019-08-04T12:40:08Z

I think we should force the user to pass the order of the categories when they are strings. I opened #14563

Sandy4321 · 2020-02-12T00:05:39Z

Friends, may you speed up this fix
since in real live, there are often many differences between train and data test?
as stated above by @ogrisel : Personally I find this issue really annoying

Can you do at least this as @daskol wrote above
I could ignore unknown catregiries value as OneHotEncoder does. Another possible scenario (say sentinel) could replace unknown value with default one which could be specified in OrdinalEncoder's constructor

yes just ignore unknown catregiries for transform - replace unknown value with default one which could be specified in OrdinalEncoder's constructor
or set to none...
but please do something!!!!

jnothman · 2020-02-12T07:17:33Z

Does the order mean anything in your use case, or is ordinal encoding just a way of encoding string-valued features?

Sandy4321 · 2020-02-12T12:59:46Z

Order is not important
The idea is to make it practicable
Since for now it is not practicable distilled case
Meaning , in real data there are always new categories in test data, ( categories unseen in training data)
Then ordinal encoder just crashes
Pls add capabilities to handle cases when test data and training data have different categories
Meaning
Test data have categories that are not present in train data

chutcheson · 2020-06-05T03:48:19Z

I think that this would be a really helpful change. Given the choice of a default virtual encoding for unseen values, I would rather it be the first value as opposed to the last value.

I feel that this is intuitive because you always know where the first value is without looking (e.g. if the default encoding were -1 or 0).

On a practical use case, I am converting categorical data to numeric data to use with the RandomForestRegressor.

Some of my categories are quite small (~5 types) while some are larger (~5000 types).

I would like to use the OneHotEncoder on my smaller types because it will make my features more interpretable when I look at their permutation importance.

I would like to use the OrdinalEncoder on my larger categories because it will make actions like permutation importance less computationally expensive and my model choice will be robust to the effect of the OrdinalEncoder's choice of ordering the features.

However, I will no doubt encounter a number of examples in my test data for the larger categories that are not present in my training dataset, so if I use the OrdinalEncoder it will throw an error.

My other alternative is to do something hacky with Pandas or the DictVectorizer.

nickcorona · 2020-06-07T14:34:57Z

No, I just mean that if you want an ordinal encoding, rather than one-hot, it will often be because order matters.

That's not true though. Often times people use ordinalencoder because onehotencoder would expand the feature array too much and so filling in a variable with arbitrary numbers is a more space-efficient option that doesn't make too much of a difference in tree ensembles.

jdraines · 2020-06-28T20:59:42Z

I encountered a similar need. I was hoping to use sklearn to encode categorical features prior to passing them to an Embedding layer in keras. In this case, the order of the categorical features does not matter, and it would be helpful to have the ability to make any unknown categories encountered at transform time be encoded by the same integer.

My solution for myself was to make a new encoder subclass I called CardinalEncoder. Its __init__ has a handle_unknown (bool) parameter. If True, then fit will fit a copy of X in which the final element is replaced with 'Unknown'. During transform, the original form of X is encoded. If the final element in X is unique, then no data loss will occur on a fit_transform call, but some data loss may occur on subsequent transforms if the data includes instances of the unique final fitted element as well as other unknown elements. That was an acceptable trade-off for me, as the likelihood of X's final element being unique was very low for my data.

My solution is at:
https://github.com/jdraines/cardinal_encoder/blob/master/cardinal_encoder.py

Sandy4321 · 2020-06-29T13:28:27Z

Great idea
But what scikit learn people resistant to do it?
Why to do against?

Sandy4321 · 2020-08-14T12:24:45Z

jdraines/cardinal_encoder: Implements a Scikit-Learn CardinalEncoder which differs from OrdinalEncoder in that it handles unknowns.
Is great

rragundez mentioned this issue May 6, 2019

[WIP] OrdinalEncoder functionality for unknown categories when transforming #13808

Closed

NicolasHug mentioned this issue May 8, 2019

[MRG] Add support for infrequent categories in OneHotEncoder and OrdinalEncoder #13833

Closed

4 tasks

nathanielmhld mentioned this issue Jul 31, 2019

[MRG] Allowing Virtual Category instead of error for OrdinalEncoder #14534

Closed

Sandy4321 mentioned this issue Feb 12, 2020

do not ignore unknown categories : known problem in scikit learn for OrdinalEncoder scikit-learn-contrib/category_encoders#234

Closed

FelixWick mentioned this issue Aug 2, 2020

FEA allow unknowns in OrdinalEncoder transform #17406

Merged

jnothman closed this as completed in #17406 Aug 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Error Policy in OrdinalEncoder #13488

Handle Error Policy in OrdinalEncoder #13488

daskol commented Mar 21, 2019 •

edited

Loading

jnothman commented Mar 21, 2019 via email

daskol commented Mar 21, 2019

jnothman commented Mar 22, 2019 via email

daskol commented Mar 22, 2019

NicolasHug commented Mar 22, 2019

ogrisel commented Mar 27, 2019 •

edited

Loading

ogrisel commented Mar 27, 2019 •

edited

Loading

amueller commented Apr 2, 2019

thomasjpfan commented May 10, 2019

nathanielmhld commented Jul 30, 2019

jnothman commented Jul 31, 2019 via email

nathanielmhld commented Jul 31, 2019

jnothman commented Jul 31, 2019 via email

amueller commented Aug 2, 2019

nathanielmhld commented Aug 2, 2019

jnothman commented Aug 3, 2019 via email

jnothman commented Aug 3, 2019 via email

NicolasHug commented Aug 4, 2019

Sandy4321 commented Feb 12, 2020

jnothman commented Feb 12, 2020 via email

Sandy4321 commented Feb 12, 2020

chutcheson commented Jun 5, 2020

nickcorona commented Jun 7, 2020 •

edited

Loading

jdraines commented Jun 28, 2020

Sandy4321 commented Jun 29, 2020

Sandy4321 commented Aug 14, 2020

Handle Error Policy in OrdinalEncoder #13488

Handle Error Policy in OrdinalEncoder #13488

Comments

daskol commented Mar 21, 2019 • edited Loading

jnothman commented Mar 21, 2019 via email

daskol commented Mar 21, 2019

jnothman commented Mar 22, 2019 via email

daskol commented Mar 22, 2019

NicolasHug commented Mar 22, 2019

ogrisel commented Mar 27, 2019 • edited Loading

ogrisel commented Mar 27, 2019 • edited Loading

amueller commented Apr 2, 2019

thomasjpfan commented May 10, 2019

nathanielmhld commented Jul 30, 2019

jnothman commented Jul 31, 2019 via email

nathanielmhld commented Jul 31, 2019

jnothman commented Jul 31, 2019 via email

amueller commented Aug 2, 2019

nathanielmhld commented Aug 2, 2019

jnothman commented Aug 3, 2019 via email

jnothman commented Aug 3, 2019 via email

NicolasHug commented Aug 4, 2019

Sandy4321 commented Feb 12, 2020

jnothman commented Feb 12, 2020 via email

Sandy4321 commented Feb 12, 2020

chutcheson commented Jun 5, 2020

nickcorona commented Jun 7, 2020 • edited Loading

jdraines commented Jun 28, 2020

Sandy4321 commented Jun 29, 2020

Sandy4321 commented Aug 14, 2020

daskol commented Mar 21, 2019 •

edited

Loading

ogrisel commented Mar 27, 2019 •

edited

Loading

ogrisel commented Mar 27, 2019 •

edited

Loading

nickcorona commented Jun 7, 2020 •

edited

Loading