Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Error Policy in OrdinalEncoder #13488

Closed
daskol opened this issue Mar 21, 2019 · 26 comments · Fixed by #17406
Closed

Handle Error Policy in OrdinalEncoder #13488

daskol opened this issue Mar 21, 2019 · 26 comments · Fixed by #17406

Comments

@daskol
Copy link

daskol commented Mar 21, 2019

Preprocessor class OneHotEncoder allows transformation if unknown values are found. It would be great to introduce the same option to OrdinalEncoder. It seems simple to do since OrdinalEncoder (as well as OneHotEncoder) is derived from _BaseEncoder which actually implements handling error policy.

@jnothman
Copy link
Member

jnothman commented Mar 21, 2019 via email

@daskol
Copy link
Author

daskol commented Mar 21, 2019

No, #12264 is not my case but it is desirable too. I would like OrdinalEncoder do not throw an exception if it meets an unexpected value. I mean that unexpected value is value which categories does not contain.

@jnothman
Copy link
Member

jnothman commented Mar 22, 2019 via email

@daskol
Copy link
Author

daskol commented Mar 22, 2019

handle_unknown : ‘error’ or ‘ignore’, default=’error’.

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

I could ignore unknown catregiries value as OneHotEncoder does. Another possible scenario (say sentinel) could replace unknown value with default one which could be specified in OrdinalEncoder's constructor.

@NicolasHug
Copy link
Member

The policy of OnehotEncoder for unknown categories is to set all the columns to 0. That wouldn't be great for OrdinalEncoder since unknown categories and the first one (i.e. zero) would be mixed.

Allowing a fallback category value seems reasonable to me.

@daskol could you please describe your particular use-case that motivates this change?

@ogrisel
Copy link
Member

ogrisel commented Mar 27, 2019

We discussed the issue with @jorisvandenbossche and I think the sanest strategy would be to have:

  • min_frequency=5 (5 is an example, the default could be 1) to set the threshold to collapse all categories that appear less than 5 times in the training set into a virtual category
  • rare_category="rare_value" as a parameter to control the name of the virtual category used to map all the rare values. This will be mostly be useful for inverse_transform and for consistency if we implement a similar category collapsing option in other categorical variable preprocessors such as OneHotEncoder and ImpactEncoder / TargetEncoder (whatever its name).
  • handle_unknown="treat_as_rare" that would map any inference-time unknown categories to the integer that is mapped to the virtual rare_category (even when its never used on the training set).

IMO handle_unknown="treat_as_rare" is the sanest way to handle inference-time unknown values from a statistical point of view.

@ogrisel
Copy link
Member

ogrisel commented Mar 27, 2019

Personally I find this issue really annoying. At the moment we cannot use OrdinalEncoder on data with a long-tailed distribution of categorical variable frequencies in a cross validation loops without triggering the unknown category exception at prediction time.

@amueller
Copy link
Member

amueller commented Apr 2, 2019

also see #12153 for something related. I think min_frequency is good, but I also want max_levels or something like that. Basically we could reimplement all the different pruning options we have in CountVectorizer....

@thomasjpfan
Copy link
Member

If we add a “rare” category to OrdinalEncoder it seems like it goes against the “Ordinal” part of the encoder. It assumes the rare or unknown category has the highest (or lowest) value. If we do introduce this, it would be good to document this behavior.

Ideally, if OrdinalEncoder can handle most of the logic that has to deal with unknown and infrequent categories, OneHotEncoder would need to do less. (I am thinking of composition ie OneHotEncoder has a OrdinalEncoder)

@nathanielmhld
Copy link

Hi! I wanted to add that it seems to me that the feature of allowing the encoder to be fit on data that contains some categories, and then applied to data that contains maybe an additional category or two, seems like a common use case for all kinds of categorical data. Even if one category isn't very uncommon, if you're doing a random split for training and validation it's only a matter of time before this error comes up. I like the "min_frequency" solution for its generality, but to (naive) me, it seems too complicated. To me, it seems the default behavior should be to send all categories not present in the original fitting to a single virtual category. Or maybe a "create_virtual_category = True" option. If this is amenable, I'd be happy to take a crack at making it, I'm trying to spend more time working on open source code!

@jnothman
Copy link
Member

jnothman commented Jul 31, 2019 via email

@nathanielmhld
Copy link

I didn't pick that up from the docs.. my understanding was that the encoder assigned essentially arbitrary integer values to each category, but you're saying it assigns them maybe based on frequency or something else which means that order matters? If that's the case, I'd like to suggest an update to the docs because I didn't quite understand that from the docs. If it's not, then couldn't we just stick the virtual category at the end of the original ordinal categories?

@jnothman
Copy link
Member

jnothman commented Jul 31, 2019 via email

@amueller
Copy link
Member

amueller commented Aug 2, 2019

I think the OrdinalEncoder is weird because it is indeed the intention that the order matters - that's why it's called OrdinalEncoder. But the order is always lexical, which rarely makes sense. I have a hard time coming up with usage scenarios for OrdinalEncoder because of that.
I guess the user could always rename the their categories so that lexical ordering is useful. But then what's the point of the encoder?
If we say "it's ordered but the order is arbitrary" that might make it easier to reason about it?

@ogrisel iirc you were one of the people that wanted this encoder. Can you give examples of your motivation?

@nathanielmhld
Copy link

I totally agree @amueller. I was using it for preprocessing data before LightGBM, which requires integer data, but you can flag certain columns as categorical and so the order does NOT matter. I am confused about what "order matters" means because by its very nature, there is no correct order of the classes. I'm not sure that an "order matters" OrdinalEncoder makes sense to me, when no order is specified.

@jnothman
Copy link
Member

jnothman commented Aug 3, 2019 via email

@jnothman
Copy link
Member

jnothman commented Aug 3, 2019 via email

@NicolasHug
Copy link
Member

I think we should force the user to pass the order of the categories when they are strings. I opened #14563

@Sandy4321
Copy link

Friends, may you speed up this fix
since in real live, there are often many differences between train and data test?
as stated above by @ogrisel : Personally I find this issue really annoying

Can you do at least this as @daskol wrote above
I could ignore unknown catregiries value as OneHotEncoder does. Another possible scenario (say sentinel) could replace unknown value with default one which could be specified in OrdinalEncoder's constructor

yes just ignore unknown catregiries for transform - replace unknown value with default one which could be specified in OrdinalEncoder's constructor
or set to none...
but please do something!!!!

@jnothman
Copy link
Member

jnothman commented Feb 12, 2020 via email

@Sandy4321
Copy link

Order is not important
The idea is to make it practicable
Since for now it is not practicable distilled case
Meaning , in real data there are always new categories in test data, ( categories unseen in training data)
Then ordinal encoder just crashes
Pls add capabilities to handle cases when test data and training data have different categories
Meaning
Test data have categories that are not present in train data

@chutcheson
Copy link

I think that this would be a really helpful change. Given the choice of a default virtual encoding for unseen values, I would rather it be the first value as opposed to the last value.

I feel that this is intuitive because you always know where the first value is without looking (e.g. if the default encoding were -1 or 0).

On a practical use case, I am converting categorical data to numeric data to use with the RandomForestRegressor.

Some of my categories are quite small (~5 types) while some are larger (~5000 types).

I would like to use the OneHotEncoder on my smaller types because it will make my features more interpretable when I look at their permutation importance.

I would like to use the OrdinalEncoder on my larger categories because it will make actions like permutation importance less computationally expensive and my model choice will be robust to the effect of the OrdinalEncoder's choice of ordering the features.

However, I will no doubt encounter a number of examples in my test data for the larger categories that are not present in my training dataset, so if I use the OrdinalEncoder it will throw an error.

My other alternative is to do something hacky with Pandas or the DictVectorizer.

@nickcorona
Copy link

nickcorona commented Jun 7, 2020

No, I just mean that if you want an ordinal encoding, rather than one-hot, it will often be because order matters.

That's not true though. Often times people use ordinalencoder because onehotencoder would expand the feature array too much and so filling in a variable with arbitrary numbers is a more space-efficient option that doesn't make too much of a difference in tree ensembles.

@jdraines
Copy link

I encountered a similar need. I was hoping to use sklearn to encode categorical features prior to passing them to an Embedding layer in keras. In this case, the order of the categorical features does not matter, and it would be helpful to have the ability to make any unknown categories encountered at transform time be encoded by the same integer.

My solution for myself was to make a new encoder subclass I called CardinalEncoder. Its __init__ has a handle_unknown (bool) parameter. If True, then fit will fit a copy of X in which the final element is replaced with 'Unknown'. During transform, the original form of X is encoded. If the final element in X is unique, then no data loss will occur on a fit_transform call, but some data loss may occur on subsequent transforms if the data includes instances of the unique final fitted element as well as other unknown elements. That was an acceptable trade-off for me, as the likelihood of X's final element being unique was very low for my data.

My solution is at:
https://github.com/jdraines/cardinal_encoder/blob/master/cardinal_encoder.py

@Sandy4321
Copy link

Great idea
But what scikit learn people resistant to do it?
Why to do against?

@Sandy4321
Copy link

jdraines/cardinal_encoder: Implements a Scikit-Learn CardinalEncoder which differs from OrdinalEncoder in that it handles unknowns.
Is great

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet