New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethinking the CategoricalEncoder API ? #10521

Closed
jorisvandenbossche opened this Issue Jan 23, 2018 · 63 comments

Comments

Projects
None yet
7 participants
@jorisvandenbossche
Contributor

jorisvandenbossche commented Jan 23, 2018

Based on some discussions we are having here and issues that are opened, we are having some doubts that CategoricalEncoder (#9151) was the good choice of name (and since it is not released yet we have some room for change).

So summary of how it is now:

  • The class name CategoricalEncoder says what type of data it accepts (categorical data)
  • The keyword argument encoding specifies how to encode those data

Currently we already have encoding='onehot'|'onehot-dense'|'ordinal'.

But what to do in the following cases:

  • We want to add more encoding options (eg binary encoding, mean target encoding, unary encoding, ...). Do we keep adding those as new values for the encoding kwarg in the one big CategoricalEncoder class?
  • We want to add an option specific to one of the encodings (eg for 'onehot' encoding to drop the first (redundant) column, or for 'ordinal' encoding base the order of the categories on the frequency, ...). The problem here is that we then need to add additional keyword arguments to CategoricalEncoder that are or are not active depending on what you passed for encoding kwarg, which is not the nicest API design.

For that last problem, we already had this with the sparse=True/False option, which was only relevant for 'onehot' and not for 'ordinal', and which we solved with having both 'onehot' and 'onehot-dense' encoding options and not a sparse keyword. But such an approach also does not scale.

Related to this, there is a PR to add a UnaryEncoder (#8652). There was a related discussion on the naming in that PR, as currently the name says how it encodes, not what type of data it gets (in the current design, it accepts already encoded integers, not actual categorical data. In that regard, to be consistent with CategoricalEncoder, it might better be named OrdinalEncoder because it needs ordinal data as input).


What are the options forward:

  1. Keep things as we have it now in master, and be be OK with adding some new options to the single class (an important question which is hard to answer now, is how much new features we will want to add in the future).
  2. Switch the naming scheme and have a bunch of 'categorical encoders' where the name says how it encodes (OnehotEncoder, OrdinalEncoder, and later maybe BinaryEncoder, UnaryEncoder, ...)

So it is a bit a trade-off of potential build up of number of classes vs number of keyword arguments in a single class.


One problem with the second approach (and one of the reasons we went with CategoricalEncoder in the first place, even before we added the multiple encoding options), is that there is already a OnehotEncoder, which has a different API than the CategoricalEncoder. And, there is not really a good other name we could use for the encoder that does one-hot encoding.
However, I think that, with some temporary ugly hacks, we could reuse the name, if we are OK with deprecating the current attributes (and I think we agree it are not the most useful attributes). The idea would be that if you fit the class with string data, you get the new behaviour, and if you fit the class with integer data, you get a deprecation warning indicating the default behaviour will change (and indicating which keyword to specify to get rid of the warning).

cc @jnothman @amueller @GaelVaroquaux @rth

@ogrisel

This comment has been minimized.

Member

ogrisel commented Jan 23, 2018

Thanks for the summary @jorisvandenbossche. I think I am in favor of option 2: reuse OneHotEncoder class, deprecate the weird attributes and add a constructor parameter to select the behavior with a future warning that says that the default behavior will change but makes it easy to silence that warning just by passing a value for that option.

@jnothman

This comment has been minimized.

Member

jnothman commented Jan 23, 2018

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jan 23, 2018

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Jan 23, 2018

The idea of reverting CategoricalEncoder makes me quite sad

To be clear, it would not be a revert, it would be a refactor / rename that keeps all functionality!
But I also like the "CategoricalEncoder" name, that would indeed be sad.

That said, I will quickly try to do the changes to have an idea how possible it is to integrate this in OnehotEncoder.

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Jan 23, 2018

OK, I opened a PR with a proof of concept: #10523.
It's not yet complete (no deprecation warnings and new attributes are not yet calculated in old behaviour).

The main API question is about the format of the input data.
So as a recap, there are two different ways we currently process the categorical data:

  1. As actual, not yet encoded (integer or string), categorical data (how it is done in CategoricalEncoder) -> infer categories from the unique values in the training data
  2. As integer, already encoded data (how it is done in the current OneHotEncoder) -> infer categories from the maximum value in the training data

The question is: do we find both cases worth supporting? Thus, in the potentially merged OneHotEncoder, do we keep the ability to do both, or do we fully deprecate and then remove the ability to process ordinal input?

If want the ability to process both, we can add a boolean keyword to specify the input data type (for now I use encoded_input=False/True, but other ideas are ordinal_input, ...)

For the deprecation period, we have to support both anyway, and also have to introduce a keyword to choose the behaviour (to be able to silence the warning and choose the new behaviour).
So in principle we could just keep the keyword afterwards.

Given that we want to handle both, an overview of how OneHotEncoder would work:

  • for now encoded_input=None, and we infer the default based on the data
  • if int-like data (handled before by OneHotEncoder) encoded_input is internally set to True and a deprecation warning is raised. If the user wants to keep the current behaviour, it can manually specify it as OneHotEncoder(encoded_input=True) to silence the warning.
  • if the input is not int-like, we set encoded_input internally to False and without warning use the new behaviour (= the current CategoricalEncoder behaviour)
  • in the future we change the default of encoded_input from None to False (by default the new behaviour, also for int-like data)
@jnothman

This comment has been minimized.

Member

jnothman commented Jan 24, 2018

I'm still not sure what you're suggesting is the practical difference due to inferring categories from the max value.

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Jan 24, 2018

@jnothman I suppose you acknowledge there can be a difference in practice? (the output you get depending on the data you have)

But whether this difference is important in practice, I don't know. That's where I would like to see feedback. Whether anybody actually wants this "max value"-based method, or whether we are fine with (in the future, after deprecation) only having the "unique values"-based method.

I think I personally would never need this max-value based method, but the OneHotEncoder has been like that for many years (for good reason or not?).

Actually deprecating the max-value based categorization would certainly make the implementation (after deprecation) simpler.
And if we choose for that route, I agree the option should rather be legacy_mode=True/False rather than encoded_input/ordinal_input

@jnothman

This comment has been minimized.

Member

jnothman commented Jan 24, 2018

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Jan 24, 2018

Aha, that clarifies our misunderstanding :-)
I misunderstood how the current OneHotEncoder is actually working. Suppose you have one feature with values [2, 3, 5, 2]. I thought the current OneHotEncoder would have categories [0, 1, 2, 3, 4, 5] (while current CategoricalEncoder would have categories [2, 3, 5]). But you are right that the active_features_ is also only [2, 3, 5], essentially making them the same with the default value of n_values='auto'.

So it is only the case where you pass an integer to n_values (like n_values=6 for categories=[0, 1, 2, 3, 4, 5] in the above case) to specify the number of categories that will actually be an API change (deprecated / removed).
And that will be easily be replacable by the user with categories=range(6)

Sorry for the confusion.
In that light, I think we even don't need the legacy_mode option. We can just translate n_values=6 to categories=range(6) internally and raise a warning for that (but need to check this with the actual tests).

The other difference is the handling of unseen categories. With the current behaviour of the OneHotEncoder, if the unseen values are within the range(0, max), it will not raise an error even if handle_unknow='error' (the default). But also that can be solved separately by in such a case raising a warning that the user should set handle_unknown='ignore' manually to keep the existing behaviour.

The only feature we would loose is the distinction between unknown categories that are within the range(0, max) (by the current OneHotEncoder not regarded as 'unknown') and those that are bigger than that (> max, those are currently already regarded as unknown by the OneHotEncoder).

@jnothman

This comment has been minimized.

Member

jnothman commented Jan 24, 2018

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Jan 24, 2018

no, that is the sort of thing we have tried before and it's just too finicky.

Can you clarify to which aspect this "no" refers?
To the fact that I think a legacy_mode is not needed?

@jnothman

This comment has been minimized.

Member

jnothman commented Jan 24, 2018

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Jan 24, 2018

yes, to the idea that you can just make something that is both backwards compatible and what we want going forward

That was not what I tried to suggest. I wanted to make clear that think it is possible to not have a legacy_mode keyword, not by having it magically both backwards compat and what we want in the future, but by deprecating the behaviour of the existing keywords.

So to be concrete: a non-default value of n_values can be deprecated and has to be replaced by categories specification. handle_unknow in case of integer data should be set explicitly by the user to choose either full ignoring or full erroring instead of current mix (and otherwise deprecation warning is raised).

@jnothman

This comment has been minimized.

Member

jnothman commented Jan 25, 2018

@jnothman

This comment has been minimized.

Member

jnothman commented Jan 25, 2018

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Jan 28, 2018

can we just make it so that during deprecation, categories must be set explicitly, and legacy mode with warnings is otherwise in effect? Is that what you are suggesting?

Yes, it might be still missing case, but I think this is possible (will check by actual coding it next week).

The different 'legacy' cases:

  • n_values='auto' (the default)
    • handle_unknown='ignore' -> fine, no change in behaviour
    • handle_unknown='error' -> Problem, values in range are still ignored, values above range error
      • Possible solution:
        • in fit, if the range is consecutive => fine, no change in behaviour (for all people that now combined LabelEncoder with it, which is a typical use case I think)
        • if this is not the case: raise deprecation warning that they have to set categories explicitly to keep this behaviour (and internally use legacy mode)
  • n_values=value
    • this can be translated to categories=[range(value)] internally, and raise deprecation warning that user should do that themselves in the future
    • in this case handle_unknown='error' / 'ignore' work as expected

The deprecation warning in case of n_values='auto' will only be raise in fit and not upon construction (which is not really ideal), but it is only in fit that we know that the user is passing it numeric data and not string data.

@jnothman

This comment has been minimized.

Member

jnothman commented Jan 28, 2018

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Feb 1, 2018

You basically want it to be: legacy mode is active if categories is not set and if the data is all integers?

Yes indeed (in practice it will more or less be the same)

One question: if categories and n_values parameters are their default, do we publish categories_? If n_values is set explicitly, do we publish categories_?

I personally would already as much as possible provide the attributes of the new interface, even in legacy mode. So in both case I would calculate categories_ (even if it would be a bit more work)


So I tried to put the above logic in code (will push some updates to the PR), and I have one more question for the case of integer data when n_values or categories is not set (typical case for 'legacy_mode'). The problem lies in the fact that if the inferred categories are simply a consecutive range (0, 1, 2, 3, ... max), there is no difference between the new and old (legacy) behaviour, and we don't necessarily need to raise a deprecation warning.
Some possibilities to do in this specific case:

  1. Detect this case (that the inferred categories are a consecutive range), and in that case don't raise a warning.
    • This is possible to detect (with a little bit of extra code complexity) as we are already in fit anyhow
    • I think this will be a common case when using OneHotEncoder with integer data, and a case where the user actually does not need to worry about our refactoring, so it would be nice to not bother him/her with a warning
  2. Always raise a warning, and indicate in the warning message what to do if you are in such a case (in addition to an explanation what to do if you don't have a consecutive range):
    • If they know they have only consecutive ranges as categories, they want to ignore the warning, so we can add to the warning message an explanation how to do this (add a code sample with filterwarnings they can copy paste)
    • A potential advantage of this is that we can also add to the warning message that if they used the LabelEncoder to create the integers, they can now directly use OneHotEncoder (I think this currently is a typical usage pattern). That way, the warning will also go away
  3. Always raise a warning but provide a keyword to silence it (eg legacy_mode=False)
    • If we find the advice to use a filterwarnings statement (see point 2 above) too cumbersome, we could also add a keyword to obtain the same result
    • Disadvantage of this is introducing a keyword that will not be needed anymore in a few releases when the deprecations are cleaned up.

I am personally in favor of option 1 or 2. Using the LabelEncoder before OneHotEncoder seems to be a typical pattern (from a quick github search), and in those case you always have consecutive ranges, and there will never be a change in behaviour with the new implementation, so we shouldn't warn for it. On the other hand, if we warn we can point them to the fact that if they used LabelEncoder, they no longer need to do it. Which would be nice to actually give this advice explicitly.
The question is how frequently users have such consecutive integers as categories without having used LabelEncoder as the previous step ..

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Feb 1, 2018

Hmm, one case I forgot is when you have integer inferred categories that are not consecutive (let's say [1,3,5]), but you want the new behaviour and not legacy behaviour (so in that case you cannot just ignore the warning, as that would handle unseen values differently in the transform step, i.e. values in between the range (eg 2) will not raise an error).
In case we don't provide the legacy_mode=False keyword, the only way to obtain the new behaviour is by manually passing categories=[1,3,5], which can be a slight inconvenience. That might be a reason to favor option 3 and give up my objection on introducing a temporary keyword legacy_mode=False (but also not fully sure it is worth it, as this would be the only case* where such a keyword is actually needed)

* this only case = integer data with inferred categories that are not consecutive range, and where you cannot / don't want to set the categories manually or set handle_unknown to ignore.

Sorry for all the long text, but it's quite complex :)

@jnothman

This comment has been minimized.

Member

jnothman commented Feb 1, 2018

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Feb 1, 2018

We're only talking about the case where n_values is unset, right?

Yes (the other case easily be translated in its equivalent categories value, with a nice deprecation warning, and without different in new and legacy behaviour)

a variant of 3. that was just "OneHotEncoder running in legacy mode. Set categories='auto' for slightly different behaviour without a warning."

Ah, that sounds like a good idea! (irregardless of whether detecting the consecutive categories case or not). So we set in the code the default of categories to None (without changing the semantics of its default), so we know if the user set it explicitly, and in that way it is a nice way to indicate legacy_mode=False without needing that extra keyword.

@jnothman

This comment has been minimized.

Member

jnothman commented Feb 1, 2018

@amueller

This comment has been minimized.

Member

amueller commented Feb 7, 2018

What fresh hell is this :-/

@amueller

This comment has been minimized.

Member

amueller commented Feb 7, 2018

OR we could name the new one DummyEncoder ;) (though that is a bit conflicting with the DummyClassifier)

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Feb 7, 2018

@amueller Don't read all of the above!
I was just planning to make a nice summary for new readers of the issue. The above discussion is overly complicated (also because I was still not fully understanding the current complex behaviour of OneHotEncoder ... :-))

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Feb 7, 2018

OR we could name the new one DummyEncoder ;)

I think @GaelVaroquaux was against that because "one-hot" is known to be this in more fields (and we already use 'Dummy' for other things in scikit-learn ...)

@amueller

This comment has been minimized.

Member

amueller commented Feb 7, 2018

hrm. I guess we kept OneHotEncoder because it's more efficient when it can be used.... Ideally we would get rid of all the weird behaviors. I kinda had wanted to deprecate it but then we didn't...

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Feb 7, 2018

I kinda had wanted to deprecate it but then we didn't...

In my POC PR (#10523), I deprecated almost everything of OneHotEncoder, except its name ...

@jnothman

This comment has been minimized.

Member

jnothman commented Feb 7, 2018

@jnothman

This comment has been minimized.

Member

jnothman commented Feb 11, 2018

@amueller, are you persuaded by the issue that we ultimately want different additional parameters (e.g. drop_first, nan handling) depending on the encoding, and that justifies having a different discrete encoder for each encoding format?

@amueller

This comment has been minimized.

Member

amueller commented Feb 23, 2018

I'll try to look at this in the spring break in two weeks, ok? not sure if I'll have time before that :-/

@lesshaste

This comment has been minimized.

lesshaste commented Feb 24, 2018

I hope this isn't the wrong place to ask but what does the current implementation do with tables that are mixed categorical and non-categorical within one column? Taking the example from pandas-dev/pandas#17418

Consider the dataframe df = pd.DataFrame([{'apple': 1, 'pear':'a', 'carrot': 1}, {'apple':'a', 'pear':2, 'carrot':3}, {'apple': 2, 'pear':3, 'carrot':1}, {'apple': 3, 'pear':'b', 'carrot': 1}, {'apple': 4, 'pear':4, 'carrot': 1}]) which equals:

  apple  carrot pear
0     1       1    a
1     a       3    2
2     2       1    3
3     3       1    b
4     4       1    4

DictVectorizer gives exactly what I need in this case.

    from sklearn.feature_extraction import DictVectorizer
    enc = DictVectorizer(sparse = False)
    enc.fit_transform(df.to_dict(orient='r'))

This gives:

array([[ 1.,  0.,  1.,  0.,  1.,  0.],
       [ 0.,  1.,  3.,  2.,  0.,  0.],
       [ 2.,  0.,  1.,  3.,  0.,  0.],
       [ 3.,  0.,  1.,  0.,  0.,  1.],
       [ 4.,  0.,  1.,  4.,  0.,  0.]])

We can see the features names of the columns with:

    enc.feature_names_
    ['apple', 'apple=a', 'carrot', 'pear', 'pear=a', 'pear=b']

It would be great if the new CategoricalEncoder had an option to do the same.

@jnothman

This comment has been minimized.

Member

jnothman commented Feb 24, 2018

@lesshaste

This comment has been minimized.

lesshaste commented Feb 24, 2018

That’s a shame. One simple sub case is where a column is numerical but has some missing values. A simple solution is to convert the NaNs into empty strings and then use DictVectorizer as in my example above. This effectively creates a new feature for when the value is missing but leaves the numerical values unchanged otherwise. I have found this a very useful technique.

Will the new CategoricalEncoder be able to do something similar?

@jnothman

This comment has been minimized.

Member

jnothman commented Feb 25, 2018

@lesshaste

This comment has been minimized.

lesshaste commented Feb 25, 2018

That sounds good.

You are right there are two use cases. Let me explain a particular example of where treating numeric values as different from strings has been useful for me. It may be that there is a better solution.

Say you have an integer numeric feature which takes a large range of values. However you suspect that for some small values, the precise value is significant. For larger values you suspect this isn’t the case. A simple thing to do is to convert all small values to strings, run DictVectorizer as above and then perform feature selection or just use your favorite classifier directly.

@jnothman

This comment has been minimized.

Member

jnothman commented Feb 25, 2018

@lesshaste

This comment has been minimized.

lesshaste commented Feb 25, 2018

@jnothman Yes in a sense except with a twist. Say I suspect that some of the values from 1...1024 are meaningful. That is 22 indicates something specific which is quite different from 21 or 23. Taking logs won't help here. But I want to leave all the values over 1024 as numerical as I don't think those specific values mean much.

@jnothman

This comment has been minimized.

Member

jnothman commented Feb 25, 2018

@lesshaste

This comment has been minimized.

lesshaste commented Feb 25, 2018

@jnothman To be a little clearer, I don't know that 22 is significant. I just suspect that some values are but I don't know which ones or how many there are. I have found the "convert to a string" and then DictVectorizer method to be very useful for discovering which these are.

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Feb 26, 2018

@lesshaste For the issue about NaNs as separate category, see #10465
If you want to further discuss the specific non-linear discretization or mixed numeric/string encoding, feel free to open a new issue. But would like to keep this one focused on the original issue, i.e. the naming and organisation in different classes of the CategoricalEncoder/OneHotEncoder.

I'll try to look at this in the spring break in two weeks, ok? not sure if I'll have time before that :-/

@amueller that's fine. I won't have time the coming two weeks to work on the PR that is blocked by this anyway. After that I should also have time again to work on it.

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Mar 20, 2018

@amueller did you have time to give this a look?

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Mar 29, 2018

@amueller are you ok with that I go ahead with working on the PR to split CategoricalEncoder in OrdinalEncoder and OneHotEncoder (and with deprecating current arguments of OneHotEncoder) ?

@amueller

This comment has been minimized.

Member

amueller commented Apr 26, 2018

Sorry for being absent. Seems ok, but can you maybe give me two weeks so I can actually review? Thanks!

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented May 14, 2018

@amueller no problem, for me the same :-)
But, I am now planning to look at this again. So if you could give this a look that would be welcome. I have some work to do on the PR (#10523), so don't review that yet in detail (you can look at it to have an idea of what we propose however).
I think the main question I want to see answered before I put a lot of time in it, is if you are OK with splitting up CategoricalEncoder into multiple classes, and in that case, if you are OK with re-using OneHotEncoder (which means deprecating some of its current (strange) features). Those questions are summarized in #10521 (comment) and #10521 (comment).

(and once we agree on that part, there is still a lot to discuss about the actual implementation in the PR :))

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented May 17, 2018

I updated the PR #10523, ready for review

@amueller

This comment has been minimized.

Member

amueller commented May 19, 2018

I'll cautiously say I'm back ;)

@kuraga

This comment has been minimized.

kuraga commented May 28, 2018

IMHO the most important thing is a universal API (i.e. parameters and bbehavior patterns) for all of encoders we discuss

P.S. https://github.com/scikit-learn-contrib/categorical-encoding ?

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented May 28, 2018

In the category_encoders package, all encoders have a cols argument, similar to the categorical_features in the old OneHotEncoder (although it does not accept exactly the same kind of values). See eg http://contrib.scikit-learn.org/categorical-encoding/onehot.html
So that is related to the current discussion we are having in #10523 about deprecating categorical_features or not.

For the rest I think there are not really conflicting keywords (they have some others specific to dataframes which we won't add to sklearn at this point). The naming for OneHotEncoder and OrdinalEncoder at least is consistent with the category_encoders package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment