Rethinking the CategoricalEncoder API ? #10521
Comments
Thanks for the summary @jorisvandenbossche. I think I am in favor of option 2: reuse |
The idea of reverting CategoricalEncoder makes me quite sad, but I think
you're right that future users would be less mystified by option 2. My main
concern is that we have tried implementing this as a change to OHE for a
long time and it never flew. Perhaps it would be good to attempt the
modifications to the OneHotEncoder docstring according to that proposed
change, so we can see if it looks sane.
|
+1 to what Joel said
Sent from my phone. Please forgive typos and briefness.
…On Jan 23, 2018, 12:28, at 12:28, Joel Nothman ***@***.***> wrote:
The idea of reverting CategoricalEncoder makes me quite sad, but I
think
you're right that future users would be less mystified by option 2. My
main
concern is that we have tried implementing this as a change to OHE for
a
long time and it never flew. Perhaps it would be good to attempt the
modifications to the OneHotEncoder docstring according to that proposed
change, so we can see if it looks sane.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#10521 (comment)
|
To be clear, it would not be a revert, it would be a refactor / rename that keeps all functionality! That said, I will quickly try to do the changes to have an idea how possible it is to integrate this in OnehotEncoder. |
OK, I opened a PR with a proof of concept: #10523. The main API question is about the format of the input data.
The question is: do we find both cases worth supporting? Thus, in the potentially merged OneHotEncoder, do we keep the ability to do both, or do we fully deprecate and then remove the ability to process ordinal input? If want the ability to process both, we can add a boolean keyword to specify the input data type (for now I use For the deprecation period, we have to support both anyway, and also have to introduce a keyword to choose the behaviour (to be able to silence the warning and choose the new behaviour). Given that we want to handle both, an overview of how OneHotEncoder would work:
|
I'm still not sure what you're suggesting is the practical difference due to inferring categories from the max value. |
@jnothman I suppose you acknowledge there can be a difference in practice? (the output you get depending on the data you have) But whether this difference is important in practice, I don't know. That's where I would like to see feedback. Whether anybody actually wants this "max value"-based method, or whether we are fine with (in the future, after deprecation) only having the "unique values"-based method. I think I personally would never need this max-value based method, but the OneHotEncoder has been like that for many years (for good reason or not?). Actually deprecating the max-value based categorization would certainly make the implementation (after deprecation) simpler. |
Remind me what the actual difference in output is, when n_values='auto',
please? I had thought the active_features_ thing made them basically
identical, but I'm probably forgetting something.
|
Aha, that clarifies our misunderstanding :-) So it is only the case where you pass an integer to Sorry for the confusion. The other difference is the handling of unseen categories. With the current behaviour of the OneHotEncoder, if the unseen values are within the range(0, max), it will not raise an error even if The only feature we would loose is the distinction between unknown categories that are within the range(0, max) (by the current OneHotEncoder not regarded as 'unknown') and those that are bigger than that (> max, those are currently already regarded as unknown by the OneHotEncoder). |
no, that is the sort of thing we have tried before and it's just too
finicky. unless there is good reason to maintain current behaviour, we
should just have a legacy_mode to slowly bring us to the future.
|
Can you clarify to which aspect this "no" refers? |
yes, to the idea that you can just make something that is both backwards
compatible and what we want going forward
|
That was not what I tried to suggest. I wanted to make clear that think it is possible to not have a So to be concrete: a non-default value of |
so if I do .fit([[5]]).transform([[4]]), for which values of n_values,
categories and handle_umknown will that raise an error?
…On 25 Jan 2018 9:32 am, "Joris Van den Bossche" ***@***.***> wrote:
yes, to the idea that you can just make something that is both backwards
compatible and what we want going forward
That was not what I tried to suggest. I wanted to make clear that think it
is possible to not have a legacy_mode keyword, not by having it magically
both backwards compat and what we want in the future, but by deprecating
the behaviour of the existing keywords.
So to be concrete: a non-default value of n_values can be deprecated and
has to be replaced by categories specification. handle_unknow in case of
integer data should be set explicitly by the user to choose either full
ignoring or full erroring instead of current mix (and otherwise deprecation
warning is raised).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10521 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6-DrQWep22_gs-hg9cC0u19B1_PSks5tN6-HgaJpZM4RpUE8>
.
|
can we just make it so that during deprecation, categories must be set
explicitly, and legacy mode with warnings is otherwise in effect? Is that
what you are suggesting?
|
Yes, it might be still missing case, but I think this is possible (will check by actual coding it next week). The different 'legacy' cases:
The deprecation warning in case of |
we don't usually raise warnings until fit in any case so don't worry about
that.
that strategy sounds mostly good.
I'm not actually sure if we should be sniffing for strings in the data,
though. You basically want it to be: legacy mode is active if categories is
not set *and* if the data is all integers?
One question: if categories and n_values parameters are their default, do
we publish categories_? If n_values is set explicitly, do we publish
categories_?
…On 29 Jan 2018 10:00 am, "Joris Van den Bossche" ***@***.***> wrote:
can we just make it so that during deprecation, categories must be set
explicitly, and legacy mode with warnings is otherwise in effect? Is that
what you are suggesting?
Yes, it might be still missing case, but I *think* this is possible (will
check by actual coding it next week).
The different 'legacy' cases:
- n_values='auto' (the default)
- handle_unknown='ignore' -> fine, no change in behaviour
- handle_unknown='error' -> Problem, values in range are still
ignored, values above range error
- Possible solution:
- in fit, if the range is consecutive => fine, no change in
behaviour (for all people that now combined LabelEncoder with it, which is
a typical use case I think)
- if this is not the case: raise deprecation warning that
they have to set categories explicitly to keep this behaviour (and
internally use legacy mode)
- n_values=value
- this can be translated to categories=[range(value)] internally,
and raise deprecation warning that user should do that themselves in the
future
- in this case handle_unknown='error' / 'ignore' work as expected
The deprecation warning in case of n_values='auto' will only be raise in
fit and not upon construction (which is not really ideal), but it is only
in fit that we know that the user is passing it numeric data and not string
data.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10521 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6x8xnyZXBLij-DCC45JyYNf8pA5kks5tPPwXgaJpZM4RpUE8>
.
|
Yes indeed (in practice it will more or less be the same)
I personally would already as much as possible provide the attributes of the new interface, even in legacy mode. So in both case I would calculate So I tried to put the above logic in code (will push some updates to the PR), and I have one more question for the case of integer data when
I am personally in favor of option 1 or 2. Using the LabelEncoder before OneHotEncoder seems to be a typical pattern (from a quick github search), and in those case you always have consecutive ranges, and there will never be a change in behaviour with the new implementation, so we shouldn't warn for it. On the other hand, if we warn we can point them to the fact that if they used LabelEncoder, they no longer need to do it. Which would be nice to actually give this advice explicitly. |
Hmm, one case I forgot is when you have integer inferred categories that are not consecutive (let's say [1,3,5]), but you want the new behaviour and not legacy behaviour (so in that case you cannot just ignore the warning, as that would handle unseen values differently in the transform step, i.e. values in between the range (eg 2) will not raise an error). * this only case = integer data with inferred categories that are not consecutive range, and where you cannot / don't want to set the categories manually or set handle_unknown to ignore. Sorry for all the long text, but it's quite complex :) |
We're only talking about the case where n_values is unset, right?
I'm fine with 1., and it would not be any more expensive, since auto
already needs to examine the set of labels. I could also accept, for
simplicity, a variant of 3. that was just "OneHotEncoder running in legacy
mode. Set categories='auto' for slightly different behaviour without a
warning."
|
Yes (the other case easily be translated in its equivalent
Ah, that sounds like a good idea! (irregardless of whether detecting the consecutive categories case or not). So we set in the code the default of |
Yes, but only if we want to warn every time someone uses it without passing
categories. It's the cheap implementation approach, but it might be
unnecessarily verbose for the users, which is why I would prefer 1 if it
can be done simply.
|
What fresh hell is this :-/ |
OR we could name the new one |
@amueller Don't read all of the above! |
I think @GaelVaroquaux was against that because "one-hot" is known to be this in more fields (and we already use 'Dummy' for other things in scikit-learn ...) |
hrm. I guess we kept OneHotEncoder because it's more efficient when it can be used.... Ideally we would get rid of all the weird behaviors. I kinda had wanted to deprecate it but then we didn't... |
In my POC PR (#10523), I deprecated almost everything of OneHotEncoder, except its name ... |
It's not much more efficient. And if LabelEncoder had fast paths for ints
in range [0, n_values-1], if justified, that would be good enough.
|
@amueller, are you persuaded by the issue that we ultimately want different additional parameters (e.g. drop_first, nan handling) depending on the encoding, and that justifies having a different discrete encoder for each encoding format? |
I'll try to look at this in the spring break in two weeks, ok? not sure if I'll have time before that :-/ |
I hope this isn't the wrong place to ask but what does the current implementation do with tables that are mixed categorical and non-categorical within one column? Taking the example from pandas-dev/pandas#17418 Consider the dataframe
DictVectorizer gives exactly what I need in this case.
This gives:
We can see the features names of the columns with:
It would be great if the new CategoricalEncoder had an option to do the same. |
I don't think we intend to handle that kind of mixed case
|
That’s a shame. One simple sub case is where a column is numerical but has some missing values. A simple solution is to convert the NaNs into empty strings and then use DictVectorizer as in my example above. This effectively creates a new feature for when the value is missing but leaves the numerical values unchanged otherwise. I have found this a very useful technique. Will the new CategoricalEncoder be able to do something similar? |
we've considered allowing users to have NaN treated as a separate category
or similar. but that's not the same as handling arbitrary numeric values as
different from strings.
|
That sounds good. You are right there are two use cases. Let me explain a particular example of where treating numeric values as different from strings has been useful for me. It may be that there is a better solution. Say you have an integer numeric feature which takes a large range of values. However you suspect that for some small values, the precise value is significant. For larger values you suspect this isn’t the case. A simple thing to do is to convert all small values to strings, run DictVectorizer as above and then perform feature selection or just use your favorite classifier directly. |
So you're using it for a non-linear discretisation? The next release is
likely to include a fixed-width discretizer, but following on from a log
transform or a quantile transform it should act quite similar to what you
want... But the log transform might alone be sufficient in your setting.
…On 25 February 2018 at 18:10, lesshaste ***@***.***> wrote:
That sounds good.
You are right there are two use cases. Let me explain a particular example
of where treating numeric values as different from strings has been useful
for me. It may be that there is a better solution.
Say you have an integer numeric feature which takes a large range of
values. However you suspect that for some small values, the precise value
is significant. For larger values you suspect this isn’t the case. A simple
thing to do is to convert all small values to strings, run DictVectorizer
as above and then perform feature selection or just use your favorite
classifier directly.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10521 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz60cmjwlDVKGyXc6oPyIC9oLbptSgks5tYQdvgaJpZM4RpUE8>
.
|
@jnothman Yes in a sense except with a twist. Say I suspect that some of the values from 1...1024 are meaningful. That is 22 indicates something specific which is quite different from 21 or 23. Taking logs won't help here. But I want to leave all the values over 1024 as numerical as I don't think those specific values mean much. |
It sounds like you know too much about your variable for a generic
transform to be the sort of thing you need.
…On 25 February 2018 at 20:37, lesshaste ***@***.***> wrote:
@jnothman <https://github.com/jnothman> Yes in a sense except with a
twist. Say I suspect that some of the values from 1...1024 are meaningful.
That is 22 indicates something specific which is quite different from 21 or
23. Taking logs won't help here. But I want to leave all the values over
1024 as numerical as I don't think those specific values mean much.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10521 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz65bOdVB6k7rCAcgLBYz_NslxXWV0ks5tYSnggaJpZM4RpUE8>
.
|
@jnothman To be a little clearer, I don't know that 22 is significant. I just suspect that some values are but I don't know which ones or how many there are. I have found the "convert to a string" and then DictVectorizer method to be very useful for discovering which these are. |
@lesshaste For the issue about NaNs as separate category, see #10465
@amueller that's fine. I won't have time the coming two weeks to work on the PR that is blocked by this anyway. After that I should also have time again to work on it. |
@amueller did you have time to give this a look? |
@amueller are you ok with that I go ahead with working on the PR to split CategoricalEncoder in OrdinalEncoder and OneHotEncoder (and with deprecating current arguments of OneHotEncoder) ? |
Sorry for being absent. Seems ok, but can you maybe give me two weeks so I can actually review? Thanks! |
@amueller no problem, for me the same :-) (and once we agree on that part, there is still a lot to discuss about the actual implementation in the PR :)) |
I updated the PR #10523, ready for review |
I'll cautiously say I'm back ;) |
IMHO the most important thing is a universal API (i.e. parameters and bbehavior patterns) for all of encoders we discuss P.S. https://github.com/scikit-learn-contrib/categorical-encoding ? |
In the For the rest I think there are not really conflicting keywords (they have some others specific to dataframes which we won't add to sklearn at this point). The naming for OneHotEncoder and OrdinalEncoder at least is consistent with the |
Based on some discussions we are having here and issues that are opened, we are having some doubts that
CategoricalEncoder
(#9151) was the good choice of name (and since it is not released yet we have some room for change).So summary of how it is now:
CategoricalEncoder
says what type of data it accepts (categorical data)encoding
specifies how to encode those dataCurrently we already have
encoding='onehot'|'onehot-dense'|'ordinal'
.But what to do in the following cases:
encoding
kwarg in the one bigCategoricalEncoder
class?CategoricalEncoder
that are or are not active depending on what you passed forencoding
kwarg, which is not the nicest API design.For that last problem, we already had this with the
sparse=True/False
option, which was only relevant for 'onehot' and not for 'ordinal', and which we solved with having both 'onehot' and 'onehot-dense' encoding options and not asparse
keyword. But such an approach also does not scale.Related to this, there is a PR to add a
UnaryEncoder
(#8652). There was a related discussion on the naming in that PR, as currently the name says how it encodes, not what type of data it gets (in the current design, it accepts already encoded integers, not actual categorical data. In that regard, to be consistent with CategoricalEncoder, it might better be named OrdinalEncoder because it needs ordinal data as input).What are the options forward:
So it is a bit a trade-off of potential build up of number of classes vs number of keyword arguments in a single class.
One problem with the second approach (and one of the reasons we went with
CategoricalEncoder
in the first place, even before we added the multiple encoding options), is that there is already aOnehotEncoder
, which has a different API than theCategoricalEncoder
. And, there is not really a good other name we could use for the encoder that does one-hot encoding.However, I think that, with some temporary ugly hacks, we could reuse the name, if we are OK with deprecating the current attributes (and I think we agree it are not the most useful attributes). The idea would be that if you fit the class with string data, you get the new behaviour, and if you fit the class with integer data, you get a deprecation warning indicating the default behaviour will change (and indicating which keyword to specify to get rid of the warning).
cc @jnothman @amueller @GaelVaroquaux @rth
The text was updated successfully, but these errors were encountered: