-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
Description
Based on some discussions we are having here and issues that are opened, we are having some doubts that CategoricalEncoder
(#9151) was the good choice of name (and since it is not released yet we have some room for change).
So summary of how it is now:
- The class name
CategoricalEncoder
says what type of data it accepts (categorical data) - The keyword argument
encoding
specifies how to encode those data
Currently we already have encoding='onehot'|'onehot-dense'|'ordinal'
.
But what to do in the following cases:
- We want to add more encoding options (eg binary encoding, mean target encoding, unary encoding, ...). Do we keep adding those as new values for the
encoding
kwarg in the one bigCategoricalEncoder
class? - We want to add an option specific to one of the encodings (eg for 'onehot' encoding to drop the first (redundant) column, or for 'ordinal' encoding base the order of the categories on the frequency, ...). The problem here is that we then need to add additional keyword arguments to
CategoricalEncoder
that are or are not active depending on what you passed forencoding
kwarg, which is not the nicest API design.
For that last problem, we already had this with the sparse=True/False
option, which was only relevant for 'onehot' and not for 'ordinal', and which we solved with having both 'onehot' and 'onehot-dense' encoding options and not a sparse
keyword. But such an approach also does not scale.
Related to this, there is a PR to add a UnaryEncoder
(#8652). There was a related discussion on the naming in that PR, as currently the name says how it encodes, not what type of data it gets (in the current design, it accepts already encoded integers, not actual categorical data. In that regard, to be consistent with CategoricalEncoder, it might better be named OrdinalEncoder because it needs ordinal data as input).
What are the options forward:
- Keep things as we have it now in master, and be be OK with adding some new options to the single class (an important question which is hard to answer now, is how much new features we will want to add in the future).
- Switch the naming scheme and have a bunch of 'categorical encoders' where the name says how it encodes (OnehotEncoder, OrdinalEncoder, and later maybe BinaryEncoder, UnaryEncoder, ...)
So it is a bit a trade-off of potential build up of number of classes vs number of keyword arguments in a single class.
One problem with the second approach (and one of the reasons we went with CategoricalEncoder
in the first place, even before we added the multiple encoding options), is that there is already a OnehotEncoder
, which has a different API than the CategoricalEncoder
. And, there is not really a good other name we could use for the encoder that does one-hot encoding.
However, I think that, with some temporary ugly hacks, we could reuse the name, if we are OK with deprecating the current attributes (and I think we agree it are not the most useful attributes). The idea would be that if you fit the class with string data, you get the new behaviour, and if you fit the class with integer data, you get a deprecation warning indicating the default behaviour will change (and indicating which keyword to specify to get rid of the warning).