Rethinking the CategoricalEncoder API ?

Based on some discussions we are having here and issues that are opened, we are having some doubts that `CategoricalEncoder` (https://github.com/scikit-learn/scikit-learn/pull/9151) was the good choice of name (and since it is not released yet we have some room for change).

So summary of how it is now:

- The class name `CategoricalEncoder` says what type of data it accepts (categorical data)
- The keyword argument `encoding` specifies *how* to encode those data

Currently we already have `encoding='onehot'|'onehot-dense'|'ordinal'`.

But what to do in the following cases:

- We want to add more encoding options (eg binary encoding, mean target encoding, unary encoding, ...). Do we keep adding those as new values for the `encoding` kwarg in the one big `CategoricalEncoder` class?
- We want to add an option specific to one of the encodings (eg for 'onehot' encoding to drop the first (redundant) column, or for 'ordinal' encoding base the order of the categories on the frequency, ...). The problem here is that we then need to add additional keyword arguments to `CategoricalEncoder` that are or are not active depending on what you passed for `encoding` kwarg, which is not the nicest API design.

For that last problem, we already had this with the `sparse=True/False` option, which was only relevant for 'onehot' and not for 'ordinal', and which we solved with having both 'onehot' and 'onehot-dense' encoding options and not a `sparse` keyword. But such an approach also does not scale.

Related to this, there is a PR to add a `UnaryEncoder` (https://github.com/scikit-learn/scikit-learn/pull/8652). There was a related discussion on the naming in that PR, as currently the name says *how* it encodes, not what type of data it gets (in the current design, it accepts already encoded integers, not actual categorical data. In that regard, to be consistent with CategoricalEncoder, it might better be named OrdinalEncoder because it needs ordinal data as input).

---

What are the options forward:

1) Keep things as we have it now in master, and be be OK with adding some new options to the single class (an important question which is hard to answer now, is how much new features we will want to add in the future).
2) Switch the naming scheme and have a bunch of 'categorical encoders' where the name says how it encodes (OnehotEncoder, OrdinalEncoder, and later maybe BinaryEncoder, UnaryEncoder, ...)

So it is a bit a trade-off of potential build up of number of classes vs number of keyword arguments in a single class.

---

One problem with the second approach (and one of the reasons we went with `CategoricalEncoder` in the first place, even before we added the multiple encoding options), is that there is already a `OnehotEncoder`, which has a different API than the `CategoricalEncoder`. And, there is not really a good other name we could use for the encoder that does one-hot encoding. 
However, I think that, with some temporary ugly hacks, we could reuse the name, if we are OK with deprecating the current attributes (and I think we agree it are not the most useful attributes). The idea would be that if you fit the class with string data, you get the new behaviour, and if you fit the class with integer data, you get a deprecation warning indicating the default behaviour will change (and indicating which keyword to specify to get rid of the warning). 

cc @jnothman @amueller @GaelVaroquaux @rth 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Rethinking the CategoricalEncoder API ? #10521

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Rethinking the CategoricalEncoder API ? #10521

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions