Skip to content

Rethinking the CategoricalEncoder API ? #10521

@jorisvandenbossche

Description

@jorisvandenbossche

Based on some discussions we are having here and issues that are opened, we are having some doubts that CategoricalEncoder (#9151) was the good choice of name (and since it is not released yet we have some room for change).

So summary of how it is now:

  • The class name CategoricalEncoder says what type of data it accepts (categorical data)
  • The keyword argument encoding specifies how to encode those data

Currently we already have encoding='onehot'|'onehot-dense'|'ordinal'.

But what to do in the following cases:

  • We want to add more encoding options (eg binary encoding, mean target encoding, unary encoding, ...). Do we keep adding those as new values for the encoding kwarg in the one big CategoricalEncoder class?
  • We want to add an option specific to one of the encodings (eg for 'onehot' encoding to drop the first (redundant) column, or for 'ordinal' encoding base the order of the categories on the frequency, ...). The problem here is that we then need to add additional keyword arguments to CategoricalEncoder that are or are not active depending on what you passed for encoding kwarg, which is not the nicest API design.

For that last problem, we already had this with the sparse=True/False option, which was only relevant for 'onehot' and not for 'ordinal', and which we solved with having both 'onehot' and 'onehot-dense' encoding options and not a sparse keyword. But such an approach also does not scale.

Related to this, there is a PR to add a UnaryEncoder (#8652). There was a related discussion on the naming in that PR, as currently the name says how it encodes, not what type of data it gets (in the current design, it accepts already encoded integers, not actual categorical data. In that regard, to be consistent with CategoricalEncoder, it might better be named OrdinalEncoder because it needs ordinal data as input).


What are the options forward:

  1. Keep things as we have it now in master, and be be OK with adding some new options to the single class (an important question which is hard to answer now, is how much new features we will want to add in the future).
  2. Switch the naming scheme and have a bunch of 'categorical encoders' where the name says how it encodes (OnehotEncoder, OrdinalEncoder, and later maybe BinaryEncoder, UnaryEncoder, ...)

So it is a bit a trade-off of potential build up of number of classes vs number of keyword arguments in a single class.


One problem with the second approach (and one of the reasons we went with CategoricalEncoder in the first place, even before we added the multiple encoding options), is that there is already a OnehotEncoder, which has a different API than the CategoricalEncoder. And, there is not really a good other name we could use for the encoder that does one-hot encoding.
However, I think that, with some temporary ugly hacks, we could reuse the name, if we are OK with deprecating the current attributes (and I think we agree it are not the most useful attributes). The idea would be that if you fit the class with string data, you get the new behaviour, and if you fit the class with integer data, you get a deprecation warning indicating the default behaviour will change (and indicating which keyword to specify to get rid of the warning).

cc @jnothman @amueller @GaelVaroquaux @rth

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions