Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example of TargetEncoder w/ categorical target #182

Open
bkj opened this issue Apr 23, 2019 · 8 comments
Open

Example of TargetEncoder w/ categorical target #182

bkj opened this issue Apr 23, 2019 · 8 comments

Comments

@bkj
Copy link

bkj commented Apr 23, 2019

Is there an example of using TargetEncoder w/ a categorical target variable? The docstring suggests that it should be possible, but I don't see how the code is determining that y is continuous vs. categorical.

Am I supposed to pass categorical y as a vector of strings? As an (n_obs, n_classes) array of one-hot encoded labels? I tried a few things, but they don't seem to work.

The code takes the mean of y -- which seems weird when y is categorical.

@janmotl janmotl added the bug label Apr 23, 2019
@janmotl
Copy link
Collaborator

janmotl commented Apr 23, 2019

All encoders should accept pd.Categorical:

y_categorical = pd.Categorical(y[0])

But TargetEncoder does not accept that. Hence, that is a bug.


Also, I am not sure that TargetEncoder currently handles polynomial targets correctly. The way it should handle it is described in article A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems in section Extension to Multi-Valued Categorical Targets. In short, for each encoded feature, it should create m-1 columns, where m is the count of unique values in the target.

There are four things to do:

  1. Write a parameterized unit tests for binomial pd.Categorical targets. A good start is to clone and modify test_classification() in test_encoders.py.
  2. Write a parameterized unit tests for polynomial pd.Categorical targets. It is ok to skip in the test encoders that you are not interested in and that do not work out of box.
  3. Fix TargetEncoder.
  4. Write a pull request.

@bkj
Copy link
Author

bkj commented Apr 23, 2019

OK thanks. What do you mean by polynomial targets?

@janmotl
Copy link
Collaborator

janmotl commented Apr 23, 2019

In unit tests, we currently test only targets with {True, False}. That's insufficient.

Since each supervised encoder (e.g.: TargetEncoder, WOEEncoder,...) should support binary targets, we should test each encoder with following targets:

  1. Strings like: {'Apple', 'Banana'}
  2. Integers like: {-1000, 2000}. I used "ugly" numbers since if it works for them, it should also work for "nice" numbers like {0, 1}.
  3. Boolean, just like we already do so.
  4. Pandas Categories with two unique values.

The encoders should always encode the datasets the same way regardless of the used target representation.


TargetEncoder should also be tested on polynomial targets (categorical targets with more than 2 unique values) like:

  1. Strings: {'Apple', 'Banana', 'Cinnamon'}
  2. Integers: {-1000, 2000, 4000}
  3. Pandas Categories with three unique values.

Finally, TargetEncoder should also be tested on continuous targets, like doubles, since it should support regression tasks (see Continuous Targets in the article).

@bkj
Copy link
Author

bkj commented Apr 23, 2019

Ah ok -- I was confused by the term "polynomial". I think a more standard term is "multiclass classification" for classification w/ more than 2 unique target values. "Polynomial" makes me think of f(x) = a * x ** 2 + b * x + c...

@willrazen
Copy link

willrazen commented May 15, 2019

I observed the same lack of functionality in TargetEncoder and LeaveOneOutEncoder. Then I went after Barreca's paper to find out what was missing, and basically what is proposed is:

  1. One-hot-encode the categorical target variable, except one category (so they are linearly independent)
  2. For each new binary target, encode the categorical independent variable with the proposed technique for binary target (i.e. several new predictors will be created, instead of only one)

This only is useful when the categorical target has much lower cardinality than the independent variable you're trying to encode.

A better alternative might be the one described in Encoding Categorical Variables with Conjugate
Bayesian Models for WeWork Lead Scoring
Engine
, available on arXiv.

@janmotl
Copy link
Collaborator

janmotl commented May 15, 2019

The implementation of the referenced article is at: https://github.com/aslakey/CBM_Encoding

In short: instead of returning just "the average of the target", as TargetEncoder does, it also returns "the variance of the target". And that's definitely interesting.

However, when we execute https://github.com/aslakey/CBM_Encoding/blob/master/run_dirichlet_experiments.py on car data set, which has label with 4 classes and 6 nominal features, and calculate both, avg() and var(), we end up with 48 features (4*6*2=48). Hence, I don't think it really solves the issue with the high-cardinality dependent variable...

@PGijsbers
Copy link
Contributor

It seems to be broken with any categorical target (even if it is binary):

from category_encoders import HashingEncoder, TargetEncoder
import numpy as np
import pandas as pd

if __name__ == '__main__':
    enc = TargetEncoder
    x = np.random.randint(low=0, high=5, size=(150, 4))
    y = np.random.randint(low=0, high=2, size=(150,))

    x_cat = pd.DataFrame(x)
    for col in x_cat.columns:
        x_cat[col] = x_cat[col].astype('category')
    y_cat = pd.Series(y, dtype='category')
    enc().fit(x_cat, y_cat)

produces

Traceback (most recent call last):
  File ".../mwece.py", line 14, in <module>
    enc().fit(x_cat, y_cat)
  File "...\venv\lib\site-packages\category_encoders\target_encoder.py", line 142, in fit
    self.mapping = self.fit_target_encoding(X_ordinal, y)
  File "...\lib\site-packages\category_encoders\target_encoder.py", line 168, in fit_target_encoding
    prior = self._mean = y.mean()
  File "...\lib\site-packages\pandas\core\generic.py", line 11214, in stat_func
    return self._reduce(
  File "...\venv\lib\site-packages\pandas\core\series.py", line 3872, in _reduce
    return delegate._reduce(name, skipna=skipna, **kwds)
  File "...\venv\lib\site-packages\pandas\core\arrays\categorical.py", line 2124, in _reduce
    raise TypeError(f"Categorical cannot perform the operation {name}")
TypeError: Categorical cannot perform the operation mean

PGijsbers added a commit to openml-labs/gama that referenced this issue Jan 13, 2021
There are a few bugs that prevent the TargetEncoder from working as
intended (see
scikit-learn-contrib/category_encoders#182):
 - It does not work with a `categorical` series as target
 - It does not work for multi-class classification
@Shellcat-Zero
Copy link

Is anyone working on this currently? Is there a difference between this TargetEncoder and the SKLearn LabelEncoder or does this library implement LabelEncoder elsewhere?

Tangentially, are the unit tests set up here to ensure that the encoders yield the same results as their SKLearn counterparts, since this library touts full compatibility with sklearn pipelines? This would be particularly important for those wanting to use this library as a drop-in replacement within their SKLearn workflows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants