Example of TargetEncoder w/ categorical target #182

bkj · 2019-04-23T04:18:17Z

Is there an example of using TargetEncoder w/ a categorical target variable? The docstring suggests that it should be possible, but I don't see how the code is determining that y is continuous vs. categorical.

Am I supposed to pass categorical y as a vector of strings? As an (n_obs, n_classes) array of one-hot encoded labels? I tried a few things, but they don't seem to work.

The code takes the mean of y -- which seems weird when y is categorical.

The text was updated successfully, but these errors were encountered:

janmotl · 2019-04-23T08:44:48Z

All encoders should accept pd.Categorical:

y_categorical = pd.Categorical(y[0])

But TargetEncoder does not accept that. Hence, that is a bug.

Also, I am not sure that TargetEncoder currently handles polynomial targets correctly. The way it should handle it is described in article A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems in section Extension to Multi-Valued Categorical Targets. In short, for each encoded feature, it should create m-1 columns, where m is the count of unique values in the target.

There are four things to do:

Write a parameterized unit tests for binomial pd.Categorical targets. A good start is to clone and modify test_classification() in test_encoders.py.
Write a parameterized unit tests for polynomial pd.Categorical targets. It is ok to skip in the test encoders that you are not interested in and that do not work out of box.
Fix TargetEncoder.
Write a pull request.

bkj · 2019-04-23T13:21:30Z

OK thanks. What do you mean by polynomial targets?

janmotl · 2019-04-23T14:32:38Z

In unit tests, we currently test only targets with {True, False}. That's insufficient.

Since each supervised encoder (e.g.: TargetEncoder, WOEEncoder,...) should support binary targets, we should test each encoder with following targets:

Strings like: {'Apple', 'Banana'}
Integers like: {-1000, 2000}. I used "ugly" numbers since if it works for them, it should also work for "nice" numbers like {0, 1}.
Boolean, just like we already do so.
Pandas Categories with two unique values.

The encoders should always encode the datasets the same way regardless of the used target representation.

TargetEncoder should also be tested on polynomial targets (categorical targets with more than 2 unique values) like:

Strings: {'Apple', 'Banana', 'Cinnamon'}
Integers: {-1000, 2000, 4000}
Pandas Categories with three unique values.

Finally, TargetEncoder should also be tested on continuous targets, like doubles, since it should support regression tasks (see Continuous Targets in the article).

bkj · 2019-04-23T14:43:28Z

Ah ok -- I was confused by the term "polynomial". I think a more standard term is "multiclass classification" for classification w/ more than 2 unique target values. "Polynomial" makes me think of f(x) = a * x ** 2 + b * x + c...

willrazen · 2019-05-15T10:53:27Z

I observed the same lack of functionality in TargetEncoder and LeaveOneOutEncoder. Then I went after Barreca's paper to find out what was missing, and basically what is proposed is:

One-hot-encode the categorical target variable, except one category (so they are linearly independent)
For each new binary target, encode the categorical independent variable with the proposed technique for binary target (i.e. several new predictors will be created, instead of only one)

This only is useful when the categorical target has much lower cardinality than the independent variable you're trying to encode.

A better alternative might be the one described in Encoding Categorical Variables with Conjugate
Bayesian Models for WeWork Lead Scoring
Engine, available on arXiv.

janmotl · 2019-05-15T13:15:29Z

The implementation of the referenced article is at: https://github.com/aslakey/CBM_Encoding

In short: instead of returning just "the average of the target", as TargetEncoder does, it also returns "the variance of the target". And that's definitely interesting.

However, when we execute https://github.com/aslakey/CBM_Encoding/blob/master/run_dirichlet_experiments.py on car data set, which has label with 4 classes and 6 nominal features, and calculate both, avg() and var(), we end up with 48 features (4*6*2=48). Hence, I don't think it really solves the issue with the high-cardinality dependent variable...

PGijsbers · 2021-01-12T11:45:24Z

It seems to be broken with any categorical target (even if it is binary):

from category_encoders import HashingEncoder, TargetEncoder
import numpy as np
import pandas as pd

if __name__ == '__main__':
    enc = TargetEncoder
    x = np.random.randint(low=0, high=5, size=(150, 4))
    y = np.random.randint(low=0, high=2, size=(150,))

    x_cat = pd.DataFrame(x)
    for col in x_cat.columns:
        x_cat[col] = x_cat[col].astype('category')
    y_cat = pd.Series(y, dtype='category')
    enc().fit(x_cat, y_cat)

produces

Traceback (most recent call last):
  File ".../mwece.py", line 14, in <module>
    enc().fit(x_cat, y_cat)
  File "...\venv\lib\site-packages\category_encoders\target_encoder.py", line 142, in fit
    self.mapping = self.fit_target_encoding(X_ordinal, y)
  File "...\lib\site-packages\category_encoders\target_encoder.py", line 168, in fit_target_encoding
    prior = self._mean = y.mean()
  File "...\lib\site-packages\pandas\core\generic.py", line 11214, in stat_func
    return self._reduce(
  File "...\venv\lib\site-packages\pandas\core\series.py", line 3872, in _reduce
    return delegate._reduce(name, skipna=skipna, **kwds)
  File "...\venv\lib\site-packages\pandas\core\arrays\categorical.py", line 2124, in _reduce
    raise TypeError(f"Categorical cannot perform the operation {name}")
TypeError: Categorical cannot perform the operation mean

There are a few bugs that prevent the TargetEncoder from working as intended (see scikit-learn-contrib/category_encoders#182): - It does not work with a `categorical` series as target - It does not work for multi-class classification

Shellcat-Zero · 2021-11-09T21:31:56Z

Is anyone working on this currently? Is there a difference between this TargetEncoder and the SKLearn LabelEncoder or does this library implement LabelEncoder elsewhere?

Tangentially, are the unit tests set up here to ensure that the encoders yield the same results as their SKLearn counterparts, since this library touts full compatibility with sklearn pipelines? This would be particularly important for those wanting to use this library as a drop-in replacement within their SKLearn workflows.

janmotl added the bug label Apr 23, 2019

janmotl mentioned this issue May 29, 2019

Contrast coding schemes for multiclass labels #192

Open

PaulWestenthanner added the help wanted label Jan 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example of TargetEncoder w/ categorical target #182

Example of TargetEncoder w/ categorical target #182

bkj commented Apr 23, 2019 •

edited

janmotl commented Apr 23, 2019

bkj commented Apr 23, 2019

janmotl commented Apr 23, 2019

bkj commented Apr 23, 2019 •

edited

willrazen commented May 15, 2019 •

edited

janmotl commented May 15, 2019

PGijsbers commented Jan 12, 2021

Shellcat-Zero commented Nov 9, 2021

Example of TargetEncoder w/ categorical target #182

Example of TargetEncoder w/ categorical target #182

Comments

bkj commented Apr 23, 2019 • edited

janmotl commented Apr 23, 2019

bkj commented Apr 23, 2019

janmotl commented Apr 23, 2019

bkj commented Apr 23, 2019 • edited

willrazen commented May 15, 2019 • edited

janmotl commented May 15, 2019

PGijsbers commented Jan 12, 2021

Shellcat-Zero commented Nov 9, 2021

bkj commented Apr 23, 2019 •

edited

bkj commented Apr 23, 2019 •

edited

willrazen commented May 15, 2019 •

edited