New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example of TargetEncoder w/ categorical target #182
Comments
All encoders should accept y_categorical = pd.Categorical(y[0]) But Also, I am not sure that There are four things to do:
|
OK thanks. What do you mean by polynomial targets? |
In unit tests, we currently test only targets with Since each supervised encoder (e.g.:
The encoders should always encode the datasets the same way regardless of the used target representation.
Finally, |
Ah ok -- I was confused by the term "polynomial". I think a more standard term is "multiclass classification" for classification w/ more than 2 unique target values. "Polynomial" makes me think of |
I observed the same lack of functionality in TargetEncoder and LeaveOneOutEncoder. Then I went after Barreca's paper to find out what was missing, and basically what is proposed is:
This only is useful when the categorical target has much lower cardinality than the independent variable you're trying to encode. A better alternative might be the one described in Encoding Categorical Variables with Conjugate |
The implementation of the referenced article is at: https://github.com/aslakey/CBM_Encoding In short: instead of returning just "the average of the target", as TargetEncoder does, it also returns "the variance of the target". And that's definitely interesting. However, when we execute https://github.com/aslakey/CBM_Encoding/blob/master/run_dirichlet_experiments.py on car data set, which has label with 4 classes and 6 nominal features, and calculate both, avg() and var(), we end up with 48 features ( |
It seems to be broken with any from category_encoders import HashingEncoder, TargetEncoder
import numpy as np
import pandas as pd
if __name__ == '__main__':
enc = TargetEncoder
x = np.random.randint(low=0, high=5, size=(150, 4))
y = np.random.randint(low=0, high=2, size=(150,))
x_cat = pd.DataFrame(x)
for col in x_cat.columns:
x_cat[col] = x_cat[col].astype('category')
y_cat = pd.Series(y, dtype='category')
enc().fit(x_cat, y_cat) produces
|
There are a few bugs that prevent the TargetEncoder from working as intended (see scikit-learn-contrib/category_encoders#182): - It does not work with a `categorical` series as target - It does not work for multi-class classification
Is anyone working on this currently? Is there a difference between this TargetEncoder and the SKLearn LabelEncoder or does this library implement LabelEncoder elsewhere? Tangentially, are the unit tests set up here to ensure that the encoders yield the same results as their SKLearn counterparts, since this library touts full compatibility with sklearn pipelines? This would be particularly important for those wanting to use this library as a drop-in replacement within their SKLearn workflows. |
Is there an example of using
TargetEncoder
w/ a categorical target variable? The docstring suggests that it should be possible, but I don't see how the code is determining thaty
is continuous vs. categorical.Am I supposed to pass categorical
y
as a vector of strings? As an(n_obs, n_classes)
array of one-hot encoded labels? I tried a few things, but they don't seem to work.The code takes the mean of
y
-- which seems weird wheny
is categorical.The text was updated successfully, but these errors were encountered: