Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary encoding for features with cardinality <= 2 #300

Merged
merged 17 commits into from
Sep 9, 2022

Conversation

LilianBoulard
Copy link
Member

Fixes #246
Also refactored some tests for the SuperVectorizer to make the code cleaner.

@LilianBoulard LilianBoulard added the enhancement New feature or request label Aug 8, 2022
@LilianBoulard LilianBoulard added this to the 0.3.0 release milestone Aug 8, 2022
@LilianBoulard LilianBoulard self-assigned this Aug 8, 2022
@LilianBoulard
Copy link
Member Author

sklearn's OrdinalEncoder does not include a get_feature_names method, and only implemented get_feature_names_out in version 1.1.
How should we deal with this issue ?

One option would be to implement a BinaryEncoder that would inherit the OrdinalEncoder, and only implement what we need.

@GaelVaroquaux
Copy link
Member

I think that using the option 'drop="if_binary"' of the OneHotEncoder would be a cleaner way of doing things. I realize that it is available only since sklearn 0.23, but it makes code more future-proof.

Copy link
Member

@jovan-stojanovic jovan-stojanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggestions for the docs

dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved
dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved
dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved
dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved
Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the option 'drop="if_binary"' of the OneHotEncoder would be a cleaner way of doing things.

It is available only since sklearn 0.23, but it makes code more future-proof.

…binary_enc

� Conflicts:
�	dirty_cat/super_vectorizer.py
�	dirty_cat/test/test_super_vectorizer.py
…binary_enc

� Conflicts:
�	dirty_cat/super_vectorizer.py
�	dirty_cat/test/test_super_vectorizer.py
@LilianBoulard
Copy link
Member Author

This feature requires a fix introduced in #303.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 5, 2022 via email

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it would be much simpler not to introduce a new category of transformers but simply modify the default one for low_card_cat.

In addition, we need to document clearly this change as it is a backward incompatible change that might change predictive pipelines.

dirty_cat/gap_encoder.py Show resolved Hide resolved
dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved
dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved
col: X[col].nunique() for col in categorical_columns
}
binary_cat_columns = [
col for col in categorical_columns if _nunique_values[col] <= 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be only if there are binary categorization?

Suggested change
col for col in categorical_columns if _nunique_values[col] <= 2
col for col in categorical_columns if _nunique_values[col] == 2

I guess there is no point in encoding if _nunique_values[col] <= 1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but what do you think we should do with the features with cardinality==1 ? I guess maybe drop them ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes drop would be the best not to create confusion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, for now, we decided with Gaël to not drop them (given we want to release very soon), but that's something we can discuss in the future.

@LilianBoulard
Copy link
Member Author

LilianBoulard commented Sep 9, 2022

So the reason I did an independent "category" for the binary features is because while OneHot has a parameter to avoid this, other encoders might not.
For example if we wanted to use SimilarityEncoding for the low_card_cat transformer, then the dimensionality would explode for features with cardinality==2 as it doesn't have the same kind of mechanism OneHot does (AFAIK).

@GaelVaroquaux
Copy link
Member

Hum, maybe we should actually use drop="first", which we solve all problems in a single step.

@amy12xx, what do you think?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 9, 2022 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 9, 2022 via email

@GaelVaroquaux
Copy link
Member

Hum, maybe we should actually use drop="first", which we solve all problems in a single step.

After many discussions, including with the scikit-learn team, the consensus is that "drop='first'" raises real interpretability issues and that we should rather go for if_binary

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

One tiny additional suggestion and we're ready to merge

CHANGES.rst Outdated Show resolved Hide resolved
@GaelVaroquaux
Copy link
Member

I committed the suggestion myself to be able to merge ASAP.

Will merge once the tests pass

Copy link
Member Author

@LilianBoulard LilianBoulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I had just one tiny change to the fit_transform docstring (I committed but hadn't finished writing it :p). LGTM too!

@GaelVaroquaux GaelVaroquaux merged commit abb7911 into skrub-data:master Sep 9, 2022
@GaelVaroquaux
Copy link
Member

Merged! Yey!

@LilianBoulard LilianBoulard deleted the binary_enc branch December 15, 2022 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

One hot encoding of Supervectorizer for columns with 2 values
3 participants