Binary encoding for features with cardinality <= 2 #300

LilianBoulard · 2022-08-08T12:14:01Z

Fixes #246
Also refactored some tests for the SuperVectorizer to make the code cleaner.

…binary_enc

LilianBoulard · 2022-08-08T12:33:21Z

sklearn's OrdinalEncoder does not include a get_feature_names method, and only implemented get_feature_names_out in version 1.1.
How should we deal with this issue ?

One option would be to implement a BinaryEncoder that would inherit the OrdinalEncoder, and only implement what we need.

GaelVaroquaux · 2022-08-08T19:07:43Z

I think that using the option 'drop="if_binary"' of the OneHotEncoder would be a cleaner way of doing things. I realize that it is available only since sklearn 0.23, but it makes code more future-proof.

jovan-stojanovic

A few suggestions for the docs

dirty_cat/super_vectorizer.py

GaelVaroquaux

Use the option 'drop="if_binary"' of the OneHotEncoder would be a cleaner way of doing things.

It is available only since sklearn 0.23, but it makes code more future-proof.

…binary_enc � Conflicts: � dirty_cat/super_vectorizer.py � dirty_cat/test/test_super_vectorizer.py

LilianBoulard · 2022-09-05T12:23:13Z

This feature requires a fix introduced in #303.

GaelVaroquaux · 2022-09-05T12:28:34Z

This feature requires a fix introduced in #303.

#303 is currently not mergeable because of failing tests and conflicts

GaelVaroquaux

I think that it would be much simpler not to introduce a new category of transformers but simply modify the default one for low_card_cat.

In addition, we need to document clearly this change as it is a backward incompatible change that might change predictive pipelines.

dirty_cat/gap_encoder.py

dirty_cat/super_vectorizer.py

jovan-stojanovic · 2022-09-09T08:25:57Z

dirty_cat/super_vectorizer.py

+            col: X[col].nunique() for col in categorical_columns
+        }
+        binary_cat_columns = [
+            col for col in categorical_columns if _nunique_values[col] <= 2


Should this be only if there are binary categorization?

Suggested change

col for col in categorical_columns if _nunique_values[col] <= 2

col for col in categorical_columns if _nunique_values[col] == 2

I guess there is no point in encoding if _nunique_values[col] <= 1

True, but what do you think we should do with the features with cardinality==1 ? I guess maybe drop them ?

Yes drop would be the best not to create confusion.

So, for now, we decided with Gaël to not drop them (given we want to release very soon), but that's something we can discuss in the future.

LilianBoulard · 2022-09-09T08:29:30Z

So the reason I did an independent "category" for the binary features is because while OneHot has a parameter to avoid this, other encoders might not.
For example if we wanted to use SimilarityEncoding for the low_card_cat transformer, then the dimensionality would explode for features with cardinality==2 as it doesn't have the same kind of mechanism OneHot does (AFAIK).

GaelVaroquaux · 2022-09-09T09:59:54Z

Hum, maybe we should actually use drop="first", which we solve all problems in a single step.

@amy12xx, what do you think?

GaelVaroquaux · 2022-09-09T10:05:05Z

Hum, maybe we should actually use drop="first", which we solve all problems in a single step.

GaelVaroquaux · 2022-09-09T10:10:09Z

So the reason I did an independent "category" for the binary features is because while OneHot has a parameter to avoid this, other encoders might not.

But the original issue only applies to the OneHotEncoder

GaelVaroquaux · 2022-09-09T11:35:12Z

Hum, maybe we should actually use drop="first", which we solve all problems in a single step.

After many discussions, including with the scikit-learn team, the consensus is that "drop='first'" raises real interpretability issues and that we should rather go for if_binary

GaelVaroquaux

Looks great.

One tiny additional suggestion and we're ready to merge

CHANGES.rst

GaelVaroquaux · 2022-09-09T12:09:52Z

I committed the suggestion myself to be able to merge ASAP.

Will merge once the tests pass

LilianBoulard

Okay, I had just one tiny change to the fit_transform docstring (I committed but hadn't finished writing it :p). LGTM too!

GaelVaroquaux · 2022-09-09T12:22:55Z

Merged! Yey!

LilianBoulard added 4 commits August 5, 2022 18:17

Add binary encoding to the SuperVectorizer

05ef600

Merge branch 'master' of https://github.com/dirty-cat/dirty_cat into …

496cd62

…binary_enc

[WIP] Adapt the tests

0b8a185

Remove debug

64f09f5

LilianBoulard added the enhancement New feature or request label Aug 8, 2022

LilianBoulard added this to the 0.3.0 release milestone Aug 8, 2022

LilianBoulard self-assigned this Aug 8, 2022

Replace OrdinalEncoder with OneHotEncoder

edf722e

jovan-stojanovic reviewed Aug 17, 2022

View reviewed changes

dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved

dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved

dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved

dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved

GaelVaroquaux requested changes Sep 2, 2022

View reviewed changes

LilianBoulard added 2 commits September 5, 2022 13:51

Merge branch 'master' of https://github.com/dirty-cat/dirty_cat into …

56330e9

…binary_enc � Conflicts: � dirty_cat/super_vectorizer.py � dirty_cat/test/test_super_vectorizer.py

Merge branch 'master' of https://github.com/dirty-cat/dirty_cat into …

42fbb5a

…binary_enc � Conflicts: � dirty_cat/super_vectorizer.py � dirty_cat/test/test_super_vectorizer.py

LilianBoulard added 4 commits September 8, 2022 14:53

Merge master

8d12d66

Update tests, standardize implementation

56f4895

[skip ci] Update docstring with suggestions

03cb54b

Bump to sklearn 0.23

ca3095f

jovan-stojanovic mentioned this pull request Sep 8, 2022

MAINT Preparing for beta release 0.3.0 #333

Merged

LilianBoulard requested review from GaelVaroquaux and jovan-stojanovic September 8, 2022 15:21

[skip ci] Also update setup

6a16301

GaelVaroquaux requested changes Sep 8, 2022

View reviewed changes

dirty_cat/gap_encoder.py Show resolved Hide resolved

dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved

dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved

jovan-stojanovic reviewed Sep 9, 2022

View reviewed changes

LilianBoulard added 2 commits September 9, 2022 14:01

Simplify implementation

82072e1

Updat Changes

4840e13

GaelVaroquaux reviewed Sep 9, 2022

View reviewed changes

CHANGES.rst Outdated Show resolved Hide resolved

Update CHANGES.rst

1ce364f

GaelVaroquaux approved these changes Sep 9, 2022

View reviewed changes

LilianBoulard added 2 commits September 9, 2022 14:14

Update docstring

5930a1b

Merge remote-tracking branch 'fork/binary_enc' into binary_enc

4602bf2

LilianBoulard commented Sep 9, 2022

View reviewed changes

jovan-stojanovic approved these changes Sep 9, 2022

View reviewed changes

GaelVaroquaux merged commit abb7911 into skrub-data:master Sep 9, 2022

LilianBoulard deleted the binary_enc branch December 15, 2022 13:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary encoding for features with cardinality <= 2 #300

Binary encoding for features with cardinality <= 2 #300

LilianBoulard commented Aug 8, 2022

LilianBoulard commented Aug 8, 2022

GaelVaroquaux commented Aug 8, 2022

jovan-stojanovic left a comment

GaelVaroquaux left a comment

LilianBoulard commented Sep 5, 2022

GaelVaroquaux commented Sep 5, 2022 via email

GaelVaroquaux left a comment

jovan-stojanovic Sep 9, 2022

LilianBoulard Sep 9, 2022

jovan-stojanovic Sep 9, 2022

LilianBoulard Sep 9, 2022 •

edited

LilianBoulard commented Sep 9, 2022 •

edited

GaelVaroquaux commented Sep 9, 2022

GaelVaroquaux commented Sep 9, 2022 via email

GaelVaroquaux commented Sep 9, 2022 via email

GaelVaroquaux commented Sep 9, 2022

GaelVaroquaux left a comment

GaelVaroquaux commented Sep 9, 2022

LilianBoulard left a comment

GaelVaroquaux commented Sep 9, 2022

	col for col in categorical_columns if _nunique_values[col] <= 2
	col for col in categorical_columns if _nunique_values[col] == 2

Binary encoding for features with cardinality <= 2 #300

Binary encoding for features with cardinality <= 2 #300

Conversation

LilianBoulard commented Aug 8, 2022

LilianBoulard commented Aug 8, 2022

GaelVaroquaux commented Aug 8, 2022

jovan-stojanovic left a comment

Choose a reason for hiding this comment

GaelVaroquaux left a comment

Choose a reason for hiding this comment

LilianBoulard commented Sep 5, 2022

GaelVaroquaux commented Sep 5, 2022 via email

GaelVaroquaux left a comment

Choose a reason for hiding this comment

jovan-stojanovic Sep 9, 2022

Choose a reason for hiding this comment

LilianBoulard Sep 9, 2022

Choose a reason for hiding this comment

jovan-stojanovic Sep 9, 2022

Choose a reason for hiding this comment

LilianBoulard Sep 9, 2022 • edited

Choose a reason for hiding this comment

LilianBoulard commented Sep 9, 2022 • edited

GaelVaroquaux commented Sep 9, 2022

GaelVaroquaux commented Sep 9, 2022 via email

GaelVaroquaux commented Sep 9, 2022 via email

GaelVaroquaux commented Sep 9, 2022

GaelVaroquaux left a comment

Choose a reason for hiding this comment

GaelVaroquaux commented Sep 9, 2022

LilianBoulard left a comment

Choose a reason for hiding this comment

GaelVaroquaux commented Sep 9, 2022

LilianBoulard Sep 9, 2022 •

edited

LilianBoulard commented Sep 9, 2022 •

edited