-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binary encoding for features with cardinality <= 2 #300
Conversation
sklearn's One option would be to implement a |
I think that using the option 'drop="if_binary"' of the OneHotEncoder would be a cleaner way of doing things. I realize that it is available only since sklearn 0.23, but it makes code more future-proof. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few suggestions for the docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the option 'drop="if_binary"' of the OneHotEncoder would be a cleaner way of doing things.
It is available only since sklearn 0.23, but it makes code more future-proof.
…binary_enc � Conflicts: � dirty_cat/super_vectorizer.py � dirty_cat/test/test_super_vectorizer.py
…binary_enc � Conflicts: � dirty_cat/super_vectorizer.py � dirty_cat/test/test_super_vectorizer.py
This feature requires a fix introduced in #303. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it would be much simpler not to introduce a new category of transformers but simply modify the default one for low_card_cat.
In addition, we need to document clearly this change as it is a backward incompatible change that might change predictive pipelines.
dirty_cat/super_vectorizer.py
Outdated
col: X[col].nunique() for col in categorical_columns | ||
} | ||
binary_cat_columns = [ | ||
col for col in categorical_columns if _nunique_values[col] <= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be only if there are binary categorization?
col for col in categorical_columns if _nunique_values[col] <= 2 | |
col for col in categorical_columns if _nunique_values[col] == 2 |
I guess there is no point in encoding if _nunique_values[col] <= 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but what do you think we should do with the features with cardinality==1 ? I guess maybe drop them ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes drop would be the best not to create confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, for now, we decided with Gaël to not drop them (given we want to release very soon), but that's something we can discuss in the future.
So the reason I did an independent "category" for the binary features is because while OneHot has a parameter to avoid this, other encoders might not. |
Hum, maybe we should actually use drop="first", which we solve all problems in a single step. @amy12xx, what do you think? |
Hum, maybe we should actually use drop="first", which we solve all problems in a single step.
|
So the reason I did an independent "category" for the binary features is because while OneHot has a parameter to avoid this, other encoders might not.
But the original issue only applies to the OneHotEncoder
|
After many discussions, including with the scikit-learn team, the consensus is that "drop='first'" raises real interpretability issues and that we should rather go for if_binary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great.
One tiny additional suggestion and we're ready to merge
I committed the suggestion myself to be able to merge ASAP. Will merge once the tests pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I had just one tiny change to the fit_transform
docstring (I committed but hadn't finished writing it :p). LGTM too!
Merged! Yey! |
Fixes #246
Also refactored some tests for the
SuperVectorizer
to make the code cleaner.