-
Notifications
You must be signed in to change notification settings - Fork 153
Use handle_unknown=ignore in SuperVectorizer #473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hmm it seems that we can't both specify |
Hmm it seems that we can't both specify drop='if_binary, and handle_unknown= "ignore", because it would create some ambiguity on what a zero vector means. I would say it's better to keep what we have now, that is drop='if_binary and handle_unknown="error", but I'd like to hear other opinions.
I would rather have a pipeline that runs and does not raise error, from a practical point of view.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, just a typo to correct I think.
CHANGES.rst
Outdated
@@ -38,6 +38,9 @@ Minor changes | |||
which can be used to specify where to save and load from datasets. | |||
:pr:`432` by :user:`Lilian Boulard <LilianBoulard>` | |||
|
|||
* The :class:`SuperVectorizer`'s default `OneHotEncoder` for low cardinality categorical variables now defaults to `handle_unknown="ignore"` and `drop=None` instead of `handle_unknown="error"` and `drop=if_binary`. This means | |||
that categories seen only at test time will be encoded by a vector of zeroes instead of raising an error, and that no category will be dropped. :pr:`473` by :user:`Leo Grinsztajn <LeoGrin>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we get rid of "drop=if_binary"? It had been added as a specific request to avoid having too big encoded matrices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that the semantics of the two options somewhat clash (maybe something to discuss with the scikit-learn team) and bad things will happen if we have binary and unknown in the same column, but it is quite unlikely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential solutions to keep using drop="binary":
- keep handle_unknown="ignore" for non-binary columns, but use handle_unknown="error" for binary columns (@GaelVaroquaux prefered solution I think)
- create a subclass of OneHotEncoder which maps new columns encountered during transform to "infrequent" (quite similar to
infrequent_if_exists
but without needing sklearn > 1.1).
Tagging some scikit-learn people to have other opinions: @glemaitre @Vincent-Maladiere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is related to: scikit-learn/scikit-learn#18072 and resolved in scikit-learn/scikit-learn#19041.
I think that the argument taken was the following: scikit-learn/scikit-learn#18072 (comment)
In the end, we are lenient and we are not capable of inverting properly the array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that being lenient is fine for our purposes (the case is going to be very infrequent) and thus we should just turn on both options.
Thanks @glemaitre !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @glemaitre ! I also think that that's the best solution. The only issue is that sklearn is lenient from 0.24.2, so we get an error for earlier version (from 0.23.0). I think we can just catch it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I finally chose is to just turn on both options, but to use handle_ignore="error"
if sklearn version is < 0.24.2, while issuing a warning. I know that we excluded using infrequent_if_exists
because changing behaviour for different sklearn version might be confusing for users, but I figured it would be okay in this case, as it only concerns sklearn versions between 0.23 and 0.24.2.
Actually I've deleted this and converted every categorical column to object dtype,
Often, in pandas things are faster (and consume less memory) with categorical dtype, rather than object dtype. It is in general a good idea to stick to categorical dtype.
|
What I finally chose is to just turn on both options, but to use handle_ignore= "error" if sklearn version is < 0.24.2, while issuing a warning.
Might lead to hacky code to issue the warning only for the right case.
|
Change default `low_card_cat_transformer` in SuperVectorizer to use handle_unknown="ignore"
Pandas `category` dtype conversion converts new categories to nans, so we now update the list of categories before converting.
Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com>
This avoids dealing with the categories attached to the dtype.
And use handle_unknown="error" for sklearn < 0.24.2.
…ectorizer" This reverts commit 34ed05f.
e726d92
to
ab27660
Compare
Ah ok, I've reverted the change to keep categorical columns. If the speed gain is significant, would it make sense to convert columns with string or object dtypes to categorical? (maybe if the number of category is not too high) |
Right now I issue a warning in |
and change the warning message to be more informative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of small changes are necessary, but the overall strategy is the right one. Thanks!!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like a bit more name change :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! Thank you!!
Merging! |
Change default
low_card_cat_transformer
in SuperVectorizer to use handle_unknown="ignore".Solves #455