Skip to content

Use handle_unknown=ignore in SuperVectorizer #473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Feb 16, 2023

Conversation

LeoGrin
Copy link
Contributor

@LeoGrin LeoGrin commented Jan 23, 2023

Change default low_card_cat_transformer in SuperVectorizer to use handle_unknown="ignore".

Solves #455

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Jan 23, 2023

Hmm it seems that we can't both specify drop='if_binary, and handle_unknown="ignore", because it would create some ambiguity on what a zero vector means. I would say it's better to keep what we have now, that is drop='if_binary and handle_unknown="error", but I'd like to hear other opinions.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jan 23, 2023 via email

Copy link
Member

@jovan-stojanovic jovan-stojanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just a typo to correct I think.

CHANGES.rst Outdated
@@ -38,6 +38,9 @@ Minor changes
which can be used to specify where to save and load from datasets.
:pr:`432` by :user:`Lilian Boulard <LilianBoulard>`

* The :class:`SuperVectorizer`'s default `OneHotEncoder` for low cardinality categorical variables now defaults to `handle_unknown="ignore"` and `drop=None` instead of `handle_unknown="error"` and `drop=if_binary`. This means
that categories seen only at test time will be encoded by a vector of zeroes instead of raising an error, and that no category will be dropped. :pr:`473` by :user:`Leo Grinsztajn <LeoGrin>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we get rid of "drop=if_binary"? It had been added as a specific request to avoid having too big encoded matrices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that the semantics of the two options somewhat clash (maybe something to discuss with the scikit-learn team) and bad things will happen if we have binary and unknown in the same column, but it is quite unlikely.

Copy link
Contributor Author

@LeoGrin LeoGrin Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential solutions to keep using drop="binary":

  • keep handle_unknown="ignore" for non-binary columns, but use handle_unknown="error" for binary columns (@GaelVaroquaux prefered solution I think)
  • create a subclass of OneHotEncoder which maps new columns encountered during transform to "infrequent" (quite similar to infrequent_if_exists but without needing sklearn > 1.1).

Tagging some scikit-learn people to have other opinions: @glemaitre @Vincent-Maladiere

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is related to: scikit-learn/scikit-learn#18072 and resolved in scikit-learn/scikit-learn#19041.

I think that the argument taken was the following: scikit-learn/scikit-learn#18072 (comment)

In the end, we are lenient and we are not capable of inverting properly the array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that being lenient is fine for our purposes (the case is going to be very infrequent) and thus we should just turn on both options.

Thanks @glemaitre !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @glemaitre ! I also think that that's the best solution. The only issue is that sklearn is lenient from 0.24.2, so we get an error for earlier version (from 0.23.0). I think we can just catch it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I finally chose is to just turn on both options, but to use handle_ignore="error" if sklearn version is < 0.24.2, while issuing a warning. I know that we excluded using infrequent_if_exists because changing behaviour for different sklearn version might be confusing for users, but I figured it would be okay in this case, as it only concerns sklearn versions between 0.23 and 0.24.2.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jan 30, 2023 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jan 30, 2023 via email

@jovan-stojanovic jovan-stojanovic added this to the 0.4.0 release milestone Feb 1, 2023
LeoGrin and others added 10 commits February 13, 2023 11:18
Change default `low_card_cat_transformer` in SuperVectorizer to use handle_unknown="ignore"
Pandas `category` dtype conversion converts new categories to nans, so we now update the list of categories before converting.
Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com>
This avoids dealing with the categories attached to the dtype.
And use handle_unknown="error" for sklearn < 0.24.2.
@LeoGrin
Copy link
Contributor Author

LeoGrin commented Feb 13, 2023

Often, in pandas things are faster (and consume less memory) with categorical dtype, rather than object dtype. It is in general a good idea to stick to categorical dtype.

Ah ok, I've reverted the change to keep categorical columns. If the speed gain is significant, would it make sense to convert columns with string or object dtypes to categorical? (maybe if the number of category is not too high)

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Feb 13, 2023

What I finally chose is to just turn on both options, but to use handle_ignore= "error" if sklearn version is < 0.24.2, while issuing a warning.
Might lead to hacky code to issue the warning only for the right case.

Right now I issue a warning in _clone_transformers (called at the beginning of fit) if low_card_cat_encoder is None (default). So there may be a warning a bit more often what we could have (if we catched the OneHotEncoder errors for earlier sklearn versions), but on the other hand we avoid hacky code.

and change the warning message to be more informative.
Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of small changes are necessary, but the overall strategy is the right one. Thanks!!!

@LeoGrin LeoGrin requested review from GaelVaroquaux and removed request for jovan-stojanovic February 15, 2023 10:18
Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like a bit more name change :(

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! Thank you!!

@GaelVaroquaux
Copy link
Member

Merging!

@GaelVaroquaux GaelVaroquaux merged commit 9b15dd2 into skrub-data:main Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants