Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 2D matrices support for GapEncoder #185

Merged
merged 11 commits into from
Jul 19, 2021

Conversation

LilianBoulard
Copy link
Member

Fixes #165

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the super fast implementation of this feature. I did a first full review (sorry, not as thorough as I would like, but it's late). I added some inline comments, and here are a few:

We are changing the public interface of GapEncoder in a backward incompatible way: before it would accept 1D array, and now it will throw an error. While I think that this is absolutely the right thing to do, we need to indicate this in a very clear way in our changelog, for instance by writing in bold in the beginning "Backward incompatible change to GapEncoder" and describing the change.

I think that we need to test all the public methods with 2D inputs: score, get_feature_names, transform. Just to make sure that we have no hidden problem.

Finally, I would suggest to change the default of SuperVectorizer to now use the GapEncoder (which is the motivation for this change). It will stress-test a lot this implementation, and will help us convince ourselves that it is functional.

dirty_cat/test/test_gap_encoder.py Outdated Show resolved Hide resolved
dirty_cat/test/test_gap_encoder.py Outdated Show resolved Hide resolved
dirty_cat/test/test_gap_encoder.py Outdated Show resolved Hide resolved
dirty_cat/gap_encoder.py Outdated Show resolved Hide resolved
dirty_cat/gap_encoder.py Outdated Show resolved Hide resolved
@alexis-cvetkov
Copy link
Contributor

I have made a pull request on your branch @LilianBoulard with many updates on the PR:

  • docstrings
  • I have made the tests more exhaustive, and replaced all 1D test inputs by 2D inputs with several columns.
  • implemented the score method
  • updated the get_feature_names method to add column names manually/automatically as prefixes before the labels.
  • changed the SuperVectorizer to use the GapEncoder as default encoder (instead of SimilarityEncoder).
  • slightly changed the tests for the SuperVectorizer to make it work with GapEncoder
  • checked that the example 03 with the SuperVectorizer works well (we get similar conclusions from the features importance)

image

@GaelVaroquaux
Copy link
Member

Absolutely awesome. @LilianBoulard , when you have time, can you merge @alexis-cvetkov 's branch in yours.

We'll need the note in the CHANGES.rst (unless Alexis's branch implements it), and we might be good to go.

Copy link
Contributor

@alexis-cvetkov alexis-cvetkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me !

@GaelVaroquaux
Copy link
Member

OK, I commited a few changes directly in the PR and will merge once CI has ran.

Thanks a lot!!

@GaelVaroquaux
Copy link
Member

Merging! Thanks!!

@GaelVaroquaux GaelVaroquaux merged commit 89ef822 into skrub-data:master Jul 19, 2021
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jul 19, 2021 via email

@LilianBoulard LilianBoulard deleted the fix_gap_encoder branch October 12, 2021 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GapEncoder only supports 1D matrices
3 participants