-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA Joiner add many-to-many joins #674
FEA Joiner add many-to-many joins #674
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the nice implem!
I've added a few minor comments, otherwise it's all good :)
Co-authored-by: Lilian <lilian@boulard.fr>
Co-authored-by: Lilian <lilian@boulard.fr>
Co-authored-by: Lilian <lilian@boulard.fr>
Co-authored-by: Lilian <lilian@boulard.fr>
Co-authored-by: Lilian <lilian@boulard.fr>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments.
Do we have an exemple for the new functionality? I think that we would need a dedicated example, for instance with GPS coordinates.
It should mention the following words (for search engine discovery): "spatial join", and also "join keys across multiple columns"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. This is exciting!!
A bunch of comments, but mostly cosmetics
# on imprecise and multiple-key correspondences. | ||
# This is made easy by skrub's |Joiner| transformer. | ||
# | ||
# Our final cross-validated accuracy score is 0.6. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to suggest to do a GridSearch here to set the match_score
hyper parameter of the Joiner. But unfortunately this example is 11mn long, so we can't really make it much longer.
Do you know what takes time? Maybe you can run a profiler on it. It would be very beneficial to decrease run time in general
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 Germany Germany 84000000 Germany 4223 Germany Berlin | ||
2 Italy Italy 59000000 Italy 2099 Italia Rome | ||
1 Germany Germany 84000000 Germany 4223 Germany Berlin | ||
2 Italy Italy 59000000 Italy 2099 Italia Rome |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a small example of something that looks like a spatial join here, so that people find out that it is possible.
@@ -291,7 +291,7 @@ def fuzzy_join( | |||
|
|||
See Also | |||
-------- | |||
FeatureAugmenter | |||
Joiner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also:
- document in fuzzy_join that it works with multiple keys (this is non trivial)
- add a tiny example in the docstring that demos this
- add a test that asserts that it works
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
The doc building timed out :( |
It seems that the docs are still timing out: the example is taking too long. @jovan-stojanovic : Can you run it on your computer to find out what is making the example run so long. Maybe at verbosity or a bit of timer, so that we understand |
It seems that the problem is now downloading the data from figshare (which was not this morning) :( |
It seems that the problem is now downloading the data from figshare (which was not this morning) :(
OK. In a sense, this is good news, because it means that there is no fundamental problem with the example.
To make examples more robust to data download difficulties we should:
1. be writing fetching functions that cache to disk
2. implement dataset caching on the doc building, as in nilearn https://github.com/nilearn/nilearn/blob/main/.github/workflows/README.md#dataset-caching
Item 2 above is certainly outside of the scope of this PR. Do you think that you can do item 1?
|
Yes, definitely, will do it, I hope this helps for now. But item 2 would actually solve this problem in all our examples. |
I've added an issue so that we don't forget about item 2: #693 |
The multiple join feature is probably niche, now that I think about it. Most users won't need it, but every user will benefit from a simpler API :) If we'd really need it, my intuition is that we could create another transformer later, like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once the dataset downloading issue is cleared, I really like the API simplification :)
Note that we need to add the cache for example data and change the name of example 4 in follow-up PR's. |
* rename to Joiner * complete renaming * add many to many join support * update changelog * update docstring * update docstring * fix test * fix init * fix changelog * fix example * modify test * Update CHANGES.rst Co-authored-by: Lilian <lilian@boulard.fr> * Update examples/04_fuzzy_joining.py Co-authored-by: Lilian <lilian@boulard.fr> * Update skrub/_joiner.py Co-authored-by: Lilian <lilian@boulard.fr> * Update skrub/_joiner.py Co-authored-by: Lilian <lilian@boulard.fr> * Update skrub/_joiner.py Co-authored-by: Lilian <lilian@boulard.fr> * fix tests * pre-commit * fix docstring * renaming * update changelog * apply suggestions * fix index * add new example * new example * add flight example * update width * add figure * Update examples/07_multiple_key_join.py Co-authored-by: Lilian <lilian@boulard.fr> * Update examples/07_multiple_key_join.py Co-authored-by: Lilian <lilian@boulard.fr> * Update examples/07_multiple_key_join.py Co-authored-by: Lilian <lilian@boulard.fr> * apply suggestions * Update examples/07_multiple_key_join.py Co-authored-by: Lilian <lilian@boulard.fr> * Update examples/07_multiple_key_join.py Co-authored-by: Lilian <lilian@boulard.fr> * Update examples/07_multiple_key_join.py Co-authored-by: Lilian <lilian@boulard.fr> * simplify text * update figure display * remove figure * Remove leftover blank lines * add conclusion * fix conclusion * Update examples/07_multiple_key_join.py Co-authored-by: Lilian <lilian@boulard.fr> * remove attribute * Update examples/07_multiple_key_join.py Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org> * Update examples/07_multiple_key_join.py Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org> * Update examples/07_multiple_key_join.py Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org> * Update examples/07_multiple_key_join.py Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org> * Update examples/07_multiple_key_join.py Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org> * Update examples/07_multiple_key_join.py Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org> * Update examples/07_multiple_key_join.py Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org> * Update examples/07_multiple_key_join.py Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org> * Update examples/07_multiple_key_join.py Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org> * apply suggestions * drop id cols * add list flexibility to joiner * add fetching function * temp raise timeout time * simplify Joiner signature * revert name temporarily for euroscipy * update joiner to accept tuples --------- Co-authored-by: Lilian <lilian@boulard.fr> Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Fix #629 and fix #628