Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Joiner add many-to-many joins #674

Merged
merged 65 commits into from
Aug 16, 2023

Conversation

jovan-stojanovic
Copy link
Member

@jovan-stojanovic jovan-stojanovic commented Jul 21, 2023

Fix #629 and fix #628

Copy link
Member

@LilianBoulard LilianBoulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice implem!
I've added a few minor comments, otherwise it's all good :)

CHANGES.rst Outdated Show resolved Hide resolved
examples/04_fuzzy_joining.py Outdated Show resolved Hide resolved
skrub/_joiner.py Outdated Show resolved Hide resolved
skrub/_joiner.py Outdated Show resolved Hide resolved
skrub/_joiner.py Outdated Show resolved Hide resolved
jovan-stojanovic and others added 10 commits July 21, 2023 14:14
Co-authored-by: Lilian <lilian@boulard.fr>
Co-authored-by: Lilian <lilian@boulard.fr>
Co-authored-by: Lilian <lilian@boulard.fr>
Co-authored-by: Lilian <lilian@boulard.fr>
Co-authored-by: Lilian <lilian@boulard.fr>
doc/assembling.rst Outdated Show resolved Hide resolved
Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments.

Do we have an exemple for the new functionality? I think that we would need a dedicated example, for instance with GPS coordinates.

It should mention the following words (for search engine discovery): "spatial join", and also "join keys across multiple columns"

doc/assembling.rst Outdated Show resolved Hide resolved
CHANGES.rst Outdated Show resolved Hide resolved
Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This is exciting!!

A bunch of comments, but mostly cosmetics

examples/07_multiple_key_join.py Outdated Show resolved Hide resolved
examples/07_multiple_key_join.py Outdated Show resolved Hide resolved
examples/07_multiple_key_join.py Outdated Show resolved Hide resolved
examples/07_multiple_key_join.py Show resolved Hide resolved
examples/07_multiple_key_join.py Outdated Show resolved Hide resolved
# on imprecise and multiple-key correspondences.
# This is made easy by skrub's |Joiner| transformer.
#
# Our final cross-validated accuracy score is 0.6.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to suggest to do a GridSearch here to set the match_score hyper parameter of the Joiner. But unfortunately this example is 11mn long, so we can't really make it much longer.

Do you know what takes time? Maybe you can run a profiler on it. It would be very beneficial to decrease run time in general

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's mostly the GapEncoder. Especially when working on the ID columns, which is why I dropped them (#585).
I hope this will get much better after speedups #680

1 Germany Germany 84000000 Germany 4223 Germany Berlin
2 Italy Italy 59000000 Italy 2099 Italia Rome
1 Germany Germany 84000000 Germany 4223 Germany Berlin
2 Italy Italy 59000000 Italy 2099 Italia Rome
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a small example of something that looks like a spatial join here, so that people find out that it is possible.

examples/06_ken_embeddings.py Outdated Show resolved Hide resolved
examples/07_multiple_key_join.py Outdated Show resolved Hide resolved
@@ -291,7 +291,7 @@ def fuzzy_join(

See Also
--------
FeatureAugmenter
Joiner
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also:

  • document in fuzzy_join that it works with multiple keys (this is non trivial)
  • add a tiny example in the docstring that demos this
  • add a test that asserts that it works

jovan-stojanovic and others added 7 commits August 4, 2023 07:59
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
@GaelVaroquaux
Copy link
Member

The doc building timed out :(

@GaelVaroquaux
Copy link
Member

It seems that the docs are still timing out: the example is taking too long.

@jovan-stojanovic : Can you run it on your computer to find out what is making the example run so long. Maybe at verbosity or a bit of timer, so that we understand

@jovan-stojanovic
Copy link
Member Author

It seems that the problem is now downloading the data from figshare (which was not this morning) :(

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Aug 4, 2023 via email

@jovan-stojanovic
Copy link
Member Author

  1. be writing fetching functions that cache to disk 2. implement dataset caching on the doc building, as in nilearn https://github.com/nilearn/nilearn/blob/main/.github/workflows/README.md#dataset-caching Item 2 above is certainly outside of the scope of this PR. Do you think that you can do item 1?

Yes, definitely, will do it, I hope this helps for now. But item 2 would actually solve this problem in all our examples.

@GaelVaroquaux
Copy link
Member

But item 2 would actually solve this problem in all our examples.

I've added an issue so that we don't forget about item 2: #693

@Vincent-Maladiere
Copy link
Member

Vincent-Maladiere commented Aug 7, 2023

I agree with @Vincent-Maladiere that we need to harmonize the API. It's important.

@Vincent-Maladiere : how often do you think that we will need multiple joins, in particular with the same set of tables?

The dilemma is the following: I wouldn't be surprised that, in the long run, we can implement optimized version of multiple joins/aggregations by passing them to the underlying engine and letting it optimize the total. But on the other hand, using list of columns names for multi-key joins simplifies a huge amount doing multi-key joins, which is something that is bound to be needed some times. I don't know which way to go.

Opinions?

The multiple join feature is probably niche, now that I think about it. Most users won't need it, but every user will benefit from a simpler API :)

If we'd really need it, my intuition is that we could create another transformer later, like MultiJoinAgg.

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the dataset downloading issue is cleared, I really like the API simplification :)

skrub/_joiner.py Outdated Show resolved Hide resolved
skrub/_joiner.py Outdated Show resolved Hide resolved
skrub/_joiner.py Outdated Show resolved Hide resolved
@jovan-stojanovic
Copy link
Member Author

Note that we need to add the cache for example data and change the name of example 4 in follow-up PR's.

@LilianBoulard LilianBoulard merged commit 2473cd8 into skrub-data:main Aug 16, 2023
24 checks passed
LeoGrin pushed a commit to LeoGrin/skrub that referenced this pull request Aug 24, 2023
* rename to Joiner

* complete renaming

* add many to many join support

* update changelog

* update docstring

* update docstring

* fix test

* fix init

* fix changelog

* fix example

* modify test

* Update CHANGES.rst

Co-authored-by: Lilian <lilian@boulard.fr>

* Update examples/04_fuzzy_joining.py

Co-authored-by: Lilian <lilian@boulard.fr>

* Update skrub/_joiner.py

Co-authored-by: Lilian <lilian@boulard.fr>

* Update skrub/_joiner.py

Co-authored-by: Lilian <lilian@boulard.fr>

* Update skrub/_joiner.py

Co-authored-by: Lilian <lilian@boulard.fr>

* fix tests

* pre-commit

* fix docstring

* renaming

* update changelog

* apply suggestions

* fix index

* add new example

* new example

* add flight example

* update width

* add figure

* Update examples/07_multiple_key_join.py

Co-authored-by: Lilian <lilian@boulard.fr>

* Update examples/07_multiple_key_join.py

Co-authored-by: Lilian <lilian@boulard.fr>

* Update examples/07_multiple_key_join.py

Co-authored-by: Lilian <lilian@boulard.fr>

* apply suggestions

* Update examples/07_multiple_key_join.py

Co-authored-by: Lilian <lilian@boulard.fr>

* Update examples/07_multiple_key_join.py

Co-authored-by: Lilian <lilian@boulard.fr>

* Update examples/07_multiple_key_join.py

Co-authored-by: Lilian <lilian@boulard.fr>

* simplify text

* update figure display

* remove figure

* Remove leftover blank lines

* add conclusion

* fix conclusion

* Update examples/07_multiple_key_join.py

Co-authored-by: Lilian <lilian@boulard.fr>

* remove attribute

* Update examples/07_multiple_key_join.py

Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>

* Update examples/07_multiple_key_join.py

Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>

* Update examples/07_multiple_key_join.py

Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>

* Update examples/07_multiple_key_join.py

Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>

* Update examples/07_multiple_key_join.py

Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>

* Update examples/07_multiple_key_join.py

Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>

* Update examples/07_multiple_key_join.py

Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>

* Update examples/07_multiple_key_join.py

Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>

* Update examples/07_multiple_key_join.py

Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>

* apply suggestions

* drop id cols

* add list flexibility to joiner

* add fetching function

* temp raise timeout time

* simplify Joiner signature

* revert name temporarily for euroscipy

* update joiner to accept tuples

---------

Co-authored-by: Lilian <lilian@boulard.fr>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
@jovan-stojanovic jovan-stojanovic deleted the joiner_many_keys branch September 11, 2023 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Many to many (multiple key) joins in FeatureAugmenter Rename FeatureAugmenter
4 participants