Joiner store state in `fit` + add other distance scaling strategies #821

jeromedockes · 2023-11-13T14:20:53Z

fixes #762, fixes #760, fixes #758

…ches

jeromedockes · 2023-11-14T09:19:44Z

ATM, grid-searching the threshold for the distance between matched rows is very inefficient: we redo the full vectorization, nearest-neighbor search and joining just to apply a different threshold to the same column. Some options could be

add caching to the Joiner itself
separate the steps of joining and filling rows that are too far apart with nans, to take advantage of the Pipeline's existing caching mechanism
investigate more in which situations the thresholding is necessary. Maybe a good learner with enough data, if provided with the distance (or score) column, can learn to disregard the untrustworthy features when the distance is too large
... ?

jeromedockes · 2023-12-01T11:20:54Z

ok @Vincent-Maladiere @jovan-stojanovic I think I've addressed the main comments if you want to have another look

Vincent-Maladiere

Some additional comments

skrub/_fuzzy_join.py

skrub/_join_utils.py

skrub/_joiner.py

Vincent-Maladiere

Great work, thank you, @jeromedockes, LGTM!

jeromedockes · 2023-12-05T09:39:26Z

@jovan-stojanovic (and others @LeoGrin @GaelVaroquaux if you have the time) would you like to have another look I think we are converging on this one?

jovan-stojanovic

Looks really good! Here is a final review @jeromedockes, some small things to change and I guess we are ready for the release! 🚀

jovan-stojanovic · 2023-12-11T14:25:37Z

examples/04_fuzzy_joining.py

-#    score, that we will use later to show what are the worst matches.
+#    We set the ``add_match_info`` parameter to `True` to show distances
+#    between the rows that have been matched, that we will use later to show
+#    what are the worst matches.

 ###############################################################################
 #


I can't comment below but L146-147:
"Czechia"/"Czech Republic" and "Luxembourg*"/"Luxembourg" should be replaced by "Egypt"/"Egypt, Arab Rep." and "Lesotho*"/"Lesotho" to reflect well what was printed above.

examples/04_fuzzy_joining.py

jovan-stojanovic · 2023-12-11T14:37:05Z

examples/04_fuzzy_joining.py

 # We create a selector that we will insert at the end of our pipeline, to
 # select the relevant columns before fitting the regressor

 pipeline = make_pipeline(


IMO maybe its better to do it in two times:

create the selector

add it to the pipeline

Just to help the user grasp it more easily.

Suggested change

# We create a selector that we will insert at the end of our pipeline, to

# select the relevant columns before fitting the regressor

pipeline = make_pipeline(

# We create a selector that we will insert at the end of our pipeline, to

# select the relevant columns before fitting the regressor

selector = SelectCols(

[

"GDP per capita (current US$)",

"Life expectancy at birth, total (years)",

"Strength of legal rights index (0=weak to 12=strong)",

"GDP per capita (current US$) gdp",

"Life expectancy at birth, total (years) life_exp",

"Strength of legal rights index (0=weak to 12=strong) legal_rights",

]

# We create our pipeline

pipeline = make_pipeline(

skrub/_fuzzy_join.py

jovan-stojanovic · 2023-12-11T14:52:20Z

skrub/_join_utils.py

+def check_column_name_duplicates(
+    main_table,
+    aux_table,
+    suffix,
+    main_table_name="main_table",
+    aux_table_name="aux_table",
+):
+    """Check that there are no duplicate column names after applying a suffix.
+
+    The suffix is applied to (a copy of) `aux_columns` before checking for


This is super useful! 🎉

jovan-stojanovic · 2023-12-11T15:11:01Z

skrub/_joiner.py

    suffix : str, default=""
        Suffix to append to the ``aux_table``'s column names. You can use it
        to avoid duplicate column names in the join.


WDYT, shouldn't the suffix by default be something like _aux ?
In any, case this is applied only if there are duplicate columns. (same remark for the fuzzy_join)

it is always applied, we decided not to apply it only when there are duplicates (at least for now). note that the pandas and polars approach does not work 100% because they add the suffix only if there are duplicates but then don't check if there are duplicates after adding the suffix. also we thought it is useful to be able to easily know what will be the output column names. however in a later pr we want to add an option for generating an automatic suffix.

Re what should be the default, _aux does make sense although many users will want no suffix (if they don't have duplicated column names), and at the same time _aux might be too short to prevent duplicates in some cases.
So I'm not really sure what's best, I guess in many cases users will have to provide their own suffix
WDYT @Vincent-Maladiere and @skrub-data/devs ?

Ok thanks, ah yes you are right that checking duplicates after the suffix is (a great asset of the Joiner) changing the logic here.
I guess this is anyway not a blocking issue for this PR, I'm ok for resolving this with future issues.

skrub/_joiner.py

Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com>

…refactor_joiner

jeromedockes · 2023-12-11T16:15:13Z

thanks a lot for the review, @jovan-stojanovic ! I think the last outstanding question is what should be the default for "suffix". (this also applies to other joiners AggJoiner InterpolationJoiner)

Vincent-Maladiere · 2023-12-12T08:08:23Z

Let's discuss the suffix strategy outside of this PR and move forward :)

jeromedockes added 18 commits November 8, 2023 18:16

start adding matching strategies

255a184

iter

7c58a6f

Merge remote-tracking branch 'upstream/main' into refactor_joiner

80e9a6a

add to Joiner, pass aux to fit in matchers

5ee25e8

use hashing vectorizer + better handling of sparsity

1af0103

add actual join

f5c8c0d

use pd merge rather than concat to let pandas handle rows without mat…

1663a75

…ches

update example

c0e0e4b

update fuzzy_join

ee189db

better names in fuzzy_join key checks

e968a79

add distance rescaling and max_dist

b13fedd

update joiner docstring

c02bc74

update fuzzy_join docstring

be23bc8

allow None or "inf" as max_dist

dd73ada

update example

d6167d0

unused import

d74391f

add note

3f87007

iter

4cec389

jeromedockes marked this pull request as draft November 13, 2023 14:21

jeromedockes added 5 commits November 13, 2023 17:24

select matching as string + use 2nd neighbor as default

c275329

rename matching -> ref_dist

46cbceb

outdated comments

079286f

update example

402bda9

update tests

0d3f079

Merge remote-tracking branch 'upstream/main' into refactor_joiner

94b332d

jeromedockes mentioned this pull request Nov 14, 2023

Better threshold metric for fuzzy_join #470

Open

jeromedockes added 3 commits November 20, 2023 18:14

iter

3a62108

add rescaling with percentile of aux-aux distances

bf747ca

improve docstring

5cf7c0a

jeromedockes added 8 commits December 1, 2023 11:01

remove worst match rescaling option

a51dfe4

fix docstring

effaa54

capitalize param description

4530fa7

add tests

4ad0147

update fuzzy_join docstring

c869b67

rename aux_quartile, Percentile

ea1b5cb

full stop at end of param description

a132d73

detail

3131039

jeromedockes changed the title ~~[WIP] Joiner store state in fit + add other distance scaling strategies~~ Joiner store state in fit + add other distance scaling strategies Dec 1, 2023

Vincent-Maladiere reviewed Dec 1, 2023

View reviewed changes

skrub/_joiner.py Show resolved Hide resolved

jeromedockes added 5 commits December 1, 2023 16:21

fix Joiner bug when aux table index is not range(shape[0])

2f672ab

simpler reset index

1429309

duplicate column name checking

3dc1ad2

better way of passing table names

24f0498

type hints

19b92c9

Vincent-Maladiere approved these changes Dec 4, 2023

View reviewed changes

jeromedockes added 2 commits December 8, 2023 12:46

convert polars dataframes to pandas until we have actual polars support

f37856f

details

aa79283

jovan-stojanovic reviewed Dec 11, 2023

View reviewed changes

jeromedockes and others added 4 commits December 11, 2023 16:40

Apply suggestions from code review

86fbb65

Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com>

apply suggestions from review

f24bc4b

Merge branch 'refactor_joiner' of github.com:jeromedockes/skrub into …

cc7b63c

…refactor_joiner

Merge remote-tracking branch 'upstream/main' into refactor_joiner

771464d

jovan-stojanovic approved these changes Dec 12, 2023

View reviewed changes

Vincent-Maladiere merged commit 9fa3bd3 into skrub-data:main Dec 12, 2023
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Joiner store state in `fit` + add other distance scaling strategies #821

Joiner store state in `fit` + add other distance scaling strategies #821

jeromedockes commented Nov 13, 2023 •

edited

jeromedockes commented Nov 14, 2023

jeromedockes commented Dec 1, 2023

Vincent-Maladiere left a comment

Vincent-Maladiere left a comment

jeromedockes commented Dec 5, 2023

jovan-stojanovic left a comment

jovan-stojanovic Dec 11, 2023

jovan-stojanovic Dec 11, 2023

jovan-stojanovic Dec 11, 2023

jovan-stojanovic Dec 11, 2023

jeromedockes Dec 11, 2023

jovan-stojanovic Dec 11, 2023

jeromedockes commented Dec 11, 2023

Vincent-Maladiere commented Dec 12, 2023

-# We create a selector that we will insert at the end of our pipeline, to
-# select the relevant columns before fitting the regressor
-pipeline = make_pipeline(
+# We create a selector that we will insert at the end of our pipeline, to
+# select the relevant columns before fitting the regressor
+selector = SelectCols(
+        [
+            "GDP per capita (current US$)",
+            "Life expectancy at birth, total (years)",
+            "Strength of legal rights index (0=weak to 12=strong)",
+            "GDP per capita (current US$) gdp",
+            "Life expectancy at birth, total (years) life_exp",
+            "Strength of legal rights index (0=weak to 12=strong) legal_rights",
+        ]
+# We create our pipeline
+pipeline = make_pipeline(

Joiner store state in fit + add other distance scaling strategies #821

Joiner store state in fit + add other distance scaling strategies #821

Conversation

jeromedockes commented Nov 13, 2023 • edited

jeromedockes commented Nov 14, 2023

jeromedockes commented Dec 1, 2023

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

jeromedockes commented Dec 5, 2023

jovan-stojanovic left a comment

Choose a reason for hiding this comment

jovan-stojanovic Dec 11, 2023

Choose a reason for hiding this comment

jovan-stojanovic Dec 11, 2023

Choose a reason for hiding this comment

jovan-stojanovic Dec 11, 2023

Choose a reason for hiding this comment

jovan-stojanovic Dec 11, 2023

Choose a reason for hiding this comment

jeromedockes Dec 11, 2023

Choose a reason for hiding this comment

jovan-stojanovic Dec 11, 2023

Choose a reason for hiding this comment

jeromedockes commented Dec 11, 2023

Vincent-Maladiere commented Dec 12, 2023

Joiner store state in `fit` + add other distance scaling strategies #821

Joiner store state in `fit` + add other distance scaling strategies #821

jeromedockes commented Nov 13, 2023 •

edited