FEA Improving fuzzy_join: numerical columns and multiple keys #530

jovan-stojanovic · 2023-03-15T17:03:23Z

This PR adds new important features to the fuzzy_join:

Joining on numerical columns is now possible. The distance between columns used is euclidean.
Many-to-many joins (joining on multiple keys) is also now possible, both for string and numerical keys. For instance, joining on both the string columns ["City", "Country"] or on numerical columns ["latitude", "longitude"] was the motivation for adding this.

GaelVaroquaux · 2023-03-20T21:20:35Z

You should merge with master. You have a conflict.

GaelVaroquaux

It seems to me that the code works only for either all columns numerical or all columns not numerical. We'd like code that works for a mix.

I would see a structure of the code where all columns get separately injected into a numerical representation (scaled by it's 2 norm), and are then combinered. This approach can be later expanded to account for datetime (we will need to do so at some point)

dirty_cat/_fuzzy_join.py

examples/04_fuzzy_joining_and_FeatureAugmenter.py

jovan-stojanovic · 2023-03-22T11:16:09Z

It seems to me that the code works only for either all columns numerical or all columns not numerical. We'd like code that works for a mix.

I would see a structure of the code where all columns get separately injected into a numerical representation (scaled by it's 2 norm), and are then combinered. This approach can be later expanded to account for datetime (we will need to do so at some point)

Thanks, it is now possible to join on mixed types of columns.
I worked along the lines of your idea, so the code structure is the following:

check if only numerical columns then do numerical join;
check if only string columns then do joins on strings;
check if mixed types columns: encode the columns separately based on their type and then create the "full" encoding
array of all columns and find the nearest neighbor.

In this way, it will be easier to add new types of joins such as datetime with separate encoding methods.

jovan-stojanovic · 2023-03-22T14:56:10Z

I have also an idea on how to join on datetime values, but I think it should be added in a follow-up PR, this one already adds a lot to the function.

LilianBoulard

Thanks for the PR. I've submitted a small commit to fix a few things and a few comments!

Small note: when writing a string spanning multiple lines, the convention (at least in this project) is to have the space at the end of the line.
So instead of writing

print(
    "This is a sentence"
    " that's spanning multiple lines."
)

we'd write

print(
    "This is a sentence "
    "that's spanning multiple lines. "  # Also notice the trailing space!
)

(small, insignificant detail, I agree ^^)

dirty_cat/_fuzzy_join.py

dirty_cat/tests/test_fuzzy_join.py

Vincent-Maladiere

Hey @jovan-stojanovic, thanks for the PR! fuzzy_join looks great :) I have some comments that might help make things easier for the user.

Also, there's a quick fix I wanted to address, but I couldn't comment on it because it wasn't changed during this PR:

    if return_score:
        df_joined = pd.concat(
            [df_joined, pd.DataFrame(norm_distance, columns=["matching_score"])], axis=1
        )

This could be replaced by:

    if return_score:
        df_joined["matching_score"] = norm_distance

dirty_cat/_fuzzy_join.py

Vincent-Maladiere · 2023-04-12T14:40:11Z

dirty_cat/_fuzzy_join.py

+            "Specify numerical_match as 'string' "
+            "or 'number'. "
+        )
+    elif numerical_match in ["number"] and any_numeric and not mixed_types:


I think this design is a bit tricky because we are required to choose numerical_match="number" to use two or more columns of mixed types. I think introducing another category like "mixed" could bring more transparency about what is being used.

In this scenario, choosing numerical_match="number" and having mixed types would raise an error, same for numerical_match="string". This would lead to the following design:

if ( (numerical_match == "numerical" and not only_numerical) or (numerical_match == "string" and not only_string) ): raise ValueError(...) if numerical_match in ["number", "mixed"]: main_num_enc, aux_num_enc = _numeric_encoding( main_table, main_num_cols, aux_table, aux_num_cols ) if numerical_match in ["string", "mixed"]: main_str_enc, aux_str_enc = _string_encoding( main_table, main_str_cols, aux_table, aux_str_cols, encoder=encoder, analyzer=analyzer, ngram_range=ngram_range, ) if numerical_match == "mixed": main_enc = hstack((main_num_enc, main_str_enc), format="csr") aux_enc = hstack((aux_num_enc, aux_str_enc), format="csr") elif numerical_match == "numerical": main_enc = main_num_enc aux_enc = aux_num_enc else: main_enc = main_str_enc aux_enc = aux_str_enc idx_closest, norm_distance = _nearest_matches(main_enc, aux_enc)

WDYT?

In this scenario, choosing numerical_match="number" and having mixed types would raise an error, same for numerical_match="string".

I think it won't as there is the elif numerical_match in ["number"] and any_numeric and mixed_types condition.

I think we need to improve this design but I am still not sure of the way to go. Adding numerical_match="mixed" would be problematic because the idea is to have something like numerical_match is in ["number", "date", "geo"]
but "mixed" would suppose that you infer the numerical type rather than explicitly giving it.

Either way, I think I will shortly after this PR start working on adding joins on dates, so we might discuss this further then.

Tell me what you think.

Vincent-Maladiere

Some additional remarks following our IRL discussion :)

dirty_cat/_fuzzy_join.py

Vincent-Maladiere · 2023-04-13T08:57:01Z

dirty_cat/_fuzzy_join.py

+        )
+        main_enc = hstack((main_num_enc, main_str_enc), format="csr")
+        aux_enc = hstack((aux_num_enc, aux_str_enc), format="csr")
+        idx_closest, norm_distance = _nearest_matches(main_enc, aux_enc)


Following my comment above, I've just realized that we stack the mixed encodings together before computing the nearest neighbors. Isn't this what we want to avoid since the string encoding will be overrepresented by its number of columns?

Yes, you are right..
We need to find some benchmarks for this and see what is the best solution. Do you mind keeping it this way for now or would you rather sum the distances to have an equivalent weight ?

dirty_cat/_fuzzy_join.py

LilianBoulard

Hey, I've submitted a commit, which fixes some things.
Otherwise, it looks good :)

jovan-stojanovic · 2023-04-17T15:30:03Z

I have one small change left and it's ready.

…into fj_numerical

Vincent-Maladiere · 2023-04-17T18:04:04Z

dirty_cat/_fuzzy_join.py

+        )
+    all_cats = pd.concat([main_cols_clean, aux_cols_clean], axis=0).unique()
+
+    if isinstance(encoder, str) and encoder == "hashing":


Overall I'm -1 for introducing encoder = "hashing" by default instead of None because it introduces unnecessary complexity and a mixed argument signature.

You can open an issue on this and we can discuss it more. I think this was done so that people don't think that we use the CountVectorizer..

In that scenario, we could document in the docstring that the default None choice will fall back on HashingVectorizer. This is what is done in scikit-learn's ensemble estimator StackingRegressor for example:
https://github.com/scikit-learn/scikit-learn/blob/559609fe98ec2145788133687e64a6e87766bc77/sklearn/ensemble/_stacking.py#L788-L790

I'm okay to do it in a subsequent PR if we want this to be merged quickly.

Oh I see the point, actually this makes much more sense. The docstring is there to inform the user what's in the function, not the parameter name. I think I will change this now then.

Vincent-Maladiere

The error messages that have been changed break the CI

dirty_cat/_fuzzy_join.py

Vincent-Maladiere

Hi @jovan-stojanovic and @LilianBoulard we're almost done, here's a last review before it LGTM :)

dirty_cat/_feature_augmenter.py

Vincent-Maladiere · 2023-04-18T08:53:39Z

dirty_cat/_fuzzy_join.py

+        )
+    all_cats = pd.concat([main_cols_clean, aux_cols_clean], axis=0).unique()
+
+    if isinstance(encoder, str) and encoder == "hashing":


In that scenario, we could document in the docstring that the default None choice will fall back on HashingVectorizer. This is what is done in scikit-learn's ensemble estimator StackingRegressor for example:
https://github.com/scikit-learn/scikit-learn/blob/559609fe98ec2145788133687e64a6e87766bc77/sklearn/ensemble/_stacking.py#L788-L790

I'm okay to do it in a subsequent PR if we want this to be merged quickly.

Vincent-Maladiere · 2023-04-18T08:56:22Z

dirty_cat/_fuzzy_join.py

+        main_enc = hstack((main_num_enc, main_str_enc), format="csr")
+        aux_enc = hstack((aux_num_enc, aux_str_enc), format="csr")
+        idx_closest, matching_score = _nearest_matches(main_enc, aux_enc)
+        mm_scaler = MinMaxScaler(feature_range=(0, 1))


We should move MinMaxScaler into _nearest_matches, because the other branch conditions would still need to have matching_score between 0 and 1.

Yes, but I am afraid doing this will make us have a matching_score of 1 for the best match, not the exact one, right? This is something that we really don't want with strings, because I find it useful to have the perfect string matches as 1 (and brings us easily back to what pandas.merge does).

You're right, but this gets tricky with numerical and mixed values. A perfect match with numerical values will have a distance of 0, thus a matching_score of 1. However, if the distance is higher than 1, matching_score becomes negative, which is weird.

Besides, the current implementation for mixed values has 1 for the best match and not the exact match since we use MinMaxScaler, right? So, the matching_score parameter behaves differently for string and mixed values.

I think we can solve that by removing MinMaxScaler and dividing the distance by its maximum in the _nearest_matches function:

For perfect matches, the distance is 0 => the similarity is 1

For best matches, the distance > 0 => the similarity < 1

For the worse case, the distance is 1 => the similarity is 0

WDYT? :)

Great, indeed, this is the best solution I think!

Co-authored-by: Vincent M <maladiere.vincent@yahoo.fr>

jovan-stojanovic · 2023-04-18T13:56:55Z

Ok, I think this ready to be merged when the tests pass @Vincent-Maladiere @LilianBoulard

CHANGES.rst

dirty_cat/_fuzzy_join.py

jovan-stojanovic added 9 commits March 13, 2023 15:46

ENH Numerical fuzzy_join

656bb63

merge with main

6a7d008

improve tests

6ac38d0

add multiple key joins

b45c51c

add changes

85e70a4

fix tests

213a95a

revert param name

335e841

fix tests

a1361df

fix param names

3805a61

GaelVaroquaux reviewed Mar 20, 2023

View reviewed changes

jovan-stojanovic added 7 commits March 21, 2023 08:44

fix param names

c485425

merge with main

f416eb9

add standard scaler

c9d2262

allow iterable as input

00beaac

add tests

105e435

add joins for mixed types

98a6708

merge with main

8614d3c

jovan-stojanovic added 4 commits March 22, 2023 12:21

fix tests

9791e7d

fix typo

14913de

fix error message

fea5eb5

fix docstring

e5bbd3f

jovan-stojanovic requested a review from GaelVaroquaux March 24, 2023 08:23

jovan-stojanovic requested a review from LilianBoulard April 3, 2023 11:24

Various improvements

2e6199f

LilianBoulard reviewed Apr 3, 2023

View reviewed changes

dirty_cat/_fuzzy_join.py Show resolved Hide resolved

dirty_cat/tests/test_fuzzy_join.py Outdated Show resolved Hide resolved

jovan-stojanovic added 2 commits April 5, 2023 10:43

improve error keywords

e3ed5c7

Merge branch 'main' into fj_numerical

a2f6256

Vincent-Maladiere reviewed Apr 12, 2023

View reviewed changes

Vincent-Maladiere reviewed Apr 13, 2023

View reviewed changes

jovan-stojanovic added 4 commits April 14, 2023 11:14

add suggested solutions

16c45d9

flatten the distance

29450ec

add minmax scaler

ff9598d

adapt test

1c87532

jovan-stojanovic mentioned this pull request Apr 17, 2023

Fuzzy joining on mixed typed columns is with unequal weights #547

Closed

remove error option

aeb3f54

Vincent-Maladiere reviewed Apr 17, 2023

View reviewed changes

dirty_cat/_fuzzy_join.py Outdated Show resolved Hide resolved

jovan-stojanovic and others added 2 commits April 17, 2023 15:58

remove copy and unused tests

f8aa053

Review

2d1b6c1

LilianBoulard approved these changes Apr 17, 2023

View reviewed changes

jovan-stojanovic added 2 commits April 17, 2023 17:30

minmax scaler matching score

376541e

Merge branch 'fj_numerical' of github.com:jovan-stojanovic/dirty_cat …

015abd8

…into fj_numerical

Vincent-Maladiere reviewed Apr 17, 2023

View reviewed changes

dirty_cat/_fuzzy_join.py Outdated Show resolved Hide resolved

dirty_cat/_fuzzy_join.py Show resolved Hide resolved

Fix tests on error messages

4f255b7

Vincent-Maladiere reviewed Apr 18, 2023

View reviewed changes

jovan-stojanovic and others added 2 commits April 18, 2023 11:31

change default vectorizer param

3b60f59

Update dirty_cat/_feature_augmenter.py

de8183c

Co-authored-by: Vincent M <maladiere.vincent@yahoo.fr>

LilianBoulard approved these changes Apr 18, 2023

View reviewed changes

jovan-stojanovic added 2 commits April 18, 2023 15:54

remove minmax scaling and divide distance by max

4bda8ae

merge with main

71bd1fc

LilianBoulard approved these changes Apr 18, 2023

View reviewed changes

CHANGES.rst Outdated Show resolved Hide resolved

dirty_cat/_fuzzy_join.py Outdated Show resolved Hide resolved

Apply suggestions from code review

2988f1d

LilianBoulard merged commit c552053 into skrub-data:main Apr 18, 2023
16 checks passed

jovan-stojanovic deleted the fj_numerical branch May 16, 2023 11:43

jovan-stojanovic mentioned this pull request Jul 7, 2023

Many to many (multiple key) joins in FeatureAugmenter #629

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEA Improving fuzzy_join: numerical columns and multiple keys #530

FEA Improving fuzzy_join: numerical columns and multiple keys #530

jovan-stojanovic commented Mar 15, 2023 •

edited

GaelVaroquaux commented Mar 20, 2023

GaelVaroquaux left a comment

jovan-stojanovic commented Mar 22, 2023 •

edited

jovan-stojanovic commented Mar 22, 2023

LilianBoulard left a comment

Vincent-Maladiere left a comment

Vincent-Maladiere Apr 12, 2023 •

edited

jovan-stojanovic Apr 14, 2023

Vincent-Maladiere left a comment

Vincent-Maladiere Apr 13, 2023 •

edited

jovan-stojanovic Apr 14, 2023

LilianBoulard left a comment

jovan-stojanovic commented Apr 17, 2023

Vincent-Maladiere Apr 17, 2023

jovan-stojanovic Apr 18, 2023

Vincent-Maladiere Apr 18, 2023 •

edited

jovan-stojanovic Apr 18, 2023

Vincent-Maladiere left a comment

Vincent-Maladiere left a comment

Vincent-Maladiere Apr 18, 2023 •

edited

Vincent-Maladiere Apr 18, 2023

jovan-stojanovic Apr 18, 2023 •

edited

Vincent-Maladiere Apr 18, 2023

jovan-stojanovic Apr 18, 2023

jovan-stojanovic commented Apr 18, 2023

FEA Improving fuzzy_join: numerical columns and multiple keys #530

FEA Improving fuzzy_join: numerical columns and multiple keys #530

Conversation

jovan-stojanovic commented Mar 15, 2023 • edited

GaelVaroquaux commented Mar 20, 2023

GaelVaroquaux left a comment

Choose a reason for hiding this comment

jovan-stojanovic commented Mar 22, 2023 • edited

jovan-stojanovic commented Mar 22, 2023

LilianBoulard left a comment

Choose a reason for hiding this comment

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

Vincent-Maladiere Apr 12, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

Vincent-Maladiere Apr 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LilianBoulard left a comment

Choose a reason for hiding this comment

jovan-stojanovic commented Apr 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vincent-Maladiere Apr 18, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

Vincent-Maladiere Apr 18, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jovan-stojanovic Apr 18, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jovan-stojanovic commented Apr 18, 2023

jovan-stojanovic commented Mar 15, 2023 •

edited

jovan-stojanovic commented Mar 22, 2023 •

edited

Vincent-Maladiere Apr 12, 2023 •

edited

Vincent-Maladiere Apr 13, 2023 •

edited

Vincent-Maladiere Apr 18, 2023 •

edited

Vincent-Maladiere Apr 18, 2023 •

edited

jovan-stojanovic Apr 18, 2023 •

edited