Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Improving fuzzy_join: numerical columns and multiple keys #530

Merged
merged 38 commits into from
Apr 18, 2023

Conversation

jovan-stojanovic
Copy link
Member

@jovan-stojanovic jovan-stojanovic commented Mar 15, 2023

This PR adds new important features to the fuzzy_join:

  • Joining on numerical columns is now possible. The distance between columns used is euclidean.

  • Many-to-many joins (joining on multiple keys) is also now possible, both for string and numerical keys. For instance, joining on both the string columns ["City", "Country"] or on numerical columns ["latitude", "longitude"] was the motivation for adding this.

@GaelVaroquaux
Copy link
Member

You should merge with master. You have a conflict.

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that the code works only for either all columns numerical or all columns not numerical. We'd like code that works for a mix.

I would see a structure of the code where all columns get separately injected into a numerical representation (scaled by it's 2 norm), and are then combinered. This approach can be later expanded to account for datetime (we will need to do so at some point)

dirty_cat/_fuzzy_join.py Outdated Show resolved Hide resolved
dirty_cat/_fuzzy_join.py Outdated Show resolved Hide resolved
dirty_cat/_fuzzy_join.py Outdated Show resolved Hide resolved
dirty_cat/_fuzzy_join.py Outdated Show resolved Hide resolved
dirty_cat/_fuzzy_join.py Outdated Show resolved Hide resolved
dirty_cat/_fuzzy_join.py Outdated Show resolved Hide resolved
dirty_cat/_fuzzy_join.py Outdated Show resolved Hide resolved
dirty_cat/_fuzzy_join.py Outdated Show resolved Hide resolved
examples/04_fuzzy_joining_and_FeatureAugmenter.py Outdated Show resolved Hide resolved
@jovan-stojanovic
Copy link
Member Author

jovan-stojanovic commented Mar 22, 2023

It seems to me that the code works only for either all columns numerical or all columns not numerical. We'd like code that works for a mix.

I would see a structure of the code where all columns get separately injected into a numerical representation (scaled by it's 2 norm), and are then combinered. This approach can be later expanded to account for datetime (we will need to do so at some point)

Thanks, it is now possible to join on mixed types of columns.
I worked along the lines of your idea, so the code structure is the following:

  • check if only numerical columns then do numerical join;
  • check if only string columns then do joins on strings;
  • check if mixed types columns: encode the columns separately based on their type and then create the "full" encoding
    array of all columns and find the nearest neighbor.

In this way, it will be easier to add new types of joins such as datetime with separate encoding methods.

@jovan-stojanovic
Copy link
Member Author

I have also an idea on how to join on datetime values, but I think it should be added in a follow-up PR, this one already adds a lot to the function.

Copy link
Member

@LilianBoulard LilianBoulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I've submitted a small commit to fix a few things and a few comments!

Small note: when writing a string spanning multiple lines, the convention (at least in this project) is to have the space at the end of the line.
So instead of writing

print(
    "This is a sentence"
    " that's spanning multiple lines."
)

we'd write

print(
    "This is a sentence "
    "that's spanning multiple lines. "  # Also notice the trailing space!
)

(small, insignificant detail, I agree ^^)

dirty_cat/_fuzzy_join.py Show resolved Hide resolved
dirty_cat/tests/test_fuzzy_join.py Outdated Show resolved Hide resolved
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jovan-stojanovic, thanks for the PR! fuzzy_join looks great :) I have some comments that might help make things easier for the user.

Also, there's a quick fix I wanted to address, but I couldn't comment on it because it wasn't changed during this PR:

    if return_score:
        df_joined = pd.concat(
            [df_joined, pd.DataFrame(norm_distance, columns=["matching_score"])], axis=1
        )

This could be replaced by:

    if return_score:
        df_joined["matching_score"] = norm_distance

dirty_cat/_fuzzy_join.py Outdated Show resolved Hide resolved
dirty_cat/_fuzzy_join.py Outdated Show resolved Hide resolved
dirty_cat/_fuzzy_join.py Show resolved Hide resolved
dirty_cat/_fuzzy_join.py Outdated Show resolved Hide resolved
dirty_cat/_fuzzy_join.py Outdated Show resolved Hide resolved
dirty_cat/_fuzzy_join.py Outdated Show resolved Hide resolved
dirty_cat/_fuzzy_join.py Outdated Show resolved Hide resolved
"Specify numerical_match as 'string' "
"or 'number'. "
)
elif numerical_match in ["number"] and any_numeric and not mixed_types:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this design is a bit tricky because we are required to choose numerical_match="number" to use two or more columns of mixed types. I think introducing another category like "mixed" could bring more transparency about what is being used.

In this scenario, choosing numerical_match="number" and having mixed types would raise an error, same for numerical_match="string". This would lead to the following design:

if (
    (numerical_match == "numerical" and not only_numerical)
    or (numerical_match == "string" and not only_string)
):
    raise ValueError(...)

if numerical_match in ["number", "mixed"]:
    main_num_enc, aux_num_enc = _numeric_encoding(
        main_table, main_num_cols, aux_table, aux_num_cols
    )

if numerical_match in ["string", "mixed"]:
    main_str_enc, aux_str_enc = _string_encoding(
        main_table,
        main_str_cols,
        aux_table,
        aux_str_cols,
        encoder=encoder,
        analyzer=analyzer,
        ngram_range=ngram_range,
    )

if numerical_match == "mixed":
    main_enc = hstack((main_num_enc, main_str_enc), format="csr")
    aux_enc = hstack((aux_num_enc, aux_str_enc), format="csr")
elif numerical_match == "numerical":
    main_enc = main_num_enc
    aux_enc = aux_num_enc
else:
    main_enc = main_str_enc
    aux_enc = aux_str_enc

idx_closest, norm_distance = _nearest_matches(main_enc, aux_enc)

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this scenario, choosing numerical_match="number" and having mixed types would raise an error, same for numerical_match="string".

I think it won't as there is the elif numerical_match in ["number"] and any_numeric and mixed_types condition.

I think we need to improve this design but I am still not sure of the way to go. Adding numerical_match="mixed" would be problematic because the idea is to have something like numerical_match is in ["number", "date", "geo"]
but "mixed" would suppose that you infer the numerical type rather than explicitly giving it.

Either way, I think I will shortly after this PR start working on adding joins on dates, so we might discuss this further then.

Tell me what you think.

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional remarks following our IRL discussion :)

dirty_cat/_fuzzy_join.py Outdated Show resolved Hide resolved
dirty_cat/_fuzzy_join.py Outdated Show resolved Hide resolved
dirty_cat/_fuzzy_join.py Show resolved Hide resolved
)
main_enc = hstack((main_num_enc, main_str_enc), format="csr")
aux_enc = hstack((aux_num_enc, aux_str_enc), format="csr")
idx_closest, norm_distance = _nearest_matches(main_enc, aux_enc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following my comment above, I've just realized that we stack the mixed encodings together before computing the nearest neighbors. Isn't this what we want to avoid since the string encoding will be overrepresented by its number of columns?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right..
We need to find some benchmarks for this and see what is the best solution. Do you mind keeping it this way for now or would you rather sum the distances to have an equivalent weight ?

Copy link
Member

@LilianBoulard LilianBoulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I've submitted a commit, which fixes some things.
Otherwise, it looks good :)

@jovan-stojanovic
Copy link
Member Author

I have one small change left and it's ready.

)
all_cats = pd.concat([main_cols_clean, aux_cols_clean], axis=0).unique()

if isinstance(encoder, str) and encoder == "hashing":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I'm -1 for introducing encoder = "hashing" by default instead of None because it introduces unnecessary complexity and a mixed argument signature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can open an issue on this and we can discuss it more. I think this was done so that people don't think that we use the CountVectorizer..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that scenario, we could document in the docstring that the default None choice will fall back on HashingVectorizer. This is what is done in scikit-learn's ensemble estimator StackingRegressor for example:
https://github.com/scikit-learn/scikit-learn/blob/559609fe98ec2145788133687e64a6e87766bc77/sklearn/ensemble/_stacking.py#L788-L790

I'm okay to do it in a subsequent PR if we want this to be merged quickly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see the point, actually this makes much more sense. The docstring is there to inform the user what's in the function, not the parameter name. I think I will change this now then.

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error messages that have been changed break the CI

dirty_cat/_fuzzy_join.py Outdated Show resolved Hide resolved
dirty_cat/_fuzzy_join.py Show resolved Hide resolved
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jovan-stojanovic and @LilianBoulard we're almost done, here's a last review before it LGTM :)

dirty_cat/_feature_augmenter.py Outdated Show resolved Hide resolved
)
all_cats = pd.concat([main_cols_clean, aux_cols_clean], axis=0).unique()

if isinstance(encoder, str) and encoder == "hashing":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that scenario, we could document in the docstring that the default None choice will fall back on HashingVectorizer. This is what is done in scikit-learn's ensemble estimator StackingRegressor for example:
https://github.com/scikit-learn/scikit-learn/blob/559609fe98ec2145788133687e64a6e87766bc77/sklearn/ensemble/_stacking.py#L788-L790

I'm okay to do it in a subsequent PR if we want this to be merged quickly.

main_enc = hstack((main_num_enc, main_str_enc), format="csr")
aux_enc = hstack((aux_num_enc, aux_str_enc), format="csr")
idx_closest, matching_score = _nearest_matches(main_enc, aux_enc)
mm_scaler = MinMaxScaler(feature_range=(0, 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move MinMaxScaler into _nearest_matches, because the other branch conditions would still need to have matching_score between 0 and 1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I am afraid doing this will make us have a matching_score of 1 for the best match, not the exact one, right? This is something that we really don't want with strings, because I find it useful to have the perfect string matches as 1 (and brings us easily back to what pandas.merge does).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, but this gets tricky with numerical and mixed values. A perfect match with numerical values will have a distance of 0, thus a matching_score of 1. However, if the distance is higher than 1, matching_score becomes negative, which is weird.

Besides, the current implementation for mixed values has 1 for the best match and not the exact match since we use MinMaxScaler, right? So, the matching_score parameter behaves differently for string and mixed values.

I think we can solve that by removing MinMaxScaler and dividing the distance by its maximum in the _nearest_matches function:

  • For perfect matches, the distance is 0 => the similarity is 1
  • For best matches, the distance > 0 => the similarity < 1
  • For the worse case, the distance is 1 => the similarity is 0

WDYT? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, indeed, this is the best solution I think!

jovan-stojanovic and others added 2 commits April 18, 2023 11:31
Co-authored-by: Vincent M <maladiere.vincent@yahoo.fr>
@jovan-stojanovic
Copy link
Member Author

Ok, I think this ready to be merged when the tests pass @Vincent-Maladiere @LilianBoulard

CHANGES.rst Outdated Show resolved Hide resolved
dirty_cat/_fuzzy_join.py Outdated Show resolved Hide resolved
@LilianBoulard LilianBoulard merged commit c552053 into skrub-data:main Apr 18, 2023
16 checks passed
@jovan-stojanovic jovan-stojanovic deleted the fj_numerical branch May 16, 2023 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants