FEA Improving fuzzy_join: numerical columns and multiple keys#530
FEA Improving fuzzy_join: numerical columns and multiple keys#530LilianBoulard merged 38 commits intoskrub-data:mainfrom
Conversation
|
You should merge with master. You have a conflict. |
GaelVaroquaux
left a comment
There was a problem hiding this comment.
It seems to me that the code works only for either all columns numerical or all columns not numerical. We'd like code that works for a mix.
I would see a structure of the code where all columns get separately injected into a numerical representation (scaled by it's 2 norm), and are then combinered. This approach can be later expanded to account for datetime (we will need to do so at some point)
Thanks, it is now possible to join on mixed types of columns.
In this way, it will be easier to add new types of joins such as datetime with separate encoding methods. |
|
I have also an idea on how to join on datetime values, but I think it should be added in a follow-up PR, this one already adds a lot to the function. |
LilianBoulard
left a comment
There was a problem hiding this comment.
Thanks for the PR. I've submitted a small commit to fix a few things and a few comments!
Small note: when writing a string spanning multiple lines, the convention (at least in this project) is to have the space at the end of the line.
So instead of writing
print(
"This is a sentence"
" that's spanning multiple lines."
)we'd write
print(
"This is a sentence "
"that's spanning multiple lines. " # Also notice the trailing space!
)(small, insignificant detail, I agree ^^)
Vincent-Maladiere
left a comment
There was a problem hiding this comment.
Hey @jovan-stojanovic, thanks for the PR! fuzzy_join looks great :) I have some comments that might help make things easier for the user.
Also, there's a quick fix I wanted to address, but I couldn't comment on it because it wasn't changed during this PR:
if return_score:
df_joined = pd.concat(
[df_joined, pd.DataFrame(norm_distance, columns=["matching_score"])], axis=1
)This could be replaced by:
if return_score:
df_joined["matching_score"] = norm_distance
dirty_cat/_fuzzy_join.py
Outdated
| "Specify numerical_match as 'string' " | ||
| "or 'number'. " | ||
| ) | ||
| elif numerical_match in ["number"] and any_numeric and not mixed_types: |
There was a problem hiding this comment.
I think this design is a bit tricky because we are required to choose numerical_match="number" to use two or more columns of mixed types. I think introducing another category like "mixed" could bring more transparency about what is being used.
In this scenario, choosing numerical_match="number" and having mixed types would raise an error, same for numerical_match="string". This would lead to the following design:
if (
(numerical_match == "numerical" and not only_numerical)
or (numerical_match == "string" and not only_string)
):
raise ValueError(...)
if numerical_match in ["number", "mixed"]:
main_num_enc, aux_num_enc = _numeric_encoding(
main_table, main_num_cols, aux_table, aux_num_cols
)
if numerical_match in ["string", "mixed"]:
main_str_enc, aux_str_enc = _string_encoding(
main_table,
main_str_cols,
aux_table,
aux_str_cols,
encoder=encoder,
analyzer=analyzer,
ngram_range=ngram_range,
)
if numerical_match == "mixed":
main_enc = hstack((main_num_enc, main_str_enc), format="csr")
aux_enc = hstack((aux_num_enc, aux_str_enc), format="csr")
elif numerical_match == "numerical":
main_enc = main_num_enc
aux_enc = aux_num_enc
else:
main_enc = main_str_enc
aux_enc = aux_str_enc
idx_closest, norm_distance = _nearest_matches(main_enc, aux_enc)WDYT?
There was a problem hiding this comment.
In this scenario, choosing numerical_match="number" and having mixed types would raise an error, same for numerical_match="string".
I think it won't as there is the elif numerical_match in ["number"] and any_numeric and mixed_types condition.
I think we need to improve this design but I am still not sure of the way to go. Adding numerical_match="mixed" would be problematic because the idea is to have something like numerical_match is in ["number", "date", "geo"]
but "mixed" would suppose that you infer the numerical type rather than explicitly giving it.
Either way, I think I will shortly after this PR start working on adding joins on dates, so we might discuss this further then.
Tell me what you think.
Vincent-Maladiere
left a comment
There was a problem hiding this comment.
Some additional remarks following our IRL discussion :)
dirty_cat/_fuzzy_join.py
Outdated
| ) | ||
| main_enc = hstack((main_num_enc, main_str_enc), format="csr") | ||
| aux_enc = hstack((aux_num_enc, aux_str_enc), format="csr") | ||
| idx_closest, norm_distance = _nearest_matches(main_enc, aux_enc) |
There was a problem hiding this comment.
Following my comment above, I've just realized that we stack the mixed encodings together before computing the nearest neighbors. Isn't this what we want to avoid since the string encoding will be overrepresented by its number of columns?
There was a problem hiding this comment.
Yes, you are right..
We need to find some benchmarks for this and see what is the best solution. Do you mind keeping it this way for now or would you rather sum the distances to have an equivalent weight ?
LilianBoulard
left a comment
There was a problem hiding this comment.
Hey, I've submitted a commit, which fixes some things.
Otherwise, it looks good :)
|
I have one small change left and it's ready. |
dirty_cat/_fuzzy_join.py
Outdated
| ) | ||
| all_cats = pd.concat([main_cols_clean, aux_cols_clean], axis=0).unique() | ||
|
|
||
| if isinstance(encoder, str) and encoder == "hashing": |
There was a problem hiding this comment.
Overall I'm -1 for introducing encoder = "hashing" by default instead of None because it introduces unnecessary complexity and a mixed argument signature.
There was a problem hiding this comment.
You can open an issue on this and we can discuss it more. I think this was done so that people don't think that we use the CountVectorizer..
There was a problem hiding this comment.
In that scenario, we could document in the docstring that the default None choice will fall back on HashingVectorizer. This is what is done in scikit-learn's ensemble estimator StackingRegressor for example:
https://github.com/scikit-learn/scikit-learn/blob/559609fe98ec2145788133687e64a6e87766bc77/sklearn/ensemble/_stacking.py#L788-L790
I'm okay to do it in a subsequent PR if we want this to be merged quickly.
There was a problem hiding this comment.
Oh I see the point, actually this makes much more sense. The docstring is there to inform the user what's in the function, not the parameter name. I think I will change this now then.
Vincent-Maladiere
left a comment
There was a problem hiding this comment.
The error messages that have been changed break the CI
Vincent-Maladiere
left a comment
There was a problem hiding this comment.
Hi @jovan-stojanovic and @LilianBoulard we're almost done, here's a last review before it LGTM :)
dirty_cat/_fuzzy_join.py
Outdated
| ) | ||
| all_cats = pd.concat([main_cols_clean, aux_cols_clean], axis=0).unique() | ||
|
|
||
| if isinstance(encoder, str) and encoder == "hashing": |
There was a problem hiding this comment.
In that scenario, we could document in the docstring that the default None choice will fall back on HashingVectorizer. This is what is done in scikit-learn's ensemble estimator StackingRegressor for example:
https://github.com/scikit-learn/scikit-learn/blob/559609fe98ec2145788133687e64a6e87766bc77/sklearn/ensemble/_stacking.py#L788-L790
I'm okay to do it in a subsequent PR if we want this to be merged quickly.
dirty_cat/_fuzzy_join.py
Outdated
| main_enc = hstack((main_num_enc, main_str_enc), format="csr") | ||
| aux_enc = hstack((aux_num_enc, aux_str_enc), format="csr") | ||
| idx_closest, matching_score = _nearest_matches(main_enc, aux_enc) | ||
| mm_scaler = MinMaxScaler(feature_range=(0, 1)) |
There was a problem hiding this comment.
We should move MinMaxScaler into _nearest_matches, because the other branch conditions would still need to have matching_score between 0 and 1.
There was a problem hiding this comment.
Yes, but I am afraid doing this will make us have a matching_score of 1 for the best match, not the exact one, right? This is something that we really don't want with strings, because I find it useful to have the perfect string matches as 1 (and brings us easily back to what pandas.merge does).
There was a problem hiding this comment.
You're right, but this gets tricky with numerical and mixed values. A perfect match with numerical values will have a distance of 0, thus a matching_score of 1. However, if the distance is higher than 1, matching_score becomes negative, which is weird.
Besides, the current implementation for mixed values has 1 for the best match and not the exact match since we use MinMaxScaler, right? So, the matching_score parameter behaves differently for string and mixed values.
I think we can solve that by removing MinMaxScaler and dividing the distance by its maximum in the _nearest_matches function:
- For perfect matches, the distance is 0 => the similarity is 1
- For best matches, the distance > 0 => the similarity < 1
- For the worse case, the distance is 1 => the similarity is 0
WDYT? :)
There was a problem hiding this comment.
Great, indeed, this is the best solution I think!
Co-authored-by: Vincent M <maladiere.vincent@yahoo.fr>
|
Ok, I think this ready to be merged when the tests pass @Vincent-Maladiere @LilianBoulard |
This PR adds new important features to the fuzzy_join:
Joining on numerical columns is now possible. The distance between columns used is euclidean.
Many-to-many joins (joining on multiple keys) is also now possible, both for string and numerical keys. For instance, joining on both the string columns ["City", "Country"] or on numerical columns ["latitude", "longitude"] was the motivation for adding this.