-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA Improving fuzzy_join: numerical columns and multiple keys #530
FEA Improving fuzzy_join: numerical columns and multiple keys #530
Conversation
You should merge with master. You have a conflict. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that the code works only for either all columns numerical or all columns not numerical. We'd like code that works for a mix.
I would see a structure of the code where all columns get separately injected into a numerical representation (scaled by it's 2 norm), and are then combinered. This approach can be later expanded to account for datetime (we will need to do so at some point)
Thanks, it is now possible to join on mixed types of columns.
In this way, it will be easier to add new types of joins such as datetime with separate encoding methods. |
I have also an idea on how to join on datetime values, but I think it should be added in a follow-up PR, this one already adds a lot to the function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. I've submitted a small commit to fix a few things and a few comments!
Small note: when writing a string spanning multiple lines, the convention (at least in this project) is to have the space at the end of the line.
So instead of writing
print(
"This is a sentence"
" that's spanning multiple lines."
)
we'd write
print(
"This is a sentence "
"that's spanning multiple lines. " # Also notice the trailing space!
)
(small, insignificant detail, I agree ^^)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jovan-stojanovic, thanks for the PR! fuzzy_join
looks great :) I have some comments that might help make things easier for the user.
Also, there's a quick fix I wanted to address, but I couldn't comment on it because it wasn't changed during this PR:
if return_score:
df_joined = pd.concat(
[df_joined, pd.DataFrame(norm_distance, columns=["matching_score"])], axis=1
)
This could be replaced by:
if return_score:
df_joined["matching_score"] = norm_distance
dirty_cat/_fuzzy_join.py
Outdated
"Specify numerical_match as 'string' " | ||
"or 'number'. " | ||
) | ||
elif numerical_match in ["number"] and any_numeric and not mixed_types: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this design is a bit tricky because we are required to choose numerical_match="number"
to use two or more columns of mixed types. I think introducing another category like "mixed"
could bring more transparency about what is being used.
In this scenario, choosing numerical_match="number"
and having mixed types would raise an error, same for numerical_match="string"
. This would lead to the following design:
if (
(numerical_match == "numerical" and not only_numerical)
or (numerical_match == "string" and not only_string)
):
raise ValueError(...)
if numerical_match in ["number", "mixed"]:
main_num_enc, aux_num_enc = _numeric_encoding(
main_table, main_num_cols, aux_table, aux_num_cols
)
if numerical_match in ["string", "mixed"]:
main_str_enc, aux_str_enc = _string_encoding(
main_table,
main_str_cols,
aux_table,
aux_str_cols,
encoder=encoder,
analyzer=analyzer,
ngram_range=ngram_range,
)
if numerical_match == "mixed":
main_enc = hstack((main_num_enc, main_str_enc), format="csr")
aux_enc = hstack((aux_num_enc, aux_str_enc), format="csr")
elif numerical_match == "numerical":
main_enc = main_num_enc
aux_enc = aux_num_enc
else:
main_enc = main_str_enc
aux_enc = aux_str_enc
idx_closest, norm_distance = _nearest_matches(main_enc, aux_enc)
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this scenario, choosing numerical_match="number" and having mixed types would raise an error, same for numerical_match="string".
I think it won't as there is the elif numerical_match in ["number"] and any_numeric and mixed_types
condition.
I think we need to improve this design but I am still not sure of the way to go. Adding numerical_match="mixed"
would be problematic because the idea is to have something like numerical_match is in ["number", "date", "geo"]
but "mixed" would suppose that you infer the numerical type rather than explicitly giving it.
Either way, I think I will shortly after this PR start working on adding joins on dates, so we might discuss this further then.
Tell me what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some additional remarks following our IRL discussion :)
dirty_cat/_fuzzy_join.py
Outdated
) | ||
main_enc = hstack((main_num_enc, main_str_enc), format="csr") | ||
aux_enc = hstack((aux_num_enc, aux_str_enc), format="csr") | ||
idx_closest, norm_distance = _nearest_matches(main_enc, aux_enc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following my comment above, I've just realized that we stack the mixed encodings together before computing the nearest neighbors. Isn't this what we want to avoid since the string encoding will be overrepresented by its number of columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right..
We need to find some benchmarks for this and see what is the best solution. Do you mind keeping it this way for now or would you rather sum the distances to have an equivalent weight ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I've submitted a commit, which fixes some things.
Otherwise, it looks good :)
I have one small change left and it's ready. |
dirty_cat/_fuzzy_join.py
Outdated
) | ||
all_cats = pd.concat([main_cols_clean, aux_cols_clean], axis=0).unique() | ||
|
||
if isinstance(encoder, str) and encoder == "hashing": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I'm -1 for introducing encoder = "hashing"
by default instead of None
because it introduces unnecessary complexity and a mixed argument signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can open an issue on this and we can discuss it more. I think this was done so that people don't think that we use the CountVectorizer..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that scenario, we could document in the docstring that the default None
choice will fall back on HashingVectorizer
. This is what is done in scikit-learn's ensemble estimator StackingRegressor
for example:
https://github.com/scikit-learn/scikit-learn/blob/559609fe98ec2145788133687e64a6e87766bc77/sklearn/ensemble/_stacking.py#L788-L790
I'm okay to do it in a subsequent PR if we want this to be merged quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see the point, actually this makes much more sense. The docstring is there to inform the user what's in the function, not the parameter name. I think I will change this now then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error messages that have been changed break the CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jovan-stojanovic and @LilianBoulard we're almost done, here's a last review before it LGTM :)
dirty_cat/_fuzzy_join.py
Outdated
) | ||
all_cats = pd.concat([main_cols_clean, aux_cols_clean], axis=0).unique() | ||
|
||
if isinstance(encoder, str) and encoder == "hashing": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that scenario, we could document in the docstring that the default None
choice will fall back on HashingVectorizer
. This is what is done in scikit-learn's ensemble estimator StackingRegressor
for example:
https://github.com/scikit-learn/scikit-learn/blob/559609fe98ec2145788133687e64a6e87766bc77/sklearn/ensemble/_stacking.py#L788-L790
I'm okay to do it in a subsequent PR if we want this to be merged quickly.
dirty_cat/_fuzzy_join.py
Outdated
main_enc = hstack((main_num_enc, main_str_enc), format="csr") | ||
aux_enc = hstack((aux_num_enc, aux_str_enc), format="csr") | ||
idx_closest, matching_score = _nearest_matches(main_enc, aux_enc) | ||
mm_scaler = MinMaxScaler(feature_range=(0, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move MinMaxScaler
into _nearest_matches
, because the other branch conditions would still need to have matching_score
between 0 and 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I am afraid doing this will make us have a matching_score
of 1 for the best match, not the exact one, right? This is something that we really don't want with strings, because I find it useful to have the perfect string matches as 1 (and brings us easily back to what pandas.merge does).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, but this gets tricky with numerical and mixed values. A perfect match with numerical values will have a distance of 0, thus a matching_score
of 1. However, if the distance is higher than 1, matching_score
becomes negative, which is weird.
Besides, the current implementation for mixed values has 1 for the best match and not the exact match since we use MinMaxScaler
, right? So, the matching_score
parameter behaves differently for string and mixed values.
I think we can solve that by removing MinMaxScaler
and dividing the distance by its maximum in the _nearest_matches
function:
- For perfect matches, the distance is 0 => the similarity is 1
- For best matches, the distance > 0 => the similarity < 1
- For the worse case, the distance is 1 => the similarity is 0
WDYT? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, indeed, this is the best solution I think!
Co-authored-by: Vincent M <maladiere.vincent@yahoo.fr>
Ok, I think this ready to be merged when the tests pass @Vincent-Maladiere @LilianBoulard |
This PR adds new important features to the fuzzy_join:
Joining on numerical columns is now possible. The distance between columns used is euclidean.
Many-to-many joins (joining on multiple keys) is also now possible, both for string and numerical keys. For instance, joining on both the string columns ["City", "Country"] or on numerical columns ["latitude", "longitude"] was the motivation for adding this.