-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA Fuzzy joining on datetime #552
FEA Fuzzy joining on datetime #552
Conversation
I applied a I also added an What remains is to differentiate this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Jovan, here are some feedbacks on this PR :)
My main remark is that I don't understand the point of numerical_match
anymore: we extract embeddings from numerical and times by using the StandardScaler, and from string by using TFIDF.
Therefore we use the Euclidean Distance in the NearestNeighbor
in all cases.
If users want to encode numerical values as strings, shouldn't they transform their numerical values to string before running the fuzzy_join
? Otherwise, I feel it's very error-prone.
Also, removing the numerical_match
would simplify the logic a lot.
Am I missing something? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
I have nothing of significance to add, as I'm not super familiar with the workings of fuzzy_join 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last comment before it LGTM :)
skrub/_fuzzy_join.py
Outdated
@@ -430,8 +430,8 @@ def fuzzy_join( | |||
main_time_cols = main_table[main_cols].select_dtypes(include="datetime").columns | |||
aux_time_cols = aux_table[aux_cols].select_dtypes(include="datetime").columns | |||
|
|||
main_str_cols = list(set(main_cols) - set(main_num_cols) - set(main_time_cols)) | |||
aux_str_cols = list(set(aux_cols) - set(aux_num_cols) - set(main_time_cols)) | |||
main_str_cols = main_table[main_cols].select_dtypes(include="object").columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also "string" and "category", so include=["string", "category", "object"]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the test passed, I though object was sufficient. But if we want to be sure, I'll add all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, waiting for the CI to be green :)
Adds support for fuzzy joining tables on datetime columns.
The datetime columns in the table must be recognizable with pandas.DataFrame.select_dtypes('datetime').