Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Fuzzy joining on datetime #552

Merged
merged 18 commits into from
Jun 14, 2023

Conversation

jovan-stojanovic
Copy link
Member

Adds support for fuzzy joining tables on datetime columns.

The datetime columns in the table must be recognizable with pandas.DataFrame.select_dtypes('datetime').

@jovan-stojanovic
Copy link
Member Author

I applied a StandardScaler on datetime values to avoid issue #547.

I also added an 'auto' option for the numerical_match parameter: in this case, we will use the column types to identify how to encode it.

What remains is to differentiate this numerical_match='auto' approach from the 'numbers' and 'time', where I am unsure what to do. I guess the idea would be to force certain columns to either have the appropriate type (int, float or datetime) suitable for encoding or raise an error?
Or to just enforce a check as if there is this type, numeric or datetime, and raise an error if it's missing, WDYT?

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jovan, here are some feedbacks on this PR :)

My main remark is that I don't understand the point of numerical_match anymore: we extract embeddings from numerical and times by using the StandardScaler, and from string by using TFIDF.

Therefore we use the Euclidean Distance in the NearestNeighbor in all cases.

If users want to encode numerical values as strings, shouldn't they transform their numerical values to string before running the fuzzy_join? Otherwise, I feel it's very error-prone.

Also, removing the numerical_match would simplify the logic a lot.

Am I missing something? WDYT?

skrub/_fuzzy_join.py Show resolved Hide resolved
skrub/_fuzzy_join.py Outdated Show resolved Hide resolved
skrub/_fuzzy_join.py Show resolved Hide resolved
skrub/_fuzzy_join.py Outdated Show resolved Hide resolved
skrub/_fuzzy_join.py Outdated Show resolved Hide resolved
skrub/_fuzzy_join.py Outdated Show resolved Hide resolved
skrub/_fuzzy_join.py Outdated Show resolved Hide resolved
skrub/tests/test_fuzzy_join.py Show resolved Hide resolved
skrub/tests/test_fuzzy_join.py Show resolved Hide resolved
skrub/tests/test_fuzzy_join.py Show resolved Hide resolved
Copy link
Member

@LilianBoulard LilianBoulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!
I have nothing of significance to add, as I'm not super familiar with the workings of fuzzy_join 😅

skrub/_fuzzy_join.py Outdated Show resolved Hide resolved
skrub/_fuzzy_join.py Outdated Show resolved Hide resolved
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last comment before it LGTM :)

@@ -430,8 +430,8 @@ def fuzzy_join(
main_time_cols = main_table[main_cols].select_dtypes(include="datetime").columns
aux_time_cols = aux_table[aux_cols].select_dtypes(include="datetime").columns

main_str_cols = list(set(main_cols) - set(main_num_cols) - set(main_time_cols))
aux_str_cols = list(set(aux_cols) - set(aux_num_cols) - set(main_time_cols))
main_str_cols = main_table[main_cols].select_dtypes(include="object").columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also "string" and "category", so include=["string", "category", "object"] ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the test passed, I though object was sufficient. But if we want to be sure, I'll add all

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, waiting for the CI to be green :)

@jovan-stojanovic jovan-stojanovic merged commit 5167963 into skrub-data:main Jun 14, 2023
22 checks passed
@jovan-stojanovic jovan-stojanovic deleted the fuzzy_join_datetime branch September 11, 2023 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants