Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

division by 0 in fuzzy_join when there are only perfect matches #764

Closed
jeromedockes opened this issue Sep 28, 2023 · 4 comments
Closed
Labels
bug Something isn't working

Comments

@jeromedockes
Copy link
Member

Describe the bug

if we join 2 tables on columns for which matches do exist, the rescaling of nearest neighbor distances does a division by 0; we get a warning and the matching scores are NaN

Steps/Code to Reproduce

>>> warnings.filterwarnings("default")
>>> import pandas as pd
... from skrub import fuzzy_join
... 
... df1 = pd.DataFrame({"A": [0, 1]})
... 
... fuzzy_join(df1, df1, on="A", return_score=True)
/home/jerome/workspace/backedup_repositories/skrub/skrub/_fuzzy_join.py:359: UserWarning: This feature is still experimental.
  warnings.warn("This feature is still experimental.")
/home/jerome/workspace/backedup_repositories/skrub/skrub/_fuzzy_join.py:189: RuntimeWarning: invalid value encountered in divide
  distance = distance / np.max(distance)
   A_x  A_y  matching_score
0    0    0             NaN
1    1    1             NaN
>>> 

Expected Results

no division by 0, scores are numbers (prob. equal to 1)

Actual Results

division by 0, scores are NaN

Versions

main branch
@jeromedockes jeromedockes added the bug Something isn't working label Sep 28, 2023
@GaelVaroquaux
Copy link
Member

Good catch, this is an important special case.

Ideally, when there are only exact matches, we shouldn't even go down that code path and only use standard matching from the backend (pandas, polars...), as it will be much faster.

@jovan-stojanovic
Copy link
Member

jovan-stojanovic commented Sep 29, 2023

Yes, indeed, you are right. This is IMO equivalent to #730 if we agree that we should rely on the backend for perfect matches..

@jeromedockes
Copy link
Member Author

this is not the same as when the user requires perfect matches by setting match_score=1. This division by 0 is triggered when all rows happen to have a perfect match in the joined dataframe. it will be fixed by #802

jeromedockes added a commit to jeromedockes/skrub that referenced this issue Oct 20, 2023
@jeromedockes
Copy link
Member Author

fixed by #802

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants