Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Adding FuzzyJoin #291

Merged
merged 156 commits into from
Oct 10, 2022
Merged

Conversation

jovan-stojanovic
Copy link
Member

This PR adds the new FuzzyJoin class that allows joining tables with dirty columns.

Co-authored-by: @LeoGrin

Co-authored-by: LeoGrin <leo.grinsztajn@polytechnique.edu>
@jovan-stojanovic jovan-stojanovic changed the title Adding joining feature FEA Adding FuzzyJoin Jul 28, 2022
dirty_cat/fuzzy_join.py Outdated Show resolved Hide resolved
dirty_cat/fuzzy_join.py Outdated Show resolved Hide resolved
dirty_cat/fuzzy_join.py Outdated Show resolved Hide resolved
dirty_cat/test/test_fuzzy_join.py Outdated Show resolved Hide resolved
@LeoGrin
Copy link
Contributor

LeoGrin commented Jul 28, 2022

Awesome! Some additional things which could be useful:

  • benchmark on datasets with typos, on abbreviations (@alexis-cvetkov was saying that we could use the is_abbrevation tag in YAGO for this), and maybe on many to many joins.
  • add a parameter to control for recall / precision tradeoff. It would probably be hard to set so I'm not sure. An idea: print the worst matches so the user knows if he needs more precision.
  • allow custom match dictionary
  • (maybe) add a option to require 100% for certain columns (when we have an id column but it's missing in some cases)

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jul 28, 2022 via email

@GaelVaroquaux
Copy link
Member

Thanks for this exciting PR. Two TODOs:

  • Fix the tests
  • Add an example. I have a hard time reviewing a PR without an example, because I cannot get a feeling on the user experience.

@mjboos
Copy link
Contributor

mjboos commented Sep 2, 2022

Hi @jovan-stojanovic I'm curious why you went for Countvectorizer (e.g. with character n-grams) instead of, for example, something like the SimilarityEncoder.

I guess the "fuzzy" part comes from matching (for example) char n-grams which might still give close distance if there are some misspellings in fields/columns - but dirty_cat also covers this use case with the SimilarityEncoder, no?

(background is that we're thinking about how to do deduplication and how to use fuzzyjoin for that, but for this using string distances would be cool)

Thanks!

@jovan-stojanovic
Copy link
Member Author

jovan-stojanovic commented Sep 2, 2022

Hi @jovan-stojanovic I'm curious why you went for Countvectorizer (e.g. with character n-grams) instead of, for example, something like the SimilarityEncoder.

I guess the "fuzzy" part comes from matching (for example) char n-grams which might still give close distance if there are some misspellings in fields/columns - but dirty_cat also covers this use case with the SimilarityEncoder, no?

(background is that we're thinking about how to do deduplication and how to use fuzzyjoin for that, but for this using string distances would be cool)

Thanks!

Hi @mjboos , thanks for your comment. Both the SimilarityEncoder and FuzzyJoin are based on CountVectorizer, so they are similar in that way. However, we did not use the SimilarityEncoder directly because it can be very slow, because you need to compute for every category distances between them. From the user perspective, this is important. This is why we used Nearest Neighbors as it is much faster.

Here, there is an option return_distance=True that may help for deduplication.

However, you are right that the next challenge would be to add a better precision metric so as to distinguish between good and bad matches. It links to this discussion.

Maybe applying the SimilarityEncoder to n neairest neighbors would be an option then? Certainly better than applying it directly to all categories. This would be added as an option with precision='sim_enc' and a threshold.

In any case, thanks, you can create an issue, to be discussed as an additional feature in future PR's!

@GaelVaroquaux
Copy link
Member

You might new a "plt.tight_layout()" before each plt.show() in the example, as the xlabel and ylabel do not appear

@GaelVaroquaux
Copy link
Member

We're having failing tests that seem unrelated to this PR: https://github.com/dirty-cat/dirty_cat/actions/runs/3175021096/jobs/5176907256

I don't like this: if we have a fragile test suite, we will be chasing more and more failing tests as we add features.

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments.

The thing that actually worries me is that our tests are fragile.

It seems that the test failures that we are witnessing are independent from the PR. @jovan-stojanovic : can you confirm?

dirty_cat/datasets/tests/test_fetching.py Outdated Show resolved Hide resolved
examples/07_joining_tables_with_FuzzyJoin.py Outdated Show resolved Hide resolved
examples/07_joining_tables_with_FuzzyJoin.py Outdated Show resolved Hide resolved
examples/07_joining_tables_with_FuzzyJoin.py Outdated Show resolved Hide resolved
examples/07_joining_tables_with_FuzzyJoin.py Outdated Show resolved Hide resolved
@jovan-stojanovic
Copy link
Member Author

jovan-stojanovic commented Oct 4, 2022

It seems that the test failures that we are witnessing are independent from the PR. @jovan-stojanovic : can you confirm?

Yes, this is something that happened for the first time yesterday. With #371, I will try and resolve the min_hash_encoder (and gap_encoder) test so that we don't fetch data online.
It is definitely linked to the fact that we need to download data from the internet for some tests..
( See #376 and #379 )

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 10, 2022 via email

@GaelVaroquaux
Copy link
Member

You didn't add the dependency to statsmodels in the right place, I fear: https://app.circleci.com/pipelines/github/dirty-cat/dirty_cat/1032/workflows/83c23fe0-dfaf-4b27-aab6-f6a9f3f87cae/jobs/2300

@GaelVaroquaux
Copy link
Member

You need to add the dependency here: https://github.com/dirty-cat/dirty_cat/blob/master/build_tools/circle/build_doc.sh#L126

You should remove it to were you have added it previously.

@GaelVaroquaux
Copy link
Member

Merging! This is a great addition.

@GaelVaroquaux GaelVaroquaux merged commit 142ae1b into skrub-data:master Oct 10, 2022
@LilianBoulard
Copy link
Member

Congrats!!

@jovan-stojanovic jovan-stojanovic deleted the fuzzy_join branch November 17, 2022 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants