-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA Adding FuzzyJoin #291
FEA Adding FuzzyJoin #291
Conversation
Co-authored-by: LeoGrin <leo.grinsztajn@polytechnique.edu>
Awesome! Some additional things which could be useful:
|
• allow custom match dictionary
• (maybe) add a option to require 100% for certain columns (when we have an id column but it's missing in some cases)
Let's do these two later, in a second PR.
Thanks for the active back and forth, this is great!
|
Thanks for this exciting PR. Two TODOs:
|
Hi @jovan-stojanovic I'm curious why you went for Countvectorizer (e.g. with character n-grams) instead of, for example, something like the I guess the "fuzzy" part comes from matching (for example) char n-grams which might still give close distance if there are some misspellings in fields/columns - but dirty_cat also covers this use case with the SimilarityEncoder, no? (background is that we're thinking about how to do deduplication and how to use fuzzyjoin for that, but for this using string distances would be cool) Thanks! |
Hi @mjboos , thanks for your comment. Both the Here, there is an option However, you are right that the next challenge would be to add a better precision metric so as to distinguish between good and bad matches. It links to this discussion. Maybe applying the In any case, thanks, you can create an issue, to be discussed as an additional feature in future PR's! |
You might new a "plt.tight_layout()" before each plt.show() in the example, as the xlabel and ylabel do not appear |
We're having failing tests that seem unrelated to this PR: https://github.com/dirty-cat/dirty_cat/actions/runs/3175021096/jobs/5176907256 I don't like this: if we have a fragile test suite, we will be chasing more and more failing tests as we add features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments.
The thing that actually worries me is that our tests are fragile.
It seems that the test failures that we are witnessing are independent from the PR. @jovan-stojanovic : can you confirm?
Yes, this is something that happened for the first time yesterday. With #371, I will try and resolve the min_hash_encoder (and gap_encoder) test so that we don't fetch data online. |
I tried using it, but the problem is that it will not work with our current minimal requirements. It requires the statsmodels package along with seaborn to work.
Let's add statsmodels to our doc dependencies
|
You didn't add the dependency to statsmodels in the right place, I fear: https://app.circleci.com/pipelines/github/dirty-cat/dirty_cat/1032/workflows/83c23fe0-dfaf-4b27-aab6-f6a9f3f87cae/jobs/2300 |
You need to add the dependency here: https://github.com/dirty-cat/dirty_cat/blob/master/build_tools/circle/build_doc.sh#L126 You should remove it to were you have added it previously. |
Merging! This is a great addition. |
Congrats!! |
This PR adds the new
FuzzyJoin
class that allows joining tables with dirty columns.Co-authored-by: @LeoGrin