FEA Adding FuzzyJoin #291

jovan-stojanovic · 2022-07-28T12:53:43Z

This PR adds the new FuzzyJoin class that allows joining tables with dirty columns.

Co-authored-by: @LeoGrin

Co-authored-by: LeoGrin <leo.grinsztajn@polytechnique.edu>

dirty_cat/fuzzy_join.py

dirty_cat/test/test_fuzzy_join.py

LeoGrin · 2022-07-28T18:11:48Z

Awesome! Some additional things which could be useful:

benchmark on datasets with typos, on abbreviations (@alexis-cvetkov was saying that we could use the is_abbrevation tag in YAGO for this), and maybe on many to many joins.
add a parameter to control for recall / precision tradeoff. It would probably be hard to set so I'm not sure. An idea: print the worst matches so the user knows if he needs more precision.
allow custom match dictionary
(maybe) add a option to require 100% for certain columns (when we have an id column but it's missing in some cases)

GaelVaroquaux · 2022-07-28T20:12:38Z

• allow custom match dictionary • (maybe) add a option to require 100% for certain columns (when we have an id column but it's missing in some cases)

Let's do these two later, in a second PR. Thanks for the active back and forth, this is great!

GaelVaroquaux · 2022-08-29T09:40:33Z

Thanks for this exciting PR. Two TODOs:

Fix the tests
Add an example. I have a hard time reviewing a PR without an example, because I cannot get a feeling on the user experience.

mjboos · 2022-09-02T09:00:31Z

Hi @jovan-stojanovic I'm curious why you went for Countvectorizer (e.g. with character n-grams) instead of, for example, something like the SimilarityEncoder.

I guess the "fuzzy" part comes from matching (for example) char n-grams which might still give close distance if there are some misspellings in fields/columns - but dirty_cat also covers this use case with the SimilarityEncoder, no?

(background is that we're thinking about how to do deduplication and how to use fuzzyjoin for that, but for this using string distances would be cool)

Thanks!

jovan-stojanovic · 2022-09-02T10:06:40Z

Hi @jovan-stojanovic I'm curious why you went for Countvectorizer (e.g. with character n-grams) instead of, for example, something like the SimilarityEncoder.

I guess the "fuzzy" part comes from matching (for example) char n-grams which might still give close distance if there are some misspellings in fields/columns - but dirty_cat also covers this use case with the SimilarityEncoder, no?

(background is that we're thinking about how to do deduplication and how to use fuzzyjoin for that, but for this using string distances would be cool)

Thanks!

Hi @mjboos , thanks for your comment. Both the SimilarityEncoder and FuzzyJoin are based on CountVectorizer, so they are similar in that way. However, we did not use the SimilarityEncoder directly because it can be very slow, because you need to compute for every category distances between them. From the user perspective, this is important. This is why we used Nearest Neighbors as it is much faster.

Here, there is an option return_distance=True that may help for deduplication.

However, you are right that the next challenge would be to add a better precision metric so as to distinguish between good and bad matches. It links to this discussion.

Maybe applying the SimilarityEncoder to n neairest neighbors would be an option then? Certainly better than applying it directly to all categories. This would be added as an option with precision='sim_enc' and a threshold.

In any case, thanks, you can create an issue, to be discussed as an additional feature in future PR's!

dirty_cat/_fuzzy_join.py

benchmarks/fuzzy_join_benchmark.py

examples/07_joining_tables_with_FuzzyJoin.py

GaelVaroquaux · 2022-10-03T14:31:57Z

You might new a "plt.tight_layout()" before each plt.show() in the example, as the xlabel and ylabel do not appear

GaelVaroquaux · 2022-10-03T19:52:56Z

We're having failing tests that seem unrelated to this PR: https://github.com/dirty-cat/dirty_cat/actions/runs/3175021096/jobs/5176907256

I don't like this: if we have a fragile test suite, we will be chasing more and more failing tests as we add features.

GaelVaroquaux

A few minor comments.

The thing that actually worries me is that our tests are fragile.

It seems that the test failures that we are witnessing are independent from the PR. @jovan-stojanovic : can you confirm?

dirty_cat/datasets/tests/test_fetching.py

examples/07_joining_tables_with_FuzzyJoin.py

jovan-stojanovic · 2022-10-04T07:46:09Z

It seems that the test failures that we are witnessing are independent from the PR. @jovan-stojanovic : can you confirm?

Yes, this is something that happened for the first time yesterday. With #371, I will try and resolve the min_hash_encoder (and gap_encoder) test so that we don't fetch data online.
It is definitely linked to the fact that we need to download data from the internet for some tests..
( See #376 and #379 )

GaelVaroquaux · 2022-10-10T11:24:04Z

I tried using it, but the problem is that it will not work with our current minimal requirements. It requires the statsmodels package along with seaborn to work.

Let's add statsmodels to our doc dependencies

examples/07_joining_tables_with_FuzzyJoin.py

GaelVaroquaux · 2022-10-10T12:59:04Z

You didn't add the dependency to statsmodels in the right place, I fear: https://app.circleci.com/pipelines/github/dirty-cat/dirty_cat/1032/workflows/83c23fe0-dfaf-4b27-aab6-f6a9f3f87cae/jobs/2300

GaelVaroquaux · 2022-10-10T13:04:02Z

You need to add the dependency here: https://github.com/dirty-cat/dirty_cat/blob/master/build_tools/circle/build_doc.sh#L126

You should remove it to were you have added it previously.

GaelVaroquaux · 2022-10-10T13:24:06Z

Merging! This is a great addition.

LilianBoulard · 2022-10-11T18:24:51Z

Congrats!!

Adding joining feature

58a0a59

Co-authored-by: LeoGrin <leo.grinsztajn@polytechnique.edu>

jovan-stojanovic changed the title ~~Adding joining feature~~ FEA Adding FuzzyJoin Jul 28, 2022

improve test

ba556d3

LeoGrin reviewed Jul 28, 2022

View reviewed changes

dirty_cat/fuzzy_join.py Outdated Show resolved Hide resolved

dirty_cat/fuzzy_join.py Outdated Show resolved Hide resolved

dirty_cat/fuzzy_join.py Outdated Show resolved Hide resolved

dirty_cat/test/test_fuzzy_join.py Outdated Show resolved Hide resolved

jovan-stojanovic added 13 commits July 29, 2022 10:23

improve docs and params

997fb92

improve test

c6e66af

update test

4ad1b84

use tfidfvectorizer and add todo

aa5807c

add 2dball precision measure

97f0d69

modify threshold

8212b67

change param name

31c12ba

Add suffixes to overlaping column names

532167b

Add id matching

c884fe4

remove exact matching (for now)

57c120a

fuzzy_join as a function

fe5d340

class to function

7c5a6e8

class to function 2

fbf2864

jovan-stojanovic added 3 commits August 29, 2022 11:45

fix tests

70a5fe9

fix test 2

446caa3

improve join

550bde4

add example

faf2cff

GaelVaroquaux reviewed Sep 2, 2022

View reviewed changes

dirty_cat/_fuzzy_join.py Outdated Show resolved Hide resolved

GaelVaroquaux reviewed Sep 2, 2022

View reviewed changes

benchmarks/fuzzy_join_benchmark.py Outdated Show resolved Hide resolved

GaelVaroquaux reviewed Sep 2, 2022

View reviewed changes

examples/07_joining_tables_with_FuzzyJoin.py Outdated Show resolved Hide resolved

GaelVaroquaux reviewed Sep 2, 2022

View reviewed changes

examples/07_joining_tables_with_FuzzyJoin.py Show resolved Hide resolved

jovan-stojanovic added 6 commits October 3, 2022 11:31

typo in example

9bbc0a9

typo in example

8e47b9a

improve plots and rendering

5eb2d79

correct plots

69107da

standardize example

3736d26

improve docs

8d651a6

fix rendering

860ca23

GaelVaroquaux reviewed Oct 3, 2022

View reviewed changes

improve example

b9e0cb6

jovan-stojanovic added 2 commits October 10, 2022 11:40

improve docs

af40a80

merge with master

b690fa3

improve test coverage

b4a92e5

GaelVaroquaux reviewed Oct 10, 2022

View reviewed changes

examples/07_joining_tables_with_FuzzyJoin.py Outdated Show resolved Hide resolved

add statsmodel dependancy

f92ed3c

GaelVaroquaux reviewed Oct 10, 2022

View reviewed changes

examples/07_joining_tables_with_FuzzyJoin.py Outdated Show resolved Hide resolved

GaelVaroquaux reviewed Oct 10, 2022

View reviewed changes

examples/07_joining_tables_with_FuzzyJoin.py Outdated Show resolved Hide resolved

GaelVaroquaux reviewed Oct 10, 2022

View reviewed changes

examples/07_joining_tables_with_FuzzyJoin.py Outdated Show resolved Hide resolved

improve test and example

a4c48e2

jovan-stojanovic added 2 commits October 10, 2022 15:05

add statsmodels dependancy

5e89e87

remove dependancy

480806a

GaelVaroquaux merged commit 142ae1b into skrub-data:master Oct 10, 2022

jovan-stojanovic deleted the fuzzy_join branch November 17, 2022 07:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEA Adding FuzzyJoin #291

FEA Adding FuzzyJoin #291

jovan-stojanovic commented Jul 28, 2022

LeoGrin commented Jul 28, 2022

GaelVaroquaux commented Jul 28, 2022 via email

GaelVaroquaux commented Aug 29, 2022

mjboos commented Sep 2, 2022

jovan-stojanovic commented Sep 2, 2022 •

edited

GaelVaroquaux commented Oct 3, 2022

GaelVaroquaux commented Oct 3, 2022

GaelVaroquaux left a comment

jovan-stojanovic commented Oct 4, 2022 •

edited

GaelVaroquaux commented Oct 10, 2022 via email

GaelVaroquaux commented Oct 10, 2022

GaelVaroquaux commented Oct 10, 2022

GaelVaroquaux commented Oct 10, 2022

LilianBoulard commented Oct 11, 2022

FEA Adding FuzzyJoin #291

FEA Adding FuzzyJoin #291

Conversation

jovan-stojanovic commented Jul 28, 2022

LeoGrin commented Jul 28, 2022

GaelVaroquaux commented Jul 28, 2022 via email

GaelVaroquaux commented Aug 29, 2022

mjboos commented Sep 2, 2022

jovan-stojanovic commented Sep 2, 2022 • edited

GaelVaroquaux commented Oct 3, 2022

GaelVaroquaux commented Oct 3, 2022

GaelVaroquaux left a comment

Choose a reason for hiding this comment

jovan-stojanovic commented Oct 4, 2022 • edited

GaelVaroquaux commented Oct 10, 2022 via email

GaelVaroquaux commented Oct 10, 2022

GaelVaroquaux commented Oct 10, 2022

GaelVaroquaux commented Oct 10, 2022

LilianBoulard commented Oct 11, 2022

jovan-stojanovic commented Sep 2, 2022 •

edited

jovan-stojanovic commented Oct 4, 2022 •

edited