Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Deduplication #339

Merged
merged 23 commits into from
Jan 17, 2023
Merged

FEA Deduplication #339

merged 23 commits into from
Jan 17, 2023

Conversation

mjboos
Copy link
Contributor

@mjboos mjboos commented Sep 11, 2022

This PR implements a basic deduplicate function (see discussion in #306 ).
It is based on computing n-gram tfidf distances between unique words in a column (same as for fuzzy_join), hierarchically clustering the distance matrix, then deduplicating by picking the most frequent exemplar in each cluster as the "correct" spelling.

Open design question: Should this be a function or a class? It clearly has a state (and could be useful when "trained" on a large set of dirty categories and then applied to new incoming data) - but saving a translation table in a class would be costly. An alternative would be to go from a clustered distance matrix to a K-means representation based on string distances (where each centroid is the most frequent cluster exemplar).

It's currently WIP because I still want to add an example/documentation.

FWIW this function was immediately useful in my work, but its usefulness depends on the structure of your data:
If there are few underlying categories, which might be corrupted by typos or alternative spellings, then it should work well out of the box.

@jovan-stojanovic
Copy link
Member

Hi @mjboos, thanks for opening this PR, exciting work ahead.
Just so you know, on Wednesday during the day we plan to have a sprint during which we will take a closer look at the problem.
We may come up with some comments and answers to your questions.

@mjboos
Copy link
Contributor Author

mjboos commented Sep 13, 2022

Thanks for the update @jovan-stojanovic !
I added a bare-bones example (examples/07_deduplication.py) to show a use case for this work, let me know if you have any questions/ideas for future directions.

@mjboos mjboos changed the title WIP: Deduplication Deduplication Oct 10, 2022
@mjboos
Copy link
Contributor Author

mjboos commented Oct 10, 2022

@jovan-stojanovic @GaelVaroquaux Initially I wanted to extend the example, but think it's better to only do that after you gave some feedback on the PR/design questions I raised above.

Seems like the test is failing due to some missing approval/workflow configuration - let me know if it's something on my end.

@GaelVaroquaux
Copy link
Member

I approved running the tests.

@GaelVaroquaux
Copy link
Member

Trivial comment: the file should be named "_deduplicate.py", to signal that it is private and that people should not be importing from it.

@GaelVaroquaux
Copy link
Member

I had a quick look, and overall things look good. I made a bunch of minor comments, but nothing on the overall design. Let me try to think about it more. Yet, so far, so good!

@mjboos
Copy link
Contributor Author

mjboos commented Oct 18, 2022

Thanks for the feedback @GaelVaroquaux I implemented the suggestions

Copy link
Member

@jovan-stojanovic jovan-stojanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work! Some comments from my side as well.
On your open question, I believe this should be a function, so I agree with what you did.

dirty_cat/deduplicate.py Outdated Show resolved Hide resolved
dirty_cat/deduplicate.py Outdated Show resolved Hide resolved
Comment on lines 99 to 102
if return_translation_table:
return unrolled_corrections, translation_table
else:
return unrolled_corrections
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same idea for fuzzy_join, but some users pointed out that the simplest was to return one table, with an additional column containing "non-translated" values. It may be best to do here as well. From experience, the user may then more easily observe results and correct if needed. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand, you mean returning translated data as a table with suggested translations (or non-translated values)?
Like translation_table, just for all data instead of the unique examples (translation_table contains unique strings/categories + suggested translations, or the original if no translation was suggested)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok actually, I think the best is that this function only returns unrolled_corrections table.
The main problem is that the output format changes depending on parameters value, which is usually not recommended.
Translation tables can be easily extracted then by dropping duplicates in the dirty data column..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it should always return the same output type.
Okay, will return the table you described above with translated/corrected and uncorrected values (lmk if I misunderstood).

Copy link
Member

@jovan-stojanovic jovan-stojanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments.This is going to be a great new feature, thanks.

@jovan-stojanovic jovan-stojanovic changed the title Deduplication FEA Deduplication Nov 16, 2022
@mjboos
Copy link
Contributor Author

mjboos commented Nov 17, 2022

Thanks for the comments @jovan-stojanovic !
I implemented your suggestions, but had some additional questions (took some time to get back to it, life happens).

When I used the function in the past month, I stumbled across another use case:
Often I notice dirty categorical data in a database/data warehouse, this data is usually too large to download and put into the function itself. So I already create a table of unique values and their counts and then use this as input (instead of the raw data).
To fix typos, I create the translation table and use it to join in the database.
I'm not sure if this is a workflow you want to support in dirty cat, but I could adapt the function to accept already unique strings + their counts (instead of the raw data/list of strings).
Trade off is that we overload the parameter (can be either unique values + counts or just list of potentially duplicate strings) and I'm not sure whether dirty cat is often used with data too large to fit into memory.

@jovan-stojanovic
Copy link
Member

Thanks! Interesting use case but there is something I don't understand:

So I already create a table of unique values and their counts and then use this as input (instead of the raw data). To fix typos, I create the translation table and use it to join in the database.

Isn't this exactly what the deduplicate function does? So the only difference would be that you would manually provide unique words and counts instead of doing what is in the first line unique_words, counts = np.unique(data, return_counts=True)
But then again, you would need to do this to get the counts and uniques, only outside of the function?

@mjboos
Copy link
Contributor Author

mjboos commented Nov 21, 2022

Thanks! Interesting use case but there is something I don't understand:

So I already create a table of unique values and their counts and then use this as input (instead of the raw data). To fix typos, I create the translation table and use it to join in the database.

Isn't this exactly what the deduplicate function does? So the only difference would be that you would manually provide unique words and counts instead of doing what is in the first line unique_words, counts = np.unique(data, return_counts=True) But then again, you would need to do this to get the counts and uniques, only outside of the function?

Exactly, (locally) I allow counts as optional argument, if it is present I assume data already refers to unique strings.
I get the counts & uniques via a cloud database (e.g., google cloud) and download only these. For me, this is helpful because I have very large datasets in the cloud and don't want to get the full, raw data - the correction then happens in the cloud too, by uploading the translation_table as a literal table and joining on it.
It's quite useful for me, but I'm not sure how many dirty_cat users will use a similar workflow (hence trade-off of overloading the parameters vs. supporting this workflow).

@jovan-stojanovic
Copy link
Member

I see, then best to do in my opinion is the following:

  • finish this version of deduplication;
  • add an issue mentioning this, to keep track of it and see if it affect others.

@mjboos
Copy link
Contributor Author

mjboos commented Dec 11, 2022

I see, then best to do in my opinion is the following:

  • finish this version of deduplication;
  • add an issue mentioning this, to keep track of it and see if it affect others.

Sounds good @jovan-stojanovic .
I changed the function to always return the deduplicated data and adapted the example to show how to re-create the translation table - that should make it easier to use.
After it's merged, I'll create an issue as you suggested.

Lmk if that's enough.

Copy link
Member

@jovan-stojanovic jovan-stojanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM! I will wait for another review before merging (maybe @GaelVaroquaux).

Copy link
Member

@LilianBoulard LilianBoulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent PR, thanks a lot!

dirty_cat/_deduplicate.py Show resolved Hide resolved
examples/06_deduplication.py Show resolved Hide resolved
examples/06_deduplication.py Outdated Show resolved Hide resolved
examples/06_deduplication.py Outdated Show resolved Hide resolved
examples/06_deduplication.py Outdated Show resolved Hide resolved
@mjboos
Copy link
Contributor Author

mjboos commented Dec 16, 2022

Thanks @jovan-stojanovic @LilianBoulard
I accepted all suggestions apart from the one where I had an additional question :)

Copy link
Member

@jovan-stojanovic jovan-stojanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few remarks came up to us just before closing this:

Hope this is it this time..

dirty_cat/_deduplicate.py Outdated Show resolved Hide resolved
examples/06_deduplication.py Outdated Show resolved Hide resolved
examples/06_deduplication.py Outdated Show resolved Hide resolved
Comment on lines 41 to 64
def generate_example_data(examples, entries_per_example, prob_mistake_per_letter):
"""Helper function to generate data consisting of multiple entries per example.
Characters are misspelled with probability `prob_mistake_per_letter`"""
import string

data = []
for example, n_ex in zip(examples, entries_per_example):
len_ex = len(example)
# generate a 2D array of chars of size (n_ex, len_ex)
str_as_list = np.array([list(example)] * n_ex)
# randomly choose which characters are misspelled
idxes = np.where(
np.random.random(len(example[0]) * n_ex) < prob_mistake_per_letter
)[0]
# and randomly pick with which character to replace
replacements = [
string.ascii_lowercase[i]
for i in np.random.choice(np.arange(26), len(idxes)).astype(int)
]
# introduce spelling mistakes at right examples and right char locations per example
str_as_list[idxes // len_ex, idxes % len_ex] = replacements
# go back to 1d array of strings
data.append(np.ascontiguousarray(str_as_list).view(f"U{len_ex}").ravel())
return np.concatenate(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best is to move this function into datasets._fetching and rename it to something like make_deduplication so that we use it whenever we want to generate an example for this.
And to have a one-line import rather than a function makes the important features of this example much more visible.

Note: I see you used the same function in the tests, so maybe import it there as well

Comment on lines +18 to +40
def generate_example_data(examples, entries_per_example, prob_mistake_per_letter, rng):
"""Helper function to generate data consisting of multiple entries per example.
Characters are misspelled with probability `prob_mistake_per_letter`"""

data = []
for example, n_ex in zip(examples, entries_per_example):
len_ex = len(example)
# generate a 2D array of chars of size (n_ex, len_ex)
str_as_list = np.array([list(example)] * n_ex)
# randomly choose which characters are misspelled
idxes = np.where(
rng.random_sample(len(example[0]) * n_ex) < prob_mistake_per_letter
)[0]
# and randomly pick with which character to replace
replacements = [
string.ascii_lowercase[i]
for i in rng.choice(np.arange(26), len(idxes)).astype(int)
]
# introduce spelling mistakes at right examples and char locations per example
str_as_list[idxes // len_ex, idxes % len_ex] = replacements
# go back to 1d array of strings
data.append(np.ascontiguousarray(str_as_list).view(f"U{len_ex}").ravel())
return np.concatenate(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import this function from datasets, see comment above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left the function, because it's slightly different - for ease of testing I pass the randomstate explicitly here, but for the user facing function in datasets this makes it unncecessarily complicated IMO.
If you prefer to make it DRY-er, I can add the randomstate argument to the function in datasets as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for me. I think the most important was to make the example lighter. If this one is different you can keep it in the tests.

Moritz Boos and others added 21 commits January 17, 2023 11:00
Co-authored-by: Lilian <lilian@boulard.fr>
Co-authored-by: Lilian <lilian@boulard.fr>
Co-authored-by: Lilian <lilian@boulard.fr>
Co-authored-by: Lilian <lilian@boulard.fr>
Co-authored-by: Lilian <lilian@boulard.fr>
Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com>
Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com>
Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com>
@mjboos
Copy link
Contributor Author

mjboos commented Jan 17, 2023

Sure, I accepted the suggestions and made the function public. I kept the different function for testing though @jovan-stojanovic and detailed why I did above (but can change it).
Also added a change log @LilianBoulard lmk if you want to have it another way or want to add attribution to suggesters of changes as well.

Copy link
Member

@jovan-stojanovic jovan-stojanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will merge when the tests pass!

@jovan-stojanovic
Copy link
Member

The test_docstrings.py are failing (you should update the docstrings to meet the numpydoc standards).

@mjboos
Copy link
Contributor Author

mjboos commented Jan 17, 2023

The test_docstrings.py are failing (you should update the docstrings to meet the numpydoc standards).

Oops, fixed it.

@jovan-stojanovic
Copy link
Member

It's green, merging. Thanks!

@jovan-stojanovic jovan-stojanovic merged commit c545ed9 into skrub-data:main Jan 17, 2023
LeoGrin pushed a commit to LeoGrin/skrub that referenced this pull request Jan 17, 2023
* added deduplication functions

* add docstrings

* variable renames

* fix type hints to support older numpy versions

* fix naming and add tests

* formatting and docstrings

* simple example

* new example number

* code changes and example rst formatting

* feedback on PR

* fix import in doc

* always return unrolled corrections

* Update dirty_cat/_deduplicate.py

Co-authored-by: Lilian <lilian@boulard.fr>

* Update examples/06_deduplication.py

Co-authored-by: Lilian <lilian@boulard.fr>

* Update examples/06_deduplication.py

Co-authored-by: Lilian <lilian@boulard.fr>

* Update examples/06_deduplication.py

Co-authored-by: Lilian <lilian@boulard.fr>

* Update examples/06_deduplication.py

Co-authored-by: Lilian <lilian@boulard.fr>

* Update dirty_cat/_deduplicate.py

Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com>

* Update examples/06_deduplication.py

Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com>

* Update examples/06_deduplication.py

Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com>

* requested changes

* change-log entry

* make docstring numpy style compliant

Co-authored-by: Moritz Boos <m.boos@eyeo.com>
Co-authored-by: Lilian <lilian@boulard.fr>
Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants