-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA Deduplication #339
FEA Deduplication #339
Conversation
Hi @mjboos, thanks for opening this PR, exciting work ahead. |
Thanks for the update @jovan-stojanovic ! |
@jovan-stojanovic @GaelVaroquaux Initially I wanted to extend the example, but think it's better to only do that after you gave some feedback on the PR/design questions I raised above. Seems like the test is failing due to some missing approval/workflow configuration - let me know if it's something on my end. |
I approved running the tests. |
Trivial comment: the file should be named "_deduplicate.py", to signal that it is private and that people should not be importing from it. |
I had a quick look, and overall things look good. I made a bunch of minor comments, but nothing on the overall design. Let me try to think about it more. Yet, so far, so good! |
Thanks for the feedback @GaelVaroquaux I implemented the suggestions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work! Some comments from my side as well.
On your open question, I believe this should be a function, so I agree with what you did.
dirty_cat/_deduplicate.py
Outdated
if return_translation_table: | ||
return unrolled_corrections, translation_table | ||
else: | ||
return unrolled_corrections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same idea for fuzzy_join, but some users pointed out that the simplest was to return one table, with an additional column containing "non-translated" values. It may be best to do here as well. From experience, the user may then more easily observe results and correct if needed. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand, you mean returning translated data as a table with suggested translations (or non-translated values)?
Like translation_table
, just for all data instead of the unique examples (translation_table
contains unique strings/categories + suggested translations, or the original if no translation was suggested)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok actually, I think the best is that this function only returns unrolled_corrections table
.
The main problem is that the output format changes depending on parameters value, which is usually not recommended.
Translation tables can be easily extracted then by dropping duplicates in the dirty data column..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it should always return the same output type.
Okay, will return the table you described above with translated/corrected and uncorrected values (lmk if I misunderstood).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments.This is going to be a great new feature, thanks.
Thanks for the comments @jovan-stojanovic ! When I used the function in the past month, I stumbled across another use case: |
Thanks! Interesting use case but there is something I don't understand:
Isn't this exactly what the |
Exactly, (locally) I allow |
I see, then best to do in my opinion is the following:
|
Sounds good @jovan-stojanovic . Lmk if that's enough. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM! I will wait for another review before merging (maybe @GaelVaroquaux).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent PR, thanks a lot!
Thanks @jovan-stojanovic @LilianBoulard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few remarks came up to us just before closing this:
Hope this is it this time..
examples/06_deduplication.py
Outdated
def generate_example_data(examples, entries_per_example, prob_mistake_per_letter): | ||
"""Helper function to generate data consisting of multiple entries per example. | ||
Characters are misspelled with probability `prob_mistake_per_letter`""" | ||
import string | ||
|
||
data = [] | ||
for example, n_ex in zip(examples, entries_per_example): | ||
len_ex = len(example) | ||
# generate a 2D array of chars of size (n_ex, len_ex) | ||
str_as_list = np.array([list(example)] * n_ex) | ||
# randomly choose which characters are misspelled | ||
idxes = np.where( | ||
np.random.random(len(example[0]) * n_ex) < prob_mistake_per_letter | ||
)[0] | ||
# and randomly pick with which character to replace | ||
replacements = [ | ||
string.ascii_lowercase[i] | ||
for i in np.random.choice(np.arange(26), len(idxes)).astype(int) | ||
] | ||
# introduce spelling mistakes at right examples and right char locations per example | ||
str_as_list[idxes // len_ex, idxes % len_ex] = replacements | ||
# go back to 1d array of strings | ||
data.append(np.ascontiguousarray(str_as_list).view(f"U{len_ex}").ravel()) | ||
return np.concatenate(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best is to move this function into datasets._fetching and rename it to something like make_deduplication
so that we use it whenever we want to generate an example for this.
And to have a one-line import rather than a function makes the important features of this example much more visible.
Note: I see you used the same function in the tests, so maybe import it there as well
def generate_example_data(examples, entries_per_example, prob_mistake_per_letter, rng): | ||
"""Helper function to generate data consisting of multiple entries per example. | ||
Characters are misspelled with probability `prob_mistake_per_letter`""" | ||
|
||
data = [] | ||
for example, n_ex in zip(examples, entries_per_example): | ||
len_ex = len(example) | ||
# generate a 2D array of chars of size (n_ex, len_ex) | ||
str_as_list = np.array([list(example)] * n_ex) | ||
# randomly choose which characters are misspelled | ||
idxes = np.where( | ||
rng.random_sample(len(example[0]) * n_ex) < prob_mistake_per_letter | ||
)[0] | ||
# and randomly pick with which character to replace | ||
replacements = [ | ||
string.ascii_lowercase[i] | ||
for i in rng.choice(np.arange(26), len(idxes)).astype(int) | ||
] | ||
# introduce spelling mistakes at right examples and char locations per example | ||
str_as_list[idxes // len_ex, idxes % len_ex] = replacements | ||
# go back to 1d array of strings | ||
data.append(np.ascontiguousarray(str_as_list).view(f"U{len_ex}").ravel()) | ||
return np.concatenate(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import this function from datasets, see comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left the function, because it's slightly different - for ease of testing I pass the randomstate explicitly here, but for the user facing function in datasets
this makes it unncecessarily complicated IMO.
If you prefer to make it DRY-er, I can add the randomstate argument to the function in datasets
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine for me. I think the most important was to make the example lighter. If this one is different you can keep it in the tests.
Co-authored-by: Lilian <lilian@boulard.fr>
Co-authored-by: Lilian <lilian@boulard.fr>
Co-authored-by: Lilian <lilian@boulard.fr>
Co-authored-by: Lilian <lilian@boulard.fr>
Co-authored-by: Lilian <lilian@boulard.fr>
Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com>
Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com>
Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com>
Sure, I accepted the suggestions and made the function public. I kept the different function for testing though @jovan-stojanovic and detailed why I did above (but can change it). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will merge when the tests pass!
The test_docstrings.py are failing (you should update the docstrings to meet the numpydoc standards). |
Oops, fixed it. |
It's green, merging. Thanks! |
* added deduplication functions * add docstrings * variable renames * fix type hints to support older numpy versions * fix naming and add tests * formatting and docstrings * simple example * new example number * code changes and example rst formatting * feedback on PR * fix import in doc * always return unrolled corrections * Update dirty_cat/_deduplicate.py Co-authored-by: Lilian <lilian@boulard.fr> * Update examples/06_deduplication.py Co-authored-by: Lilian <lilian@boulard.fr> * Update examples/06_deduplication.py Co-authored-by: Lilian <lilian@boulard.fr> * Update examples/06_deduplication.py Co-authored-by: Lilian <lilian@boulard.fr> * Update examples/06_deduplication.py Co-authored-by: Lilian <lilian@boulard.fr> * Update dirty_cat/_deduplicate.py Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com> * Update examples/06_deduplication.py Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com> * Update examples/06_deduplication.py Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com> * requested changes * change-log entry * make docstring numpy style compliant Co-authored-by: Moritz Boos <m.boos@eyeo.com> Co-authored-by: Lilian <lilian@boulard.fr> Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com>
This PR implements a basic
deduplicate
function (see discussion in #306 ).It is based on computing n-gram tfidf distances between unique words in a column (same as for
fuzzy_join
), hierarchically clustering the distance matrix, then deduplicating by picking the most frequent exemplar in each cluster as the "correct" spelling.Open design question: Should this be a function or a class? It clearly has a state (and could be useful when "trained" on a large set of dirty categories and then applied to new incoming data) - but saving a translation table in a class would be costly. An alternative would be to go from a clustered distance matrix to a K-means representation based on string distances (where each centroid is the most frequent cluster exemplar).
It's currently WIP because I still want to add an example/documentation.
FWIW this function was immediately useful in my work, but its usefulness depends on the structure of your data:
If there are few underlying categories, which might be corrupted by typos or alternative spellings, then it should work well out of the box.