FEA Deduplication #339

mjboos · 2022-09-11T12:19:38Z

This PR implements a basic deduplicate function (see discussion in #306 ).
It is based on computing n-gram tfidf distances between unique words in a column (same as for fuzzy_join), hierarchically clustering the distance matrix, then deduplicating by picking the most frequent exemplar in each cluster as the "correct" spelling.

Open design question: Should this be a function or a class? It clearly has a state (and could be useful when "trained" on a large set of dirty categories and then applied to new incoming data) - but saving a translation table in a class would be costly. An alternative would be to go from a clustered distance matrix to a K-means representation based on string distances (where each centroid is the most frequent cluster exemplar).

It's currently WIP because I still want to add an example/documentation.

FWIW this function was immediately useful in my work, but its usefulness depends on the structure of your data:
If there are few underlying categories, which might be corrupted by typos or alternative spellings, then it should work well out of the box.

jovan-stojanovic · 2022-09-12T13:11:09Z

Hi @mjboos, thanks for opening this PR, exciting work ahead.
Just so you know, on Wednesday during the day we plan to have a sprint during which we will take a closer look at the problem.
We may come up with some comments and answers to your questions.

mjboos · 2022-09-13T21:16:26Z

Thanks for the update @jovan-stojanovic !
I added a bare-bones example (examples/07_deduplication.py) to show a use case for this work, let me know if you have any questions/ideas for future directions.

mjboos · 2022-10-10T18:03:13Z

@jovan-stojanovic @GaelVaroquaux Initially I wanted to extend the example, but think it's better to only do that after you gave some feedback on the PR/design questions I raised above.

Seems like the test is failing due to some missing approval/workflow configuration - let me know if it's something on my end.

GaelVaroquaux · 2022-10-11T05:47:37Z

I approved running the tests.

GaelVaroquaux · 2022-10-11T05:48:43Z

Trivial comment: the file should be named "_deduplicate.py", to signal that it is private and that people should not be importing from it.

dirty_cat/deduplicate.py

dirty_cat/test/test_deduplicate.py

examples/08_deduplication.py

GaelVaroquaux · 2022-10-11T06:04:48Z

I had a quick look, and overall things look good. I made a bunch of minor comments, but nothing on the overall design. Let me try to think about it more. Yet, so far, so good!

dirty_cat/deduplicate.py

mjboos · 2022-10-18T19:04:03Z

Thanks for the feedback @GaelVaroquaux I implemented the suggestions

jovan-stojanovic

Thanks for the great work! Some comments from my side as well.
On your open question, I believe this should be a function, so I agree with what you did.

dirty_cat/deduplicate.py

jovan-stojanovic · 2022-10-19T12:10:47Z

dirty_cat/_deduplicate.py

+    if return_translation_table:
+        return unrolled_corrections, translation_table
+    else:
+        return unrolled_corrections


I had the same idea for fuzzy_join, but some users pointed out that the simplest was to return one table, with an additional column containing "non-translated" values. It may be best to do here as well. From experience, the user may then more easily observe results and correct if needed. What do you think?

Not sure I understand, you mean returning translated data as a table with suggested translations (or non-translated values)?
Like translation_table, just for all data instead of the unique examples (translation_table contains unique strings/categories + suggested translations, or the original if no translation was suggested)?

Ok actually, I think the best is that this function only returns unrolled_corrections table.
The main problem is that the output format changes depending on parameters value, which is usually not recommended.
Translation tables can be easily extracted then by dropping duplicates in the dirty data column..

I agree, it should always return the same output type.
Okay, will return the table you described above with translated/corrected and uncorrected values (lmk if I misunderstood).

dirty_cat/_deduplicate.py

jovan-stojanovic

Some more comments.This is going to be a great new feature, thanks.

mjboos · 2022-11-17T11:45:58Z

Thanks for the comments @jovan-stojanovic !
I implemented your suggestions, but had some additional questions (took some time to get back to it, life happens).

When I used the function in the past month, I stumbled across another use case:
Often I notice dirty categorical data in a database/data warehouse, this data is usually too large to download and put into the function itself. So I already create a table of unique values and their counts and then use this as input (instead of the raw data).
To fix typos, I create the translation table and use it to join in the database.
I'm not sure if this is a workflow you want to support in dirty cat, but I could adapt the function to accept already unique strings + their counts (instead of the raw data/list of strings).
Trade off is that we overload the parameter (can be either unique values + counts or just list of potentially duplicate strings) and I'm not sure whether dirty cat is often used with data too large to fit into memory.

jovan-stojanovic · 2022-11-21T10:16:15Z

Thanks! Interesting use case but there is something I don't understand:

So I already create a table of unique values and their counts and then use this as input (instead of the raw data). To fix typos, I create the translation table and use it to join in the database.

Isn't this exactly what the deduplicate function does? So the only difference would be that you would manually provide unique words and counts instead of doing what is in the first line unique_words, counts = np.unique(data, return_counts=True)
But then again, you would need to do this to get the counts and uniques, only outside of the function?

mjboos · 2022-11-21T20:29:41Z

Thanks! Interesting use case but there is something I don't understand:

So I already create a table of unique values and their counts and then use this as input (instead of the raw data). To fix typos, I create the translation table and use it to join in the database.

Isn't this exactly what the deduplicate function does? So the only difference would be that you would manually provide unique words and counts instead of doing what is in the first line unique_words, counts = np.unique(data, return_counts=True) But then again, you would need to do this to get the counts and uniques, only outside of the function?

Exactly, (locally) I allow counts as optional argument, if it is present I assume data already refers to unique strings.
I get the counts & uniques via a cloud database (e.g., google cloud) and download only these. For me, this is helpful because I have very large datasets in the cloud and don't want to get the full, raw data - the correction then happens in the cloud too, by uploading the translation_table as a literal table and joining on it.
It's quite useful for me, but I'm not sure how many dirty_cat users will use a similar workflow (hence trade-off of overloading the parameters vs. supporting this workflow).

jovan-stojanovic · 2022-11-24T22:08:55Z

I see, then best to do in my opinion is the following:

finish this version of deduplication;
add an issue mentioning this, to keep track of it and see if it affect others.

mjboos · 2022-12-11T13:03:52Z

I see, then best to do in my opinion is the following:

finish this version of deduplication;

add an issue mentioning this, to keep track of it and see if it affect others.

Sounds good @jovan-stojanovic .
I changed the function to always return the deduplicated data and adapted the example to show how to re-create the translation table - that should make it easier to use.
After it's merged, I'll create an issue as you suggested.

Lmk if that's enough.

jovan-stojanovic

Thanks, LGTM! I will wait for another review before merging (maybe @GaelVaroquaux).

LilianBoulard

Excellent PR, thanks a lot!

dirty_cat/_deduplicate.py

examples/06_deduplication.py

mjboos · 2022-12-16T18:55:39Z

Thanks @jovan-stojanovic @LilianBoulard
I accepted all suggestions apart from the one where I had an additional question :)

jovan-stojanovic

A few remarks came up to us just before closing this:

Hope this is it this time..

dirty_cat/_deduplicate.py

examples/06_deduplication.py

jovan-stojanovic · 2023-01-02T16:37:45Z

examples/06_deduplication.py

+def generate_example_data(examples, entries_per_example, prob_mistake_per_letter):
+    """Helper function to generate data consisting of multiple entries per example.
+    Characters are misspelled with probability `prob_mistake_per_letter`"""
+    import string
+
+    data = []
+    for example, n_ex in zip(examples, entries_per_example):
+        len_ex = len(example)
+        # generate a 2D array of chars of size (n_ex, len_ex)
+        str_as_list = np.array([list(example)] * n_ex)
+        # randomly choose which characters are misspelled
+        idxes = np.where(
+            np.random.random(len(example[0]) * n_ex) < prob_mistake_per_letter
+        )[0]
+        # and randomly pick with which character to replace
+        replacements = [
+            string.ascii_lowercase[i]
+            for i in np.random.choice(np.arange(26), len(idxes)).astype(int)
+        ]
+        # introduce spelling mistakes at right examples and right char locations per example
+        str_as_list[idxes // len_ex, idxes % len_ex] = replacements
+        # go back to 1d array of strings
+        data.append(np.ascontiguousarray(str_as_list).view(f"U{len_ex}").ravel())
+    return np.concatenate(data)


Best is to move this function into datasets._fetching and rename it to something like make_deduplication so that we use it whenever we want to generate an example for this.
And to have a one-line import rather than a function makes the important features of this example much more visible.

Note: I see you used the same function in the tests, so maybe import it there as well

jovan-stojanovic · 2023-01-02T16:38:57Z

dirty_cat/tests/test_deduplicate.py

+def generate_example_data(examples, entries_per_example, prob_mistake_per_letter, rng):
+    """Helper function to generate data consisting of multiple entries per example.
+    Characters are misspelled with probability `prob_mistake_per_letter`"""
+
+    data = []
+    for example, n_ex in zip(examples, entries_per_example):
+        len_ex = len(example)
+        # generate a 2D array of chars of size (n_ex, len_ex)
+        str_as_list = np.array([list(example)] * n_ex)
+        # randomly choose which characters are misspelled
+        idxes = np.where(
+            rng.random_sample(len(example[0]) * n_ex) < prob_mistake_per_letter
+        )[0]
+        # and randomly pick with which character to replace
+        replacements = [
+            string.ascii_lowercase[i]
+            for i in rng.choice(np.arange(26), len(idxes)).astype(int)
+        ]
+        # introduce spelling mistakes at right examples and char locations per example
+        str_as_list[idxes // len_ex, idxes % len_ex] = replacements
+        # go back to 1d array of strings
+        data.append(np.ascontiguousarray(str_as_list).view(f"U{len_ex}").ravel())
+    return np.concatenate(data)


Import this function from datasets, see comment above.

I left the function, because it's slightly different - for ease of testing I pass the randomstate explicitly here, but for the user facing function in datasets this makes it unncecessarily complicated IMO.
If you prefer to make it DRY-er, I can add the randomstate argument to the function in datasets as well.

Fine for me. I think the most important was to make the example lighter. If this one is different you can keep it in the tests.

Co-authored-by: Lilian <lilian@boulard.fr>

Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com>

mjboos · 2023-01-17T11:13:00Z

Sure, I accepted the suggestions and made the function public. I kept the different function for testing though @jovan-stojanovic and detailed why I did above (but can change it).
Also added a change log @LilianBoulard lmk if you want to have it another way or want to add attribution to suggesters of changes as well.

jovan-stojanovic

Thanks, will merge when the tests pass!

jovan-stojanovic · 2023-01-17T12:55:09Z

The test_docstrings.py are failing (you should update the docstrings to meet the numpydoc standards).

mjboos · 2023-01-17T14:29:00Z

The test_docstrings.py are failing (you should update the docstrings to meet the numpydoc standards).

Oops, fixed it.

jovan-stojanovic · 2023-01-17T15:34:56Z

It's green, merging. Thanks!

* added deduplication functions * add docstrings * variable renames * fix type hints to support older numpy versions * fix naming and add tests * formatting and docstrings * simple example * new example number * code changes and example rst formatting * feedback on PR * fix import in doc * always return unrolled corrections * Update dirty_cat/_deduplicate.py Co-authored-by: Lilian <lilian@boulard.fr> * Update examples/06_deduplication.py Co-authored-by: Lilian <lilian@boulard.fr> * Update examples/06_deduplication.py Co-authored-by: Lilian <lilian@boulard.fr> * Update examples/06_deduplication.py Co-authored-by: Lilian <lilian@boulard.fr> * Update examples/06_deduplication.py Co-authored-by: Lilian <lilian@boulard.fr> * Update dirty_cat/_deduplicate.py Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com> * Update examples/06_deduplication.py Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com> * Update examples/06_deduplication.py Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com> * requested changes * change-log entry * make docstring numpy style compliant Co-authored-by: Moritz Boos <m.boos@eyeo.com> Co-authored-by: Lilian <lilian@boulard.fr> Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com>

mjboos changed the title ~~WIP: Deduplication~~ Deduplication Oct 10, 2022