Exclude a pseudonym if the name exists in the actual data#97
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #97 +/- ##
==========================================
- Coverage 99.63% 99.57% -0.06%
==========================================
Files 13 13
Lines 2191 2364 +173
==========================================
+ Hits 2183 2354 +171
- Misses 8 10 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This is all on code that is executed in the tests. For this we would need to refactor the tests and functions some more, however I think for the currently upcoming release it is ok (quite normal that tests repeat code, especially if you test functionality that is interdependent). I believe a refactor would make more sense at a later stage, after mailcom has been tested by users other than us. |
There was a problem hiding this comment.
Pull Request Overview
This PR implements a feature to exclude pseudonyms that match actual names found in the text being processed. The system now detects when a pseudonym matches a real person's name in the content and removes that pseudonym from the available list to prevent accidental preservation of real names.
- Adds logic to check if pseudonyms match actual detected person names and exclude them
- Implements re-processing capability when pseudonyms need to be excluded
- Updates method signatures to return exclusion status alongside pseudonymized content
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| mailcom/parse.py | Core implementation of pseudonym checking and exclusion logic with new helper methods |
| mailcom/main.py | Integration of pseudonym exclusion checking into the main processing workflow |
| mailcom/test/test_parse.py | Unit tests for new pseudonym checking functionality and updated method signatures |
| mailcom/test/test_main.py | Integration tests covering various scenarios of matching pseudonyms |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| print("Found matching name(s) from pseudonyms to actual person names.") | ||
| print(f"Names found: {names}") | ||
| print(f"Pseudonyms provided: {self.pseudo_first_names.get(lang, [])}") |
There was a problem hiding this comment.
Debug print statements should be replaced with proper logging using the logging module. Print statements in production code can clutter output and make debugging harder.
| for pseudo in self.pseudo_first_names[lang] | ||
| if pseudo not in names | ||
| ] | ||
| print(f"Updated pseudonyms: {self.pseudo_first_names.get(lang, [])}") |
There was a problem hiding this comment.
Debug print statements should be replaced with proper logging using the logging module. Print statements in production code can clutter output and make debugging harder.
| """Please provide a different list of pseudonyms via the | ||
| workflow settings file. The current list of pseudonyms | ||
| is too short and contains only names that already | ||
| exist in the actual data.""" |
There was a problem hiding this comment.
The error message contains unnecessary whitespace and formatting that makes it less readable. Consider using a cleaner format without extra indentation and line breaks.
| """Please provide a different list of pseudonyms via the | |
| workflow settings file. The current list of pseudonyms | |
| is too short and contains only names that already | |
| exist in the actual data.""" | |
| "Please provide a different list of pseudonyms via the workflow settings file. The current list of pseudonyms is too short and contains only names that already exist in the actual data." |
| ( | ||
| names.extend([name, name.lower(), name.title()]) | ||
| if name not in names | ||
| else None | ||
| ) |
There was a problem hiding this comment.
This ternary expression with side effects is hard to read and understand. Consider using a simple if statement for better clarity.
kimlee87
left a comment
There was a problem hiding this comment.
Thank you for resolving this issue. It looks good to me.
I only have one thought on the last note of your PR's description. Please see details in my comment.
mailcom/main.py
Outdated
| pseudo_ne=pseudo_ne, | ||
| pseudo_numbers=pseudo_numbers, | ||
| ) | ||
| while exclude_pseudonym: |
There was a problem hiding this comment.
I am thinking if the while loop here is needed as we called the pseudonymize_with_updated_ne() method with the ne_sent_dict as None, which means the dict would be created from the previosly detected self.ne_list and self.ne_sent, after passing pseudonymize() method.
The self._check_pseudonyms_in_content() method within pseudonymize() makes sure that no person NE from self.ne_list appears in the self.pseudo_first_names list.
The repetition might happen if we call pseudonymize() again (after the initial time), instead of pseudonymize_with_updated_ne(). This is because pseudonymize() gets NER list from a transformer model, which might be not identical after every call.
If we still to keep this while loop, I think we can also call pseudonymize() within the while.
Maybe there are cases that this iteration is necessary and I have not figured them out yet.
There was a problem hiding this comment.
Yes, true, that was why in the end I decided to put the re-checking for pseudonyms in the loop. I did forget though that pseudonymize_with_updated_ne() does not call the transformers pipeline again. So I will amend the code.
|


Exclude a pseudonym from the list if it matches the actual data. For example,
Agathe is happycannot use the pseudonymAgathe. This closes #76 .I tried several different ways, this is what I converged at in the end:
In all subsequent texts (eml files or row of csv), the used pseudonyms will differ to those that are used prior to the duplication. For example, for the three texts
and given pseudonyms
Pierre, Marcel, Aisha, the pseudonymized text then becomesIt is thus possible that the entries in the top match actual names in the bottom of the data.
I decided to take that risk here, since the data in between the csv rows/eml files is not related, ie. comes from different sources.
In general, there is probably not a high risk by using the same pseudonym as actual name, but a risk nonetheless, so we need to ensure that names are not preserved accidentally.
I tried different implementations, but went for checking pseudonyms after the first pass of
pseudonymizeand then re-pseudonymize if a pseudonym matches a name.