Skip to content

Exclude a pseudonym if the name exists in the actual data#97

Merged
iulusoy merged 3 commits intomainfrom
exclude_pseudonyms
Sep 5, 2025
Merged

Exclude a pseudonym if the name exists in the actual data#97
iulusoy merged 3 commits intomainfrom
exclude_pseudonyms

Conversation

@iulusoy
Copy link
Copy Markdown
Member

@iulusoy iulusoy commented Sep 4, 2025

Exclude a pseudonym from the list if it matches the actual data. For example, Agathe is happy cannot use the pseudonym Agathe. This closes #76 .

I tried several different ways, this is what I converged at in the end:

  • pseudonymize the current text
  • check if the name(s) that is (are) detected exists in the dictionary of pseudonyms
  • if the name exists: remove if from the dictionary of pseudonyms
  • pseudonymize with the correct pseudonyms

In all subsequent texts (eml files or row of csv), the used pseudonyms will differ to those that are used prior to the duplication. For example, for the three texts

"Agathe is happy"
"Pierre is happy"
"Aisha is happy" 

and given pseudonyms Pierre, Marcel, Aisha, the pseudonymized text then becomes

"Pierre is happy"
"Marcel is happy"
"Marcel is happy" 

It is thus possible that the entries in the top match actual names in the bottom of the data.

I decided to take that risk here, since the data in between the csv rows/eml files is not related, ie. comes from different sources.

In general, there is probably not a high risk by using the same pseudonym as actual name, but a risk nonetheless, so we need to ensure that names are not preserved accidentally.

I tried different implementations, but went for checking pseudonyms after the first pass of pseudonymize and then re-pseudonymize if a pseudonym matches a name.

@codecov
Copy link
Copy Markdown

codecov bot commented Sep 4, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.57%. Comparing base (faf6624) to head (13de51d).
⚠️ Report is 10 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #97      +/-   ##
==========================================
- Coverage   99.63%   99.57%   -0.06%     
==========================================
  Files          13       13              
  Lines        2191     2364     +173     
==========================================
+ Hits         2183     2354     +171     
- Misses          8       10       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@iulusoy iulusoy requested a review from Copilot September 4, 2025 10:46
@iulusoy
Copy link
Copy Markdown
Member Author

iulusoy commented Sep 4, 2025

Quality Gate Failed Quality Gate failed

Failed conditions 10.3% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

This is all on code that is executed in the tests. For this we would need to refactor the tests and functions some more, however I think for the currently upcoming release it is ok (quite normal that tests repeat code, especially if you test functionality that is interdependent).

I believe a refactor would make more sense at a later stage, after mailcom has been tested by users other than us.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a feature to exclude pseudonyms that match actual names found in the text being processed. The system now detects when a pseudonym matches a real person's name in the content and removes that pseudonym from the available list to prevent accidental preservation of real names.

  • Adds logic to check if pseudonyms match actual detected person names and exclude them
  • Implements re-processing capability when pseudonyms need to be excluded
  • Updates method signatures to return exclusion status alongside pseudonymized content

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
mailcom/parse.py Core implementation of pseudonym checking and exclusion logic with new helper methods
mailcom/main.py Integration of pseudonym exclusion checking into the main processing workflow
mailcom/test/test_parse.py Unit tests for new pseudonym checking functionality and updated method signatures
mailcom/test/test_main.py Integration tests covering various scenarios of matching pseudonyms

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +127 to +129
print("Found matching name(s) from pseudonyms to actual person names.")
print(f"Names found: {names}")
print(f"Pseudonyms provided: {self.pseudo_first_names.get(lang, [])}")
Copy link

Copilot AI Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug print statements should be replaced with proper logging using the logging module. Print statements in production code can clutter output and make debugging harder.

Copilot uses AI. Check for mistakes.
for pseudo in self.pseudo_first_names[lang]
if pseudo not in names
]
print(f"Updated pseudonyms: {self.pseudo_first_names.get(lang, [])}")
Copy link

Copilot AI Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug print statements should be replaced with proper logging using the logging module. Print statements in production code can clutter output and make debugging harder.

Copilot uses AI. Check for mistakes.
Comment on lines +142 to +145
"""Please provide a different list of pseudonyms via the
workflow settings file. The current list of pseudonyms
is too short and contains only names that already
exist in the actual data."""
Copy link

Copilot AI Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message contains unnecessary whitespace and formatting that makes it less readable. Consider using a cleaner format without extra indentation and line breaks.

Suggested change
"""Please provide a different list of pseudonyms via the
workflow settings file. The current list of pseudonyms
is too short and contains only names that already
exist in the actual data."""
"Please provide a different list of pseudonyms via the workflow settings file. The current list of pseudonyms is too short and contains only names that already exist in the actual data."

Copilot uses AI. Check for mistakes.
Comment on lines +120 to +124
(
names.extend([name, name.lower(), name.title()])
if name not in names
else None
)
Copy link

Copilot AI Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ternary expression with side effects is hard to read and understand. Consider using a simple if statement for better clarity.

Copilot uses AI. Check for mistakes.
@iulusoy iulusoy requested a review from kimlee87 September 4, 2025 11:07
Copy link
Copy Markdown
Contributor

@kimlee87 kimlee87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for resolving this issue. It looks good to me.

I only have one thought on the last note of your PR's description. Please see details in my comment.

mailcom/main.py Outdated
pseudo_ne=pseudo_ne,
pseudo_numbers=pseudo_numbers,
)
while exclude_pseudonym:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking if the while loop here is needed as we called the pseudonymize_with_updated_ne() method with the ne_sent_dict as None, which means the dict would be created from the previosly detected self.ne_list and self.ne_sent, after passing pseudonymize() method.

The self._check_pseudonyms_in_content() method within pseudonymize() makes sure that no person NE from self.ne_list appears in the self.pseudo_first_names list.

The repetition might happen if we call pseudonymize() again (after the initial time), instead of pseudonymize_with_updated_ne(). This is because pseudonymize() gets NER list from a transformer model, which might be not identical after every call.

If we still to keep this while loop, I think we can also call pseudonymize() within the while.

Maybe there are cases that this iteration is necessary and I have not figured them out yet.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, true, that was why in the end I decided to put the re-checking for pseudonyms in the loop. I did forget though that pseudonymize_with_updated_ne() does not call the transformers pipeline again. So I will amend the code.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Sep 5, 2025

Quality Gate Failed Quality Gate failed

Failed conditions
10.0% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

@iulusoy iulusoy merged commit adb09dc into main Sep 5, 2025
12 of 13 checks passed
@iulusoy iulusoy deleted the exclude_pseudonyms branch September 5, 2025 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Check the person's name and its replacement before pseudonymization

3 participants