Exclude a pseudonym if the name exists in the actual data by iulusoy · Pull Request #97 · ssciwr/mailcom

iulusoy · 2025-09-04T09:29:11Z

Exclude a pseudonym from the list if it matches the actual data. For example, Agathe is happy cannot use the pseudonym Agathe. This closes #76 .

I tried several different ways, this is what I converged at in the end:

pseudonymize the current text
check if the name(s) that is (are) detected exists in the dictionary of pseudonyms
if the name exists: remove if from the dictionary of pseudonyms
pseudonymize with the correct pseudonyms

In all subsequent texts (eml files or row of csv), the used pseudonyms will differ to those that are used prior to the duplication. For example, for the three texts

"Agathe is happy"
"Pierre is happy"
"Aisha is happy"

and given pseudonyms Pierre, Marcel, Aisha, the pseudonymized text then becomes

"Pierre is happy"
"Marcel is happy"
"Marcel is happy"

It is thus possible that the entries in the top match actual names in the bottom of the data.

I decided to take that risk here, since the data in between the csv rows/eml files is not related, ie. comes from different sources.

In general, there is probably not a high risk by using the same pseudonym as actual name, but a risk nonetheless, so we need to ensure that names are not preserved accidentally.

I tried different implementations, but went for checking pseudonyms after the first pass of pseudonymize and then re-pseudonymize if a pseudonym matches a name.

codecov · 2025-09-04T09:42:26Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.57%. Comparing base (faf6624) to head (13de51d).
⚠️ Report is 10 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #97      +/-   ##
==========================================
- Coverage   99.63%   99.57%   -0.06%     
==========================================
  Files          13       13              
  Lines        2191     2364     +173     
==========================================
+ Hits         2183     2354     +171     
- Misses          8       10       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…seudos

iulusoy · 2025-09-04T10:50:57Z

Quality Gate failed

Failed conditions 10.3% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

This is all on code that is executed in the tests. For this we would need to refactor the tests and functions some more, however I think for the currently upcoming release it is ok (quite normal that tests repeat code, especially if you test functionality that is interdependent).

I believe a refactor would make more sense at a later stage, after mailcom has been tested by users other than us.

Copilot

Pull Request Overview

This PR implements a feature to exclude pseudonyms that match actual names found in the text being processed. The system now detects when a pseudonym matches a real person's name in the content and removes that pseudonym from the available list to prevent accidental preservation of real names.

Adds logic to check if pseudonyms match actual detected person names and exclude them
Implements re-processing capability when pseudonyms need to be excluded
Updates method signatures to return exclusion status alongside pseudonymized content

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
mailcom/parse.py	Core implementation of pseudonym checking and exclusion logic with new helper methods
mailcom/main.py	Integration of pseudonym exclusion checking into the main processing workflow
mailcom/test/test_parse.py	Unit tests for new pseudonym checking functionality and updated method signatures
mailcom/test/test_main.py	Integration tests covering various scenarios of matching pseudonyms

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-04T11:02:41Z

mailcom/parse.py

+            print("Found matching name(s) from pseudonyms to actual person names.")
+            print(f"Names found: {names}")
+            print(f"Pseudonyms provided: {self.pseudo_first_names.get(lang, [])}")


Debug print statements should be replaced with proper logging using the logging module. Print statements in production code can clutter output and make debugging harder.

Copilot · 2025-09-04T11:02:42Z

mailcom/parse.py

+                for pseudo in self.pseudo_first_names[lang]
+                if pseudo not in names
+            ]
+            print(f"Updated pseudonyms: {self.pseudo_first_names.get(lang, [])}")


Debug print statements should be replaced with proper logging using the logging module. Print statements in production code can clutter output and make debugging harder.

Copilot · 2025-09-04T11:02:42Z

mailcom/parse.py

+                """Please provide a different list of pseudonyms via the
+                             workflow settings file. The current list of pseudonyms
+                             is too short and contains only names that already
+                             exist in the actual data."""


The error message contains unnecessary whitespace and formatting that makes it less readable. Consider using a cleaner format without extra indentation and line breaks.

Suggested change

"""Please provide a different list of pseudonyms via the

workflow settings file. The current list of pseudonyms

is too short and contains only names that already

exist in the actual data."""

"Please provide a different list of pseudonyms via the workflow settings file. The current list of pseudonyms is too short and contains only names that already exist in the actual data."

Copilot · 2025-09-04T11:02:42Z

mailcom/parse.py

+                (
+                    names.extend([name, name.lower(), name.title()])
+                    if name not in names
+                    else None
+                )


This ternary expression with side effects is hard to read and understand. Consider using a simple if statement for better clarity.

kimlee87

Thank you for resolving this issue. It looks good to me.

I only have one thought on the last note of your PR's description. Please see details in my comment.

kimlee87 · 2025-09-05T07:50:07Z

mailcom/main.py

            pseudo_ne=pseudo_ne,
            pseudo_numbers=pseudo_numbers,
        )
+        while exclude_pseudonym:


I am thinking if the while loop here is needed as we called the pseudonymize_with_updated_ne() method with the ne_sent_dict as None, which means the dict would be created from the previosly detected self.ne_list and self.ne_sent, after passing pseudonymize() method.

The self._check_pseudonyms_in_content() method within pseudonymize() makes sure that no person NE from self.ne_list appears in the self.pseudo_first_names list.

The repetition might happen if we call pseudonymize() again (after the initial time), instead of pseudonymize_with_updated_ne(). This is because pseudonymize() gets NER list from a transformer model, which might be not identical after every call.

If we still to keep this while loop, I think we can also call pseudonymize() within the while.

Maybe there are cases that this iteration is necessary and I have not figured them out yet.

Yes, true, that was why in the end I decided to put the re-checking for pseudonyms in the loop. I did forget though that pseudonymize_with_updated_ne() does not call the transformers pipeline again. So I will amend the code.

… targets

sonarqubecloud · 2025-09-05T09:15:58Z

Quality Gate failed

Failed conditions
10.0% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

feat: exclude a pseudonym if existing name

612be5f

fix&test: forgot about python 3.9 compatibility, add test for check_p…

c44992f

…seudos

iulusoy requested a review from Copilot September 4, 2025 10:46

Copilot AI reviewed Sep 4, 2025

View reviewed changes

iulusoy requested a review from kimlee87 September 4, 2025 11:07

kimlee87 approved these changes Sep 5, 2025

View reviewed changes

fix: remove while loop since ner is not reevaluated, sneak in codecov…

13de51d

… targets

iulusoy merged commit adb09dc into main Sep 5, 2025
12 of 13 checks passed

iulusoy deleted the exclude_pseudonyms branch September 5, 2025 11:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude a pseudonym if the name exists in the actual data#97

Exclude a pseudonym if the name exists in the actual data#97
iulusoy merged 3 commits intomainfrom
exclude_pseudonyms

iulusoy commented Sep 4, 2025 •

edited

Loading

Uh oh!

codecov bot commented Sep 4, 2025 •

edited

Loading

Uh oh!

iulusoy commented Sep 4, 2025

Quality Gate failed

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 4, 2025

Uh oh!

Copilot AI Sep 4, 2025

Uh oh!

Copilot AI Sep 4, 2025

Uh oh!

Copilot AI Sep 4, 2025

Uh oh!

kimlee87 left a comment

Uh oh!

kimlee87 Sep 5, 2025

Uh oh!

iulusoy Sep 5, 2025

Uh oh!

sonarqubecloud bot commented Sep 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

iulusoy commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

iulusoy commented Sep 4, 2025

Quality Gate failed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

kimlee87 left a comment

Choose a reason for hiding this comment

Uh oh!

kimlee87 Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

iulusoy Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Sep 5, 2025

Quality Gate failed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

iulusoy commented Sep 4, 2025 •

edited

Loading

codecov bot commented Sep 4, 2025 •

edited

Loading