# Demonstration notebook for the mailcom package
*Scientific Software Center, University of Heidelberg, December 2024*
The `mailcom` package is used to anonymize/pseudonymize textual data, i.e. email content. It takes an `eml` or `html` file as input and extracts information about attachements, number of attachements and type, and the content of the email body. The latter is then parsed through [`spaCy`](https://spacy.io/) and divided into sentences. The sentences are fed to a [`transformers`](https://huggingface.co/docs/transformers/en/index) named entity recognition (NER) [pipeline](https://huggingface.co/docs/transformers/v4.46.3/en/main_classes/pipelines), and person names, places, organizations, miscellaneous, are detected in the inference task. Names are replaced by pseudos, while locations, organizations and miscellaneous are replaced by `[location]`, `[organization]` and `[misc]`. The text is further parsed using string methods, to replace any numbers with `[number]` and email addresses with `[email]`. The processed text and metadata can then be written to an `xml` file or into a pandas dataframe.

Please note that 100% accuracy is not possible with this task. Any output needs to be further checked by a human to ensure the text has been anonymized completely.

The current set-up is for Romance languages, however [other language models](https://spacy.io/usage/models) can also be loaded into the spaCy pipeline. The transformers pipeline uses the `xlm-roberta-large-finetuned-conll03-english` model revision number `18f95e9` by default, but other models can also be passed (see below).

Before using the `mailcom` package, please install it into your conda environment using
```
pip install mailcom
```
After that, select the appropriate kernel for your Jupyter notebook and execute the cell below to import the package. The package is currently under active development and any function calls are subject to changes.

In [None]:
import mailcom.inout
import mailcom.parse
import pandas as pd
from IPython.display import display, HTML

The cell below defines a function used to display the result in the end, and highlight all named entities found in the text. It is used for demonstration purposes in this example.

In [None]:
# a dictionary matching colors to the different entity types
colors = {
    "LOC": "green",
    "ORG": "blue",
    "MISC": "yellow",
    "PER": "red"
}

# function for displaying the result using HTML
def highlight_ne(text, ne_list):
    # create a list of all entities with their positions
    entities = []
    for ne in ne_list:
        entities.append((ne, colors.get(ne["entity_group"])))
    
    # sort entities by their positions in the text in reverse order
    entities = sorted(entities, key=lambda x: x[0]["start"], reverse=True)
    
    # replace entities with highlighted spans
    for entity, color in entities:
        ent_word = entity["word"]
        text = text.replace(ent_word, f"<span style=\"background-color:{color}\">{ent_word}</span>")
    
    return text

Below, the input files are loaded from the given `input_dir` directory. You can provide relative or absolute paths to the directory that contains your `eml` or `html` files. All files of the `eml` or `html` file type in that directory will be considered input files.

In [None]:
# import files from input_dir - change this to your own directory
input_dir = "../mailcom/test/data"

io = mailcom.inout.InoutHandler(directory_name = input_dir)

# some internal processing
io.list_of_files()

# create pseudonymization object and load spacy and transformers
# set the spacy language for sentence splitting
spacy_language = "fr"
# you may also set the model using `model = "fr_core_news_md"`
spacy_model = "default"
# set the model for transformers, here using the default model
transformers_model = "xlm-roberta-large-finetuned-conll03-english"
# set the revision number for transformers, here using the default revision number
transformers_revision_number = "18f95e9"
ps = mailcom.parse.Pseudonymize()
ps.init_spacy(language=spacy_language, model=spacy_model)
ps.init_transformers(model=transformers_model, model_revision_number=transformers_revision_number)

In the cell below, the emails are looped over and the text is extracted. The text is then split into sentences and the sentences are pseudonymized. The pseudonymized sentences are then joined back into a text and saved to a new file.

The input text is displayed and the found named entities are highlighted for demonstration. Note that emails (all words containing '@') are filtered out seperately and thus not highlighted here.

In [None]:
# loop over mails and pseudonymize them
out_list = []
for file in io.email_list:
    print("Parsing input file {}".format(file))
    text = io.get_text(file)
    # after this function was called, the email metadata can be accessed via io.email_content
    # the dict already has the entries content, date, attachments, attachment type
    email_dict = io.email_content.copy()
    html_text = io.get_html_text(text)
    email_dict["html_text"] = html_text
    if not text:
        continue
    # Test functionality of Pseudonymize class
    output_text = ps.pseudonymize(html_text)

    # display original text and highlight found and replaced NEs
    highlighted_html = highlight_ne(html_text, ps.ne_list)
    display(HTML(highlighted_html))

    # add pseudonymized text to dict
    email_dict["pseudo_content"] = output_text
    out_list.append(email_dict)

After this, the output can be written to a file or processed further. The output is a list of dictionaries, each containing the metadata of the email and the pseudonymized content. In the below cell, the output is saved in a pandas dataframe.

In [None]:
# write output to pandas df
df = pd.DataFrame(out_list)

You may print the output for inspection in the notebook as per the cell below.

In [None]:
# print results
for idx, mail in df.iterrows():
    print("Email", idx)
    print("Original Text:\n", mail["html_text"])
    print("Pseudonymized Text:\n", mail["pseudo_content"])	