# Demonstration notebook for the mailcom package
*Scientific Software Center, University of Heidelberg, December 2024*  
The `mailcom` package is used to anonymize/pseudonymize textual data, i.e. email content. It takes an `eml` or `html` file as input and extracts information about attachements, number of attachements and type, and the content of the email body. The latter is then parsed through [`spaCy`](https://spacy.io/) and divided into sentences. The sentences are fed to a [`transformers`](https://huggingface.co/docs/transformers/en/index) named entity recognition (NER) [pipeline](https://huggingface.co/docs/transformers/v4.46.3/en/main_classes/pipelines), and person names, places, organizations, miscellaneous, are detected in the inference task. Names are replaced by pseudos, while locations, organizations and miscellaneous are replaced by `[location]`, `[organization]` and `[misc]`. The text is further parsed using string methods, to replace any numbers with `[number]` and email addresses with `[email]`. The processed text and metadata can then be written to an `xml` file or into a pandas dataframe.

Please note that 100% accuracy is not possible with this task. Any output needs to be further checked by a human to ensure the text has been anonymized completely.

The current set-up is for Romance languages, however [other language models](https://spacy.io/usage/models) can also be loaded into the spaCy pipeline. The transformers pipeline uses the `xlm-roberta-large-finetuned-conll03-english` model revision number `18f95e9` by default, but other models can also be passed (see below).

Before using the `mailcom` package, please install it into your conda environment using
```
pip install mailcom
```
After that, select the appropriate kernel for your Jupyter notebook and execute the cell below to import the package. The package is currently under active development and any function calls are subject to changes.

In [None]:
import mailcom
import pandas as pd
from IPython.display import display, HTML

The cell below defines a function used to display the result in the end, and highlight all named entities found in the text. It is used for demonstration purposes in this example.

Generally, this is a simple approach at highlighting the replaced pseudonyms, but the method itself is prone to errors and should only be used with care when assessing if the text has been anonymized correctly.

In [None]:
# a dictionary matching colors to the different entity types
colors = {
    "LOC": "green",
    "ORG": "blue",
    "MISC": "yellow",
    "PER": "red"
}

# function for displaying the result using HTML
def highlight_ne(text, ne_list):
    # create a list of all entities with their positions
    entities = []
    for ne in ne_list:
        # avoid substituting the same entity multiple times
        if ne["word"] not in entities and ne["entity_group"] in colors:
            entities.append((ne, colors.get(ne["entity_group"])))
    
    # sort entities by their positions in the text in reverse order
    # is this necessary?
    # entities = sorted(entities, key=lambda x: x[0]["start"], reverse=True)

    # replace all "<" and ">" which may mess up spans
    text = text.replace("<", "&lt;")
    text = text.replace(">", "&gt;")
    # replace entities with highlighted spans
    for entity, color in entities:
        ent_word = entity["word"]
        # I think it may be overwriting the already replaced ones
        # Instead, maybe sort which ones are different and not a subset?
        text = text.replace(ent_word, f"<span style=\"background-color:{color}\">{ent_word}</span>")
    
    return text

All settings for the whole pseudonymize process are stored in the file `mailcom/default_settings.json`. You can customize them by:

* Modifying `mailcom/default_settings.json` directly, or
* Creating a new setting file, or
* Updating specific fields when loading the settings

Function `mailcom.get_workflow_settings()` is used to load the workflow settings, as follows:

In [None]:
# get workflow settings from a setting file
setting_path = "../mailcom/default_settings.json"
workflow_settings = mailcom.get_workflow_settings(setting_path=setting_path)

# update some fields while loading the settings
new_settings = {"default_lang": "es"}
setting_dir = "../mailcom/"
workflow_settings = mailcom.get_workflow_settings(new_settings=new_settings, 
                                                  updated_setting_dir= setting_dir,
                                                  save_updated_settings=True)

In the last example of the cell above, the updated settings are saved to a file. If `updated_setting_dir` is not provided, the file is saved in the current directory. To skip saving, set `save_updated_settings` to `False`.

For this demo, we will use the default workflow settings:

In [None]:
# get default workflow settings
workflow_settings = mailcom.get_workflow_settings()

We currently support two types of input data: (1) `csv`file and (2) directory of `eml` and `html` files.

Each row of the `csv`file, `eml` file, or `html` file will be stored in an email dictionary, with pre-defined keys: `content`, `date`, `attachment`, and `attachement type`.

When loading a `csv`file as an input, a list of columns in the file to map with the above pre-defined keys should be provided. For example:

In [None]:
# import data from a csv file - change this to your own file
input_csv = "../../../data/mails_lb_sg.csv"
unmatched_keyword = workflow_settings.get("csv_col_unmatched_keyword")
input_handler = mailcom.get_input_handler(in_path=input_csv, in_type="csv", 
                                          col_names=["message"], 
                                          init_data_fields=["content"], 
                                          unmatched_keyword=unmatched_keyword)

In the cell above, the `message` column from the `csv` file is mapped to the `content` key in the email dictionary, while other keys have `None` as their values.

If the `csv` file lacks the `message` column, value of `content` is set to `unmatched_keyword`

Below, the input files are loaded from the given `input_dir` directory into an input handler. You can provide relative or absolute paths to the directory that contains your `eml` or `html` files. All files of the `eml` or `html` file type in that directory will be considered input files.

In [None]:
# import files from input_dir - change this to your own directory
input_dir = "../../../data/in/"
input_handler = mailcom.get_input_handler(in_path=input_dir, in_type="dir")

In the cell below, the emails are looped over and the email content is processed. Depending on the settings, each email content goes through the following steps:
1. language detection (optional)
2. date time detection (optional)
3. email addresses pseudonymization (optional)
4. name entities pseudonymization
5. numbers pseudonymization (optional)

For steps 3-5, the email content is divided into sentences, which are then pseudonymized. The modified sentences are recombined into a text and stored in the email dictionary under the key `"pseudo_content"`.

In [None]:
# process the input data
processed_data = mailcom.process_data(input_handler.get_email_list(), workflow_settings)

The input text is displayed and the found named entities are highlighted for demonstration. Note that emails (all words containing '@') are filtered out seperately and thus not highlighted here.

In [None]:
# loop over mails and display the highlights
for email in input_handler.get_email_list():
    # display original text and highlight found and replaced NEs
    highlighted_html = highlight_ne(email["content"], email["ne_list"])
    display(HTML(highlighted_html))

After this, the output can be written to a file or processed further. The output is a list of dictionaries, each containing the metadata of the email and the pseudonymized content. In the below cell, the output is saved in a `pandas` dataframe.

In [None]:
# write output to pandas df
df = pd.DataFrame(input_handler.get_email_list())

The output can be saved as a csv file as well.

In [None]:
# set overwrite to True to overwrite the existing file
mailcom.write_output_data(input_handler, "../../../data/out/out_demo.csv", overwrite=True)