# Notebook for testing Corpus 200 emails
*Scientific Software Center, University of Heidelberg, April 2025*

The dataset `Corpus 200 emails` contains 200 multilingual emails (Spanish, English, and Portuguese) formatted in accordance with the RFC2822 specification. Download the dataset [here](https://figshare.com/articles/dataset/Corpus_200_Emails/1326662?file=1936502)

This notebook is based on the `demo.ipynb` and `performance_deno.ipynb` notebooks.

In [None]:
import mailcom
import pandas as pd
from IPython.display import display, HTML

A function highlights all named entities found in the text (simple version).

In [None]:
# a dictionary matching colors to the different entity types
colors = {
    "LOC": "green",
    "ORG": "blue",
    "MISC": "yellow",
    "PER": "red"
}

# function for displaying the result using HTML
def highlight_ne(text, ne_list):
    if not ne_list:
        return text
    
    # create a list of all entities with their positions
    entities = []
    for ne in ne_list:
        # avoid substituting the same entity multiple times
        if ne["word"] not in entities and ne["entity_group"] in colors:
            entities.append((ne, colors.get(ne["entity_group"])))

    # replace entities with highlighted spans
    text_chunks = []
    last_idx = 0
    for entity, color in entities:
        ent_word = entity["word"]
        s_idx = entity["start"]
        e_idx = entity["end"]
        # add text before the entity
        text_chunks.append(text[last_idx:s_idx].replace("<", "&lt;").replace(">", "&gt;"))
        # add the entity with a span
        # assume that the entity does not have any HTML tags
        replacement = f"<span style=\"background-color:{color}\">{ent_word}</span>"
        text_chunks.append(replacement)
        last_idx = e_idx
    # add the remaining text
    text_chunks.append(text[last_idx:].replace("<", "&lt;").replace(">", "&gt;"))
    # join all text chunks
    result = "".join(text_chunks)
    
    return result

Load default workflow settings, but omit the default language to activate language detection feature. For simplicity, the updated settings will not be saved.

In [None]:
# activate language detection
new_settings = {"default_lang": ""}
workflow_settings = mailcom.get_workflow_settings(new_settings=new_settings, 
                                                  save_updated_settings=False)

Import eml files from the corpus to an input handler.

In [None]:
# import files from input_dir - change this to your own directory
input_dir = "../../../mailcom/test/data_extended/200_eml"
input_handler = mailcom.get_input_handler(in_path=input_dir, in_type="dir")

Process eml files. By default, all processing steps are enabled.

In [None]:
# process the input data
mailcom.process_data(input_handler.get_email_list(), workflow_settings)

The input text is displayed and the found named entities are highlighted for demonstration.

Note that emails (all words containing '@') are filtered out seperately and thus not highlighted here.

In [None]:
# loop over mails and display the highlights
for email in input_handler.get_email_list():
    # get NE for each sentence in the email
    ne_sent_dict = {}
    for sent_idx, ne in zip(email["ne_sent"], email["ne_list"]):
        if str(sent_idx) not in ne_sent_dict:
            ne_sent_dict[str(sent_idx)] = []
        ne_sent_dict[str(sent_idx)].append(ne)

    # display original text and highlight found and replaced NEs
    html_content = []
    for sent_idx, sentence in enumerate(email["sentences"]):
        ne_list = ne_sent_dict.get(str(sent_idx), [])
        highlighted_html = highlight_ne(sentence, ne_list)
        html_content.append(highlighted_html)
    display(HTML(" ".join(html_content)))

Manually check important fields, including `lang`, `detected_datetime`, `pseudo_content`, and `ne_list`.

In [None]:
for email in input_handler.get_email_list():
    print("= Email language =======\n", email["lang"])
    print("= Detected dates =======\n", email["detected_datetime"])
    print("= Pseudo content =======\n", email["pseudo_content"])
    print("= NE list =======")
    for ne in email["ne_list"]:
        print("  -", ne["word"], " - ", ne["entity_group"], " - ", ne["start"], " - ", ne["end"])
    print("= Sentences =======\n")
    for idx, sent in enumerate(email["sentences"]):
        print(f"  {idx}- {sent}")
    print("\n")

Save the output in a `pandas` dataframe.

In [None]:
# write output to pandas df
df = pd.DataFrame(input_handler.get_email_list())

The output can be saved as a csv file as well.

In [None]:
# set overwrite to True to overwrite the existing file
mailcom.write_output_data(input_handler, "../../../data/out/200_eml.csv", overwrite=True)