# Notebook for testing Corpus 200 emails
*Scientific Software Center, University of Heidelberg, April 2025*

The dataset `Corpus 200 emails` contains 200 multilingual emails (Spanish, English, and Portuguese/Galician) formatted in accordance with the RFC2822 specification. Download the dataset [here](https://figshare.com/articles/dataset/Corpus_200_Emails/1326662?file=1936502)

This notebook will create an evaluation dataset for `mailcom` using 30 emails from `Corpus 200 emails` (10 emails per language).

For each email in the dataset, we record:
* email content
* email language
* detected dates in the email
* list of named entities (NE)
* pseudo content

In [None]:
# mark email numbers for languages
# start with 1
gl_emails = ["01", "02", "03", "04", 10, 12, 15]
gl_files = [str(i) + ".eml" for i in gl_emails]
pt_emails = [30, 36, 66]
pt_files = [str(i) + ".eml" for i in pt_emails]
es_emails = ["05", "06", "07", "09", 11, 23, 28, 31, 33, 34]
es_files = [str(i) + ".eml" for i in es_emails]
en_emails = [13, 14, 19, 20, 22, 24, 32, 35, 37, 38]
en_files = [str(i) + ".eml" for i in en_emails]
chosen_files = gl_files + pt_files + es_files + en_files
assert len(set(chosen_files)) == 30

In [None]:
source_dir = "../../../../eval_data_mailcom"
input_dir = "../../../mailcom/test/data_extended/200_eml"

In [None]:
# copy files from source to input_dir
# run when needed
from pathlib import Path
import shutil
source_files = Path(source_dir).glob("*.eml")
for source_file in source_files:
    if source_file.name in chosen_files:
        shutil.copy(source_file, input_dir)
        print(f"Copied {source_file.name} to {input_dir}")

First, we use the language detection, date detection, and pseunonymize from `mailcom` to buil the draft version of the dataset. Each email will be manually checked for validation.

In [None]:
import mailcom
import pandas as pd

In [None]:
# activate language detection
new_settings = {"default_lang": ""}
workflow_settings = mailcom.get_workflow_settings(new_settings=new_settings, 
                                                  save_updated_settings=False)

In [None]:
# import files from input_dir
input_handler = mailcom.get_input_handler(in_path=input_dir, in_type="dir")

In [None]:
# process the input data
mailcom.process_data(input_handler.get_email_list(), workflow_settings)

In [None]:
# write output to pandas df
df = pd.DataFrame(input_handler.get_email_list())

In [None]:
mailcom.write_output_data(input_handler, "../../../data/eval_data_200_eml.csv", overwrite=True)