# 👤 Personal Information
Name: **Sinem Ertem**

StudentID: **14616068**

Email: [**sinem.ertem@student.uva.nl**](youremail@student.uva.nl)

Submitted on: **DD.MM.YYYY**

# 📊 Data Context
**Nationaal Archief <a name="NA"></a>[<sup>[1]</sup>](#NA) is the Netherlands' official archive, responsible for preserving the country's historical documents and records. It is the largest archive in the country, with holdings dating back to the Middle Ages. Similarly, they have documents from the Dutch United East India Company (Verenigde Oostindische Compagnie or VOC). The VOC, founded in 1602, was unquestionably the most thriving of all trading companies in the seventeenth and eighteenth centuries <a name="Gaastra2007"></a>[<sup>[2]</sup>](#Gaastra2007)**


**VOC's archives provide a unique insight into the history of early modern Europe and the wider world. They contain records of business activities, such as trade contracts, shipping records, and financial accounts. They also include detailed accounts of the day-to-day life of the VOC and its employees. This wealth of information allows us to better understand the dynamics of early modern global trade and colonialism. Additionally, the archives provide an insight into the experiences of the people who lived and worked in the Dutch East Indies, including the effects of colonization on the local population and their interactions with Dutch merchants and traders.**

**Luthra et al. <a name="luthra"></a>[<sup>[3]</sup>](#luthra) released a corpus of nearly 70,000 annotations as a shared task, for which they provide strong baselines using state-of-the-art neural network models. The Data Card <a name="dc"></a>[<sup>[4]</sup>](#dc) provides a synopsis of the dataset, motivations and uses. The dataset will be further analyzed in this notebook.**

## NLP

**A way to better understand these historical archives is by Natural Language Processing (NLP). The domain of NLP, which is also known as computational linguistics, incorporates the engineering of computer models and processes to tackle such practically-oriented problems in understanding human languages.**

**This research paper will delve into extracting useful information from text, more specifically, NER. NER corresponds to the identification of entities of interest in texts, generally of the types Person, Organisation and Location. Such entities serve as reference points that form the meaning of text and assist in its interpretation.**

**Below you can find an image of the Typogoly used by Luthra et al.**

![image](img/typology.png)

### Entity [Person] with Attribute [Gender]

**The values corrosponding to gender are divided into 4 different groups: Man, Woman, Group and Unspecfied with each there own color as seen in the image below**

![image](img/attribute_gender.png)

**If we apply the above to an example VOC Testament we get the following annotation:**

![image](img/testament1.png)

**What we can see in the image above is that 'Poedak' is annotated as Unspecified, 'Neeff Adam Domingo' as Man and 'vrijechristen Vrouw: Agar Jacobs as Woman'. Now that you have a better understanding of what NER is, let's analyze the dataset** 

**[1] https://www.nationaalarchief.nl/en \
[2] https://www.nationaalarchief.nl/sites/default/files/afbeeldingen/toegangen/NL-HaNA_1.04.02_introduction-VOC.pdf \
[3] https://www.emerald.com/insight/content/doi/10.1108/JD-02-2022-0038/full/html \
[4] https://github.com/budh333/UnSilence_VOC/blob/main/Datacard.pdf**

# 📄 Data Description

### Import libraries

In [1]:
import warnings
warnings.filterwarnings('ignore')

from tqdm import tqdm
import os
from glob import glob

import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from preprocessing import clean_text

import spacy
from spacy.lang.nl.examples import sentences 
nlp = spacy.load("nl_core_news_sm") 
# nlp.pipe_names ['tok2vec', 'morphologizer', 'tagger', 'parser', 'lemmatizer', 'attribute_ruler','ner']

  import pandas.util.testing as tm


### Data Loading

In [2]:
# Within the annotated_data directory there are 5 different directories, with each containing both txt and ann files
data_path = '../UnSilence_VOC-main/data/annotated_data/'

In [3]:
# In total there are 2199 annotation files and text files 
for i in ['A','B','C','D','E']:
    file_list = os.listdir(data_path+i)
    num_ann_files = len([filename for filename in file_list if filename.endswith('.ann')])
    print(num_ann_files)

333
267
624
463
512


In [4]:
def read_file(url):
    with open(url) as f:
        lines = f.readlines()
        for line in lines:
            line = line.strip()
            print(line)

#### VOC testament 1

In [5]:
read_file(data_path+'A/NL-HaNA_1.04.02_6848_0031.txt')

11
„
„
35
in beneven
zijn Gemelde aanbehumde vader de voorneemde
_
5
tames va
e2
25
6
welken hy testateur betuigde in dienst aangenemen en uitgevaren
te zyn ten Eynde door haar hoog Edele groot agtb=r uytgekeerd
te werden aan de geene da tot dies ontfangst zal ef zullen
Wesen gequalificeerd
Behuygende den toetateur op myne gedane verage dat zyn beidel
beneden de Twee duivend rd:s was bedragen de
voorts heb ik Not:s den Poetaleur behoorlyk g'informeerd van de
Jongste besluyten door welmelde hunne hoog Edelhedens ten op„
Bigt van de Lijsteygenen die gedoopt in de christelyke leere onderwesen
zijn Successive genamen.
Al T Gunt voorsz: staat den testateur ure en duydelyk voorge„
lesen en voorgehouden in by zin E: zoo hy betuigde wel verstaan zynde
begeerde hy dat dat instrument zijn volkomen efect mogte ge„
pielen t By als testaaren Codicell gifte ter zake des doods chte
Eenige andere makinge van uytferste Wille zulx ’t zelve best
na regten zal konnen bestaan, schoon de nodige Solemnityten

#### VOC testament 1 annotations

In [6]:
read_file(data_path+'A/NL-HaNA_1.04.02_6848_0031.ann')

T1	Person 413 432	Not:s den Poetaleur
A1	Gender T1 Man
A2	LegalStatus T1 Unspecified
T2	ProperName 419 431	den Poetaleu
T4	ProperName 1147 1160	Lucas Andries
T7	Person 1316 1336	J: N: Bestbier not:s
A10	Gender T7 Man
A11	LegalStatus T7 Unspecified
T8	ProperName 1316 1330	J: N: Bestbier
T10	ProperName 1464 1471	Batavia
T12	ProperName 1517 1528	H: N: Lacle
T13	Person 1633 1660;1661 1678	Nicolaas van Bergen van der Grijp Not:o publ:
A16	Gender T13 Man
A17	LegalStatus T13 Unspecified
T14	ProperName 1633 1660;1661 1666	Nicolaas van Bergen van der Grijp
T16	Place 1712 1728	nederlands indra
T17	ProperName 1712 1728	nederlands indra
T18	ProperName 1692 1728	hooge regeringe van nederlands indra
T19	Place 1742 1764	binnen de stad Batavia
T20	ProperName 1757 1764	Batavia
T22	ProperName 1817 1834;1835 1840	Nicolaas Iohannes Ouman
T26	Person 1517 1540	H: N: Lacle Secretaris.
A25	Gender T26 Man
A26	LegalStatus T26 Unspecified
A3	Role T1 Notary
T3	Person 1147 1160	Lucas Andries
A4	Gender T3 Man
A5	Le

### SpaCy NER

In [29]:
with open(data_path+'A/NL-HaNA_1.04.02_6848_0031.txt') as f:
        lines = f.readlines()

text = clean_text(lines)
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))

twee | CARDINAL | Numerals that do not fall under another type
november | DATE | Absolute or relative dates or periods
o | NORP | Nationalities or religious or political groups
nederlands | NORP | Nationalities or religious or political groups


In [36]:
with open(data_path+'A/NL-HaNA_1.04.02_6848_0031.ann') as f:
        lines = f.readlines()
print(lines)

['T1\tPerson 413 432\tNot:s den Poetaleur\n', 'A1\tGender T1 Man\n', 'A2\tLegalStatus T1 Unspecified\n', 'T2\tProperName 419 431\tden Poetaleu\n', 'T4\tProperName 1147 1160\tLucas Andries\n', 'T7\tPerson 1316 1336\tJ: N: Bestbier not:s\n', 'A10\tGender T7 Man\n', 'A11\tLegalStatus T7 Unspecified\n', 'T8\tProperName 1316 1330\tJ: N: Bestbier\n', 'T10\tProperName 1464 1471\tBatavia\n', 'T12\tProperName 1517 1528\tH: N: Lacle\n', 'T13\tPerson 1633 1660;1661 1678\tNicolaas van Bergen van der Grijp Not:o publ:\n', 'A16\tGender T13 Man\n', 'A17\tLegalStatus T13 Unspecified\n', 'T14\tProperName 1633 1660;1661 1666\tNicolaas van Bergen van der Grijp\n', 'T16\tPlace 1712 1728\tnederlands indra\n', 'T17\tProperName 1712 1728\tnederlands indra\n', 'T18\tProperName 1692 1728\thooge regeringe van nederlands indra\n', 'T19\tPlace 1742 1764\tbinnen de stad Batavia\n', 'T20\tProperName 1757 1764\tBatavia\n', 'T22\tProperName 1817 1834;1835 1840\tNicolaas Iohannes Ouman\n', 'T26\tPerson 1517 1540\tH: N