# Prepare Data for Training
In this notebook, we prepare a dataset we can use for training a lemmatizer with lemma.

We use two datasets which are both publicly available. The first dataset is the word list from Dansk Sprognævn (DSN). This dataset is freely available but you have to sign a contract with DSN to obtain the file. Please see [www.dsn.dk](https://www.dsn.dk) for more info. The other dataset is the Danish part of the Universal Dependencies (UD). This dataset is open source and available from the [UD repo](https://github.com/UniversalDependencies/UD_Danish) on GitHub.

The notebook assumes you have the datasets stored in a subfolder called *data*.

In [1]:
from IPython.core.display import display, HTML
import logging
from bs4 import BeautifulSoup
from collections import defaultdict, Counter
import unicodecsv as csv
from tqdm import tqdm
logging.basicConfig(level=logging.DEBUG, format="%(levelname)s : %(message)s")

In [2]:
UD_TRAIN_FILE = "./data/UD_Danish/da-ud-train.conllu"
DSN_XML_FILE = "./data/DSN/RO.iLexdump.m.fuldformer.til.aftagere.xml"
PREPARED_FILE = "./data/prepared.csv"

## Parse DSN XML data
We will now read the DSN data. Because the word classes used in DSN data do not map 1-to-1 to, we need to do some manual mapping. The CLASS_LOOKUP dictionary specifies the mapping.

In [3]:
CLASS_LOOKUP = {"sb": ["NOUN"],
                "adj": ["ADJ"],
                "adv": ["ADV"],
                "vb": ["VERB"],
                "proprium": ["PROPN"],
                "præp": ["ADP"],
                "udråbsord": ["INTJ"],
                "pron": ["PRON"],
                "talord": ["NUM"],
                "konj": ["CONJ"],
                "romertal": ["NUM"],
                "kolon": ["NOUN"],
                "lydord": ["NOUN"],
                "art": ["PRON"]}

def _build_dsn_forms(soup):
    unknown_classes = defaultdict(int)
    forms = set()
    logging.debug("finding 'hom' tags")
    homograph_groups = soup.find_all('hom', recursive=True)
    for hom_group in tqdm(homograph_groups, leave=False):
        for article in hom_group.find_all(recursive=False):
            word_class_temp = article.name.split('-')[0]
            word_classes = CLASS_LOOKUP.get(word_class_temp, None)

            if not word_classes:
                unknown_classes[word_class_temp] += 1
                continue

            head_node = article.find('hoved')
            lemma = head_node.find('opslagsord').get_text()
            full_forms = article.find('fuldformer')
            if full_forms is None:
                continue

            form_of = head_node.find('form.af')
            if form_of:
                # The lookup word ('artikel') itself is not the baseform.
                 continue


            for full_form_tag in full_forms.find_all('ff', recursive=False):
                full_form = full_form_tag.get_text()
                for word_class in word_classes:
                    forms.add((word_class, full_form, lemma))
    return sorted(forms, key = lambda x: x[1:]), unknown_classes

def _write_form(word_class, full_form, lemma):
    writer.writerow([word_class, full_form, lemma])

In [4]:
logging.debug("reading XML...")
soup = BeautifulSoup(open(DSN_XML_FILE), 'xml')

DEBUG : reading XML...


In [5]:
dsn_forms, unknown = _build_dsn_forms(soup)

DEBUG : finding 'hom' tags
                                                       

## Parse UD data
We want to learn from both the DSN and UD data. While DSN is the authorative source, UD does contain words not found in DSN. In case of inconsistencies between DSN and UD, we choose DSN over UD.

Universal Dependencies (UD) and spaCy use POS-tags such as *DET* and *AUX* which are not used in the DSN word lists. So, we use UD to learn these.

For adjectives (*ADJ*), the DSN word lists are incomplete. They do not contain various *degrees* for the adjectives, for example the forms *hurtigere* (faster) and *hurtigst* (fastest).

UD contains a large amount of proper nouns (*PROPN*) not found in DSN, specifically personal names. We might as well learn from these as well, so read the entire UD train file.

We try to remedy this by learning from UD. Unfortunately, there is some amount inconsistency between DSN and UD. To avoid introducing ambiguity because of this, we will only add data from UD if the full form does not exist in DSN.

In [6]:
def _parse_ud_line(line):
    return line.split("\t")[1:4]

def _ud_forms(ud_file, min_freq=1):
    counts = {}
    pos_prev = ""
    for line in open(ud_file).readlines():
        if line.startswith("#"):
            continue
        if line.strip() == "":
            pos_prev = ""
            continue

        orth, lemma, pos = _parse_ud_line(line)
        orth = orth.lower()
        lemma = lemma.lower()
        key = (pos_prev, pos, orth, lemma)
        counts[key] = counts.get(key, 0) + 1
        pos_prev = pos
    
    return [key for key in counts if counts[key] >= min_freq]

ud_forms = _ud_forms(UD_TRAIN_FILE)

In [7]:
ud_forms[:5]

[('', 'ADP', 'på', 'på'),
 ('ADP', 'NOUN', 'fredag', 'fredag'),
 ('NOUN', 'AUX', 'har', 'have'),
 ('AUX', 'PROPN', 'sid', 'sid'),
 ('PROPN', 'VERB', 'inviteret', 'invitere')]

## Filter UD data
We will now filter the word forms read from UD. We do this to avoid introducing ambiguity due to spelling errors and typos in UD.

We want to include the following only:
1. Any POS + full form combination *not* found in DSN.
2. Any POS_PREV + POS + full form combination for which the POS + full form is *ambiguous* in DSN + Step 1.


In [8]:
dsn_full_forms = set((pos, full_form) for pos, full_form, _lemma in dsn_forms)

In [9]:
ud_forms_unique = [(pos, full_form, lemma) for (_pos_prev, pos, full_form, lemma) in ud_forms if (pos, full_form) not in dsn_full_forms]
ud_forms_unique = list(set(ud_forms_unique))

In [10]:
dsn_ud_no_history = dsn_forms + ud_forms_unique

In [11]:
# Find ambiguous ones
counter = Counter(t[:2] for t in dsn_ud_no_history)
ambiguous = set([key for key in counter if counter[key] > 1])
len(ambiguous)

3210

In [12]:
dsn_ud_with_history = [(f'{f[0]}_{f[1]}',) + f[2:] for f in ud_forms if f[1:3] in ambiguous]
len(dsn_ud_with_history)

504

In [13]:
with open(PREPARED_FILE, 'wb') as csvfile:
    writer = csv.writer(csvfile,
                        delimiter=",",
                        quotechar='"',
                        quoting=csv.QUOTE_MINIMAL,
                        encoding='utf-8',
                        lineterminator='\n')
    
    writer.writerow(['word_class', 'full_form', 'lemma'])
    
    for pos, full_form, lemma in sorted(dsn_ud_no_history, key = lambda x: x[1:]):
        _write_form(pos, full_form, lemma)
    for pos, full_form, lemma in sorted(dsn_ud_with_history, key = lambda x: x[1:]):
        _write_form(pos, full_form, lemma)

## Unknown classes from DSN
Finally, we will list word classes found in DSN for which we do not have a mapping to UD POS-tags. Further investigation for these is an area for future work.

In [14]:
for key, value in unknown.items():
    print(key, value)

fork 361
flerord.forb. 196
præfiks 58
formelt 2
