# Prepare Data for Training
In this notebook, we prepare a dataset which can be used for training a Swedish lemmatizer with Lemmy.

**NOTE**: You do *not* need to run this notebook to use lemma. The lemmatizer comes trained and ready to use! This notebook is only if you want train the lemmatizer yourself, for example because you want it trained on a specific dataset.

We use two datasets which are buth publicly available. One is [LEXIN](https://spraakbanken.gu.se/swe/resurser/nerladdning) and the other is one of the Swedish parts of the Universal Dependencies (UD). This dataset is open source and available from the [UD repo](https://github.com/UniversalDependencies/UD_Swedish-Talbanken) on GitHub.

The notebook assumes you have the datasets stored in a subfolder called *data*.

In [6]:
from IPython.core.display import display, HTML
import logging
from bs4 import BeautifulSoup
from collections import defaultdict, Counter
import unicodecsv as csv
from tqdm import tqdm
import regex
logging.basicConfig(level=logging.DEBUG, format="%(levelname)s : %(message)s")

In [7]:
UD_TRAIN_FILE = "./data/UD_Swedish-Talbanken/sv_talbanken-ud-train.conllu" # da-ud-train.conllu # sv_talbanken-ud-train.conllu
LEXIN_XML_FILE = "./data/LEXIN/LEXIN.xml"
SALDOM_XML_FILE = "./data/Saldom/saldom.xml"
PREPARED_FILE = "./data/prepared.csv"
NORMS_FILE = "./data/norms.csv"

## Parse Lexin XML data
Our first step reading the LEXIN data. We train the lemmatizer to use POS tags to help predict the lemma. We use the UD set of POS tags. Because the word classes used in DSN data differ from UD POS tags, we need to do some manual mapping. The `CLASS_LOOKUP` dictionary specifies the mapping.

In [9]:
CLASS_LOOKUP = {"subst.": "NOUN",
                "verb": "VERB",
                "adj.": "ADJ",
                "adv.": "ADV",
                "namn": "PROPN",
                "interj.": "INTJ",
                "räkn.": "NUM",
                "pron.": "PRON",
                "konj.": "CONJ",
                "prep.": "ADP"
               }

def _build_lexin_tuples(soup):
    unknown_classes = defaultdict(int)
    forms = set()
    entries = soup.find_all('lemma-entry', recursive=True)
    for entry in tqdm(entries):
        lemma = entry.form.get_text()
        if " " in lemma:
            continue
            
        # Some entries in LEXIN are too far from being UD compatible.
        # We simply skip them and expect to learn them from UD data.
        if lemma == 'är':
            continue
        
        if lemma != 'är':
            pass
            # continue

        if "~" in lemma:
            temp = lemma.split("~")
            prefix = temp[0]
            lemma = "".join(temp)
        else:
            prefix = None

        if not entry.pos:
            continue

        word_class = entry.pos.get_text()
        if word_class not in CLASS_LOOKUP:
            unknown_classes[word_class] += 1
            continue
        word_class = CLASS_LOOKUP[word_class]
           

        full_forms = entry.inflection.get_text().strip()
        full_forms = full_forms.replace("(!)", "")
        if full_forms == "explicit a":
            full_forms = ["explicit"]
        elif full_forms:
            full_forms = _parse_inflections(full_forms)
        elif lemma.endswith("(s)") or lemma.endswith("(a)"):            
            # For some pronouns, the lemma in ends with '(s)' or '(a)'.
            lemma, temp = _process_optional(lemma)            
            full_forms = [temp]
        else:
            full_forms = []


        temp = []
        for form in full_forms:
            if form.startswith("-"):
                if prefix:
                    temp.append(prefix + form[1:])
                else:
                    applied = _with_suffix(lemma, form[1:])
                    if applied:
                        temp.append(applied)
                    else: 
                        temp.append(lemma + form[1:])
            else:
                if prefix:
                    applied = _with_suffix(lemma, form)
                    if applied:
                        temp.append(applied)
                    else: 
                        temp.append(lemma + form)

                else:
                    temp.append(form)
        full_forms = temp

        if word_class == "VERB":
            # For verbs, the LEXIN data do not contain the lemma as the lookup word ('form')
            # Instead, it's the last of the inflections, so we must do some swapping            
            full_forms = [lemma] + full_forms
            lemma = full_forms[-1]
            full_forms = full_forms[:-1]
        else:
            full_forms.append(lemma)       
        
        for full_form in full_forms:
            forms.add((word_class, full_form, lemma))
    return sorted(forms, key = lambda x: x[1:]), unknown_classes

def _parse_inflections(inflections):
    temp = regex.sub("(\((el\.|eller|vard. även|el))(.+?)(\))", r'\g<3>', inflections)
    result = []
    for form in temp.split():
        result.extend(_process_optional(form))
    return result

def _process_optional(form):
    m = regex.finditer("(.*)\((.*)\)(.*)", form).match()
    if not m:
        return [form]
    return [m.group(1) + m.group(3), m.group(1) + m.group(2) + m.group(3)]

def _with_suffix(lemma, suffix, allow_recursions=3):
    temp_suffix = suffix
    while len(temp_suffix):
        # print(temp_suffix)
        if lemma.endswith(temp_suffix):
            return lemma[:-len(temp_suffix)] + suffix
        temp_suffix = temp_suffix[:-1]
    
    if allow_recursions > 0:
        return _with_suffix(lemma[:-1], suffix, allow_recursions-1)
    
    return None

In [8]:
# Load the XML and parse it using Beautiful Soup
soup = BeautifulSoup(open(LEXIN_XML_FILE, 'rb'), 'xml', from_encoding='iso-8859-1')

In [10]:
# Build tuples of POS, + full form* and *lemma*
lexin_tuples, unknown = _build_lexin_tuples(soup)

100%|██████████| 19718/19718 [00:02<00:00, 7232.31it/s]


In [12]:
lexin_tuples[:5]

[('NOUN', 'A-inkomst', 'A-inkomst'),
 ('NOUN', 'A-inkomsten', 'A-inkomst'),
 ('NOUN', 'A-inkomster', 'A-inkomst'),
 ('NOUN', 'A-kassa', 'A-kassa'),
 ('NOUN', 'A-kassan', 'A-kassa')]

## Parse Saldom Data

In [208]:
# Load the XML and parse it using Beautiful Soup
soup = BeautifulSoup(open(SALDOM_XML_FILE), 'xml')

In [209]:
CLASS_LOOKUP = {"vb": "VERB",
                "nn": "NOUN",
                "av": "ADJ",
                "pm": "PROPN",
                "ab": "ADV",
                "nna": "NOUN",
                "pp": "ADP",
                "pma": "PROPN",
                "in": "INTJ",
                "nl": "NUM",
                "pn": "PRON",
                "sn": "CCONJ",
                "aba": "ADV",
                "ava": "ADJ",
                "ppa": "ADP",
                "kn": "CONJ",
                "kna": "CONJ",
                "vba": "VERB",
                "al": "DET",
               }

def _build_saldom_tuples(soup):
    unknown_classes = defaultdict(int)
    forms = set()
    entries = soup.find_all('LexicalEntry', recursive=False) # rec True?
    
    for entry in tqdm(soup.find("LexicalResource").find("Lexicon").find_all('LexicalEntry', recursive=False)):        
        lemma_feats = entry.find("Lemma").find("FormRepresentation").find_all("feat")
        lookup = { feat['att']: feat['val'] for feat in lemma_feats }
        lemma = lookup['writtenForm']
        if " " in lemma:
            continue
        word_class = lookup['partOfSpeech']
        if word_class not in CLASS_LOOKUP:
            unknown_classes[word_class] += 1
            continue
        word_class = CLASS_LOOKUP[word_class]

        #if lemma == "mången" and word_class == "PRON":
        #    word_class = "ADJ"
        
        full_forms = [feat["val"] for form in entry.find_all("WordForm") for feat in form.find_all("feat") if feat["att"] == "writtenForm"]
        for full_form in full_forms:
            forms.add((word_class, full_form, lemma))
    return sorted(forms, key = lambda x: x[1:]), unknown_classes

In [210]:
# Build tuples of POS, + full form* and *lemma*
saldom_tuples, unknown = _build_saldom_tuples(soup)

100%|██████████| 128036/128036 [00:27<00:00, 4696.63it/s]


In [211]:
saldom_tuples[:5]
#[t for t in saldom_tuples if t[0] == 'CCONJ'][:20]

[('NOUN', '%', '%'),
 ('NOUN', '%-', '%'),
 ('CONJ', '&', '&'),
 ('ADP', '+', '+'),
 ('ADP', '-', '-')]

In [212]:
[t for t in saldom_tuples if t[0] == 'ADJ' and t[1] == 'många']

[]

In [213]:
#sorted(unknown.keys())
sorted(unknown.items(), key=lambda x: -1*x[1])

[('sxc', 169), ('mxc', 24), ('avh', 15), ('nnh', 12), ('abh', 1)]

## Parse UD data
We will now read the UD data.

Some of the UD POS tags, such as *DET* and *AUX*, can not be mapped 1-to-1 to the DSN word classes. Consequently, we learn the words with those POS tags from UD.

Since the UD data is not just a word list but actual sentences annotated with lemmas and POS tags (and more), we have the benefit of having not only the POS tag of the word we want to lemmatize, but also the POS tag of the previous word. We can use this to improve the accuracy of our lemmatizer, so when building the list of tuples from UD, we include the POS tag of the previous word of the sentence. This is set to the empty string when the current word is the first word of the sentence.

In [214]:
def _parse_ud_line(line):
    return line.split("\t")[1:4]

def _build_ud_tuples(ud_file, min_freq=1):
    counts = {}
    pos_prev = ""
    for line in open(ud_file).readlines():
        if line.startswith("#"):
            continue
        if line.strip() == "":
            pos_prev = ""
            continue

        orth, lemma, pos = _parse_ud_line(line)
        orth = orth.lower()
        lemma = lemma.lower()
        if pos == 'ADJ' and orth == 'många' and lemma == 'mången':
            lemma = 'många'
        key = (pos_prev, pos, orth, lemma)
        counts[key] = counts.get(key, 0) + 1
        pos_prev = pos
    
    return [key for key in counts if counts[key] >= min_freq]

ud_tuples = _build_ud_tuples(UD_TRAIN_FILE)

In [215]:
ud_tuples[:5]

[('', 'ADJ', 'individuell', 'individuell'),
 ('ADJ', 'NOUN', 'beskattning', 'beskattning'),
 ('NOUN', 'ADP', 'av', 'av'),
 ('ADP', 'NOUN', 'arbetsinkomster', 'arbetsinkomst'),
 ('', 'ADP', 'genom', 'genom')]

In [216]:
[t for t in ud_tuples if t[1] == 'ADJ' and t[2] == 'många' and t[3] == 'mången']

[]

## Filter UD data
We will now filter the word forms read from UD. We do this to avoid introducing ambiguity due to spelling errors and typos in UD.

We want to include the following only:
1. Any POS + full form combination *not* found in DSN.
2. Any POS_PREV + POS + full form combination for which the POS + full form is *ambiguous* in DSN + Step 1.

By *ambiguous* we mean full forms (or combinations of POS tags and full forms) which have more than one lemma associated with them, which cause the lemmatizer to not know which of the lemmas to choose.

In [217]:
# Create a set for looking up POS + full form combinations found in DSN.
#lexin_full_forms = set((pos, full_form) for pos, full_form, _lemma in lexin_tuples)

# Create a list of POS + full form + lemma tuples from UD for wich the POS + full form combination
# is *not* found in DSN.
#ud_tuples_unique = [(pos, full_form, lemma) for (_pos_prev, pos, full_form, lemma) in ud_tuples if (pos, full_form) not in lexin_full_forms]

# Create a new list of tuples consisting of the ones from DSN and the new ones just found in UD.
#lexin_ud_no_history = lexin_tuples + list(set(ud_tuples_unique))

In [218]:
# Create a set for looking up POS + full form combinations found in DSN.
saldom_full_forms = set((pos, full_form) for pos, full_form, _lemma in saldom_tuples)

# Create a list of POS + full form + lemma tuples from UD for wich the POS + full form combination
# is *not* found in DSN.
ud_tuples_unique = [(pos, full_form, lemma) for (_pos_prev, pos, full_form, lemma) in ud_tuples if (pos, full_form) not in saldom_full_forms]

# Create a new list of tuples consisting of the ones from DSN and the new ones just found in UD.
saldom_ud_no_history = saldom_tuples + list(set(ud_tuples_unique))

In [219]:
saldom_ud_no_history[:5]

[('NOUN', '%', '%'),
 ('NOUN', '%-', '%'),
 ('CONJ', '&', '&'),
 ('ADP', '+', '+'),
 ('ADP', '-', '-')]

In [220]:
[t for t in saldom_ud_no_history if t[0] == 'ADJ' and t[1] == 'många']

[('ADJ', 'många', 'många')]

## Other Ambiguity
We have now removed words with two accepted spellings. Unfortunately, we have at least one more kind of ambiguity left in the data, namely distinct words which share one or more forms. For example, the Danish word "se" means *see*. Past tense of "se" is "så" (somewhat similar to *saw*). But the word "så" also has another meaning in Danish, namely *sow*. Consequently, if we are to lemmatize the word "så" and do not have any other information, we cannot tell whether the lemma is "se" or "så". For these situation, it helps if we know the POS tag of the previous word of the sentence. Therefor, we now identify ambiguous words which are still present after the above cleaning of ambiguous words. For these ambiguous words, we then build a list of tuples which include the POS tag of the previous word.

In [221]:
def find_ambiguous_lemmas(forms):
    counter = Counter(t[:2] for t in forms)
    ambiguous = list(set([key for key in counter if counter[key] > 1]))
    return ambiguous

#ambiguous = find_ambiguous_lemmas(clean_dsn_ud_no_history)
ambiguous = find_ambiguous_lemmas(saldom_ud_no_history)
saldom_ud_with_history = [(f'{f[0]}_{f[1]}',) + f[2:] for f in ud_tuples if f[1:3] in ambiguous]
len(saldom_ud_with_history)

620

## Write Tuples To Disk

In [222]:
def _write_form(word_class, full_form, lemma):
    writer.writerow([word_class, full_form, lemma])

with open(PREPARED_FILE, 'wb') as csvfile:
    writer = csv.writer(csvfile,
                        delimiter=",",
                        quotechar='"',
                        quoting=csv.QUOTE_MINIMAL,
                        encoding='utf-8',
                        lineterminator='\n')
    
    writer.writerow(['word_class', 'full_form', 'lemma'])
    
    for pos, full_form, lemma in sorted(saldom_ud_no_history, key = lambda x: (x[1:], x[0])):
        _write_form(pos, full_form, lemma)
    for pos, full_form, lemma in sorted(saldom_ud_with_history, key = lambda x: (x[1:], x[0])):
        _write_form(pos, full_form, lemma)