# Train a lemmatizer with lemma
In this notebook, you will see how to train a lemmatizer using lemma. It assumes you already have a CSV file of the 
format *pos*, *full_form*, *lemma*. The previous notebook, *01 prepare*, explains how to create such a file using data from Dansk Sprognævn and the Universal Dependency data.

We initially create a train/test split and train on the training data only and then evaluate on the train and test set respectively. We then train again on the entire dataset and save the trained rules

In [1]:
import logging
import random
import pandas as pd
from lemma import Lemmatizer
logging.basicConfig(level=logging.DEBUG, format="%(levelname)s : %(message)s")

In [2]:
PREPARED_FILE = "./data/prepared.csv"
TRAINED_RULES_FILE = "./data/rules.py"

In [3]:
def load_data(filename):
    df = pd.read_csv(filename, usecols=[0, 1, 2], keep_default_na=False)
    X = [(word_class, full_form) for _, (word_class, full_form, _) in df.iterrows()]
    y = [lemma for _, (_word_class, _full_form, lemma,) in df.iterrows()]
    return X, y

def split_data(X, y):
    mask = [False] * len(y)
    test_indices = random.sample(range(len(y)), len(y) // 500)
    for index in test_indices:
        mask[index] = True

    X_train = []
    y_train = []
    X_test = []
    y_test = []
    for index, test in enumerate(mask):
        if test:
            X_test += [X[index]]
            y_test += [y[index]]
        else:
            X_train += [X[index]]
            y_train += [y[index]]
    
    return X_train, y_train, X_test, y_test

def print_examples(lemmatizer):
    examples = [["VERB", "drak"], ["NOUN", "kattene"], ["NOUN", "ukrudtet"], ["NOUN", "slaraffenlandet"],
                ["NOUN", "alen"], ["NOUN", "skaber"], ["NOUN", "venskaber"], ["NOUN", "tilbageførelser"],
                ["NOUN", "aftenbønnerne"], ["NOUN", "altankassepassere"]]
    for word_class, full_form in examples:
        lemma = lemmatizer.lemmatize(word_class, full_form)
        print("(%s, %s) -> %s" % (word_class, full_form, lemma))

def calculate_accuracy(lemmatizer, X, y):
    total = 0
    correct = 0
    ambiguous = 0

    for index in range(len(y)):
        word_class, full_form = X[index]
        target = y[index]
        predicted = lemmatizer.lemmatize(word_class, full_form)
        total += 1
        if len(predicted) > 1:
            ambiguous += 1
        elif predicted[0] == target:
            correct += 1
        else:            
            #print("(%s, %s) -> %s (expected: %s)" % (word_class, full_form, predicted, target))
            pass

    print("correct:", correct)
    print("ambiguous:", ambiguous)
    print("total:", total)
    print("accuracy:", correct/total)
    print("ambiguous%:", ambiguous/total)
    print("ambiguous + accuracy:", (ambiguous+correct)/total)

## Load Data

In [4]:
X, y = load_data(PREPARED_FILE)

## Split Data

In [5]:
random.seed(0)
X_train, y_train, X_test, y_test = split_data(X, y)

In [6]:
print(f"Complete set: {len(X):10}")
print(f"Train set:    {len(X_train):10}")
print(f"Test set:     {len(X_test):10}")

Complete set:     402189
Train set:        401385
Test set:            804


## Train temmatizer - training set only

In [7]:
lemmatizer = Lemmatizer()
lemmatizer.fit(X_train, y_train)

DEBUG : epoch #1: 46905 rules (46905 new) in 1.87s
DEBUG : epoch #2: 62807 rules (15902 new) in 1.75s
DEBUG : epoch #3: 66231 rules (3424 new) in 1.65s
DEBUG : epoch #4: 67281 rules (1050 new) in 1.73s
DEBUG : epoch #5: 67680 rules (399 new) in 1.59s
DEBUG : epoch #6: 67784 rules (104 new) in 1.80s
DEBUG : epoch #7: 67808 rules (24 new) in 1.96s
DEBUG : epoch #8: 67824 rules (16 new) in 1.73s
DEBUG : epoch #9: 67824 rules (0 new) in 2.02s
DEBUG : training complete: 67824 rules in 16.18s
DEBUG : rules before pruning: 67824
DEBUG : used rules: 59962
DEBUG : rules after pruning: 59962 (7862 removed)


In [8]:
calculate_accuracy(lemmatizer, X_train, y_train)

correct: 394911
ambiguous: 6474
total: 401385
accuracy: 0.9838708471915991
ambiguous%: 0.016129152808400913
ambiguous + accuracy: 1.0


In [9]:
calculate_accuracy(lemmatizer, X_test, y_test)

correct: 745
ambiguous: 9
total: 804
accuracy: 0.9266169154228856
ambiguous%: 0.011194029850746268
ambiguous + accuracy: 0.9378109452736318


In [10]:
print_examples(lemmatizer)

(VERB, drak) -> ['drikke']
(NOUN, kattene) -> ['kat']
(NOUN, ukrudtet) -> ['ukrudt']
(NOUN, slaraffenlandet) -> ['slaraffenland']
(NOUN, alen) -> ['al', 'ale', 'alen']
(NOUN, skaber) -> ['skaber']
(NOUN, venskaber) -> ['venskab']
(NOUN, tilbageførelser) -> ['tilbageførelse']
(NOUN, aftenbønnerne) -> ['aftenbøn']
(NOUN, altankassepassere) -> ['altankassepasser']


## Train temmatizer - full dataset

In [11]:
lemmatizer = Lemmatizer()
lemmatizer.fit(X, y)

DEBUG : epoch #1: 46954 rules (46954 new) in 1.93s
DEBUG : epoch #2: 62892 rules (15938 new) in 1.92s
DEBUG : epoch #3: 66324 rules (3432 new) in 1.74s
DEBUG : epoch #4: 67372 rules (1048 new) in 2.03s
DEBUG : epoch #5: 67771 rules (399 new) in 2.12s
DEBUG : epoch #6: 67875 rules (104 new) in 1.95s
DEBUG : epoch #7: 67899 rules (24 new) in 1.69s
DEBUG : epoch #8: 67915 rules (16 new) in 1.70s
DEBUG : epoch #9: 67915 rules (0 new) in 1.67s
DEBUG : training complete: 67915 rules in 16.85s
DEBUG : rules before pruning: 67915
DEBUG : used rules: 60045
DEBUG : rules after pruning: 60045 (7870 removed)


In [12]:
calculate_accuracy(lemmatizer, X, y)

correct: 395697
ambiguous: 6492
total: 402189
accuracy: 0.9838583352602881
ambiguous%: 0.016141664739711927
ambiguous + accuracy: 1.0


## Save Learned Rules
We now save the learend rules to a Python file which can be copied to the lemmatizer source code.

In [13]:
def _to_dict(lemmatizer):
    """Convert the internal defaultdict to a standard dict."""
    temp = {}
    for pos, rules_ in lemmatizer.rules.items():
        if pos not in temp:
            temp[pos] = {}

        for full_form_suffix, lemma_suffixes_ in rules_.items():
            temp[pos][full_form_suffix] = lemma_suffixes_
    return temp

In [14]:
open(TRAINED_RULES_FILE, 'w').write("rules = " + str(_to_dict(lemmatizer)))

1801484