# Train a lemmatizer with Lemmy
In this notebook, you will see how to train a lemmatizer using lemmy. It assumes you already have a CSV file of the 
format *pos*, *full_form*, *lemma*. The previous notebook, *01 prepare*, explains how to create such a file using data from Dansk Sprognævn (DSN) and the Universal Dependency (UD) data.

We initially create a train/test split and train on the training data only and then evaluate on the train and test set respectively. We then train again on the entire dataset and save the trained rules.

In [1]:
import logging
import random
from pprint import pformat
import pandas as pd
from lemmy import Lemmatizer
logging.basicConfig(level=logging.DEBUG, format="%(levelname)s : %(message)s")

In [2]:
PREPARED_FILE = "./data/prepared.csv"
TRAINED_RULES_FILE = "./data/rules.py"

In [3]:
def print_examples(lemmatizer):
    examples = [["VERB", "drak"], ["NOUN", "kattene"], ["NOUN", "ukrudtet"], ["NOUN", "slaraffenlandet"],
                ["NOUN", "alen"], ["NOUN", "skaber"], ["NOUN", "venskaber"], ["NOUN", "tilbageførelser"],
                ["NOUN", "aftenbønnerne"], ["NOUN", "altankassepassere"]]
    for word_class, full_form in examples:
        lemma = lemmatizer.lemmatize(word_class, full_form)
        print("(%s, %s) -> %s" % (word_class, full_form, lemma))

def calculate_accuracy(lemmatizer, X, y):
    total = 0
    correct = 0
    ambiguous = 0

    for index in range(len(y)):
        word_class, full_form = X[index]
        target = y[index]
        predicted = lemmatizer.lemmatize(word_class, full_form)
        total += 1
        if len(predicted) > 1:
            ambiguous += 1
        elif predicted[0] == target:
            correct += 1


    print("correct:", correct)
    print("ambiguous:", ambiguous)
    print("total:", total)
    print("accuracy:", correct/total)
    print("ambiguous%:", ambiguous/total)
    print("ambiguous + accuracy:", (ambiguous+correct)/total)

## Load Data

In [4]:
def load_data(filename):
    df = pd.read_csv(filename, usecols=[0, 1, 2], keep_default_na=False)
    df = df.sample(frac=1, random_state=42) # shuffle rows
    X = [(word_class, full_form) for _, (word_class, full_form, _) in df.iterrows()]
    y = [lemma for _, (_word_class, _full_form, lemma,) in df.iterrows()]
    return X, y

X, y = load_data(PREPARED_FILE)

## Split Data

In [5]:
def split_data(X, y):
    mask = [False] * len(y)
    test_indices = random.sample(range(len(y)), len(y) // 500)
    for index in test_indices:
        mask[index] = True

    X_train = []
    y_train = []
    X_test = []
    y_test = []
    for index, test in enumerate(mask):
        if test:
            X_test += [X[index]]
            y_test += [y[index]]
        else:
            X_train += [X[index]]
            y_train += [y[index]]
    
    return X_train, y_train, X_test, y_test

random.seed(42)
X_train, y_train, X_test, y_test = split_data(X, y)

In [6]:
print(f"Complete set: {len(X):10}")
print(f"Train set:    {len(X_train):10}")
print(f"Test set:     {len(X_test):10}")

Complete set:    1051863
Train set:       1049760
Test set:           2103


## Train temmatizer - training set only

In [7]:
lemmatizer = Lemmatizer()
lemmatizer.fit(X_train, y_train)

DEBUG : epoch #1: 77567 rules (77567 new) in 4.17s
DEBUG : epoch #2: 103334 rules (25767 new) in 4.05s
DEBUG : epoch #3: 108183 rules (4849 new) in 4.09s
DEBUG : epoch #4: 109247 rules (1064 new) in 4.06s
DEBUG : epoch #5: 109566 rules (319 new) in 4.00s
DEBUG : epoch #6: 109671 rules (105 new) in 3.93s
DEBUG : epoch #7: 109695 rules (24 new) in 3.93s
DEBUG : epoch #8: 109703 rules (8 new) in 3.98s
DEBUG : epoch #9: 109705 rules (2 new) in 3.96s
DEBUG : epoch #10: 109705 rules (0 new) in 4.03s
DEBUG : training complete: 109705 rules in 40.36s
DEBUG : rules before pruning: 109705
DEBUG : used rules: 101430
DEBUG : rules after pruning: 101430 (8275 removed)


In [8]:
calculate_accuracy(lemmatizer, X_train, y_train)

correct: 1041250
ambiguous: 8510
total: 1049760
accuracy: 0.991893385154702
ambiguous%: 0.008106614845297972
ambiguous + accuracy: 1.0


In [9]:
calculate_accuracy(lemmatizer, X_test, y_test)

correct: 1987
ambiguous: 21
total: 2103
accuracy: 0.9448407037565383
ambiguous%: 0.009985734664764621
ambiguous + accuracy: 0.9548264384213029


In [10]:
print_examples(lemmatizer)

(VERB, drak) -> ['draka']
(NOUN, kattene) -> ['kattene']
(NOUN, ukrudtet) -> ['ukrudte']
(NOUN, slaraffenlandet) -> ['slaraffenland']
(NOUN, alen) -> ['al']
(NOUN, skaber) -> ['skab']
(NOUN, venskaber) -> ['venskab']
(NOUN, tilbageførelser) -> ['tilbageførelse']
(NOUN, aftenbønnerne) -> ['aftenbønnerne']
(NOUN, altankassepassere) -> ['altankassepassera']


## Train temmatizer - full dataset

In [11]:
lemmatizer = Lemmatizer()
lemmatizer.fit(X, y)

DEBUG : epoch #1: 77730 rules (77730 new) in 4.39s
DEBUG : epoch #2: 103564 rules (25834 new) in 4.59s
DEBUG : epoch #3: 108442 rules (4878 new) in 4.39s
DEBUG : epoch #4: 109506 rules (1064 new) in 4.40s
DEBUG : epoch #5: 109826 rules (320 new) in 4.16s
DEBUG : epoch #6: 109931 rules (105 new) in 4.15s
DEBUG : epoch #7: 109955 rules (24 new) in 4.09s
DEBUG : epoch #8: 109963 rules (8 new) in 4.20s
DEBUG : epoch #9: 109965 rules (2 new) in 4.32s
DEBUG : epoch #10: 109965 rules (0 new) in 4.15s
DEBUG : training complete: 109965 rules in 43.01s
DEBUG : rules before pruning: 109965
DEBUG : used rules: 101668
DEBUG : rules after pruning: 101668 (8297 removed)


In [12]:
calculate_accuracy(lemmatizer, X, y)

correct: 1043300
ambiguous: 8563
total: 1051863
accuracy: 0.9918592059992604
ambiguous%: 0.00814079400073964
ambiguous + accuracy: 1.0


## Save Learned Rules
We now save the learend rules to a Python file which can be copied to the lemmatizer source code.

In [13]:
def _to_dict(lemmatizer):
    """Convert the internal defaultdict to a standard dict."""
    temp = {}
    for pos, rules_ in lemmatizer.rules.items():
        if pos not in temp:
            temp[pos] = {}

        for full_form_suffix, lemma_suffixes_ in rules_.items():
            temp[pos][full_form_suffix] = lemma_suffixes_
    return temp

In [14]:
with open(TRAINED_RULES_FILE, 'w') as file:
    file.write("# coding: utf-8\n")
    file.write("from __future__ import unicode_literals\n")
    file.write("\n\n")
    file.write("rules = " + pformat(_to_dict(lemmatizer), width=120))