#### This program loads the training data created by `generate.ipynb` and creates a training model based on a semi-supervised approach toward common word endings. Lastly, it saves the training model as a list of rules using `pickle` such that `test.ipynb` can quickly load the model


In [245]:
import pickle

"longest_substring" is a utility method used to find the longest substring of s1 in s2 (assumption: a lemma can never be longer than its unmapped form, so s2 will always be a subset of s1)

**note: `i==0` is an edge case because `s2[:-0] != s2`**

In [239]:
def longest_substring(s1, s2):
    for i in range(len(s2)):
        if i==0:
            if s1.find(s2) == 0:
                return s2
        if s1.find(s2[:-(i+1)]) == 0:
            return s2[:-(i+1)]
    return -1

In [240]:
with open("lemma_mappings.txt") as file:
    items = file.readlines()
items = [item.lstrip('\t').rstrip(',\n').split(':') for item in items if len(item) > 1]

for item in items:
    item[0] = item[0].replace('\'','').lstrip('(').rstrip(')').split(', ')
    item[1] = item[1].replace('\'','')

In [241]:
# test cell - make sure slicing is working as intended
#min_end_len = 2
#print(items[0][1])
#print(items[0][1][-2:])
#print()

#for i in range(len(items[0][1])-(min_end_len-1)):
#    print(items[0][1][-min_end_len-i:]+'/'+items[0][0][1])

Create "rules," which is a dict of lemma rules based on word endings of the training data set. The keys are given in the format [word_ending]/[POS_tag] and each corresponding value is a list containing [lemma mapping, count] where "count" is how many times that particular key maps to that bucket in the training data set

In [242]:
rules = {}
min_end_len = 2
for item in items:
    if item[0][0] == item[1]:
        if len(item[1]) < min_end_len:
            continue
        for i in range(len(item[1])-(min_end_len-1)):
            key = item[1][-min_end_len-i:]+'/'+item[0][1]
            if key not in rules:
                rules[key] = [item[1][-min_end_len-i:], 1]
            else:
                rules[key][1] += 1
    else:
        if len(item[1]) < min_end_len:
            continue
        unmapped = item[0][0]
        mapped = item[1]
        diff = len(unmapped) - len(mapped)
        ss = longest_substring(unmapped, mapped)
        for i in range(len(ss)-(min_end_len-1)):
            if i==0:
                continue                                   # edge case - val would be full word
            orig_ending = unmapped[-diff-i:]
            key = unmapped[-diff-i:]+'/'+item[0][1]
            val = mapped[-(len(orig_ending)-diff):]
            if key not in rules:
                rules[key] = [val, 1]
            else:
                rules[key][1] += 1
        #print(longest_substring(unmapped, mapped))
        #print(unmapped, mapped)
                

Finally, sort the rules first by number of hits (e.g. give more weight to rules that are validated more often in the training set) and second by length of the key, as a tiebreaker (assuming, all else equal, a more specific/longer lemma match is more likely accurate than a shorter match, which could be more heavily influenced by non-relevant terms in the training set)

In [243]:
rules = sorted(rules.items(), key=lambda x:(x[1][1], len(x[0])), reverse=True)

In [246]:
#rules

...and finally, write the generated model to `model.txt` for use in `test.ipynb`

In [247]:
with open('model.txt', 'wb') as out_file:
    pickle.dump(rules, out_file)