# Making a Trie Language Model
## A Trie Language Model is an efficient data structure that organizes a good, large baseline sample of known correct words in the target language.
### Whenever possible it can be helpful to have Trie Language Model to assist you in diagnosing a corpus and potentially fixing problems or auto-correcting bad data. 
### Prior to starting, we have examined some of the CLTK data structures and found a source of some known correct words that we will use to build our tree.

#### First some standard imports

In [1]:
import site
import pickle
import random
import os
from collections import defaultdict, Counter
from tqdm import tqdm

from cltk.stem.latin.j_v import JVReplacer
from cltk.corpus.readers import get_corpus_reader
from cltk.tokenize.word import WordTokenizer

### Add parent directory to path so we can access our common code

In [3]:
import os, sys, inspect
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(currentdir)
sys.path.insert(0, parentdir) 

In [4]:
from mlyoucanuse.word_trie import WordTrie
from mlyoucanuse.aeoe_replacer import AEOEReplacer
from mlyoucanuse.corpus_analysis_fun import get_split_words

### The CLTK Lemmata package is outside of the regular CLTK library path, but we would like to work with the Lemmmata dictionary to access a large collection of good inflected Latin words so we will add the directory to our Python path programmatically.

In [3]:
site.addsitedir(os.path.expanduser('~/cltk_data/latin/lemma/latin_pos_lemmata_cltk'))
# Now we may reference the dictionary in code
from latin_unambiguous_lemmata_cltk import LEMMATA
print(f'Number of distinct root word forms: {len(LEMMATA.keys()):,}')

Number of distinct root word forms: 239,321


### Let's take examine the character content of the Lemmata values to make sure there's no odd characters

In [4]:
chars = Counter()
for word in tqdm(LEMMATA.keys()):
    for letter in word:
        chars.update({letter: 1})
print(chars)

100%|██████████| 239321/239321 [00:04<00:00, 53293.90it/s] 4824/239321 [00:00<00:04, 48223.60it/s]

Counter({'e': 258882, 'i': 237115, 'a': 205356, 't': 178453, 'u': 175885, 'r': 174307, 's': 163438, 'n': 148734, 'o': 113542, 'm': 93943, 'c': 92743, 'l': 68276, 'd': 67361, 'p': 61526, 'b': 39004, 'q': 27267, 'g': 26816, 'v': 23259, 'f': 21091, 'h': 13181, 'x': 11709, 'y': 4720, 'A': 3471, 'C': 3174, 'P': 2811, 'S': 2162, 'T': 1804, 'M': 1740, 'L': 1361, 'H': 1103, 'B': 872, 'E': 867, 'I': 834, 'D': 817, 'N': 806, 'V': 708, 'G': 703, 'O': 623, 'F': 569, 'z': 409, 'R': 405, 'U': 258, 'Q': 104, 'Z': 93, 'X': 50, 'K': 21, 'k': 5, '-': 3})





### There are no Greek or odd characters. However, there is the letter V, so we'll normalize the forms (in Latin J is the same as I and V is the same as U) so we'll coerce forms with the CLTK JVReplacer

In [5]:
jv_replacer = JVReplacer()
distinct_words =[jv_replacer.replace(word) for word in tqdm(LEMMATA.keys())]
distinct_words.sort(key=len, reverse=True)
print(f'Max length in known lemma corpus: {len(distinct_words[0])} for: {distinct_words[0]}')
distinct_words[:5]

100%|██████████| 239321/239321 [00:01<00:00, 123688.41it/s]10220/239321 [00:00<00:02, 102196.69it/s]

Max length in known lemma corpus: 28 for: Thensaurochrysonicochrysides





['Thensaurochrysonicochrysides',
 'honorificentissimisque',
 'incomprehensibilemque',
 'incomprehensibilisque',
 'reconciliationibusque']

## The prominence of the -que enclitic ending indicates we strip off those enclitics. We can do this by tokenizing the words and only taking the first token:

In [6]:
toker = WordTokenizer(language='latin')
toker.tokenize('incomprehensibilemque')

['incomprehensibilem', '-que']

In [7]:
distinct_tokenized = [toker.tokenize(tmp)[0] for tmp in tqdm(distinct_words)]
print(f'Absolute maximum length {len(distinct_tokenized[0])}')
print(f'Greatest common maximum length {len(distinct_tokenized[2])}')

100%|██████████| 239321/239321 [00:33<00:00, 7088.98it/s]| 561/239321 [00:00<00:42, 5604.10it/s]

Absolute maximum length 28
Greatest common maximum length 18





### Based on our previous word length analyses, we can see that any word over a length of 18 is likely an accidentally joined word, usually created by improper formatting or a botched data import. In practice it is common to use a lower cutoff, as the 98th percentile starts with words 12 letters long.  Adjusting the cutoff is a much like the tuning of a hyperparameter: the proper value depends on the data and your needs.

### Now let's build the WordTrie

In [8]:
latin_trie = WordTrie()
for word in tqdm(distinct_tokenized):
    latin_trie.add(word)

100%|██████████| 239321/239321 [00:01<00:00, 168323.04it/s]5529/239321 [00:00<00:04, 52597.54it/s]


### Let's show how the WordTrie can be used to split improperly joined words

In [9]:
bads =[
['maturitatemperueniunt'],
['radicibussubministres'],
['peregrinationeshabere'],
['uersibusdisertissimis'],
['crudelitatisconsuetudinem'],
['adiciebatcontrahendam'],
['translationesinprobas'],
]
for item in bads:
    print(latin_trie.extract_word_pair(item[0]))

['maturitatem', 'perueniunt']
['radicibus', 'subministres']
['peregrinationes', 'habere']
['uersibus', 'disertissimis']
['crudelitatis', 'consuetudinem']
['adiciebat', 'contrahendam']
['translationes', 'inprobas']


## Testing the word trie on a large corpus

In [10]:
latin_library_reader = get_corpus_reader(corpus_name='latin_text_latin_library', language='latin')
latin_split_words = get_split_words(latin_library_reader, latin_trie)
print(f'{len(latin_split_words)} files with oddly joined words out of {len(latin_library_reader.fileids())} corpus files')
list(latin_split_words.items())[:10]

100%|██████████| 2141/2141 [19:11<00:00,  1.60files/s]    | 1/2141 [00:00<07:02,  5.07files/s]

229 files with oddly joined words out of 2141 corpus files





[('1644.txt',
  [['captiuorum', 'quispiam'],
   ['quomodo', 'comparantur'],
   ['ingrediuntur', 'concubinae'],
   ['tauros', 'postulauere'],
   ['astantibus', 'manifestat'],
   ['pluribus', 'necasset'],
   ['latera', 'confodiunt'],
   ['arbores', 'radicitus'],
   ['frondibus', 'eiusdem'],
   ['nauigabiles', 'habet'],
   ['ceterum', 'breuissimo'],
   ['uocalem', 'pronuntiant'],
   ['fidelium', 'sacerdos'],
   ['sepulcris', 'infundunt'],
   ['perdiderat', 'quidam'],
   ['Senam', 'propagandae'],
   ['inseruiebat', 'aegris'],
   ['confessiones', 'excipit'],
   ['salutationem', 'exhibent'],
   ['aliquando', 'continet']]),
 ('abelard/dialogus.txt', [['circumcide', 'rentur']]),
 ('alanus/alanus1.txt',
  [['contra', 'positionem'],
   ['prae', 'conceptionis'],
   ['inter', 'familiaritatis']]),
 ('albertanus/albertanus.arsloquendi.txt',
  [['passionibus', 'alienus'],
   ['mendacio', 'redimere'],
   ['mendacium', 'penitus'],
   ['psalterium', 'suauem'],
   ['iniuriam', 'cohibere'],
   ['propulsan

In [11]:
perseus_latin_reader = get_corpus_reader(corpus_name='latin_text_perseus', language='latin')
perseus_split_words = get_split_words(perseus_latin_reader, latin_trie) 
print(f'{len(perseus_split_words)} files with oddly joined words out of {len(perseus_latin_reader.fileids())} corpus files')
list(perseus_split_words.items())[:10]

100%|██████████| 293/293 [03:21<00:00,  5.72files/s]     | 1/293 [00:05<26:14,  5.39s/files]

112 files with oddly joined words out of 293 corpus files





[('ammianus-marcellinus__rerum-gestarum__latin.json',
  [['constr', 'ingerentur'],
   ['inter', 'clamantibus'],
   ['incorruptis', 'simum']]),
 ('apuleius__apologia__latin.json',
  [['paupertatem', 'philosopho'], ['ostendis', 'humanissimo']]),
 ('ausonius-decimus-magnus__caesares__latin.json',
  [['sequentes', 'expediam']]),
 ('ausonius-decimus-magnus__commemoratio-professorum-burdigalensium__latin.json',
  [['etenim', 'commemorare'],
   ['cathedrae', 'perdidit'],
   ['noster', 'commemorauit'],
   ['tamen', 'grammatices'],
   ['dignus', 'grammaticos'],
   ['commemoratus', 'Urbice'],
   ['magistrum', 'collegam'],
   ['nuncupant', 'Apollinares'],
   ['carminum', 'orationem'],
   ['genitori', 'conlatus'],
   ['solstitialis', 'uelut'],
   ['disciplinis', 'adpulit']]),
 ('ausonius-decimus-magnus__eclogarum-liber__latin.json',
  [['Prometheus', 'testatur'],
   ['curis', 'sollicitudo'],
   ['addens', 'quadrantem'],
   ['quattuor', 'feruidis'],
   ['uoltu', 'perstrictus'],
   ['quadrigis', 'iu

## Evaluation
### The word splitting is effective and the data generally looks convincing; however it is most useful when paired with some supervision. There are a few edge cases, such sometimes prepositions are split away needlessly from compound verbs (compounding is a regular linguistic trend and variety is expected). Depending on your requirements, needless splits may be acceptable, for example, if you're looking to build a high quality embedding, splitting may help cluster meanings in a word vector representation. Auto-splitting is a tool for your toolbox and a suggestion to be considered, depending on your needs and use of the corpus.

## Saving & Restoring the Word Trie for later use

In [12]:
with open('latin.word_trie.pkl', 'wb') as writer:
    pickle.dump(latin_trie, writer)

In [13]:
my_new_trie = None
with open('latin.word_trie.pkl', 'rb') as reader:
    my_new_trie = pickle.load(reader)

In [14]:
# prove that the reconstituted trie can be used:
my_new_trie.has_word('et')

True

### This word_trie will be used in other notebooks, but for now
## That's all for now for now folks! 