# Lemmatization

## Latin ##

Let's apply a variable () to a text. This time, lets use the ...

In [58]:
cato_agri_praef = "Est interdum praestare mercaturis rem quaerere, nisi tam periculosum sit, et item foenerari, si tam honestum. Maiores nostri sic habuerunt et ita in legibus posiverunt: furem dupli condemnari, foeneratorem quadrupli. Quanto peiorem civem existimarint foeneratorem quam furem, hinc licet existimare. Et virum bonum quom laudabant, ita laudabant: bonum agricolam bonumque colonum; amplissime laudari existimabatur qui ita laudabatur. Mercatorem autem strenuum studiosumque rei quaerendae existimo, verum, ut supra dixi, periculosum et calamitosum. At ex agricolis et viri fortissimi et milites strenuissimi gignuntur, maximeque pius quaestus stabilissimusque consequitur minimeque invidiosus, minimeque male cogitantes sunt qui in eo studio occupati sunt. Nunc, ut ad rem redeam, quod promisi institutum principium hoc erit."

In order to lemmatize this text, we must first import the CLTK data models for the lemmatization of the Latin language

To do so, we must first import the (`CorpusImporter`) from the appropriate CLTK repository (`cltk.corpus.utils.importer`) 

In [59]:
from cltk.corpus.utils.importer import CorpusImporter

Next, we use `CorpusImporter` to access the CLTK data models (`'latin_models_cltk'`) for the lemmatization of the Latin language

In [60]:
corpus_importer = CorpusImporter('latin')
corpus_importer.import_corpus('latin_models_cltk')

At this point, we must ensure that the text matches the CLTK data by cleaning it of non-Latin characters and converting uppercase to lowercase, J to I, and V to U

To do so, we must import the CLTK tool (`JVReplacer`) as well as the word tokenization tool (`WordTokenizer`)

In [61]:
from cltk.stem.latin.j_v import JVReplacer
from cltk.tokenize.word import WordTokenizer

Apply the JV replacement tool to our text

In [62]:
jv_replacer = JVReplacer()
cato_agri_praef = jv_replacer.replace(cato_agri_praef.lower())

Apply the word tokenizer in order to ignore non-Latin characters

In [63]:
word_tokenizer = WordTokenizer('latin')
cato_word_tokens = word_tokenizer.tokenize(cato_agri_praef.lower())
cato_word_tokens = [token for token in cato_word_tokens if token not in ['.', ',', ':', ';']]

This is what our cleaned text looks like:

In [64]:
print(cato_word_tokens)

['est', 'interdum', 'praestare', 'mercaturis', 'rem', 'quaerere', 'nisi', 'tam', 'periculosum', 'sit', 'et', 'item', 'foenerari', 'si', 'tam', 'honestum', 'maiores', 'nostri', 'sic', 'habuerunt', 'et', 'ita', 'in', 'legibus', 'posiuerunt', 'furem', 'dupli', 'condemnari', 'foeneratorem', 'quadrupli', 'quanto', 'peiorem', 'ciuem', 'existimarint', 'foeneratorem', 'quam', 'furem', 'hinc', 'licet', 'existimare', 'et', 'uirum', 'bonum', 'quom', 'laudabant', 'ita', 'laudabant', 'bonum', 'agricolam', 'bonum', '-que', 'colonum', 'amplissime', 'laudari', 'existimabatur', 'qui', 'ita', 'laudabatur', 'mercatorem', 'autem', 'strenuum', 'studiosum', '-que', 'rei', 'quaerendae', 'existimo', 'uerum', 'ut', 'supra', 'dixi', 'periculosum', 'et', 'calamitosum', 'at', 'ex', 'agricolis', 'et', 'uiri', 'fortissimi', 'et', 'milites', 'strenuissimi', 'gignuntur', 'maxime', '-que', 'pius', 'quaestus', 'stabilissimus', '-que', 'consequitur', 'minime', '-que', 'inuidiosus', 'minime', '-que', 'male', 'cogitantes'

Now that we have cleaned the text, we can lemmatize it

In [86]:
from cltk.lemmatize.latin.backoff import BackoffLatinLemmatizer

In [87]:
lemmatizer = BackoffLatinLemmatizer()
lemmata = lemmatizer.lemmatize(cato_word_tokens)

In [88]:
print(lemmata)

[('est', 'sum'), ('interdum', 'interdum'), ('praestare', 'praesto'), ('mercaturis', 'mercor'), ('rem', 'res'), ('quaerere', 'quaero'), ('nisi', 'nisi'), ('tam', 'tam'), ('periculosum', 'periculosus'), ('sit', 'sum'), ('et', 'et'), ('item', 'item'), ('foenerari', 'foeneraris'), ('si', 'si'), ('tam', 'tam'), ('honestum', 'honestus'), ('maiores', 'magnus'), ('nostri', 'noster'), ('sic', 'sic'), ('habuerunt', 'habeo'), ('et', 'et'), ('ita', 'ita'), ('in', 'in'), ('legibus', 'lex'), ('posiuerunt', 'posiuerunt'), ('furem', 'fur'), ('dupli', 'duplum'), ('condemnari', 'condemnaris'), ('foeneratorem', 'foenerator'), ('quadrupli', 'quadruplus'), ('quanto', 'quantus'), ('peiorem', 'malus'), ('ciuem', 'ciuis'), ('existimarint', 'existimo'), ('foeneratorem', 'foenerator'), ('quam', 'quam'), ('furem', 'fur'), ('hinc', 'hinc'), ('licet', 'licet'), ('existimare', 'existimaris'), ('et', 'et'), ('uirum', 'uir'), ('bonum', 'bonus'), ('quom', 'cum2'), ('laudabant', 'laudo'), ('ita', 'ita'), ('laudabant', 

Now, we can not only count (`len`) all the words...

In [89]:
print(len(lemmata))

115


But also, all *unique* (`set`) words as well:

In [90]:
print(len(set(lemmata)))

90


After lemmatizing we can analyze things like lexical diversity by dividing (`/`) the number of unique words by the total number of words 

In [91]:
print(len(set(lemmata)) / len(lemmata))

0.782608695652174


## Greek ##

Let's see how lemmatization works with Greek

First, lets apply a variable to a text. 

In [92]:
athenaeus_incipit = "Ἀθήναιος μὲν ὁ τῆς βίβλου πατήρ· ποιεῖται δὲ τὸν λόγον πρὸς Τιμοκράτην· Δειπνοσοφιστὴς δὲ ταύτῃ τὸ ὄνομα. Ὑπόκειται δὲ τῷ λόγῳ Λαρήνσιος Ῥωμαῖος, ἀνὴρ τῇ τύχῃ περιφανής, τοὺς κατὰ πᾶσαν παιδείαν ἐμπειροτάτους ἐν αὑτοῦ δαιτυμόνας ποιούμενος· ἐν οἷς οὐκ ἔσθ᾽ οὗτινος τῶν καλλίστων οὐκ ἐμνημόνευσεν. Ἰχθῦς τε γὰρ τῇ βίβλῳ ἐνέθετο καὶ τὰς τούτων χρείας καὶ τὰς τῶν ὀνομάτων ἀναπτύξεις καὶ λαχάνων γένη παντοῖα καὶ ζῴων παντοδαπῶν καὶ ἄνδρας ἱστορίας συγγεγραφότας καὶ ποιητὰς καὶ φιλοσόφους καὶ ὄργανα μουσικὰ καὶ σκωμμάτων εἴδη μυρία καὶ ἐκπωμάτων διαφορὰς καὶ πλούτους βασιλέων διηγήσατο καὶ νηῶν μεγέθη καὶ ὅσα ἄλλα οὐδ᾽ ἂν εὐχερῶς ἀπομνημονεύσαιμι, ἢ ἐπιλίποι μ᾽ ἂν ἡ ἡμέρα κατ᾽ εἶδος διεξερχόμενον. Καί ἐστιν ἡ τοῦ λόγου οἰκονομία μίμημα τῆς τοῦ δείπνου πολυτελείας καὶ ἡ τῆς βίβλου διασκευὴ τῆς ἐν τῷ δείπνῳ παρασκευῆς. Τοιοῦτον ὁ θαυμαστὸς οὗτος τοῦ λόγου οἰκονόμος Ἀθήναιος ἥδιστον λογόδειπνον εἰσηγεῖται κρείττων τε αὐτὸς ἑαυτοῦ γινόμενος, ὥσπερ οἱ Ἀθήνησι ῥήτορες, ὑπὸ τῆς ἐν τῷ λέγειν θερμότητος πρὸς τὰ ἑπόμενα τῆς βίβλου βαθμηδὸν ὑπεράλλεται."

In order to lemmatize this text, we must first import the CLTK data models for the lemmatization of the Greek language

To do so, we must first import the (`CorpusImporter`) from the appropriate CLTK repository (`cltk.corpus.utils.importer`) 

In [93]:
from cltk.corpus.utils.importer import CorpusImporter

Next, we use `CorpusImporter` to access the CLTK data models (`'greek_models_cltk'`) for the lemmatization of the Latin language

In [94]:
corpus_importer = CorpusImporter('greek')
corpus_importer.import_corpus('greek_models_cltk')

At this point, we must ensure that the text matches the CLTK data by cleaning it of non-Greek characters and converting uppercase to lowercase, J to I, and V to U

To do so, we must import the word tokenization tool (`WordTokenizer`)

In [95]:
from cltk.tokenize.word import WordTokenizer

...and ignore all non-Greek characters

In [96]:
word_tokenizer = WordTokenizer('greek')
athenaeus_word_tokens = word_tokenizer.tokenize(athenaeus_incipit.lower())
athenaeus_word_tokens = [token for token in athenaeus_word_tokens if token not in ['.', ',', ':', ';']]

Our cleaned text now looks like this: 

In [97]:
print(athenaeus_word_tokens)

['ἀθήναιος', 'μὲν', 'ὁ', 'τῆς', 'βίβλου', 'πατήρ·', 'ποιεῖται', 'δὲ', 'τὸν', 'λόγον', 'πρὸς', 'τιμοκράτην·', 'δειπνοσοφιστὴς', 'δὲ', 'ταύτῃ', 'τὸ', 'ὄνομα', 'ὑπόκειται', 'δὲ', 'τῷ', 'λόγῳ', 'λαρήνσιος', 'ῥωμαῖος', 'ἀνὴρ', 'τῇ', 'τύχῃ', 'περιφανής', 'τοὺς', 'κατὰ', 'πᾶσαν', 'παιδείαν', 'ἐμπειροτάτους', 'ἐν', 'αὑτοῦ', 'δαιτυμόνας', 'ποιούμενος·', 'ἐν', 'οἷς', 'οὐκ', 'ἔσθ᾽', 'οὗτινος', 'τῶν', 'καλλίστων', 'οὐκ', 'ἐμνημόνευσεν', 'ἰχθῦς', 'τε', 'γὰρ', 'τῇ', 'βίβλῳ', 'ἐνέθετο', 'καὶ', 'τὰς', 'τούτων', 'χρείας', 'καὶ', 'τὰς', 'τῶν', 'ὀνομάτων', 'ἀναπτύξεις', 'καὶ', 'λαχάνων', 'γένη', 'παντοῖα', 'καὶ', 'ζῴων', 'παντοδαπῶν', 'καὶ', 'ἄνδρας', 'ἱστορίας', 'συγγεγραφότας', 'καὶ', 'ποιητὰς', 'καὶ', 'φιλοσόφους', 'καὶ', 'ὄργανα', 'μουσικὰ', 'καὶ', 'σκωμμάτων', 'εἴδη', 'μυρία', 'καὶ', 'ἐκπωμάτων', 'διαφορὰς', 'καὶ', 'πλούτους', 'βασιλέων', 'διηγήσατο', 'καὶ', 'νηῶν', 'μεγέθη', 'καὶ', 'ὅσα', 'ἄλλα', 'οὐδ᾽', 'ἂν', 'εὐχερῶς', 'ἀπομνημονεύσαιμι', 'ἢ', 'ἐπιλίποι', 'μ᾽', 'ἂν', 'ἡ', 'ἡμέρα', 'κατ᾽', 'εἶδος'

Now that we have cleaned the text, we can lemmatize it

First, import the CLTK's new lemmatizer (`BackoffLatinLemmatizer`)

In [104]:
from cltk.lemmatize.greek.backoff import BackoffGreekLemmatizer

Apply the lemmatizer to our cleaned Greek text

In [105]:
lemmatizer = BackoffGreekLemmatizer()
lemmata = lemmatizer.lemmatize(athenaeus_word_tokens)
print(lemmata)

[('ἀθήναιος', 'ἀθήναιος'), ('μὲν', 'μέν'), ('ὁ', 'ὁ'), ('τῆς', 'ὁ'), ('βίβλου', 'βίβλος'), ('πατήρ·', 'πατήρ·'), ('ποιεῖται', 'ποιέω'), ('δὲ', 'δέ'), ('τὸν', 'ὁ'), ('λόγον', 'λόγος'), ('πρὸς', 'πρός'), ('τιμοκράτην·', 'τιμοκράτην·'), ('δειπνοσοφιστὴς', 'δειπνοσοφιστὴς'), ('δὲ', 'δέ'), ('ταύτῃ', 'αὐτός'), ('τὸ', 'ὁ'), ('ὄνομα', 'ὄνομα'), ('ὑπόκειται', 'ὑπόκειμαι'), ('δὲ', 'δέ'), ('τῷ', 'ὁ'), ('λόγῳ', 'λόγος'), ('λαρήνσιος', 'λαρήνσιος'), ('ῥωμαῖος', 'ῥωμαῖος'), ('ἀνὴρ', 'ἀνήρ'), ('τῇ', 'ὁ'), ('τύχῃ', 'τύχη'), ('περιφανής', 'περιφανής'), ('τοὺς', 'ὁ'), ('κατὰ', 'κατά'), ('πᾶσαν', 'πᾶς'), ('παιδείαν', 'παιδεία'), ('ἐμπειροτάτους', 'ἔμπειρος'), ('ἐν', 'ἐν'), ('αὑτοῦ', 'ἑαυτοῦ'), ('δαιτυμόνας', 'δαιτυμών'), ('ποιούμενος·', 'ποιούμενος·'), ('ἐν', 'ἐν'), ('οἷς', 'ὅς'), ('οὐκ', 'οὐ'), ('ἔσθ᾽', 'ἔσθ᾽'), ('οὗτινος', 'ὅστις'), ('τῶν', 'ὁ'), ('καλλίστων', 'καλός'), ('οὐκ', 'οὐ'), ('ἐμνημόνευσεν', 'μνημονεύω'), ('ἰχθῦς', 'ἰχθύς'), ('τε', 'τε'), ('γὰρ', 'γάρ'), ('τῇ', 'ὁ'), ('βίβλῳ', 'βίβλος'), ('ἐν

Now, we can not only count (`len`) all the words...

In [106]:
print(len(lemmata))

162


But also, all *unique* (`set`) words as well:

In [107]:
print(len(set(lemmata)))

121


After lemmatizing we can analyze things like lexical diversity by dividing (`/`) the number of unique words by the total number of words 

In [102]:
print(len(set(lemmata)) / len(lemmata))

0.7469135802469136


Let's compare the lexical diversity of the Prologue of Apuleius's *Metamorphoses* using the same code

In [103]:
LatinText = 'At ego tibi sermone isto Milesio varias fabulas conseram auresque tuas benivolas lepido susurro permulceam — modo si papyrum Aegyptiam argutia Nilotici calami inscriptam non spreveris inspicere — , figuras fortunasque hominum in alias imagines conversas et in se rursus mutuo nexu refectas ut mireris. Exordior. "Quis ille?" Paucis accipe. Hymettos Attica et Isthmos Ephyrea et Taenaros Spartiatica, glebae felices aeternum libris felicioribus conditae, mea vetus prosapia est; ibi linguam Atthidem primis pueritiae stipendiis merui. Mox in urbe Latia advena studiorum Quiritium indigenam sermonem aerumnabili labore nullo magistro praeeunte aggressus excolui. En ecce praefamur veniam, siquid exotici ac forensis sermonis rudis locutor offendero. Iam haec equidem ipsa vocis immutatio desultoriae scientiae stilo quem accessimus respondet. Fabulam Graecanicam incipimus. Lector intende: laetaberis.'

In [108]:
from cltk.corpus.utils.importer import CorpusImporter
corpus_importer = CorpusImporter('latin')
corpus_importer.import_corpus('latin_models_cltk')

from cltk.stem.latin.j_v import JVReplacer
from cltk.tokenize.word import WordTokenizer

jv_replacer = JVReplacer()
LatinText = jv_replacer.replace(LatinText.lower())

word_tokenizer = WordTokenizer('latin')
LatinTextTokens = word_tokenizer.tokenize(LatinText.lower())
LatinTextTokens = [token for token in LatinTextTokens if token not in ['.', ',', ':', ';', '?', '—']]

from cltk.lemmatize.latin.backoff import BackoffLatinLemmatizer
lemmatizer = BackoffLatinLemmatizer()
lemmata = lemmatizer.lemmatize(LatinTextTokens)

print(len(set(lemmata)) / len(lemmata))

0.9508196721311475
