# Demonstration of the Classical Language Tool Kit (CLTK)
Source: http://cltk.org

The Classical Language Toolkit (CLTK) is a Python library offering natural language processing (NLP) for the languages of pre–modern Eurasia. Pre-configured pipelines are available for 19 languages.

For a quick introduction into the motivation and intention behind CLTK please refer to the paper from the reading list.

For the docs please check out: https://docs.cltk.org/en/latest/index.html

## Installation

There are two ways of installing CLTK. Either by using
```$ !pip install cltk```

Or by directly cloning it from their github repository (https://github.com/cltk/cltk) using the command line interface.
```
$ git clone https://github.com/cltk/cltk.git
$ cd path_to_cltk_dir # (e.g. /user/cltk)
$ make install
```
Note: Requires poetry. In case you have not installed the poetry package execute 
```$ pip install poetry```

Making some imports...

In [1]:
from collections import Counter

# Download and process corpora
from cltk.data.fetch import FetchCorpus
from cltk.core.data_types import Doc, Word
from cltk.corpora.grc.tei import onekgreek_tei_xml_to_text

# Sample texts
from cltk.languages.example_texts import get_example_text

# Text normalisation tools
from cltk.alphabet.text_normalization import cltk_normalize
from cltk.alphabet.lat import JVReplacer, LigatureReplacer, normalize_lat
from cltk.alphabet.grc import expand_iota_subscript

# Phonetics/Phonology
from cltk.phonology.lat.phonology import LatinTranscription, LatinSyllabifier
from cltk.phonology.grc.phonology import GreekTranscription # GreekSyllabifier not working?

# Prosody (Scansion)
from cltk.prosody.lat.hexameter_scanner import HexameterScanner
from cltk.prosody.lat.metrical_validator import MetricalValidator
from cltk.prosody.lat.macronizer import Macronizer
from cltk.prosody.grc import Scansion

# Sentence tokenisation
from cltk.sentence.grc import GreekRegexSentenceTokenizer
from cltk.sentence.lat import LatinPunktSentenceTokenizer

# Word tokenisation
from cltk.tokenizers.lat.lat import LatinWordTokenizer
from cltk.tokenizers import GreekTokenizationProcess

# Lemmatisation
from cltk.lemmatize.lat import LatinBackoffLemmatizer
from cltk.lemmatize.grc import GreekBackoffLemmatizer

# Stemming
from cltk.stem.lat import stem

# POS tagging
from cltk.tag.pos import POSTag

# Named entity recognition
from cltk.ner.ner import tag_ner

# Stop words
from cltk.stops.lat import STOPS as STOPS_lat
from cltk.stops.grc import STOPS as STOPS_grc

# Syntactic parsing
from cltk.dependency.processes import StanzaProcess
from cltk.dependency.tree import DependencyTree

# Lexicon
from cltk.lexicon.lat import LatinLewisLexicon

# Decliner
from cltk.morphology.lat import CollatinusDecliner

# Modules

## Corpora
How to check which corpora are available and how to download/clone them from the github repository:

### Latin:

In [2]:
# Check for available corpora
fetcher_lat = FetchCorpus('lat', testing=False)
available_corpora_lat = fetcher_lat.list_corpora
# print(available_corpora_lat)

In [3]:
# Download corpora
# fetcher_lat.import_corpus('lat_models_cltk')
# fetcher_lat.import_corpus('lat_text_latin_library')

### Greek:

In [4]:
fetcher_grc = FetchCorpus('grc', testing=False)
available_corpora_grc = fetcher_grc.list_corpora
# print(available_corpora_grc)

In [5]:
# Download corpora
# fetcher_grc.import_corpus('grc_models_cltk')
# fetcher_grc.import_corpus('grc_text_first1kgreek')
# onekgreek_tei_xml_to_text()

## Load sample texts

### Latin:
Introduction to Ceasares Bellum Gallicum (translation available here: https://www.gottwein.de/Lat/caes/bg1001.php)

In [6]:
bellum_gallicum_ceasar = get_example_text('lat')
# print(bellum_gallicum_ceasar)

### Greek:
Introduction to Platon's Apologie (17a-19a) (translation available here: https://www.gottwein.de/Grie/plat/apol17a.php)

In [7]:
apologia_platon = get_example_text('grc')
# print(apologia_platon)

## Text normalisation:

Let's now perform a little bit of text normalisation.

### Latin:
- **Latin aphabet:** Replace J/V with I/U. Latin alphabet does not distinguish between J/j and I/i and V/v and U/u; Yet, many texts bear the influence of later editors and the predilections of other languages. In practical terms, the JV substitution is recommended on all Latin text preprocessing; it helps to collapse the search space.

In [8]:
repl = JVReplacer()
normalised_lat_sample = repl.replace('In vino veritas.')
print(normalised_lat_sample)

In uino ueritas.


- **Ligatures in Latin:** Replace ‘œæ’ with AE, ‘Œ Æ’ with OE. Classical Latin wrote the o and e separately (as has today again become the general practice), but the ligature was used by medieval and early modern writings, in part because the diphthongal sound had, by Late Latin, merged into the sound.

In [9]:
lig_repl = LigatureReplacer()
normalised_lat_sample = lig_repl.replace("prœil")
print(normalised_lat_sample)

proeil


- **normalize_lat()** performs all normalisation options at once

In [10]:
norm_sample = "canō Īuliī suspensám quăm aegérrume ĭndignu îs óccidentem frúges Julius Caesar. " \
              "In vino veritas. mæd prœil"
print(normalize_lat(norm_sample, drop_accents=True, drop_macrons=False,
                    jv_replacement=False, ligature_replacement=False))      # all options set to False by default

canō Īuliī suspensam quăm aegerrume ĭndignu is óccidentem frúges Julius Caesar. In vino veritas. mæd prœil


Normalise the excerpt from Ceasare's Bellum Gallicum using the normalize_lat() function.

In [11]:
# Normalise bellum gallicum sample
normalised_bg = normalize_lat(bellum_gallicum_ceasar, jv_replacement=True)

### Greek:
Dealing with a different language and alphabets makes it necessary to perfom different types of normalisation. Thus with greek we have to pay attention to:
- **Iota subscribtum:**

In [12]:
str_iota_subscript = "ἐν τῇ νῦν Ἑλλάδι καλεομένῃ χωρῇ οὕτω δ᾽ εἶπε τερᾴζων"
print(expand_iota_subscript(str_iota_subscript,
                            lowercase=True))    # lowercase is True by default returns lowercase string

ἐν τῆι νῦν ἑλλάδι καλεομένηι χωρῆι οὕτω δ᾽ εἶπε τεράιζων


- **Unicode encoding:** In some cases it might be relevant to consider encodings. Especially, when dealing with non-Ascii characters. The cltk_normalize() function from the unicodedata package which returns normal form 'form' for Unicode string unistr.

In [13]:
# Uniode normalisation
word = cltk_normalize('διοτρεφές')
print(word) # changes are not visible for us if we print the word

διοτρεφές


Again we perform the above operations on our greek sample.

In [14]:
expanded_apol = expand_iota_subscript(apologia_platon, lowercase=False)
normalised_apol = cltk_normalize(expanded_apol)

## Sentence Tokenisation

### Latin:

In [15]:
sent_tokeniser_lat = LatinPunktSentenceTokenizer()
sentences_bg = sent_tokeniser_lat.tokenize(normalised_bg)
print(sentences_bg[0])

Gallia est omnis diuisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.


### Greek:

In [16]:
sent_tokeniser_grc = GreekRegexSentenceTokenizer()
sentences_grc = sent_tokeniser_grc.tokenize(normalised_apol)
print(sentences_grc[0])

ὅτι μὲν ὑμεῖς, ὦ ἄνδρες Ἀθηναῖοι, πεπόνθατε ὑπὸ τῶν ἐμῶν κατηγόρων, οὐκ οἶδα: ἐγὼ δ ̓ οὖν καὶ αὐτὸς ὑπ ̓ αὐτῶν ὀλίγου ἐμαυτοῦ ἐπελαθόμην, οὕτω πιθανῶς ἔλεγον.


## Word Tokenisation

### Latin:

In [17]:
tokeniser_lat = LatinWordTokenizer()
tokens_bg = tokeniser_lat.tokenize(normalised_bg)
print(tokens_bg[0:10])

['Gallia', 'est', 'omnis', 'diuisa', 'in', 'partes', 'tres', ',', 'quarum', 'unam']


### Greek:

In [18]:
tokenizer_process = GreekTokenizationProcess()
output_doc = tokenizer_process.run(input_doc=Doc(raw=normalised_apol))
tokens_apol = output_doc.tokens
print(tokens_apol[0:10])

['ὅτι', 'μὲν', 'ὑμεῖς', ',', 'ὦ', 'ἄνδρες', 'Ἀθηναῖοι', ',', 'πεπόνθατε', 'ὑπὸ']


## Lemmatisation:

### Latin:

In [19]:
lemmatiser_lat = LatinBackoffLemmatizer()     # needs a list input
lemmatised_bg = lemmatiser_lat.lemmatize(tokens_bg)
lemmas_bg = [lemma for token, lemma in lemmatised_bg]
print(lemmatised_bg[0:10])

[('Gallia', 'Gallia'), ('est', 'sum'), ('omnis', 'omnis'), ('diuisa', 'diuido'), ('in', 'in'), ('partes', 'pars'), ('tres', 'tres'), (',', 'punc'), ('quarum', 'qui'), ('unam', 'unus')]


### Greek:

In [20]:
lemmatiser_grc = GreekBackoffLemmatizer()   # needs a list input
lemmatised_apol = lemmatiser_grc.lemmatize(tokens_apol)
lemmas_apol = [lemma for token, lemma in lemmatised_apol]
print(lemmatised_apol[0:10])

[('ὅτι', 'ὅτι'), ('μὲν', 'μέν'), ('ὑμεῖς', 'σύ'), (',', 'punc'), ('ὦ', 'ὦ'), ('ἄνδρες', 'ἀνήρ'), ('Ἀθηναῖοι', 'Ἀθηναῖος'), (',', 'punc'), ('πεπόνθατε', 'πάσχω'), ('ὑπὸ', 'ὑπό')]


## Stemming

### Latin:

In [21]:
# Get the stems for all tokens in Bellum Gallicum
stemmed_tokens_bg = [stem(token) for token in tokens_bg]
print(stemmed_tokens_bg[0:])

# or: get the stems for the lemmas in Bellum Gallicum
# stemmed_lemmas_bg = [stem(lemma) for lemma in lemmas_bg]
# print(stemmed_lemmas_bg[0:10])

['Gall', 'est', 'omn', 'diuis', 'in', 'part', 'tr', ',', 'quar', 'un', 'incolu', 'Belg', ',', 'ali', 'Aquitan', ',', 'terti', 'qui', 'ipsor', 'lingu', 'Celt', ',', 'nostr', 'Gall', 'appella', '.', 'H', 'omn', 'lingu', ',', 'institut', ',', 'leg', 'inter', 's', 'differu', '.', 'Gall', 'ab', 'Aquitan', 'Garumn', 'flumen', ',', '', 'Belg', 'Matron', 'et', 'Sequan', 'diuidi', '.', 'Hor', 'omni', 'fortissim', 'su', 'Belg', ',', 'proptere', 'quod', '', 'cult', 'atque', 'humanitat', 'prouinci', 'longissim', 'absu', ',', 'minim', '-', 'ad', 'e', 'mercator', 'saep', 'commea', 'atque', 'e', 'quae', 'ad', 'effeminand', 'anim', 'pertine', 'importa', ',', 'proxim', '-', 'su', 'German', ',', 'qui', 'trans', 'Rhen', 'incolu', ',', 'cum', 'qu', 'continente', 'bell', 'geru', '.', 'Qu', 'de', 'caus', 'Helueti', 'quoque', 'reliqu', 'Gall', 'uirtut', 'praecedu', ',', 'quod', 'fer', 'cotidian', 'proeli', 'cum', 'German', 'contendu', ',', 'cum', 'aut', 'su', 'fin', 'e', 'prohibe', 'aut', 'ips', 'in', 'eor',

### Greek:

In [22]:
# missing in action (at least currently)

## Stopwords
For almost all languages we can find a list of stopwords. To, for example, get a rough estimation of howmany (unique) lexical items are in our text or the frequnecy of function words.

### Latin:

In [23]:
# (Unique) lexical items
lexemes_bg = [lemma for lemma in lemmas_bg if lemma not in STOPS_lat]
print(f'Number of lexemes: {len(lexemes_bg)}')
print(f'Unique lexemes: {set(lexemes_bg)}')
print(f'Number of unique lexemes: {len(set(lexemes_bg))}')

Number of lexemes: 159
Unique lexemes: {'bellum', 'obtineo', 'effemino', 'proelium', 'Gallus', 'exter', 'mercator', 'Oceanus', 'Horus', 'septentrio', 'prohibeo', 'Aquitanus', 'saepe', 'Hispania', 'Heluetiis', 'praecedo', 'quoque', 'importo', 'propior', 'contineo', 'sol', 'Rhenus', 'longus', 'vergo', '-que', 'lex', 'alius', 'Gallia', 'noster', 'occasus', 'initium', 'flumen', 'capio', 'Belgaris', 'Galli', 'Hi', 'cultus', 'diuido', 'Heluetii', 'humanitas', 'animus', 'attingo', 'fortis', 'Celtae', 'cum2', 'punc', 'Pyrenaeus', 'pertineo', 'Aquitania', 'con-meo', 'Garumna', 'appello', 'inferus', 'uirtus', 'differo', 'tres', 'prouincia', 'Germanus', 'contendo', 'paruus', 'fere', 'causa', 'Matrona2', 'mons', 'omnis', 'tertius', 'specto', 'finis', 'Sequanus', 'Eos', 'institus', 'cottidianus', 'orior', 'Qua', 'pars', 'incolo1', 'reliquus', 'Rhodanus', 'propterea', 'lingua', 'dico', 'gero1', 'Sequani', 'Belga', 'absum'}
Number of unique lexemes: 85


In [24]:
# Frequency count of stopwords
func_words_bg = [lemma for lemma in lemmas_bg if lemma in STOPS_lat]
func_words_bg_freq = Counter(func_words_bg)
print(f'Frequency counts of function words: {func_words_bg_freq}')

Frequency counts of function words: Counter({'qui': 8, 'ab': 7, 'ad': 6, 'sum': 5, 'et': 5, 'is': 5, 'in': 3, 'unus': 2, 'ipse': 2, 'inter': 2, 'atque': 2, 'aut': 2, 'sui': 1, 'trans': 1, 'de': 1, 'suus': 1, 'quam': 1, 'etiam': 1})


### Greek:

In [25]:
# (Unique) lexical items
lexemes_grc = [lemma for lemma in lemmas_apol if lemma not in STOPS_grc]
print(f'Number of lexemes: {len(lexemes_grc)}')
print(f'Unique lexemes: {set(lexemes_grc)}')
print(f'Number of unique lexemes: {len(set(lexemes_grc))}')

Number of lexemes: 287
Unique lexemes: {'εὐλαβέομαι', 'μειρακίωΙ', 'μή', 'πιθανός', 'εἰρήκασιν.', 'σύ', 'αἰσχύνω', 'ἄλλως', 'δήπου', 'τραπεζόω', 'τῆΙ', 'προσέχω', 'ἐμαυτοῦ', 'ὀνόμασιν—πιστεύω', 'δ', 'συγγιγνώσκω', 'πλάσσω', 'εἰκῆΙ', 'μά', 'ἐξελέγχω', 'οὐδείς', 'τρέφω', 'ἑβδομήκοντα', 'γάρ', 'τρόπος', 'ψεύδω', 'ῥήμασί', 'ὥσπερ', 'λέξις', 'οὑτωσί.', 'ἡλικίαΙ', 'πρέπω', 'ξενόω', 'εἴη—αὐτὸ', 'εἷς', 'ἔθω', 'ἀρετή', 'ἐρῶ', 'ἀλλ', 'οἶδα', 'τῆΙδε', 'καλέω', 'ἐγώ', 'πᾶς', 'αὐτίκα', 'ῥήτωρ', 'θαυμάζω', 'μηδ', 'ἐπιλανθάνομαι', 'μήτε', 'καί', 'ἐπιτυγχάνω', 'ἴσως', 'ἕνεκα.', 'βελτίων', 'ἀναίσχυντος', 'οὐδέ', 'αὐτός', 'διά', 'δίκαιος', 'ἔχω', 'δι', 'ὁμολογέω', 'ἄλλοθι', 'ἀγορᾶΙ', 'ἐμός', 'προσδοκάω', 'punc', 'ἀπολογέομαι', 'φαίνω', 'ἔτος', 'καίτοι', 'λόγος', 'κατά', 'τυγχάνω', 'ἐπειδάν', 'νῦν', 'θορυβέω', 'λέγειν.', 'εἰμί', 'ἀκούητέ', 'ἐᾶν—ἴσως', 'μάλιστα', 'λέγω', 'ἄτεχνος', 'ἐνθάδε', 'δοκέω', 'ὀλίγος', 'ὧΙ', 'ἀλήθειαν—οὐ', 'Ζεύς', 'ἀνήρ', 'δέομαι', 'ῥήτωρ.', 'ὑπό', 'ξένος', 'ἀνά', 'χείρων', 'γίγν

In [26]:
# Frequnecy of function words
func_words_apol = [lemma for lemma in lemmas_apol if lemma in STOPS_grc]
func_words_apol_freq = Counter(func_words_apol)
print(f'Frequency counts of function words: {func_words_apol_freq}')

Frequency counts of function words: Counter({'ὁ': 13, 'οὗτος': 12, 'εἰ': 5, 'ὦ': 4, 'οὖν': 4, 'ὡς': 4, 'ὅς': 4, 'ἐν': 4, 'γε': 3, 'ἄν': 3, 'ἤ': 3, 'ὅτι': 2, 'οὐ': 2, 'τε': 2, 'οὕτως': 1, 'ἄρα': 1, 'τις': 1, 'εἰς': 1})


## POS tagging :
There are multiple taggers available for both Latin and Greek. I choose the 'tag_ngram_123_backoff' tagger for demonstration purposes.

### Latin:

In [27]:
pos_tagger_lat = POSTag('lat')
tagged_bg = pos_tagger_lat.tag_ngram_123_backoff(normalised_bg)
print(tagged_bg[0:10])

[('Gallia', None), ('est', 'V3SPIA---'), ('omnis', 'A-S---MN-'), ('diuisa', None), ('in', 'R--------'), ('partes', 'N-P---FA-'), ('tres', 'M--------'), (',', 'U--------'), ('quarum', 'P-P---FG-'), ('unam', 'A-S---FA-')]


### Greek:

In [28]:
pos_tagger_lat = POSTag('grc')
tagged_apol = pos_tagger_lat.tag_ngram_123_backoff(normalised_apol)
print(tagged_apol[0:10])

[('ὅτι', 'C--------'), ('μὲν', 'G--------'), ('ὑμεῖς', 'P-P----N-'), (',', 'U--------'), ('ὦ', 'E--------'), ('ἄνδρες', 'N-P---MN-'), ('Ἀθηναῖοι', None), (',', 'U--------'), ('πεπόνθατε', None), ('ὑπὸ', 'R--------')]


## Named Entity Recognition (NER):

If we are intersted if our text contains informaiton about locations, persons, etc. We can make use of the named entity tagging module. The function tag_ner() which requires information about the language will return a mask consisting of tags (such as 'LOCATION') and if there is no tag available 'FALSE'.

### Latin:
Using our Bellum Gallicum snippet, we get the following results.

In [29]:
ner_lat = tag_ner(iso_code='lat', input_tokens=tokens_bg) # returns a mask that tells you if a token is a NE or not

# print(f'Mask:\n{ner_lat}\n')
print(f'Identified NEs:\n{[(token, mask_val) for token, mask_val in list(zip(tokens_bg, ner_lat)) if mask_val]}')

Identified NEs:
[('Gallia', 'LOCATION'), ('Belgae', 'LOCATION'), ('Aquitani', 'LOCATION'), ('Celtae', 'LOCATION'), ('Galli', 'LOCATION'), ('Belgae', 'LOCATION'), ('Heluetii', 'PERSON'), ('Rhodano', 'LOCATION'), ('Oceano', 'LOCATION'), ('Belgae', 'LOCATION'), ('Galliae', 'LOCATION'), ('Aquitania', 'LOCATION'), ('Oceani', 'LOCATION')]


### Greek:
Doing the same with our greek Apologia sample text only returns one NE:

In [30]:
ner_grc = tag_ner(iso_code='grc', input_tokens=tokens_apol)

# print(f'Mask:\n{ner_grc}\n')
print(f'Identified NEs:\n{[(token, mask_val) for token, mask_val in list(zip(tokens_apol, ner_grc)) if mask_val]}')

Identified NEs:
[('Ἀθηναῖοι', 'LOCATION')]


## Syntactic parsing
For the syntactic parsing CLTK calls on available Stanza models.
Note: To execute the next cells you will need to install Stanza.

In [31]:
# Uncomment next line in case you don't have stanza installed
# !pip install stanza

### Latin:

In [32]:
process_stanza_lat = StanzaProcess(language="lat")
stanza_out_bg = process_stanza_lat.run(Doc(raw=normalised_bg))

In [33]:
# We can now either decide to extract features of the individual words in our text...
words_bg = stanza_out_bg.words
isinstance(words_bg[0], Word)
print(words_bg[0])

Word(index_char_start=None, index_char_stop=None, index_token=0, index_sentence=0, string='Gallia', pos=noun, lemma='mallis', stem=None, scansion=None, xpos='A1|grn1|casA|gen2', upos='NOUN', dependency_relation='nsubj', governor=3, features={Case: [nominative], Degree: [positive], Gender: [feminine], Number: [singular]}, category={F: [neg], N: [pos], V: [neg]}, stop=None, named_entity=None, syllables=None, phonetic_transcription=None, definition=None)


In [34]:
# ... or decide to print a dependency tree
sentences_stanza_bg = stanza_out_bg.sentences
t0_bg = DependencyTree.to_tree(sentences_stanza_bg[0])
t0_bg.print_tree()

root | diuisa_3/verb
    └─ nsubj | Gallia_0/noun
        └─ det | omnis_2/pronoun
    └─ cop | est_1/auxiliary
    └─ obl:arg | partes_5/noun
        └─ case | in_4/adposition
        └─ nummod | tres_6/numeral
        └─ acl:relcl | incolunt_10/verb
            └─ punct | ,_7/punctuation
            └─ obj | unam_9/numeral
                └─ nmod | quarum_8/pronoun
            └─ nsubj | Belgae_11/noun
                └─ nmod | Aquitani_14/noun
                    └─ punct | ,_12/punctuation
                    └─ acl:relcl | Celtae_20/adjective
                        └─ nsubj | qui_17/pronoun
                        └─ obl | lingua_19/noun
                            └─ nmod | ipsorum_18/pronoun
                        └─ punct | ,_21/punctuation
        └─ acl:relcl | appellantur_24/verb
            └─ obj | aliam_13/pronoun
                └─ amod | tertiam_16/adjective
            └─ punct | ,_15/punctuation
            └─ xcomp | Galli_23/adjective
                └─ amod | nos

### Greek:

In [35]:
process_stanza_grc = StanzaProcess(language="grc")
stanza_out_bg = process_stanza_grc.run(Doc(raw=normalised_apol))

In [36]:
words_apol = stanza_out_bg.words
isinstance(words_apol[0], Word)
print(words_apol[0])

Word(index_char_start=None, index_char_stop=None, index_token=0, index_sentence=0, string='ὅτι', pos=adverb, lemma='ὅτι', stem=None, scansion=None, xpos='Df', upos='ADV', dependency_relation='advmod', governor=6, features={}, category={F: [neg], N: [pos], V: [pos]}, stop=None, named_entity=None, syllables=None, phonetic_transcription=None, definition=None)


In [37]:
sentences_stanza_apol = stanza_out_bg.sentences
t0_apol = DependencyTree.to_tree(sentences_stanza_apol[0])
t0_apol.print_tree()

root | πεπόνθατε_6/verb
    └─ advmod | ὅτι_0/adverb
    └─ discourse | μὲν_1/adverb
    └─ advmod | ὑμεῖς,_2/adverb
    └─ vocative | ἄνδρες_4/noun
        └─ discourse | ὦ_3/interjection
    └─ iobj | Ἀθηναῖοι,_5/adjective
    └─ conj | οἶδα:_12/verb
        └─ obl | κατηγόρων,_10/noun
            └─ case | ὑπὸ_7/adposition
            └─ det | τῶν_8/determiner
            └─ nmod | ἐμῶν_9/adjective
        └─ advmod | οὐκ_11/adverb


## Phonology
Opposed to a modern spoken language, we can not simply look for videos or listen to radio podcasts to get an impression on the pronunciation. It get's even worse if are unable to read a text because we can not read the script that was used. Therefore, the phonology (and prosody) sections show us which tools we could use to aid us.

### Latin:
- **Transcription:**

In [38]:
transcriber_lat = LatinTranscription()
# print(transcriber_lat.transcribe('veritas'))
# print('\n')

# Full sample text
transcribed_bg = transcriber_lat.transcribe(normalised_bg)
# print(transcribed_bg)

- **Syllabification:**

In [39]:
syllabifier_lat = LatinSyllabifier()
# print(syllabifier_lat.syllabify('veritas'))

In [40]:
# We can see that this function is ment for single word strings...
# Full sample text
syllabified_bg = syllabifier_lat.syllabify(normalised_bg)
# print(syllabified_bg[0:10])

In [41]:
# Thus, we would have to iterate over each word:
# Full sample text
syllabified_bg = [syllabifier_lat.syllabify(token) for token in tokens_bg]
# print(syllabified_bg[0:5])

### Greek:
- **Transcription:**

In [42]:
transcriber_grc = GreekTranscription()
# print(transcriber_grc.transcribe(word))
# print(transcriber_grc.transcribe(normalised_apol))

- **Syllabification:**

In [43]:
### Appears to be not working correctly, but we can get it through the Prosody module (see below)

## Prosody

### Latin:
The Latin module for prosody is well developed. It contains some classes which will analyse or verify the metric of a text string (e.g., ```HexameterScanner()``` and ```MetricalValidator()```). We can also find something called 

In [44]:
# Excerpt from the Pro Vergils Aeneis which is writen as hexameters:
# - -uu | -uu | -uu | -uu | -uu | -x (each foot can be replaced with -- alternatively)
aeneis = ["Arma virumque cano, Troiae qui primus ab oris ",
         "Italiam, fato profugus, Laviniaque venit ",
         "litora, multum ille et terris iactatus et alto ",
         "vi superum saevae memorem Iunonis ob iram; ",
         "Multa quoque et bello passus, dum conderet urbem, ",
         "inferretque deos Latio, genus unde Latinum, ",
         "Albanique patres, atque altae moenia Romae."]

# Compute the scansion pattern
hexameter_scan = HexameterScanner()
aeneis_hexa = []
for line in aeneis:
        hexa = hexameter_scan.scan(line)
        aeneis_hexa.append(hexa)
        # print(line)
        # print(hexa.scansion)

As we might see from the above print statement, the HexameterScanner() does not work perfectly (yet). We can optionally use the MetricValidator() to demonstrate this.

In [45]:
metric_veri = MetricalValidator()
# for hexa in aeneis_hexa:
#         print(metric_veri.is_valid_hexameter(hexa.scansion))

We can also find something called ```Macronizer()``` which  places a macron over naturally long Latin vowels.

In [46]:
macroniser = Macronizer('tag_ngram_123_backoff')

# return only the macronised text
print(macroniser.macronize_text(' '.join(aeneis)))

# return each word with its POS tag and its macronised form
# print(macroniser.macronize_tags(' '.join(aeneis)))

arma virumque cano , trojae quī prīmus ab ōrīs i_taliam , fato profugus , laviniaque venit lītora , multum ille et terrīs iactatus et altō vī superum saevae memorem jūnōnis ob īram ; multa quoque et bellō passūs , dum conderet urbem , inferretque deōs lātiō , genus unde latinum , albanique patrēs , atque altae moenia Rōmae .


### Greek:
For Greek only scansion is available, i.e. it will mark long and short syllables but not take into account the metrics of a text. This becomes clear when we look at the first seven lines in the Ilias. The next code cell will break down the scansion process for greek.

In [47]:
ilias = "μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος " \
        "οὐλομένην, ἣ μυρί᾽ Ἀχαιοῖς ἄλγε᾽ ἔθηκε, " \
        "πολλὰς δ᾽ ἰφθίμους ψυχὰς Ἄϊδι προΐαψεν " \
        "ἡρώων, αὐτοὺς δὲ ἑλώρια τεῦχε κύνεσσιν " \
        "οἰωνοῖσί τε πᾶσι, Διὸς δ᾽ ἐτελείετο βουλή, " \
        "ἐξ οὗ δὴ τὰ πρῶτα διαστήτην ἐρίσαντε " \
        "Ἀτρεΐδης τε ἄναξ ἀνδρῶν καὶ δῖος Ἀχιλλεύς."
scanner_grc = Scansion()

# After some iniital cleaning and tokenisation of the text, we can extract the syllables.
scanned_syl_ilias = [syllable for word in scanner_grc._make_syllables(ilias)[0] for syllable in word]

# Using the output from the _make_syllables() function we are now able to perfom the scansion:
scansion_syl_ilias = scanner_grc._scansion([scanned_syl_ilias])

# Here we can see which syllable was marked as long and which one as short.
# print(list(zip(scanned_syl_ilias, list(scansion_syl_ilias[0]))))

# Note: The scansion does not reflect the dactylic hexameter used by Homer which would typically be:
#            -uu | -uu | -- | -uu | -uu | --
#       for each of the lines in the excerpt.

In [48]:
# Execute scansion using the scan_text() function
print(scanner_grc.scan_text(ilias))

['¯˘˘¯˘˘˘¯¯˘˘¯˘˘¯˘¯˘˘¯¯˘˘˘¯¯˘˘˘¯˘˘˘¯˘¯˘˘˘˘˘¯˘¯¯¯¯¯˘˘¯˘˘¯˘˘¯˘¯¯¯˘˘¯˘˘˘˘˘¯˘˘¯¯˘¯¯˘¯˘˘˘¯¯˘˘¯˘˘˘¯˘˘˘˘¯¯¯˘˘˘x']


## Lexicon

For some laguages, we are able to look up a lemma in a lexicon.

### Latin:
For latin, CLTK accesses a digital form of Charlton T. Lewis’s An Elementary Latin Dictionary (1890) to match a lemma against the headwords of an entry. If more than one match is found, then it returns the concatenated entries.

In [49]:
lex_lat = LatinLewisLexicon(interactive=False)      # by default: interactive=True
print(lex_lat.lookup(lemmas_bg[1]))

sum

 (2d pers. es, or old ēs; old subj praes. siem,
            siēs, siet, sient, for sim, etc., T.; fuat for sit, T., V., L.;
            imperf. often forem, forēs, foret, forent, for essem, etc.;
                fut. escunt for erunt, C.), fuī (fūvimus for
            fuimus, Enn. ap. C.), futūrus (inf fut. fore or futūrum
            esse, C.), esse 
ES-; FEV-. —
I. As a predicate, asserting existence, 
to be, exist, live
: ut id aut esse dicamus aut non esse: flumen est Arar,
            quod, etc., Cs.: homo nequissimus omnium qui
                sunt, qui fuerunt: arbitrari, me nusquam aut nullum fore: fuimus Troes, fuit
            Ilium, V.—Of place, 
to be, be present, be found, stay, live
: cum non liceret Romae quemquam esse: cum essemus in
                castris: deinceps in lege est, ut, etc.: erat nemo, quicum essem libentius
            quam tecum: sub uno tecto esse, L.—Of circumstances or condition, 
to be, be found, be situated, be placed
: Sive erit in Tyriis, Ty

### Greek:

In [50]:
# missing in action (at least currently)

## Declination/Conjugation
CLTK also offeres a possibility to delcinate a lemma using the cltk.morphology module. Again this is not available for all languages but we can use Latin to demonstrate this.
 
### Latin:

In [51]:
decliner = CollatinusDecliner()

# Print conjugation of 'esse' ('be')
# Full information
print(decliner.decline(lemmas_bg[1], flatten=False, collatinus_dict=False)[0:10]) # attributes both set to False by default

[('sum', 'v1spia---'), ('es', 'v2spia---'), ('est', 'v3spia---'), ('sumus', 'v1ppia---'), ('estis', 'v2ppia---'), ('sunt', 'v3ppia---'), ('eram', 'v1siia---'), ('eras', 'v2siia---'), ('erat', 'v3siia---'), ('eramus', 'v1piia---')]


In [52]:
# Print only conjugated forms
print(decliner.decline(lemmas_bg[1], flatten=True, collatinus_dict=False)[0:10])

['sum', 'es', 'est', 'sumus', 'estis', 'sunt', 'eram', 'eras', 'erat', 'eramus']


### Greek:

In [53]:
# missing in action (at least currently)