## Importing corpus readers

***NB** *I have found many errors in these corpus reader modules. Use them with caution.*

First we will need to import the corpus reader.

To do so, you must `import` the module (`get_corpus_reader`) `from` the proper repository (`cltk.corpus.readers`)

In [4]:
from cltk.corpus.readers import get_corpus_reader

let's say you want to open a Latin text...

...first, you might want a detailed (`-l`) list (`ls`) the corpora in your latin directory (`~/cltk_data/latin/text/`)

In [5]:
ls -l ~/cltk_data/latin/text/

total 0
drwxr-xr-x    5 rjbarnes  staff    160 Dec 12 10:29 [34mlatin_text_antique_digiliblt[m[m/
drwxr-xr-x   46 rjbarnes  staff   1472 Dec 12 10:29 [34mlatin_text_corpus_grammaticorum_latinorum[m[m/
drwxr-xr-x  623 rjbarnes  staff  19936 Dec 11 17:10 [34mlatin_text_latin_library[m[m/
drwxr-xr-x   55 rjbarnes  staff   1760 Dec 11 17:33 [34mlatin_text_perseus[m[m/
drwxr-xr-x   20 rjbarnes  staff    640 Dec 12 10:28 [34mlatin_text_poeti_ditalia[m[m/
drwxr-xr-x    8 rjbarnes  staff    256 Dec 12 10:27 [34mlatin_text_tesserae[m[m/
drwxr-xr-x    3 rjbarnes  staff     96 Feb 21 12:34 [34mphi5[m[m/
drwxr-xr-x    3 rjbarnes  staff     96 Jan 18 17:50 [34mphi7[m[m/


the CLTK has metadata filtration modules for Latin Library, Perseus, and PHI5

## The Latin Library ##

Let's see how to filter through the Latin Library with CLTK

first, you must come up with a variable (`latin_corpus`) which you can assign (`=`) to the corpus of the Latin Library (`get_corpus_reader(corpus_name = 'latin_text_latin_library', language = 'latin')`)

In [3]:
latin_corpus = get_corpus_reader(corpus_name = 'latin_text_latin_library', language = 'latin')

You will now be able to read the metadata for corpus filtration...

### Filtering by time period, author, text ###

CLTK has ways for filtering the Latin Library by time period, author, and text

To import by time period and author you must import the module (`corpus_directories_by_type`)

In [4]:
from cltk.corpus.latin.latin_library_corpus_types import corpus_directories_by_type

then, you may filter by **time period** (`keys`)...

In [5]:
list(corpus_directories_by_type.keys())

['republican',
 'augustan',
 'early_silver',
 'late_silver',
 'old',
 'christian',
 'medieval',
 'renaissance',
 'neo_latin',
 'misc',
 'early']

... and by **author** (`values`)

In [7]:
list(corpus_directories_by_type.values())

[['./caesar', './lucretius', './nepos', './cicero'],
 ['./livy', './ovid', './horace', './vergil', './hyginus'],
 ['./martial',
  './juvenal',
  './tacitus',
  './lucan',
  './quintilian',
  './sen',
  './statius',
  './silius',
  './columella'],
 ['./suetonius',
  './gellius',
  './apuleius',
  './justin',
  './apicius',
  './fulgentius',
  './orosius'],
 ['./plautus'],
 ['./ambrose',
  './abelard',
  './alcuin',
  './augustine',
  './bede',
  './bible',
  './cassiodorus',
  './commodianus',
  './gregorytours',
  './hugo',
  './isidore',
  './jerome',
  './prudentius',
  './tertullian',
  './kempis',
  './leothegreat'],
 ['./boethiusdacia', './dante'],
 [],
 ['./addison',
  './bacon',
  './bultelius',
  './descartes',
  './erasmus',
  './galileo',
  './kepler',
  './may',
  './melanchthon',
  './xylander',
  './campion'],
 ['./alanus',
  './albertanus',
  './albertofaix',
  './aquinas',
  './ammianus',
  './arnobius',
  './capellanus',
  './cato',
  './claudian',
  './curtius',
  './e

To filter by **text** you must import the module (`corpus_texts_by_type`)

In [6]:
from cltk.corpus.latin.latin_library_corpus_types import corpus_texts_by_type

then, you may filter by individual **text** (`values`)

In [8]:
list(corpus_texts_by_type.values())

[['sall.1.txt',
  'sall.2.txt',
  'sall.cotta.txt',
  'sall.ep1.txt',
  'sall.ep2.txt',
  'sall.frag.txt',
  'sall.invectiva.txt',
  'sall.lep.txt',
  'sall.macer.txt',
  'sall.mithr.txt',
  'sall.phil.txt',
  'sall.pomp.txt',
  'varro.frag.txt',
  'varro.ll10.txt',
  'varro.ll5.txt',
  'varro.ll6.txt',
  'varro.ll7.txt',
  'varro.ll8.txt',
  'varro.ll9.txt',
  'varro.rr1.txt',
  'varro.rr2.txt',
  'varro.rr3.txt',
  'sulpicia.txt'],
 ['resgestae.txt',
  'resgestae1.txt',
  'manilius1.txt',
  'manilius2.txt',
  'manilius3.txt',
  'manilius4.txt',
  'manilius5.txt',
  'catullus.txt',
  'vitruvius1.txt',
  'vitruvius10.txt',
  'vitruvius2.txt',
  'vitruvius3.txt',
  'vitruvius4.txt',
  'vitruvius5.txt',
  'vitruvius6.txt',
  'vitruvius7.txt',
  'vitruvius8.txt',
  'vitruvius9.txt',
  'propertius1.txt',
  'tibullus1.txt',
  'tibullus2.txt',
  'tibullus3.txt'],
 ['pliny.ep1.txt',
  'pliny.ep10.txt',
  'pliny.ep2.txt',
  'pliny.ep3.txt',
  'pliny.ep4.txt',
  'pliny.ep5.txt',
  'pliny.ep6.tx

### Interacting with the library ###

at this point you can count (`len`) the number of documents (`docs`) listed (`list`) in the Latin Library (`latin_corpus`)

In [9]:
len(list(latin_corpus.docs()))

2141

In addition to counting documents (`docs`), you can also count the paragraphs (`paras`), sentences (`sents`), and words (`words`) of an individual text.

let's use, as an example, Seneca the Elder's *Suasoriae* (`'seneca.suasoriae.txt'`).

In [10]:
len(list(latin_corpus.paras('seneca.suasoriae.txt')))

176

In [11]:
len(list(latin_corpus.sents('seneca.suasoriae.txt')))

783

In [12]:
len(list(latin_corpus.words('seneca.suasoriae.txt')))

13382

## Perseus Library ##

OK, we've seen how to deal with the Latin Library...

Let's try with Perseus library. 

first, let's  come up with a variable (`perseus_corpus`) to assign (`=`) to the corpus of the Latin Library (`get_corpus_reader(language='latin', corpus_name='latin_text_perseus')`)

In [13]:
perseus_corpus = get_corpus_reader(language='latin', corpus_name='latin_text_perseus')

### Filtering by time period, author/text ###

let's filter the documents in the library by, first, importing the module (`perseus_corpus_texts_by_type`)

In [14]:
from cltk.corpus.latin.perseus_corpus_types import perseus_corpus_texts_by_type

...and, then, by breaking the corpus up by **time period** (`keys`)...

In [15]:
list(perseus_corpus_texts_by_type.keys())

['augustan',
 'christian',
 'early_silver',
 'late_silver',
 'misc',
 'neo_latin',
 'old',
 'republican']

...and by **text/author** (`values`)

In [16]:
list(perseus_corpus_texts_by_type.values())

[['propertius-sextus__elegies__latin.json',
  'horace__ars-poetica__latin.json',
  'horace__carmen-saeculare__latin.json',
  'horace__odes__latin.json',
  'horace__satires__latin.json',
  'horace__epistulae__latin.json',
  'horace__epodi__latin.json',
  'ovid__amores__latin.json',
  'ovid__art-of-love__latin.json',
  'ovid__book-of-days__latin.json',
  'ovid__epistulae__latin.json',
  'ovid__ibis__latin.json',
  'ovid__letters-from-the-black-sea__latin.json',
  'ovid__metamorphoses__latin.json',
  'ovid__remedy-of-love__latin.json',
  'ovid__sorrows__latin.json',
  'ovid__art-of-beauty__latin.json',
  'virgil__aeneid__latin.json',
  'virgil__eclogues__latin.json',
  'virgil__georgics__latin.json',
  'tibullus__elegiae__latin.json',
  'vitruvius-pollio__on-architecture__latin.json'],
 ['ausonius-decimus-magnus__epistulae__latin.json',
  'boethius-d-524__de-consolatione-philosophiae__latin.json',
  'boethius-d-524__de-fide-catholica__latin.json',
  'boethius-d-524__quomodo-trinitas-unus-

- *NB Perseus differs from Latin Library insofar as it is two-tiered (keys, values) rather than three-tiered (keys, values, and docs)* 

### Interacting with the library ###

as with the latin library, we can count the documents (`docs`) in the library

In [17]:
len(list(perseus_corpus.docs()))

293

we can also count the paragraphs (`paras`), sentences (`sents`), and words (`words`) of an individual text

In [18]:
len(list(perseus_corpus.paras('seneca-lucius-annaeus-55-bc-ca-39-ad__suasoriae__latin.json')))

100

In [19]:
len(list(perseus_corpus.sents('seneca-lucius-annaeus-55-bc-ca-39-ad__suasoriae__latin.json')))

623

In [20]:
len(list(perseus_corpus.words('seneca-lucius-annaeus-55-bc-ca-39-ad__suasoriae__latin.json')))

12393

### Opening and navigating a text from the library ###

Now, let's pull up (`list`) an individual document from the Perseus Library (`perseus_corpus.docs`) - such as, seneca's *Suasoriae* (`'seneca-lucius-annaeus-55-bc-ca-39-ad__suasoriae__latin.json'`)

In [21]:
list(perseus_corpus.docs('seneca-lucius-annaeus-55-bc-ca-39-ad__suasoriae__latin.json'))

[{'meta': 'chapter-section',
  'author': 'seneca, lucius annaeus, 55 b.c.-ca. 39 a.d.',
  'text': {'0': {'0': '\nCuicumque rei magnitudinem natura\ndederat dedit et modum: nihil infinitum est nisi\nOceanus. Aiunt fertiles in Oceano iacere terras\nultraque Oceanum rursus alia litora, alium nasci orbem,\nnec usquam rerum naturam desinere, sed semper\ninde ubi desisse uideatur nouam exsurgere. facile\nista finguntur quia Oceanus nauigari non potest.\nSatis sit hactenus Alexandro uicisse qua mundo\nlucere satis est. Intra has terras caelum Hercules\nmeruit. Stat immotum mare et quasi deficientis in suo\nfine naturae pigra moles: nouae ac terribiles figurae,\nmagna etiam Oceano portenta quae profunda ista\nuastitas nutrit, confusa lux alta caligine et\ninterceptus tenebris dies, ipsum uero graue et defixum\nmare et aut nulla aut ignota sidera. Ita est,\n \nAlexander, rerum natura: post omnia Oceanus, post\n Oceanum nihil.\n',
    '1': ' ARGENTARI. Resiste, orbis te tuus\nreuocat: uicimus qu

to navigate to a particular paragraph (`paras`) or sentence (`senta`) we would provide it's number, beginning with 0

The first paragraph (`paras`) is `[0]`

In [22]:
list(perseus_corpus.paras('seneca-lucius-annaeus-55-bc-ca-39-ad__suasoriae__latin.json'))[0]

'Cuicumque rei magnitudinem natura\ndederat dedit et modum: nihil infinitum est nisi\nOceanus. Aiunt fertiles in Oceano iacere terras\nultraque Oceanum rursus alia litora, alium nasci orbem,\nnec usquam rerum naturam desinere, sed semper\ninde ubi desisse uideatur nouam exsurgere. facile\nista finguntur quia Oceanus nauigari non potest.\nSatis sit hactenus Alexandro uicisse qua mundo\nlucere satis est. Intra has terras caelum Hercules\nmeruit. Stat immotum mare et quasi deficientis in suo\nfine naturae pigra moles: nouae ac terribiles figurae,\nmagna etiam Oceano portenta quae profunda ista\nuastitas nutrit, confusa lux alta caligine et\ninterceptus tenebris dies, ipsum uero graue et defixum\nmare et aut nulla aut ignota sidera. Ita est,\n \nAlexander, rerum natura: post omnia Oceanus, post\n Oceanum nihil.'

The first sentence (`sents`) is `[0]`

In [23]:
list(perseus_corpus.sents('seneca-lucius-annaeus-55-bc-ca-39-ad__suasoriae__latin.json'))[0]

'Cuicumque rei magnitudinem natura\ndederat dedit et modum: nihil infinitum est nisi\nOceanus.'

As you may have noted, the text is messy.

### Cleaning opened text ###

if you want to clean a text, you must import the module (`remove_non_latin`)

In [24]:
from cltk.corpus.utils.formatter import remove_non_latin

assign (`=`) a variable (`test_text`) to the text you want to clean - for our purposes, the first paragraph of Seneca's *Suasoriae* (`list(perseus_corpus.paras('seneca-lucius-annaeus-55-bc-ca-39-ad__suasoriae__latin.json'))[0]`)

In [25]:
test_text = list(perseus_corpus.paras('seneca-lucius-annaeus-55-bc-ca-39-ad__suasoriae__latin.json'))[0]

Now, clean (`remove_non_latin`) the paragraph (`test_text`)

In [26]:
remove_non_latin(test_text)

'Cuicumque rei magnitudinem naturadederat dedit et modum nihil infinitum est nisiOceanus Aiunt fertiles in Oceano iacere terrasultraque Oceanum rursus alia litora alium nasci orbemnec usquam rerum naturam desinere sed semperinde ubi desisse uideatur nouam exsurgere facileista finguntur quia Oceanus nauigari non potestSatis sit hactenus Alexandro uicisse qua mundolucere satis est Intra has terras caelum Herculesmeruit Stat immotum mare et quasi deficientis in suofine naturae pigra moles nouae ac terribiles figuraemagna etiam Oceano portenta quae profunda istauastitas nutrit confusa lux alta caligine etinterceptus tenebris dies ipsum uero graue et defixummare et aut nulla aut ignota sidera Ita est Alexander rerum natura post omnia Oceanus post Oceanum nihil'

but there's a problem...

Since that function removes all the periods and commas, we would have to do the following tell the the function to keep also the period and commas (`also_keep=['.', ',']`)

In [27]:
remove_non_latin(test_text, also_keep=['.', ','])

'Cuicumque rei magnitudinem naturadederat dedit et modum nihil infinitum est nisiOceanus. Aiunt fertiles in Oceano iacere terrasultraque Oceanum rursus alia litora, alium nasci orbem,nec usquam rerum naturam desinere, sed semperinde ubi desisse uideatur nouam exsurgere. facileista finguntur quia Oceanus nauigari non potest.Satis sit hactenus Alexandro uicisse qua mundolucere satis est. Intra has terras caelum Herculesmeruit. Stat immotum mare et quasi deficientis in suofine naturae pigra moles nouae ac terribiles figurae,magna etiam Oceano portenta quae profunda istauastitas nutrit, confusa lux alta caligine etinterceptus tenebris dies, ipsum uero graue et defixummare et aut nulla aut ignota sidera. Ita est, Alexander, rerum natura post omnia Oceanus, post Oceanum nihil.'