# Tesserae v5 Demo

This demo will go over the basics of Tesserae v5 development up through October 11, 2018.

In [6]:
import json

from tesserae.db import TessMongoConnection
from tesserae.db.entities import Frequency, Match, Text, Token, Unit
from tesserae.utils import TessFile
from tesserae.tokenizers import GreekTokenizer, LatinTokenizer
from tesserae.unitizer import Unitizer
from tesserae.matchers import DefaultMatcher

# Set up the connection and clean up the database
connection = TessMongoConnection('45.55.219.221', 27017, None, None, 'tesstest')

# Clean up the previous demo
connection.connection['frequencies'].delete_many({})
connection.connection['matches'].delete_many({})
connection.connection['texts'].delete_many({})
connection.connection['tokens'].delete_many({})
connection.connection['units'].delete_many({})

<pymongo.results.DeleteResult at 0x7fb79b462648>

## Loading and Storing New Texts

The Tesserae database catalogs metadata, including the title, author, and year published, as well as integrity information like filepath, MD5 hash, and CTS URN.

We start by loading in some metadata from `text_metadata.json`.

In [7]:
with open('text_metadata.json', 'r') as f:
    text_meta = json.load(f)

print('{}{}{}{}'.format('Title'.ljust(15), 'Author'.ljust(15), 'Language'.ljust(15), 'Year'))
print('{}{}{}{}'.format('-----'.ljust(15), '------'.ljust(15), '--------'.ljust(15), '----'))
for t in text_meta:
    print('{}{}{}{}'.format(t['title'].ljust(15), t['author'].ljust(15), t['language'].ljust(15), str(t['year']).ljust(15)))

Title          Author         Language       Year
-----          ------         --------       ----
aeneid         vergil         latin          19             
de oratore     cicero         latin          38             
heracles       euripides      greek          -416           
epistles       plato          greek          -280           


Then insert the new texts with `tesserae.text_access.insert_text`

In [8]:
texts = []
for t in text_meta:
    texts.append(Text.json_decode(t))
result = connection.insert(texts)
print(result.inserted_count)
print(result.inserted_id)

AttributeError: 'list' object has no attribute 'collection'

We can retrieve the inserted texts with `tesserae.text_access.retrieve_text_list`. These texts will be converted to objects representing the database entries. The returned text list can be filtered by any valid field in the text database.

In [None]:
texts = retrieve_text_list(client)

print('{}{}{}{}'.format('Title'.ljust(15), 'Author'.ljust(15), 'Language'.ljust(15), 'Year'))
for t in texts:
    print('{}{}{}{}'.format(t.title.ljust(15), t.author.ljust(15), t.language.ljust(15), t.year))

## Loading .tess Files

Text metadata includes the path to the .tess file on the local filesystem. Using a Text retrieved from the database, the file can be loaded for further processing.

In [None]:
tessfile = load_text(client, texts[1].cts_urn)

print(tessfile.path)
print(len(tessfile))
print(tessfile[270])

We can iterate through the file line-by-line.

In [None]:
lines = tessfile.readlines()
for i in range(10):
    print(next(lines))

We can also iterate token-by-token.

In [None]:
tokens = tessfile.read_tokens()
for i in range(10):
    print(next(tokens))

## Tokenizing a Text

Texts can be tokenized with `tesserae.tokenizers.get_token_info`. This function takes a token and the language to use for lemmatization, etc.

In [None]:
tokenized = []
print('{}{}{}'.format('Raw'.ljust(15), 'Normalized'.ljust(15), 'Lemmata'))
for i in range(10):
    token = get_token_info(next(tokens), tessfile.metadata.language)
    print('{}{}{}'.format(token.raw.ljust(15), token.token_type.ljust(15), token.lemmata))
    tokenized.append(token)

Processed tokens can then be stored in and retrieved from the database, similar to text metadata. It should be noted that the resulting list is shorter than the original token list. During insertion, Tesserae removes duplicate tokens to prevent database bloat.

In [None]:
client['tokens'].delete_many({})
result = insert_tokens(client, tokenized)

tokens = retrieve_token_list(client)
for token in tokens:
    print('{}{}{}'.format(token.raw.ljust(15), token.token_type.ljust(15), token.lemmata))

## Unitizing a Text

Texts can be unitized into lines (poetry only) and phrases (poetry and prose), and the intertext matches are found between units of text.


In [None]:
# Unitizing lines of a poem
if tessfile.metadata.author in ['vergil', 'euripides']:
    units = poetry.split_line_units(tessfile)
    for i in range(10):
        print(units[i].raw)

In [None]:
# Unitizing phrases of a poem or prose
if tessfile.metadata.author in ['vergil', 'euripides']:
    units = poetry.split_phrase_units(tessfile)
    for i in range(10):
        print(units[i].raw)
else:
    units = prose.split_phrase_units(tessfile)
    for i in range(10):
        print(units[i].raw)