# Requirements

Before you run this notebook, download the latest version of the dataset from zenodo. Since version 4.0, LAGT dataset consists of two parts:
* Main tabular dataset, containing all metadata and also lemmatized filtered sentences, offered here as a parquet file, to be loaded into python directly as a pandas dataframe object.
* Morphological data for each document within the corpus with one JSON file per document. Each file is represented as a list of sentences, and each sentence is accompanied by a simplified morphological annotation, containing token, lemma, simplified postag and a positional index of the token. The directory with these files, has to be downloaded and unzipped, e.g. in "data/large_files/ subdirectory of a repository or so. Below we demonstrate a potential usage of this data.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
import os
import json

### Working with the tabular dataset

In [2]:
# read the dataset directly from zenodo (if you have a good internet connection...)
LAGT = pd.read_parquet("https://zenodo.org/records/13889714/files/LAGT_v4-1.parquet?download=1")

In [3]:
# explore the first 5 lines of the dataframe
LAGT.head(5)

Unnamed: 0,author_id,doc_id,filename,author,title,string,wordcount,source,lemmatized_sentences,lemmata_source,tlg_date,not_before,not_after,date_uncertain,tlg_epithet,provenience,lemmatacount
2,ogl0001,ogl0001.ogl001,ogl0001.ogl001.1st1K-grc1.xml,Pinytus,De Epistola Pinyti ad Dionysium,"FRAGMENTUM BEATI PINYTI, CNOSSI IN CRETA EPISC...",180,1Kgr,"[[Πινυτός, ἀντιγράφω, θαυμάζω, ἀποδέχω, Διονύσ...",grecy,,101.0,200.0,,[],christian,34
8,tlg0005,tlg0005.tlg003,tlg0005.tlg003.1st1K-grc1.xml,Theocritus,Syrinx,Οὐδενὸς εὐνάτειρα Μακροπτολέμοιο δὲ μάτηρ μαί...,77,1Kgr,"[[οὐδενός, εὐνητήρ], [μακροπτολέμοιο, μήτηρ, μ...",grecy,4-3 B.C.,-400.0,-201.0,False,[Bucolici],pagan,61
9,tlg0006,tlg0006.tlg020,tlg0006.tlg020.1st1K-grc1.xml,Euripides,Fragmenta,ποίαν σε φῶμεν γαῖαν ἐκλελοιπότα πόλει ξενοῦσθ...,17708,1Kgr,"[[φημί, γῆ, ἐκλείπω, πόλις, ξενοῦσθαι], [πάτρα...",grecy,5 B.C.,-500.0,-401.0,False,[Tragici],pagan,10277
10,tlg0007,tlg0007.tlg146,tlg0007.tlg146.1st1K-grc1.xml,Plutarch,Παροιμίαι αἷς Ἀλεξανδρεῖς ἐχρῶντο,Οἴκοι τὰ Μιλήσια: ἐπὶ τῶν ὅποι μὴ προςήκει τὴν...,2685,1Kgr,"[[Μιλήσιος], [προςήκω, τρυφή, ἐπιδείκνυμι], [Ἀ...",grecy,A.D. 1-2,1.0,200.0,False,"[Biographi, Philosophici/-ae]",pagan,1488
11,tlg0007,tlg0007.tlg147,tlg0007.tlg147.1st1K-grc1.xml,Plutarch,Ἐκλογὴ περὶ τῶν ἀδυνάτων,Κατὰ πετρῶν σπείρεις. Πλίνθον πλύνεις. Δικτύῳ ...,143,1Kgr,"[[πέτρα, σπείρω], [Πλίνθος, πλύνω, Δίκτυον, ἄν...",grecy,A.D. 1-2,1.0,200.0,False,"[Biographi, Philosophici/-ae]",pagan,125


### Basic overview

In [31]:
LAGT["wordcount"].sum()

35809325

In [33]:
len(LAGT["author_id"].unique())

475

In [34]:
LAGT.columns

Index(['author_id', 'doc_id', 'filename', 'author', 'title', 'sentences',
       'lemmatized_sentences', 'source', 'lemmata_source', 'not_before',
       'not_after', 'tlg_epithet', 'genre', 'provenience', 'wordcount',
       'lemmatacount'],
      dtype='object')

### Basic text analysis

In [5]:
# extract lemmatized sentences for a subset of documents, based on their provenience
val = "christian"
lemmatized_sentences_subset = [sent for work in LAGT[LAGT["provenience"] == val]["lemmatized_sentences"] for sent in work]

In [9]:
# filter for sentences containing a certain lemma
lemma = "ἐχθρός"
filtered_sentences = [sent for sent in lemmatized_sentences_subset if lemma in sent]
len(filtered_sentences)

1646

In [10]:
# calculate frequencies within these sentences
# (1) flat the sentences into one list of words
lemmata_flat = [l for s in filtered_sentences for l in s]
# (2) count the frequencies
freqs = Counter(lemmata_flat)
# (3) # subselect a set of most frequent words
freqs_most_common = freqs.most_common(100)
freqs_most_common[:10]

[('ἐχθρός', 1722),
 ('θεός', 331),
 ('λέγω', 232),
 ('γίγνομαι', 226),
 ('οὗτος', 202),
 ('φημί', 145),
 ('εἰμί', 131),
 ('κύριος', 126),
 ('πούς', 123),
 ('ἄνθρωπος', 113)]

### Working with the morphological data

The morphological data allow us to navigate there and back between raw sentences and the lemmatized data. Since these files are named by the same IDs as we use in our metadata `doc_id` column variable, the mapping between the two is very straightforward.

In [16]:
# point out the directory with the morpohological data:
source_dir = "../../LAGT/data/large_files/sents_data_jsons/"
# check how many files are there....
len(os.listdir(source_dir))

1958

In [11]:
# define a subset of documents on the basis of provenience
# get doc IDs for all documents from this subset
ids = LAGT[(LAGT["provenience"]=="christian")]["doc_id"].tolist()

In [22]:
# define a function to use the ids to load morphological data either 
# (1) for all sentences from all documents from the subset (if the target is None):
# (2) only for sentences containing a target lemma
def load_sentence_data(ids, target=None, source_dir = source_dir):
    sents_data = []
    for id in ids:
        try:
            file_sents_data = json.load(open(source_dir + id + ".json", "rb"))
            for doc_id, sent_n, sent_text, sent_data in file_sents_data:
                lemmata = [tup[1] for tup in sent_data]
                if target != None:
                    if target in lemmata:
                        sents_data.append((doc_id, sent_n, sent_text, sent_data))
                else:
                    sents_data.append((doc_id, sent_n, sent_text, sent_data))
        except:
            print("data for " + id + " not found")
    return sents_data

In [28]:
# load all sentence data from a subset of documents:
sents_data_all = load_sentence_data(ids)
# print the number of sentences:
print(len(sents_data_all))
# look at data for a couple of sentences, to get an idea of the overall shape of the data...
sents_data_all[1000:1003]

799515


[('tlg0317.tlg001',
  983,
  'ἐκ πολλοῦ καί τοῦ ἀνδρός κεχωρισμένης αὐτῆς διά θεοσέβειαν',
  [['ἐκ', 'ἐκ', 'r', [0, 2]],
   ['πολλοῦ', 'πολύς', 'a', [3, 9]],
   ['καί', 'καί', 'c', [10, 13]],
   ['τοῦ', 'ὁ', 'l', [14, 17]],
   ['ἀνδρός', 'ἀνήρ', 'n', [18, 24]],
   ['κεχωρισμένης', 'κεχωρινέσμαι', 'v', [25, 37]],
   ['αὐτῆς', 'αὐτός', 'p', [38, 43]],
   ['διά', 'διά', 'r', [44, 47]],
   ['θεοσέβειαν', 'θεοσέβεια', 'n', [48, 58]]]),
 ('tlg0317.tlg001', 984, '.', [['.', '.', 'u', [0, 1]]]),
 ('tlg0317.tlg001',
  985,
  'σύ μόνος ἀγνοεῖς ὅτι μή πρίν ὤν ὁ οἴκου ‖ οὗ εἰλ.',
  [['σύ', 'σύ', 'p', [0, 2]],
   ['μόνος', 'μόνος', 'a', [3, 8]],
   ['ἀγνοεῖς', 'ἀγνοέω', 'v', [9, 16]],
   ['ὅτι', 'ὅτι', 'c', [17, 20]],
   ['μή', 'μή', 'r', [21, 23]],
   ['πρίν', 'πρίν', 'c', [24, 28]],
   ['ὤν', 'οὖν', 'x', [29, 31]],
   ['ὁ', 'ὁ', 'l', [32, 33]],
   ['οἴκου', 'οἶκος', 'n', [34, 39]],
   ['‖', '‖', 'v', [40, 41]],
   ['οὗ', 'ὅς', 'r', [42, 44]],
   ['εἰλ', 'εἰλ', 'v', [45, 48]],
   ['.', '.', 'u', [

In [29]:
# load all sentence data from a subset of documents:
sents_data_all = load_sentence_data(ids, "ἐχθρός")
# print the number of sentences:
print(len(sents_data_all))
# look at data for a couple of sentences, to get an idea of the overall shape of the data...
sents_data_all[:3]

1648


[('tlg0317.tlg001',
  352,
  'ὀφθαλμός ἐχθρῶν ἔπληξέ με·',
  [['ὀφθαλμός', 'ὀφθαλμός', 'n', [0, 8]],
   ['ἐχθρῶν', 'ἐχθρός', 'a', [9, 15]],
   ['ἔπληξέ', 'πλήσσω', 'v', [16, 22]],
   ['με', 'ἐγώ', 'p', [23, 25]],
   ['·', '·', 'u', [25, 26]]]),
 ('tlg0317.tlg001',
  2231,
  'ὑμέρας ‖ μέρος ‖ ἤ ‖ ἐν παῤ αἰτῶ Δ ἀλλά πυρί αἰωνίω παραδοθήσεσθε ἐν δέ φυλάξησθε καί πῦρ αἰώ- νιον καί τόν ἐχθρόν σατάν καί τά αὐτοῦ ἔνεδρα ἐκφεύξησθε καί ζωήν αἰώνιον ἐν οὐρανοῖς λήψεσθε ‖ ταῦτα εἰπών ‖ πρ.',
  [['ὑμέρας', 'ὑμέρας', 'p', [0, 6]],
   ['‖', '‖', 'r', [7, 8]],
   ['μέρος', 'μέρος', 'n', [9, 14]],
   ['‖', '‖', 'r', [15, 16]],
   ['ἤ', 'ἤ', 'c', [17, 18]],
   ['‖', '‖', 'r', [19, 20]],
   ['ἐν', 'ἐν', 'x', [21, 23]],
   ['παῤ', 'παῤ', 'r', [24, 27]],
   ['αἰτῶ', 'αἰτέω', 'v', [28, 32]],
   ['Δ', 'δ', 'r', [33, 34]],
   ['ἀλλά', 'ἀλλά', 'c', [35, 39]],
   ['πυρί', 'πῦρ', 'n', [40, 44]],
   ['αἰωνίω', 'αἰωνέω', 'a', [45, 51]],
   ['παραδοθήσεσθε', 'παραδοθήσομαι', 'v', [52, 65]],
   ['ἐν', 'ἐν', 'r', [