### Small Project

Dispersion analysis project
Comparing Pausanias and Herodotus

### Assembling a Corpus

The first step to this project will be assembling a small corpus for analysis. Given that we are interested in the linguistic differences between Pausanias and Herodotus, we need to find the URN for Herodotus's histories: urn:cts:greekLit:tlg0016.tlg001.perseus-grc2

Once we have the URN, we can fetch the file we need. 

Let's start with some exploration into each dataset. 

Like how we did in Week 02, we'll use MyCapytain and Pandas to open up each dataset and turn them into a format we can work with.

In [2]:
from MyCapytain.resources.texts.local.capitains.cts import CapitainsCtsText

with open("../tei/tlg0525.tlg001.perseus-grc2.xml") as f:
    pausanias_text = CapitainsCtsText(urn="urn:cts:greekLit:tlg0525.tlg001.perseus-grc2", resource=f)

with open("./tlg0016.tlg001.perseus-grc2.xml") as f:
    herodotus_text = CapitainsCtsText(urn="urn:cts:greekList:tlg0016.tlg001.perseus-grc2", resource=f)

In [3]:
from lxml import etree
from MyCapytain.common.constants import Mimetypes


#pausanias
urns = []
raw_xmls = []
unannotated_strings = []

for ref in pausanias_text.getReffs(level=len(pausanias_text.citation)):
    urn = f"{pausanias_text.urn}:{ref}"
    node = pausanias_text.getTextualNode(ref)
    raw_xml = node.export(Mimetypes.XML.TEI)
    tree = node.export(Mimetypes.PYTHON.ETREE)
    s = etree.tostring(tree, encoding="unicode", method="text")

    urns.append(urn)
    raw_xmls.append(raw_xml)
    unannotated_strings.append(s)

import pandas as pd

d = {
    "urn": pd.Series(urns, dtype="string"),
    "raw_xml": raw_xmls,
    "unannotated_strings": pd.Series(unannotated_strings, dtype="string")
}
pausanias_df = pd.DataFrame(d)

#herodotus
urns = []
raw_xmls = []
unannotated_strings = []

for ref in herodotus_text.getReffs(level=len(herodotus_text.citation)):
    urn = f"{herodotus_text.urn}:{ref}"
    node = herodotus_text.getTextualNode(ref)
    raw_xml = node.export(Mimetypes.XML.TEI)
    tree = node.export(Mimetypes.PYTHON.ETREE)
    s = etree.tostring(tree, encoding="unicode", method="text")

    urns.append(urn)
    raw_xmls.append(raw_xml)
    unannotated_strings.append(s)

d = {
    "urn": pd.Series(urns, dtype="string"),
    "raw_xml": raw_xmls,
    "unannotated_strings": pd.Series(unannotated_strings, dtype="string")
}
herodotus_df = pd.DataFrame(d)



### Counting tokens, types, etc

With dataframes for each work created, let's get an idea of how many tokens are in each corpus.

In [4]:
pausanias_df['whitespaced_tokens'] = pausanias_df['unannotated_strings'].str.split()
pausanias_df['whitespaced_tokens'].explode().count()

herodotus_df['whitespaced_tokens'] = herodotus_df['unannotated_strings'].str.split()
herodotus_df['whitespaced_tokens'].explode().count()


184838

Okay, so we have an idea of the number of tokens in each corpus. Pausanias has 
In week 3, we worked on counting the number of occurences of the definite article in Pausanias using an nlp pipeline, then determining the relative frequency of the definite article.. Let's repeat these steps for both Pausanias and Herodotus, just to explore the data a little further. 

In [5]:
import spacy

nlp = spacy.load("grc_proiel_sm", disable=["ner"])

#pausanias
raw_texts_p = [t for t in pausanias_df['unannotated_strings']]
annotated_texts_p = nlp.pipe(raw_texts_p, batch_size=100)


#herodotus
raw_texts_h= [t for t in herodotus_df['unannotated_strings']]
annotated_texts_h = nlp.pipe(raw_texts_h, batch_size=100)



  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [6]:
#pausanias
pausanias_df['nlp_docs'] = list(annotated_texts_p)
definite_article_p = [t for t in pausanias_df['nlp_docs'].explode() if t.lemma_ == "ὁ"]
n_def_article_p = len(definite_article_p)
n_tokens_p = len([t for t in pausanias_df['nlp_docs'].explode()])

#herodotus
herodotus_df['nlp_docs'] = list(annotated_texts_h)
definite_article_h = [t for t in herodotus_df['nlp_docs'].explode() if t.lemma_ == "ὁ"]
n_def_article_h = len(definite_article_h)
n_tokens_h = len([t for t in herodotus_df['nlp_docs'].explode()])

basis = 10_000

rf_definite_article_in_pausanias = (n_def_article_p / n_tokens_p) * basis
rf_definite_article_in_herodotus = (n_def_article_h / n_tokens_h) * basis

print(rf_definite_article_in_pausanias)
print(rf_definite_article_in_herodotus)

1250.8047232159593
1170.7162996372285


Hm... why the difference? Surely we can't expect the relative frequency of the definite article to be the exact same in two different works. But Pausanias does use the deinite article more, so I'm curious whether this is statistically significant. 