### Small Project

Dispersion analysis project
Comparing Pausanias and Herodotus

### Assembling a Corpus

The first step to this project will be assembling a small corpus for analysis. Given that we are interested in the linguistic differences between Pausanias and Herodotus, we need to find the URN for Herodotus's histories: urn:cts:greekLit:tlg0016.tlg001.perseus-grc2

Once we have the URN, we can fetch the file we need. 

Let's start with some exploration into each dataset. 

Like how we did in Week 02, we'll use MyCapytain and Pandas to open up each dataset and turn them into a format we can work with.

In [1]:
from MyCapytain.resources.texts.local.capitains.cts import CapitainsCtsText

with open("../tei/tlg0525.tlg001.perseus-grc2.xml") as f:
    pausanias_text = CapitainsCtsText(urn="urn:cts:greekLit:tlg0525.tlg001.perseus-grc2", resource=f)

with open("./tlg0016.tlg001.perseus-grc2.xml") as f:
    herodotus_text = CapitainsCtsText(urn="urn:cts:greekList:tlg0016.tlg001.perseus-grc2", resource=f)

In [2]:
from lxml import etree
from MyCapytain.common.constants import Mimetypes


#pausanias
urns = []
raw_xmls = []
unannotated_strings = []

for ref in pausanias_text.getReffs(level=len(pausanias_text.citation)):
    urn = f"{pausanias_text.urn}:{ref}"
    node = pausanias_text.getTextualNode(ref)
    raw_xml = node.export(Mimetypes.XML.TEI)
    tree = node.export(Mimetypes.PYTHON.ETREE)
    s = etree.tostring(tree, encoding="unicode", method="text")

    urns.append(urn)
    raw_xmls.append(raw_xml)
    unannotated_strings.append(s)

import pandas as pd

d = {
    "urn": pd.Series(urns, dtype="string"),
    "raw_xml": raw_xmls,
    "unannotated_strings": pd.Series(unannotated_strings, dtype="string")
}
pausanias_df = pd.DataFrame(d)

#herodotus
urns = []
raw_xmls = []
unannotated_strings = []

for ref in herodotus_text.getReffs(level=len(herodotus_text.citation)):
    urn = f"{herodotus_text.urn}:{ref}"
    node = herodotus_text.getTextualNode(ref)
    raw_xml = node.export(Mimetypes.XML.TEI)
    tree = node.export(Mimetypes.PYTHON.ETREE)
    s = etree.tostring(tree, encoding="unicode", method="text")

    urns.append(urn)
    raw_xmls.append(raw_xml)
    unannotated_strings.append(s)

d = {
    "urn": pd.Series(urns, dtype="string"),
    "raw_xml": raw_xmls,
    "unannotated_strings": pd.Series(unannotated_strings, dtype="string")
}
herodotus_df = pd.DataFrame(d)



### Counting tokens, types, etc

With dataframes for each work created, things start to get more interesting. 