-
Notifications
You must be signed in to change notification settings - Fork 0
Recipes
End-to-end scholarly workflows — each one goes from a question to a citable
result, in Python and (where natural) from the command line. Every
Python snippet on this page has been run against the shipped corpora; outputs
shown are real. The thread running through all of them: finish with cite(),
so the exact data you used is in your paper's references.
Do the stated totals (KU-RO) on the Haghia Triada tablets match the sums of their entries? Reconcile every account, keep the failures, and cite the exact subset:
import json
import aegean
from aegean.analysis import balance_check
ht = aegean.load("lineara").filter(site="Haghia Triada")
discrepancies = []
for doc in ht:
for chk in balance_check(doc):
if not chk.balances:
discrepancies.append({"doc": doc.id, "marker": chk.marker,
"stated": chk.stated_total, "computed": chk.computed_sum})
with open("discrepancies.json", "w", encoding="utf-8") as f:
json.dump(discrepancies, f, ensure_ascii=False, indent=2)
print(len(discrepancies), "discrepancies") # 29
print(ht.cite()) # GORILA … [subset: filter(site='Haghia Triada') …]aegean balance lineara --json | jq '[.[] | select(.balances | not)]' > discrepancies.json
aegean plot balance lineara -o balance.png # the same picture: stated vs computedThe reconciliation is heuristic (section boundaries are inferred) — a discrepancy is a lead to inspect, not a verdict. See Linear A → Accounting.
Where does KU-RO occur? Count attestations by find-site, then place them — every aligned site carries a Pleiades id:
from collections import Counter
import aegean
from aegean import geo
corpus = aegean.load("lineara")
sites = Counter(d.meta.site for d in corpus
if any(t.text == "KU-RO" for t in d.words))
print(sites.most_common(3)) # [('Haghia Triada', 32), ('Phaistos', 1), ('Zakros', 1)]
coord = geo.site_coordinates()["Haghia Triada"]
print(coord.lat, coord.lon, coord.pleiades_uri)
# 35.06 24.79 https://pleiades.stoa.org/places/589672With the [geo] extra (pip install "pyaegean[geo]"), geo.word_distribution(corpus, "KU-RO") returns the same thing as a GeoDataFrame (EPSG:4326) ready for any
mapping stack, and aegean geo lineara --output sites.geojson writes GeoJSON
from the shell. See Geography.
Lemmatize a passage of the Iliad and cite the edition. load_work fetches the
complete work (commit-pinned, cached after the first fetch), pipeline runs the
whole stack per token:
import aegean
iliad = aegean.greek.load_work("tlg0012.tlg001") # the Iliad, Perseus edition
doc = iliad.documents[0] # Book 1
text = " ".join(t.text for t in doc.tokens[:40])
for r in aegean.greek.pipeline(text)[:3]:
print(r.text, "→", r.lemma)
# μῆνιν → μῆνις · ἄειδε → ἀείδω · θεὰ → θεά
print(iliad.cite())
# Homer. Ἰλιάς. Digitized by the Perseus Digital Library / Open Greek and Latin. — …aegean greek work tlg0012.tlg001 -o iliad.json # the work as a corpus file
aegean greek pipeline "μῆνιν ἄειδε θεά" --json # per-token recordsThe offline pipeline is the honest baseline; activate --treebank or
--neural (the [neural] extra) for attested-gold or state-of-the-art lemmas —
see Greek NLP for the measured accuracy of each tier.
What makes the Pylos tablets different from the rest of the Mycenaean corpus? Load the full Linear B corpus (DAMOS, ~5,900 tablets, fetched ~2 MB), split Pylos against everything else, and ask:
import aegean
from aegean.analysis import keyness
damos = aegean.load("damos")
pylos = damos.filter(site="Pylos")
rest = [d for d in damos.documents if d.meta.site != "Pylos"]
for r in keyness(pylos, rest)[:3]:
print(f"{r.item:10} G²={r.log_likelihood:.0f} log-ratio={r.log_ratio:+.1f}")
# pe-mo G²=254 log-ratio=+8.5 ('seed' — the land-tenure series)
# o-na-to G²=226 log-ratio=+8.4 ('lease plot')
# to-so-de G²=210 log-ratio=+8.3
print(pylos.cite()) # Aurora (2015), DAMOS … [subset: filter(site='Pylos') …]aegean keyness damos --site Pylos --top 10
aegean plot keyness damos --site Pylos -o pylos.pngThe textbook Ventris & Chadwick land-tenure result surfaces immediately. Read
G² (significance) together with the log-ratio (effect size) and
aegean dispersion — see Analysis → Corpus statistics.
The same split works by scribal hand (v2 of the corpus carries the
DAMOS-curated hand on meta.scribe — 3,945 documents have one): what does the
most prolific Knossos scribe write about?
h117 = damos.filter(scribe="117") # Hand 117: 684 tablets
rest117 = [d for d in damos.documents if d.meta.scribe != "117"]
keyness(h117, rest117)[:5] # the hand's characteristic vocabularyaegean keyness damos --scribe 117 --top 10Which Greek word does Linear B qa-si-re-u sound like? Cross-script
comparison romanizes both sides to one phoneme alphabet and ranks by weighted
distance:
from aegean.analysis import nearest, phonetic_compare
cmp = phonetic_compare("qa-si-re-u", "linearb", "βασιλεύς", "greek")
print(round(cmp.similarity, 2)) # 0.69
print([(c.a, c.b, c.op) for c in cmp.alignment][0]) # ('q', 'b', 'sub-far') — the qʷ→b reflex
candidates = ["ποιμήν", "βασιλεύς", "πατήρ", "θεός", "δοῦλος"]
print(nearest("qa-si-re-u", "linearb", candidates, "greek", top=2, fold_aspiration=True))
# [('βασιλεύς', 0.31), ('πατήρ', 0.61)] — the true cognate first, by a clear marginaegean analyze compare qa-si-re-u βασιλεύς
aegean analyze nearest po-me greek --top 5The ranking is the signal — defective syllabic spelling inflates absolute distances (see the caution in Analysis → Cross-script comparison). The candidate list matters: rank against a lexicon or wordlist relevant to your question, not just whatever corpus is at hand.
Which Linear A words share a stem with a productive suffix? Morphological clustering is exploratory (no known grammar) but a strong lead-generator — and the slowest analysis here, so switch on the opt-in cache if you'll rerun it:
import aegean
from aegean.analysis import find_morphological_clusters
aegean.cache.enable() # optional; or PYAEGEAN_ANALYSIS_CACHE=1
clusters = find_morphological_clusters(aegean.load("lineara").word_frequencies())
print(len(clusters), "clusters") # 81
print(clusters[0].stem, "→", [m.word for m in clusters[0].members[:4]])aegean analyze clusters lineara --top 10
aegean cache # see what's cached; --clear to wipe(Key-gated: needs a provider extra, e.g. pip install "pyaegean[anthropic]",
and its API key.) Ground the model in corpus facts, then audit what it was
given with the provenance trace:
import aegean
from aegean import ai
corpus = aegean.load("lineara")
grounding = ai.corpus_context(corpus, limit=10) + ai.cooccurrence_evidence(corpus, "KU-RO")
r = ai.ask("What can be said about the function of KU-RO?", grounding=grounding)
print(r.labeled()) # [EXPLORATORY · ask · …] — the unmissable tag
print(r.trace()) # every corpus/analysis fact the answer rested on, by sourceaegean ai ask "What is KU-RO?" --corpus lineara --trace
aegean ai eval # grounding-fidelity scores for your providerEvery generative result is a labeled hypothesis, never a reading — see AI Layer and, if you can judge the answer, the validation issue form.
Start here
Aegean scripts
Greek
Capabilities
Reference