-
Notifications
You must be signed in to change notification settings - Fork 0
Glossary
A plain-language dictionary of the terms you'll meet across this documentation — the scripts, the linguistics, the file formats, and the source projects. Each entry is one or two sentences, with a runnable example where pyaegean actually does the thing, and a link to the page that covers it in depth. If you've ever hit a word like lemmatize, scansion, ideogram, or EpiDoc and weren't sure what it meant here, this is the page to keep open.
New to the toolkit entirely? Start with Getting Started. For the full command reference see the CLI page; for every Greek function see Greek NLP.
The examples below were run against pyaegean 0.8.1 (the aegean command-line tool
and the aegean Python package). Where a feature has both a Python call and a
shell command, both are shown.
pyaegean knows five writing systems. Four are Aegean scripts; the fifth is the Greek alphabet itself. You can list them at any time:
import aegean
aegean.registered_scripts()
# ['cypriot', 'cyprominoan', 'greek', 'lineara', 'linearb']| Script id | Human name | Status | Where it's covered |
|---|---|---|---|
lineara |
Linear A | Undeciphered | Linear A |
linearb |
Linear B | Deciphered (Mycenaean Greek) | Linear B |
cypriot |
Cypriot syllabary | Deciphered (Arcadocypriot Greek) | Cypriot |
cyprominoan |
Cypro-Minoan | Undeciphered | Cypro-Minoan |
greek |
Greek alphabet | — | Greek NLP |
The Bronze Age script of Minoan Crete (roughly 1800–1450 BCE), used for the still-undeciphered Minoan language. pyaegean ships the full GORILA-derived Linear A corpus offline (1,721 documents, a 344-sign inventory). See Linear A.
aegean info lineara
# documents: 1721 · words: 1381 · signs_in_inventory: 344
# source: GORILA (Godart & Olivier 1976–1985)The script Linear A evolved into, used at Knossos, Pylos, and elsewhere to write an early form of Greek (Mycenaean). Famously deciphered by Michael Ventris in 1952 — which is why pyaegean can read Linear B words as Greek (see the bridge entry below). See Linear B.
An Iron Age syllabary used on Cyprus to write a Greek dialect (Arcadocypriot), and also the unrelated Eteocypriot language. Deciphered in the 1870s. See Cypriot.
A cluster of undeciphered scripts used on Bronze Age Cyprus, descended from (or cousins of) Linear A. pyaegean carries an illustrative sample plus the 99-sign Unicode inventory, not a full transcribed corpus. See Cypro-Minoan and Limitations.
A writing system where each sign stands for a whole syllable (like da, ku,
pa) rather than a single consonant or vowel the way an alphabet does. Linear A,
Linear B, the Cypriot syllabary, and Cypro-Minoan are all syllabaries.
A single syllabary sign — one glyph carrying one syllable's worth of sound. You can look one up in any script's inventory:
aegean sign linearb DA --json
# { "label": "DA", "glyph": "𐀅", "phonetic": "da",
# "attrs": { "signClass": "syllabogram", ... } }aegean sign cypriot PA --json
# { "label": "PA", "glyph": "𐠞", "phonetic": "pa",
# "attrs": { "signClass": "syllabogram" } }A sign that stands for a thing or commodity (a word or idea) rather than a sound — think of the picture-sign for "wine," "oil," or "sheep" on an accounting tablet. The two terms are used interchangeably here. In a parsed Linear A document these are kept separate from the spelled-out words:
import aegean
doc = next(d for d in aegean.load("lineara").iter_documents() if d.logograms)
doc.id # 'HT2'
[t.text for t in doc.logograms][:5] # ['OLE+U', 'OLE+A', 'OLE+E', 'OLE+U', 'OLE+A']
# OLE = the olive-oil ideogramA sign's value is its identity in the inventory — its label (DA, AB01), its
glyph, and its Unicode codepoint. Its sound value (the phonetic field above)
is the syllable it's read as where that's known. For undeciphered scripts the sign
value is solid but the sound value is provisional or absent — that distinction
matters a lot; see Limitations.
Aegean tablets carry numbers (quantities of a commodity) alongside words and ideograms. pyaegean keeps these as their own token kind so a count never gets mistaken for a word:
doc = next(aegean.load("lineara").iter_documents()) # HT1
[t.text for t in doc.numerals][:4] # ['197', '70', '52', '109']For the deciphered syllabaries (Linear B and Cypriot), the bridge takes a transliterated word and gives you its attested Greek reading and meaning — the payoff of decipherment, in one call.
aegean bridge linearb po-me
# po-me → ποιμήν (shepherd)
aegean bridge cypriot pa-si-le-u-se
# pa-si-le-u-se → βασιλεύς (king)If a word has no attested reading in the bundled lexicon, the bridge says so plainly rather than guessing. Covered on Linear B and Cypriot.
A way to type polytonic Greek using only plain ASCII keys — lo/gos for λόγος,
mh=nin for μῆνιν — so you never need a Greek keyboard. pyaegean converts both
directions.
from aegean import greek
greek.betacode_to_unicode("mh=nin") # 'μῆνιν'
greek.unicode_to_betacode("μῆνιν") # 'mh=nin'aegean greek betacode "mh=nin" # μῆνιν
aegean greek betacode "μῆνιν" --reverse # mh=ninPutting Greek text into one canonical Unicode form (so two visually identical words
compare equal), optionally repairing OCR/Beta-Code artifacts with --lenient. See
Greek NLP.
Diacritics are the accents, breathings, and subscripts on Greek letters. "Stripping" removes them — useful for loose matching:
greek.strip_diacritics("μῆνιν") # 'μηνιν'The smallest unit a text is cut into — usually a word or a punctuation mark. In a syllabic corpus a token is one word, ideogram, or numeral; in Greek it's a word or mark. (See tokenize on the Greek NLP page.)
Splitting a Greek word into its spoken syllables, using rules plus a curated exception lexicon for compounds.
greek.syllabify("ἄνθρωπος") # ['ἄν', 'θρω', 'πος']Where a Greek word's accent falls and what that pattern is called (oxytone, paroxytone, proparoxytone, and so on).
greek.accentuation("λόγος").classification # 'paroxytone'The International Phonetic Alphabet — a reconstructed pronunciation written in standard phonetic symbols.
greek.to_ipa("λόγος") # 'loɡos'The dictionary headword for an inflected form — the entry you'd look up. λόγοι
("words") and λόγος ("word") share the lemma λόγος.
The act of reducing each word to its lemma. With no extra backends installed pyaegean uses its rule-based lemmatizer; the opt-in neural pipeline is more accurate (see Greek NLP).
greek.lemmatize("λόγοι") # 'λόγος'aegean greek lemmatize "λόγοι"A word's grammatical category — noun, verb, adjective — tagged with the Universal Dependencies coarse tag set (UPOS = Universal POS).
greek.pos_tag("θεός") # 'NOUN'
greek.pos_tags("θεὸς ἀγαθός")
# [('θεὸς', 'NOUN'), ('ἀγαθός', 'NOUN')]The full grammatical breakdown of a word form — case, number, gender, tense, voice, mood, person, degree — often with more than one candidate parse:
greek.analyze("λόγοι")[0]
# Analysis(lemma='λόγος', pos='NOUN', case='nom', number='pl', gender='masc', ...)aegean greek morph "λόγοι"A treebank is a corpus where every sentence is annotated with its grammatical tree (which word depends on which). AGDT is the Ancient Greek Dependency Treebank (from the Perseus project) — pyaegean's main source of gold parses, glosses, and the lexicon that the opt-in Greek backends are trained from. See Greek NLP and Data & Provenance.
Working out the grammatical structure of a sentence — which word is the subject, which the object, and so on — as a tree of relations. pyaegean can emit UD relations (with the neural parser) or AGDT-style relations. See Greek NLP.
The one-call convenience that runs normalize → tokenize → tag → lemmatize → morph in sequence and hands you one record per token. See Greek NLP.
A short dictionary meaning for a word. pyaegean has two sources: a Koine (New Testament) gloss from the bundled Dodson lexicon that works offline, and a fuller LSJ gloss that fetches a ~270 MB index on first use.
aegean greek gloss-nt λόγος
# a word, speech, divine utterance, analogyWorking out the rhythm of a line of verse — marking each syllable heavy (long,
—) or light (short, ⏑) and matching the pattern to a known metre.
greek.scan_hexameter("ἄνδρα μοι ἔννεπε, Μοῦσα, πολύτροπον, ὃς μάλα πολλὰ").pattern
# '—⏑⏑|—⏑⏑|—⏑⏑|—⏑⏑|—⏑⏑|—×'aegean greek scan "ἄνδρα μοι ἔννεπε, Μοῦσα, πολύτροπον, ὃς μάλα πολλὰ"
# —⏑⏑|—⏑⏑|—⏑⏑|—⏑⏑|—⏑⏑|—×
# hexameter: dactyl, dactyl, dactyl, dactyl, dactyl, final; caesura: trochaicThe "weight" of a single syllable — heavy (long), light (short), or common (can be either). This is the raw material scansion is built from.
aegean greek quantities "μῆνιν"
# μῆνιν → μῆ:heavy | νιν:heavyThe metres pyaegean can scan (the --meter option of aegean greek scan):
| Meter name | What it is |
|---|---|
hexameter |
Dactylic hexameter (epic verse — Homer, Hesiod) |
pentameter |
Elegiac pentameter (paired with hexameter in elegiac couplets) |
trimeter |
Iambic trimeter (the spoken line of drama) |
glyconic |
Aeolic lyric line (fixed template) |
pherecratean |
Aeolic lyric line |
sapphic_hendecasyllable |
Aeolic lyric line |
adonean |
Aeolic lyric line |
alcaic_hendecasyllable |
Aeolic lyric line |
alcaic_enneasyllable |
Aeolic lyric line |
alcaic_decasyllable |
Aeolic lyric line |
Scansion is conservative: a line that only scans via synizesis (two vowels read as one) on a word outside the curated lexicon is reported as not scanning, rather than the tool guessing. See Greek NLP and Limitations.
A whole collection of documents in one script — what you get back from
aegean.load(...). It knows its size, its provenance, and how to filter and export.
corpus = aegean.load("lineara")
len(corpus) # 1721One inscription, tablet, or text within a corpus — with an id, its lines, and its tokens (words, ideograms, numerals).
doc = next(aegean.load("lineara").iter_documents())
doc.id # 'HT1'
doc.words[0].text # 'QE-RA₂-U'The bundled, offline discovery index of every Greek work pyaegean can fetch — 1,778
works that have a Greek edition in the Perseus canonical-greekLit and First1KGreek
collections, each with its id, author, English and Greek title, and source. It is
pure metadata, so searching it needs no network; anything it lists, load_work can
load. (Distinct from popular_works(), the curated 25-work short list.)
from aegean import greek
len(greek.catalog()) # 1778
greek.catalog(title="Ἰλιάς") # search by Greek title
# [{'id': 'tlg0012.tlg001', 'author': 'Homer', 'title': 'Iliad',
# 'greek_title': 'Ἰλιάς', 'source': 'perseus'}]aegean greek catalog --author plato # 39 matches| Term | What it is | More |
|---|---|---|
| GORILA | Recueil des inscriptions en linéaire A (Godart & Olivier 1976–1985) — the standard Linear A edition; pyaegean's bundled Linear A corpus is derived from it. | Linear A, Data & Provenance |
| DAMOS | Database of Mycenaean at Oslo — a Linear B corpus (~5,900 tablets) with scribal hands, find context, and object class; fetched on demand. | Linear B, Data & Provenance |
| SigLA | The Signs of Linear A database (Salgarella & Castellan) — a Linear A dataset (781 documents) with its own word division and commodity ideograms; fetched on demand. | Linear A, Data & Provenance |
| AGDT | Ancient Greek Dependency Treebank (Perseus) — gold grammatical annotation behind the Greek backends. | Greek NLP |
| EpiDoc | A community standard (a flavour of TEI XML) for encoding ancient inscriptions; pyaegean can export a corpus to EpiDoc TEI. | CLI, Data & Provenance |
| Pleiades | The open gazetteer of ancient places; pyaegean tags located sites with their Pleiades id where one exists. | Geography |
aegean geo lineara
# site lat lon pleiades
# Apodoulou 35.16 24.73 119143959
# Gournia 35.11 25.79 771100776
# Haghia Triada 35.06 24.79 589672
# ... (52 located sites)You can export to EpiDoc TEI (and other formats) from the CLI:
aegean export lineara --format epidoc --output lineara_tei.xmlA label this project uses for anything generative or speculative — most of all the AI Layer. An exploratory result is evidence for a human expert to weigh, never a reading or a translation presented as fact. Every AI command is marked exploratory for exactly this reason:
aegean ai --help
# Generative (exploratory, key-gated): translate, gloss, hypotheses, ask, ...The factual material a generative answer is tied to — the local lexicon, the
transliteration, the corpus context — so the model is reasoning from the evidence
rather than free-associating. The aegean ai ask command answers strictly from
the provided grounding; the AI layer translates by grounding first, then calling a
model. See AI Layer.
- Undeciphered means undeciphered. Linear A and Cypro-Minoan have solid sign values but no agreed language; nothing here "translates" them, and the AI layer's hypotheses are explicitly speculative.
- Sound values are provisional for undeciphered scripts, and even for deciphered ones the bridge only returns attested readings — it won't invent one.
- The Cypriot and Cypro-Minoan corpora are illustrative samples, not complete transcribed corpora; the sign inventories, however, are full.
- The richer Greek backends and gold datasets are opt-in downloads, not bundled — the core install works offline.
For the full, honest list of what pyaegean does and does not claim, read Limitations.
Start here
Aegean scripts
Greek
Capabilities
Reference