-
Notifications
You must be signed in to change notification settings - Fork 0
Greek NLP
aegean.greek is the Ancient Greek NLP pipeline. It's a set of small, independent
steps: each one is a plain function you can call on its own, and you can chain
them into your own pipeline. The core pipeline runs fully offline with no API key;
the opt-in treebank, LSJ, parser, and neural-pipeline backends fetch their data over
the network on first use, then cache it.
The core ships normalization, tokenization, syllabification, accent and prosody analysis, reconstructed IPA, metrical scansion (dactylic hexameter, elegiac pentameter, and iambic trimeter), POS tagging, a baseline lemmatizer, and a rule-based morphological analyzer. On top, opt-in backends add attested accented lemmas + gold POS/morphology (treebank), dictionary glosses (LSJ), dependency trees (parser), and the neural pipeline — joint tagging, morphology, UD parsing, and lemmatization at state-of-the-art accuracy. Not yet covered: lyric metres.
Every example below is real, runnable output. Import the module once:
from aegean import greekWhere this fits. The zero-dependency core optimizes for portability, an instant import, transparent leakage-free evaluation, metrical scansion, and a scriptable data layer. For maximum accuracy, the opt-in neural pipeline (
use_neural_pipeline, the[neural]extra) is state of the art on the UD Ancient Greek benchmarks — measured end-to-end through this package (the full protocol and comparison tables live indocs/benchmarks.md).
Every stage below is independently callable, but you don't have to compose them:
pipeline runs tokenize → sentence split → POS-tag → lemmatize (→ parse) over a
text and returns one record per token (punctuation included — nothing is dropped):
records = greek.pipeline("ἐν ἀρχῇ ἦν ὁ λόγος.")
[(r.text, r.upos, r.lemma) for r in records]
# [('ἐν','ADP','ἐν'), ('ἀρχῇ','NOUN','ἀρχή'), ('ἦν','VERB','εἰμί'),
# ('ὁ','DET','ὁ'), ('λόγος','NOUN','λόγος'), ('.','PUNCT','.')]Each TokenRecord carries sentence, index, text, upos, lemma,
lemma_known, and — when parsed — head/relation (plus xpos/feats from the
neural pipeline). pipeline uses whatever backends are active: with none, the
zero-dependency baseline; after use_treebank()/use_tagger() etc., their better
answers; after use_neural_pipeline(), one model pass fills every field of every
record. parse=True without the neural pipeline requires use_parser().
One jointly-trained model — a GreBerta encoder with tagging heads, a biaffine dependency parser decoded by a single-root MST (non-projectivity handled natively), and an edit-script lemmatizer — serving UPOS, full morphology (UD FEATS), UD dependency trees, and lemmas from a single forward pass. Trained leakage-clean on the AGDT + Gorman + Pedalion treebanks (1.41M tokens, with the evaluation folds' sentences excluded from training).
pip install "pyaegean[neural]" # onnxruntime + tokenizers + numpy; no torchgreek.use_neural_pipeline() # fetches the model bundle (~518 MB, one-time) to the cache
ana = greek.analyze_sentence(["ἐν", "ἀρχῇ", "ἦν", "ὁ", "λόγος"])
list(zip(ana.tokens, ana.upos, ana.deprel, ana.lemma))
# [('ἐν','ADP','case','ἐν'), ('ἀρχῇ','NOUN','root','ἀρχή'), ('ἦν','VERB','cop','εἰμί'),
# ('ὁ','DET','det','ὁ'), ('λόγος','NOUN','nsubj','λόγος')]
ana.feats[1] # 'Case=Dat|Gender=Fem|Number=Sing'Once active, the standard functions use it: pos_tags/pos_tag, lemmatize, and
parse — which then returns UD relations (nsubj, obj, advcl, …) with the
predicted 9-character morphological tag on each token. disable_neural_pipeline()
restores the cascades above.
Measured — UD Ancient Greek test folds, official CoNLL 2018 evaluator, through the shipped package, end-to-end from raw text (tokens F1 99.97):
| UD Perseus test | UPOS | UFeats | Lemma | UAS | LAS |
|---|---|---|---|---|---|
| neural pipeline | 96.9 | 96.1 | 94.4 | 89.2 | 84.4 |
Out-of-domain (UD PROIEL test, a source no pyaegean model trains on): lemma 90.6,
UAS 82.5, UPOS 87.2. Inference is torch-free (int8 quantization failed its accuracy
gate, so the artifact ships fp32) at roughly 450 words/second on a plain CPU. The
model bundle is CC BY-SA 4.0, fetched to the cache, never bundled; training data,
leakage controls, and the comparison tables are documented in
docs/benchmarks.md.
Beta Code is the ASCII transliteration of polytonic Greek used by the TLG and Perseus. Conversion is round-trip-safe and emits precomposed NFC.
greek.betacode_to_unicode("mh=nin") # 'μῆνιν'
greek.betacode_to_unicode("lo/gos") # 'λόγος' (context-sensitive final ς)
greek.betacode_to_unicode("tw=|") # 'τῷ' (iota subscript)
greek.unicode_to_betacode("Ἀχιλῆος") # '*a)xilh=os'
greek.normalize("ό") # 'ό' (NFC by default)
greek.strip_diacritics("ἄνθρωπος") # 'ανθρωπος'Supported Beta Code: the 24 letters (* marks capitals, s1/s2/s3 sigma
variants) and the diacritics — smooth ) / rough ( breathings, acute /,
grave \, circumflex =, diaeresis +, iota subscript |.
Lenient mode for OCR'd or messy text. normalize(..., lenient=True) repairs —
and warns about (a NormalizationWarning per repair class) — the common artifacts
of scanned editions and half-converted files, instead of letting them silently break
tokenization downstream:
greek.normalize("λόγoς", lenient=True) # 'λόγος' (Latin o inside a Greek word)
greek.normalize("μη=νιν", lenient=True) # 'μῆνιν' (Beta-Code remnant diacritic)Three repair classes: Latin letters embedded in Greek-containing words (only letters
where the visual lookalike and the Beta-Code letter agree — ambiguous ones like p
are reported but left alone), Beta-Code diacritics still attached to Greek letters
(converted only where the mark is phonologically possible: breathings on vowels/ρ,
diaeresis on ι/υ, …), and stray combining marks with no base letter (dropped).
Pure-Latin words pass through untouched, and the default strict mode is unchanged.
greek.tokenize_words("ἐν ἀρχῇ ἦν ὁ λόγος, καὶ θεός.")
# ['ἐν', 'ἀρχῇ', 'ἦν', 'ὁ', 'λόγος', 'καὶ', 'θεός']
greek.tokenize("λόγος, καί") # [Token('λόγος', WORD), Token(',', PUNCT), Token('καί', WORD)]
greek.sentences("ἐν ἀρχῇ ἦν ὁ λόγος. καὶ θεός ἦν;")
# ['ἐν ἀρχῇ ἦν ὁ λόγος', 'καὶ θεός ἦν']Elision apostrophes are kept inside a single token (ποικιλόθρον’).
Rule-based: diphthong nuclei, "muta cum liquida" clusters that stay together, doubled-consonant splits, and valid Greek onsets — plus a curated exception lexicon for lexicalised compounds, which divide at the point of union (Smyth §140) where pure phonotactics would missplit.
greek.syllabify("λόγος") # ['λό', 'γος']
greek.syllabify("ἄνθρωπος") # ['ἄν', 'θρω', 'πος']
greek.syllabify("θάλασσα") # ['θά', 'λασ', 'σα']
greek.syllabify("ποικιλόθρον") # ['ποι', 'κι', 'λό', 'θρον']
greek.syllabify("εἰσφέρω") # ['εἰσ', 'φέ', 'ρω'] (compound: εἰσ + φέρω,
# where the rules alone would give εἰ-σφέ-ρω)The lexicon lists dictionary forms (inflected variants fall back to the rules);
adding an entry is a welcome one-line contribution — see CONTRIBUTING.md, which
also explains the test that makes every entry prove it differs from the rules.
info = greek.accentuation("λόγος")
info.accent_type # 'acute'
info.position_from_end # 2 (1=ultima, 2=penult, 3=antepenult)
info.classification # 'paroxytone'
info.syllables # ('λό', 'γος')Classifications: oxytone / paroxytone / proparoxytone (acute) ·
perispomenon / properispomenon (circumflex) · barytone (grave).
Classifies each syllable as heavy / light / common — the metrical foundation of meter. A syllable is heavy if it's closed (long by position) or has a long nucleus (η, ω, a circumflex, an iota-subscript vowel, or a diphthong); light if open with a short nucleus (ε, ο); common if open with a dichronon (α, ι, υ), whose length isn't determinable from spelling.
greek.syllable_quantities("λόγος") # ['light', 'heavy']
greek.syllable_quantities("ἄνθρωπος") # ['heavy', 'heavy', 'heavy']
greek.syllable_quantities("μῆνιν") # ['heavy', 'heavy']
greek.scan("θάλασσα") # [('θά','common'), ('λασ','heavy'), ('σα','common')]Baseline scope: these quantities are computed within a single word. To resolve a syllable's quantity in metrical context — across word boundaries, with the caesura and the ambiguities a verse line allows — use the metrical scansion below, which builds on this word-level view.
Scan a line of verse into its feet. It covers dactylic hexameter (the metre of Homer), elegiac pentameter (the second line of an elegiac couplet), and iambic trimeter (the metre of tragic and comic dialogue). The scanner resolves each syllable's quantity in context — applying correptio (a long vowel shortened before another vowel), treating muta-cum-liquida clusters as the ambiguity they are, and counting position across word boundaries.
The result is glyph notation you'll recognise from any commentary: — heavy (long), ⏑ light (short), × anceps (the "either" final syllable).
sc = greek.scan_hexameter("ἄνδρα μοι ἔννεπε, Μοῦσα, πολύτροπον, ὃς μάλα πολλὰ")
sc.pattern # '—⏑⏑|—⏑⏑|—⏑⏑|—⏑⏑|—⏑⏑|—×' (Odyssey 1.1 — five dactyls)
sc.meter # 'hexameter'
[f.name for f in sc.feet]
# ['dactyl', 'dactyl', 'dactyl', 'dactyl', 'dactyl', 'final']
sc.caesura # 'trochaic' (the main word-break in the third foot)A line with spondees, and its caesura located by syllable index:
sc = greek.scan_hexameter("πλάγχθη, ἐπεὶ Τροίης ἱερὸν πτολίεθρον ἔπερσεν")
sc.pattern # '—⏑⏑|——|—⏑⏑|—⏑⏑|—⏑⏑|—×' (Odyssey 1.2)
sc.caesura # 'penthemimeral'
sc.syllables[sc.caesura_index] # 'ἱ' (the line breaks just before this syllable)Elegiac pentameter (here Simonides' epitaph for the Spartan dead):
sc = greek.scan_pentameter("κείμεθα τοῖς κείνων ῥήμασι πειθόμενοι.")
sc.pattern # '—⏑⏑|——|—|—⏑⏑|—⏑⏑|×'
[f.name for f in sc.feet]
# ['dactyl', 'spondee', 'longum', 'dactyl', 'dactyl', 'longum']Iambic trimeter (the metre of tragic and comic dialogue) — three metra of
× — ⏑ —, with resolution of a long element into two shorts:
sc = greek.scan_trimeter("ὦ κοινὸν αὐτάδελφον Ἰσμήνης κάρα") # Antigone 1
sc.pattern # '×—⏑—|×—⏑—|×—⏑×'
sc.caesura # 'hephthemimeral' (penthemimeral or hephthemimeral)
greek.scan_trimeter("Διόνυσον, ὃν τίκτει ποθ' ἡ Κάδμου κόρη").pattern # Bacchae 2
# '×⏑⏑⏑—|×—⏑—|×—⏑×' — the first long is resolved (Διό- = ⏑⏑)scan_line(line, meter) dispatches by name ("hexameter" / "pentameter" /
"trimeter"), and a LineScansion carries .line, .meter, .feet,
.syllables, .quantities, .caesura, .caesura_index, and .ambiguous
(whether more than one scansion fit).
To inspect the possible quantities of each syllable before a metre is imposed —
useful for seeing where a line is genuinely ambiguous — use syllable_options:
greek.syllable_options("πατρός")
# [('πα', ['heavy', 'light']), ('τρός', ['light'])] ← πα is muta-cum-liquida: eitherSynizesis is lexical, never inferred. When a line only scans if two written
vowels are read as one syllable (e.g. Iliad 1.1, where Πηληϊάδεω reads its
final -εω as one syllable), the scanner applies it only for words in a
curated _SYNIZESIS lexicon — each entry test-enforced to be required by a real
line that otherwise fails, and to actually coalesce two vowels (the same
contribution-friendly, no-dead-entries discipline as the
syllabification exceptions):
greek.scan_hexameter("μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος").pattern
# '—⏑⏑|—⏑⏑|——|—⏑⏑|—⏑⏑|—×' — Πηληϊάδεω is in the lexicon, so the line scansA line needing synizesis on a word not in the lexicon still raises
ScansionError rather than guessing. Lyric metres remain out of scope for now.
Transcribe Greek to IPA for two periods — "attic" (Classical, default) and
"koine" (Hellenistic/Imperial).
greek.to_ipa("θεός") # 'tʰeos' (Attic: aspirated θ)
greek.to_ipa("ὁ") # 'ho' (rough breathing → /h/)
greek.to_ipa("ἄγγελος") # 'aŋɡelos' (γγ → velar nasal)
greek.to_ipa("θεός", "koine") # 'θeos' (Koine: θ is a fricative)
greek.to_ipa("καί", "koine") # 'ke' (iotacism: αι → /e/)Attic uses aspirated φ θ χ = /pʰ tʰ kʰ/, voiced stops β γ δ = /b ɡ d/, ζ = /zd/, υ = /y/, distinctive vowel length, and rough breathing = /h/. Koine fricativizes (φ θ χ = /f θ x/; β γ δ = /v ɣ ð/), is mid-iotacism (η, ει → /i/; αι → /e/; οι → /y/), and drops length and the breathings.
Reconstructed and approximate — several values (ε/η quality, the long diphthongs, the date of iotacism) are scholarly judgement calls.
Coarse part-of-speech tags (Universal Dependencies inventory). Closed classes — article, prepositions, conjunctions, particles, pronouns, and the εἰμί copula — are tagged reliably from a lexicon; open-class words get a light suffix heuristic (a few verb endings, else NOUN).
greek.pos_tag("ὁ") # 'DET'
greek.pos_tag("πρὸς") # 'ADP' (grave folded to acute for lookup)
greek.pos_tag("ἦν") # 'VERB' (copula)
greek.pos_tag("λόγος") # 'NOUN'
greek.pos_tags("ἐν ἀρχῇ ἦν ὁ λόγος, καὶ θεός.")
# [('ἐν','ADP'), ('ἀρχῇ','NOUN'), ('ἦν','VERB'), ('ὁ','DET'),
# ('λόγος','NOUN'), (',','PUNCT'), ('καὶ','CCONJ'), ('θεός','NOUN'), ('.','PUNCT')]Baseline scope: closed classes are reliable; open-class precision is limited
(an open-class verb like ἄειδε falls back to NOUN). To fix this for attested forms,
switch on the treebank backend — with
greek.use_treebank() active, pos_tag/pos_tags return the gold AGDT tag for a
known form (e.g. ἔφη → VERB) before falling back to the heuristic. Tags:
DET ADP CCONJ SCONJ PART PRON ADV NUM NOUN VERB ADJ PUNCT X (treebank mode may
also emit INTJ). The treebank only covers attested forms, though — to tag an
unseen form well, switch on the
generalizing tagger below.
The baseline heuristic and the treebank lookup both fall down on an unseen open-class
form — the heuristic just guesses NOUN, and the lookup has no entry for it. use_tagger()
switches on a trained averaged-perceptron sequence tagger (pure Python, no heavy deps)
that predicts a tag from suffix/prefix/shape/accent features plus left-to-right sentence
context — so it generalizes to forms it has never seen.
greek.use_tagger() # one-time fetch of the prebuilt model (or local train as fallback), then cached
greek.pos_tags("ἐν ἀρχῇ ἦν ὁ λόγος") # every token tagged, in context
greek.disable_tagger() # back to the lookup/heuristicIt composes with the cascade: the closed-class lexicon and (when active) the treebank lookup still take precedence per token for the forms they cover; the tagger fills in everything else, including words neither has seen.
Measured — held-out AGDT, leakage-free. Trained on a 90% sentence split and scored on
the disjoint 10% (≈54k tokens, via greek.evaluate_tagger()), it reaches 84.4% POS
overall and 83.6% on unseen forms — forms absent from the training split. For contrast,
on the same tokens the lookup scores 0% on unseen (no entry) and the suffix heuristic only
~50%. The cached model is ~2.2 MB and import aegean stays instant — the model arrives on
first use_tagger() (prebuilt fetch, or trained locally as the fallback), never bundled.
greek.evaluate_tagger(holdout=0.1)
# {'pos_all': 0.844, 'pos_unseen': 0.836, 'n_all': 54036, 'n_seen': 45138, 'n_unseen': 8898}This is a generalizing tagger with zero heavy dependencies, an instant import, and a ~2 MB model — a deliberate point on the trade-off curve, favouring pure-Python portability over the absolute accuracy of a full neural pipeline. The benchmark harness lets you score any tagger on a gold set of your choosing.
Given an inflected form, analyze returns the morphological readings its ending
implies — part of speech plus the relevant features (case/number/gender for
nouns; tense/voice/mood/person/number for verbs) — each with a reconstructed
lemma. Greek inflection is richly ambiguous, so a single form legitimately yields
several candidate readings; you disambiguate with context.
for a in greek.analyze("λόγον"):
print(a)
# λόγος [NOUN acc sg masc]
# λόγος [NOUN acc sg fem]
# λόγος [NOUN nom sg neut]
# λόγος [NOUN acc sg neut]
# λόγος [NOUN voc sg neut]Each reading is an Analysis with the lemma, the POS, and the individual feature
fields; .features() gives just the ones that apply:
a = greek.analyze("λύεις")[0]
a.lemma, a.pos # ('λυω', 'VERB')
a.features() # {'number': 'sg', 'tense': 'pres', 'voice': 'act', 'mood': 'ind', 'person': '2'}
a.lemma_certain # False ← see "how far to trust the lemma" belowClosed-class words (the article, prepositions, conjunctions, particles, pronouns) come back as a single, confident reading:
greek.analyze("ὁ") # (Analysis(lemma='ὁ', pos='DET'),)
greek.analyze("καί")[0] # κaí → CCONJTwo convenience shortcuts when you don't need the full feature set:
greek.lemmas("ἀνθρώπων") # ['ἄνθρωπος'] (the distinct lemmas a form could belong to)
greek.best_pos("λύεις") # 'VERB' (the single most likely part of speech)Analysis.lemma_certain tells you how much to trust the lemma. When the bundled
seed lexicon knows the form, you get the correctly accented lemma and
lemma_certain=True. When the form is regular but out-of-vocabulary, the lemma is
reconstructed from the ending — unaccented (accent recession can't be
derived from the ending alone) and flagged lemma_certain=False:
[a for a in greek.analyze("ἀνθρώπων") if a.pos == "NOUN"][0].lemma # 'ἄνθρωπος' (seed, certain)
[a for a in greek.analyze("ἵππον") if a.pos == "NOUN"][0].lemma # 'ιππος' (reconstructed, uncertain)This is a baseline engine — high-precision on the regular paradigms it
encodes (the article and pronouns, the first and second declensions and common
third-declension endings, and thematic verbs in the present, imperfect, future
and sigmatic aorist indicative, plus common infinitives and the mediopassive
participle). Past tenses are augment-gated, and a dative singular is detected from
its iota subscript. Athematic, contract, irregular and suppletive forms (εἶπον →
λέγω) are beyond a purely rule-based reach; for those, switch on the
treebank-derived lexicon below. For ambiguous forms the
feature analyses are exploratory: trust the
closed classes and the feature set; treat a single auto-picked reading with care.
The baseline above is rule-based and fully offline. For attested forms you can switch on a treebank-derived lexicon built from the Perseus Ancient Greek Dependency Treebank (AGDT v2.1). It supplies correctly-accented lemmas and full features — including the irregular, contract, athematic and third-declension forms the rule engine can't reach:
greek.use_treebank() # one-time fetch of the ~15 MB prebuilt lexicon, cached; then instant
greek.lemmatize("ἄνδρα") # 'ἀνήρ' (3rd declension; the rule engine gives a bare stem)
greek.lemmatize("ἔφη") # 'φημί' (suppletive athematic verb)
greek.lemmatize("γυναικός") # 'γυνή'
greek.lemmatize("πόλεως") # 'πόλις'
greek.analyze("ἀνθρώπων")[0] # ἄνθρωπος [NOUN gen pl masc] (lemma_certain=True)Once active, lemmatize/analyze prefer the treebank for known forms and fall
back to the rule/seed engine for the rest; greek.disable_treebank() restores the
default. Network is needed only on the first call: it fetches the prebuilt
AGDT-derived lexicon (part of one shared ~15 MB bundle), falling back to
downloading the treebank itself (~75 MB) and building locally if the asset is
unreachable. The data is CC BY-SA 3.0 (derived from the AGDT), fetched to your
cache and never bundled — see
Data & Provenance.
aegean.greek.benchmark scores the pipeline against a small bundled gold set, so
you can track how its Greek coverage is doing over time. The gold is hand-authored
and independent — correct answers stated from scholarship, never read off any
engine — which is what makes the comparison below fair.
from aegean.greek import benchmark
for stage, s in benchmark.run_benchmark().items():
print(s)
# tokenize: 100% (5/5)
# syllabify: 100% (6/6)
# accent: 100% (6/6)
# scansion: 100% (5/5)
# lemma: 28% (5/18) ← seed table only; misses irregular / 3rd-declension forms
# pos: 50% (10/20) ← suffix heuristic misses open-class words
# morphology: 73% (8/11)compare_modes() scores lemma + POS with the
treebank backend off vs on (it activates
use_treebank() for you, fetching the prebuilt lexicon on first use):
benchmark.compare_modes()
# baseline : lemma 28% (5/18) · pos 50% (10/20)
# treebank : lemma 100% (18/18) · pos 100% (20/20)On this gold set the treebank lifts lemma 28% → 100% and POS 50% → 100% (morphology recall 73% → 100%). The set is deliberately weighted toward the irregular, third-declension and open-class forms that separate the engines, and each item is attested in the AGDT — so it measures the win where it applies; on genuinely unattested forms the treebank falls back to the baseline.
compare_lemmatizers and compare_pos_taggers take any lemma-or-POS callable you supply
and score it on the same bundled gold set, so you can measure an external pipeline on identical
items. The gold set is small (18 lemma / 20 POS items) and weighted toward attested forms, so
it measures lexical coverage, not generalization to unseen text — for which the held-out
evaluations below are the relevant measure.
The generalizing tagger is measured on a leakage-free 90/10 AGDT sentence split, scored in context on ≈54k tokens, with the unseen-form subset (forms absent from training) called out separately:
| POS — held-out AGDT | overall | unseen forms |
|---|---|---|
| pyaegean tagger (pure Python) | 84.4% | 83.6% |
The AGDT is the tagger's own training source, so the unseen-form column is the honest generalization measure — 83.6% from a zero-dependency, pure-Python model. A fully neutral check, on text pyaegean never trained on, is the out-of-AGDT evaluation below.
The same evaluation for lemmatization (the generalizing lemmatizer, scored with predicted POS):
| lemma — held-out AGDT | overall | unseen forms |
|---|---|---|
| pyaegean lemmatizer (pure Python, edit-tree) | 84.5% | 40.3% |
pyaegean [neural] (GreTa seq2seq, opt-in) |
~92% | 76.3% |
The pure-Python lemmatizer is solid overall but trails on unseen forms, where recovering a
lemma (often an accent/stem change, not just a suffix swap) is hardest. The opt-in
[neural] backend reaches 76.3% on unseen forms with a GreTa seq2seq that generates the
lemma, and ships as a hybrid (the gold lookup answers seen forms, the seq2seq the rest), so
overall lemma accuracy lands around 92%. It is a fetched-to-cache ONNX model behind the
[neural] extra (onnxruntime, no torch); the pure-Python edit-tree stays the zero-dependency
default. See Neural lemmatizer (opt-in) below.
The held-out numbers above are leakage-free within the AGDT — but pyaegean's backends are
all built from the AGDT, so they don't show how the system fares on text from a different
source. greek.evaluate_on_proiel() scores the active pipeline (lemmatize + pos_tag)
against the PROIEL treebank — the Greek New Testament and Herodotus — which none of
pyaegean's models have ever seen, so every form is a genuine generalization test.
from aegean import greek
greek.use_treebank(); greek.use_neural_lemmatizer() # measure the full pipeline
greek.evaluate_on_proiel() # {'lemma': …, 'pos': …, 'n': …} over the PROIEL goldPROIEL is fetched to the cache on first use (CC BY-NC-SA 3.0 — evaluation only, never
bundled, like the AGDT). Lemma accuracy is the clean metric (lemmas compared after Unicode
normalization and dropping PROIEL's #N homograph suffix); POS is compared under a reconciled
tagset (PROIEL's PROPN/SCONJ collapse to pyaegean's NOUN/CCONJ, so the figure reflects real
errors, not convention gaps). This is a neutral test for pyaegean specifically — PROIEL is
in-training for some other systems, so it is not a level field for cross-tool comparison; it
answers "how well does pyaegean read Greek it never trained on."
Pass your own gold (same schema as the bundled benchmark_gold.json) to any
scorer — score_lemmatizer, score_pos, compare_lemmatizers,
compare_pos_taggers, or compare_modes.
greek.evaluate_on_ud(treebank, split) scores the active pipeline on the Universal
Dependencies Ancient Greek test folds (Perseus / PROIEL) with the official CoNLL 2018
evaluator — the protocol the field's published numbers use. The folds are CC BY-NC-SA,
fetched to the cache for evaluation only (never trained on); greek.agdt_ud_overlap()
builds the manifest of AGDT sentences that appear in the UD folds, which pyaegean's model
training excludes. The full protocol, leakage controls, and measured numbers live in
docs/benchmarks.md.
greek.use_treebank(); greek.use_tagger(); greek.use_lemmatizer(); greek.use_parser()
greek.evaluate_on_ud("proiel", "test") # {'upos': …, 'lemma': …, 'uas': …, …}A small bundled form→lemma seed table with an identity fallback. This is the always-offline baseline; for attested forms the treebank backend supplies real, accented lemmas, and the rule-based morphological analyzer is documented above.
greek.lemmatize("λόγου") # 'λόγος'
greek.lemmatize("ἦν") # 'εἰμί'
greek.lemmatize_verbose("ξενικον") # ('ξενικον', False) ← not in the seed tableTo lemmatize unseen forms, switch on the generalizing lemmatizer below.
The seed table and the treebank lookup only lemmatize attested forms; an unseen form comes
back unchanged. use_lemmatizer() switches on a trained lemmatizer that generalizes: from
each (form, lemma) pair it learns a Chrupała-style edit tree — a recursive transform that
keeps the shared stem and rewrites the differing prefix/suffix — so a rule learned from one
word (-ου → -ος) applies to unseen words (νόμου → νόμος), and edit trees capture accent
shifts and capitalization too. An averaged-perceptron reranker, conditioned on POS, picks the
right tree for each form.
greek.use_tagger() # recommended — the lemmatizer conditions on the tagger's POS
greek.use_lemmatizer() # one-time fetch of the prebuilt model (or local train as fallback), then cached
greek.lemmatize("ἀνθρώπων") # 'ἄνθρωπος', even if the form was never attested
greek.disable_lemmatizer()It slots into the cascade after the treebank lookup: an attested form still gets its gold lemma; everything else goes to the model.
Measured — held-out AGDT, leakage-free. Trained on a 90% sentence split and scored on the
disjoint 10% (via greek.evaluate_lemmatizer(), with predicted POS), it reaches 84.5%
overall and 40.3% on unseen forms — versus the lookup's 0% on unseen. The cached model is
~7 MB (fetched prebuilt on first use — or trained locally if the asset is unreachable — never
bundled).
This is real generalization from a zero-dependency model (0% → 40% on unseen, competitive on attested forms). Recovering an unseen Greek lemma often means an internal stem/accent change rather than a suffix swap, which is where a pure-Python edit-tree reranker reaches its limit. For higher unseen accuracy, switch on the neural backend below, which reaches 76.3% on unseen forms.
The [neural] backend generates the lemma with a fine-tuned GreTa (Ancient-Greek
T5) seq2seq, composing novel stem and accent changes rather than classifying a form into a
known transformation. On unseen forms it reaches 76.3%.
pip install "pyaegean[neural]" # onnxruntime + tokenizers; no torchgreek.use_neural_lemmatizer() # fetches the model (~232 MB, one-time) to the cache
greek.lemmatize("θήσονται") # 'τίθημι' — generated, never attested in this form
greek.lemmatize("λάθωσι") # 'λανθάνω'
greek.disable_neural_lemmatizer()It is a hybrid: a bundled gold lookup answers attested (seen) forms exactly — so the model
only generates for genuinely unseen forms — and it slots into the cascade just after the
treebank lookup, ahead of the edit-tree reranker. Inference is torch-free (a numpy greedy
decode over the int8 ONNX encoder/decoder via onnxruntime); the model is fetched to the cache,
never bundled, so import aegean stays instant. The weights derive from CC BY-SA treebanks
(see Data & Provenance); the wheel stays Apache-2.0 because the model is
fetched, not bundled.
What does a word mean? use_lsj() switches on the full Perseus Liddell-Scott-Jones
lexicon — it fetches the prebuilt ~15 MB index (one-time; or, if that asset is
unreachable, downloads the ~270 MB TEI and builds the index locally), then
gloss/lookup resolve a Greek word to its dictionary entry. Looking up an inflected
form works: it tries the form, then lemmatizes (using the treebank backend
if active) and retries — so it composes with everything above.
greek.use_treebank() # optional, but lets inflected/irregular forms resolve
greek.use_lsj() # one-time fetch of the ~15 MB prebuilt index, cached; then instant
greek.gloss("ἀνδρός") # 'ἀνήρ: man, opp. god, …' (lemmatized ἀνδρός → ἀνήρ)
greek.gloss("γυναικός") # 'γυνή: wife, spouse, …'
greek.gloss("βάλλω") # 'βάλλω: Act., throw: …'
entry = greek.lookup("λόγος") # the full structured entry
entry.headword # 'λόγος'
len(entry.senses) # 64
entry.senses[0].marker, entry.senses[0].text[:40] # ('I', 'computation, reckoning …')lookup returns an LSJEntry (headword, senses of Sense(marker, level, text),
lead, short); gloss is the concise one-liner (headword: <first English sense>).
Beta Code in the source is converted to Unicode, and citations are compacted into the
sense text. The short gloss is best-effort — for a few entries (e.g. cross-reference
headwords) it can still lead with a variant; use lookup for the full picture.
The LSJ is CC BY-SA 4.0 (Perseus Digital Library), fetched to your cache and never
bundled — see Data & Provenance.
greek.disable_lsj() turns it back off.
use_parser() activates (on first use it fetches the prebuilt model from the shared
AGDT-derived bundle; if that's unreachable it downloads the AGDT and trains locally —
a few minutes) a transition-based arc-eager parser with an averaged-perceptron
classifier (pure Python, no heavy deps); then parse() turns a sentence into a
dependency tree with the gold AGDT/Prague labels (SBJ, OBJ, ATR, ADV, PRED, Aux*…).
greek.use_treebank() # optional — improves the POS/lemmas the parser feeds on
greek.use_parser() # one-time train (~2–3 min) from the cached AGDT, then cached
tree = greek.parse("ἐν ἀρχῇ ἦν ὁ λόγος")
print(tree)
# 1 ἐν ADP AuxP ->3(ἦν)
# 2 ἀρχῇ NOUN ADV ->1(ἐν)
# 3 ἦν VERB PRED ->0(ROOT)
# 4 ὁ DET ATR ->5(λόγος)
# 5 λόγος NOUN SBJ ->3(ἦν)
tree.root().form # 'ἦν'
[t.form for t in tree.children(3)] # ['ἐν', 'λόγος']A DepTree is a tuple of DepToken(id, form, lemma, upos, head, relation) with
root(), head_of(id), children(id), and is_projective(). You can also read the
treebank's gold trees directly: from aegean.greek.syntax import load_gold_trees.
This is an honest baseline. Ancient Greek is richly non-projective (only ~31%
of AGDT sentences are projective), and arc-eager can build only projective trees — so
non-projective gold structures are out of reach and are skipped in training (a known
limitation, not a bug). Measured on held-out AGDT with gold POS:
~0.67 UAS / 0.57 LAS on projective sentences, ~0.51 / 0.42 across all text
(greek.evaluate_parser() reproduces these). It produces clean, correct trees for
main-clause syntax (as above), but it is not a research-grade parser. The model is
derived from the AGDT (CC BY-SA 3.0), cached locally (~4 MB), never bundled;
greek.disable_parser() turns it off.
aegean.load("greek") loads a handful of public-domain Archaic→Koine passages
(Homer, Herodotus, Heraclitus, Sappho, John 1:1) to exercise the pipeline.
import aegean
g = aegean.load("greek")
len(g) # 5
iliad = g.get("iliad-1.1")
[t.text for t in iliad.words]
iliad.meta.scribe, iliad.meta.period # ('Homer', 'Archaic (epic)')
dict(g.word_frequencies())["λόγος"] # 2 (John 1:1 sample)The Greek Script also exposes the pipeline as a capability:
script = aegean.get_script("greek")
script.nlp.syllabify("ἄνθρωπος") # ['ἄν', 'θρω', 'πος']Start here
Aegean scripts
Greek
Capabilities
Reference