Architecture

pyaegean is built in strict, downward-only layers. Higher layers import lower ones; the core never imports a script.

L6  ai (aegean.ai)        provider-agnostic LLM clients + grounded capabilities
L5  translate             hybrid: lexicon/morphology grounding → LLM
L5  greek (aegean.greek)  Greek NLP pipeline (normalize/tokenize/syllabify/…)
L4  io · geo · data      aegean.io (EpiDoc/CSV/Parquet export) · aegean.geo (GeoDataFrame) · bundled registry + cache
L3  analysis              distance · align · morphology · collocation · patterns
                          · query · accounting · structure
L2  scripts (plugins)     lineara · linearb · cypriot · cyprominoan · greek
L1  core                  Corpus · Document · Token · Sign · SignInventory ·
                          Numeral · Script(ABC) · Registry · Provenance

The Greek layer (L5) also hosts the opt-in backends — the AGDT treebank (treebank.py), LSJ glossing (lexicon.py), the dependency parser (syntax.py), the POS tagger (tagger.py), the lemmatizers (lemmatizer.py and the neural seq2seq backend neural_lemmatizer.py), and the neural joint pipeline (joint.py, with mst.py arc decoding and udfeats.py FEATS rendering) — which fetch and build their artifacts through the L4 data/cache layer (a greek → data edge); the strict downward-only layering still holds.

The core model (`aegean.core`)

Frozen @dataclass(slots=True) value objects; numpy/pandas lazy.

Sign — one graphic unit (syllabogram, letter, logogram): label, glyph, codepoint, phonetic, attrs.
Token — one unit in a document's text stream, with a TokenKind (WORD, LOGOGRAM, NUMERAL, SEPARATOR, PUNCT, UNKNOWN), decomposed signs, optional line_no/position.
Document — one inscription/tablet/text: id, script_id, tokens, physical lines, meta (DocumentMeta: site/support/scribe/findspot/period/ name/images). Properties: .words, .numerals, .logograms, .line_tokens.
Corpus — the hub: Corpus.load(script_id) / aegean.load(...), .filter(**meta), .get(id), .word_frequencies(), .to_dataframe(level="document"|"token"|"word"), .to_dict(), .query(filters), .to_json() / .from_json() (lossless round-trip), .provenance.
SignInventory — signs indexed by label / glyph / codepoint.
Provenance — source/license/citation that travels with every corpus and stamps exports; .cite() returns a one-line citation.

Scripts are plugins

A writing system is a plugin the core knows only by interface:

from aegean.core.script import Script, register

class MyScript(Script):
    id = "myscript"
    name = "My Script"

    @property
    def sign_inventory(self): ...
    def tokenize(self, raw: str): ...

register(MyScript())

A corpus loader is registered separately via aegean.core.corpus.register_loader(script_id, fn) so aegean.load(script_id) works. The core never imports scripts (no cycles); aegean/__init__ imports scripts to register the built-ins (Linear A, Greek).

Access registered scripts:

import aegean
aegean.registered_scripts()        # ['cypriot', 'cyprominoan', 'greek', 'lineara', 'linearb']
script = aegean.get_script("greek")
script.sign_inventory              # the Greek alphabet
script.tokenize("ἐν ἀρχῇ")          # [Token(...), ...]

Conventions

The core has zero hard third-party deps. pandas (optional [data] extra) and provider SDKs are lazy-imported inside functions; collocation stats are pure stdlib, so import aegean is instant and loads nothing heavy.
No large/binary assets are bundled — that's what the download-to-cache layer is for. CI's scripts/check_footprint.py enforces import-clean, import-fast, and a code+JSON-only wheel.
Every exploratory method (cross-linguistic distance, morphology clustering, accounting reconciliation, decipherment, AI readings) carries its caveat and is labeled unverified at point of use. The Linear A material is undeciphered — analysis is never presented as ground truth.

pyaegean

Home

Start here

Aegean scripts

Greek

Greek NLP

Capabilities

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture

Architecture

The core model (`aegean.core`)

Scripts are plugins

Conventions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pyaegean

Clone this wiki locally

Architecture

Architecture

The core model (aegean.core)

Scripts are plugins

Conventions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pyaegean

Clone this wiki locally

The core model (`aegean.core`)