Architecture

pyaegean is built in strict, downward-only layers. Higher layers import lower ones; the core never imports a script.

L6  ai (aegean.ai)        provider-agnostic LLM clients + grounded capabilities
L5  translate             hybrid: lexicon/morphology grounding → LLM
L5  greek (aegean.greek)  Greek NLP pipeline (normalize/tokenize/syllabify/…)
L4  io · data · adapters  JSON/EpiDoc/CSV · bundled registry + downloader/cache
L3  analysis              distance · align · morphology · collocation · patterns
                          · query · accounting · structure
L2  scripts (plugins)     lineara · greek  (linearb / cypriot later)
L1  core                  Corpus · Document · Token · Sign · SignInventory ·
                          Numeral · Script(ABC) · Registry · Provenance

The core model (`aegean.core`)

Frozen @dataclass(slots=True) value objects; numpy/pandas lazy.

Sign — one graphic unit (syllabogram, letter, logogram): label, glyph, codepoint, phonetic, attrs.
Token — one unit in a document's text stream, with a TokenKind (WORD, LOGOGRAM, NUMERAL, SEPARATOR, PUNCT, UNKNOWN), decomposed signs, optional line_no/position.
Document — one inscription/tablet/text: id, script_id, tokens, physical lines, meta (DocumentMeta: site/support/scribe/findspot/period/ name/images). Properties: .words, .numerals, .logograms, .line_tokens.
Corpus — the hub: Corpus.load(script_id) / aegean.load(...), .filter(**meta), .get(id), .word_frequencies(), .to_dataframe(level="document"|"token"|"word"), .to_dict(), .provenance.
SignInventory — signs indexed by label / glyph / codepoint.
Provenance — source/license/citation that travels with every corpus and stamps exports; .cite() returns a one-line citation.

Scripts are plugins

A writing system is a plugin the core knows only by interface:

from aegean.core.script import Script, register

class MyScript(Script):
    id = "myscript"
    name = "My Script"

    @property
    def sign_inventory(self): ...
    def tokenize(self, raw: str): ...

register(MyScript())

A corpus loader is registered separately via aegean.core.corpus.register_loader(script_id, fn) so aegean.load(script_id) works. The core never imports scripts (no cycles); aegean/__init__ imports scripts to register the built-ins (Linear A, Greek).

Access registered scripts:

import aegean
aegean.registered_scripts()        # ['greek', 'lineara']
script = aegean.get_script("greek")
script.sign_inventory              # the Greek alphabet
script.tokenize("ἐν ἀρχῇ")          # [Token(...), ...]

Conventions

Heavy deps (numpy/pandas/scipy) and provider SDKs are lazy-imported inside functions; import aegean stays instant.
No large/binary assets are bundled — that's what the download-to-cache layer is for. The wheel stays < 3 MB (CI guards it).
Every exploratory method (cross-linguistic distance, morphology clustering, accounting reconciliation, decipherment, AI readings) carries its caveat and is labeled unverified at point of use. The Linear A material is undeciphered — analysis is never presented as ground truth.

pyaegean

Home

Start here

Aegean scripts

Greek

Capabilities

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture

Architecture

The core model (`aegean.core`)

Scripts are plugins

Conventions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pyaegean

Clone this wiki locally

Architecture

Architecture

The core model (aegean.core)

Scripts are plugins

Conventions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pyaegean

Clone this wiki locally

The core model (`aegean.core`)