-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture
github-actions[bot] edited this page Jun 8, 2026
·
14 revisions
pyaegean is built in strict, downward-only layers. Higher layers import lower ones; the core never imports a script.
L6 ai (aegean.ai) provider-agnostic LLM clients + grounded capabilities
L5 translate hybrid: lexicon/morphology grounding → LLM
L5 greek (aegean.greek) Greek NLP pipeline (normalize/tokenize/syllabify/…)
L4 io · data · adapters JSON/EpiDoc/CSV · bundled registry + downloader/cache
L3 analysis distance · align · morphology · collocation · patterns
· query · accounting · structure
L2 scripts (plugins) lineara · greek (linearb / cypriot later)
L1 core Corpus · Document · Token · Sign · SignInventory ·
Numeral · Script(ABC) · Registry · Provenance
Frozen @dataclass(slots=True) value objects; numpy/pandas lazy.
-
Sign— one graphic unit (syllabogram, letter, logogram):label,glyph,codepoint,phonetic,attrs. -
Token— one unit in a document's text stream, with aTokenKind(WORD,LOGOGRAM,NUMERAL,SEPARATOR,PUNCT,UNKNOWN), decomposedsigns, optionalline_no/position. -
Document— one inscription/tablet/text:id,script_id,tokens, physicallines,meta(DocumentMeta: site/support/scribe/findspot/period/ name/images). Properties:.words,.numerals,.logograms,.line_tokens. -
Corpus— the hub:Corpus.load(script_id)/aegean.load(...),.filter(**meta),.get(id),.word_frequencies(),.to_dataframe(level="document"|"token"|"word"),.to_dict(),.provenance. -
SignInventory— signs indexed by label / glyph / codepoint. -
Provenance— source/license/citation that travels with every corpus and stamps exports;.cite()returns a one-line citation.
A writing system is a plugin the core knows only by interface:
from aegean.core.script import Script, register
class MyScript(Script):
id = "myscript"
name = "My Script"
@property
def sign_inventory(self): ...
def tokenize(self, raw: str): ...
register(MyScript())A corpus loader is registered separately via
aegean.core.corpus.register_loader(script_id, fn) so aegean.load(script_id)
works. The core never imports scripts (no cycles); aegean/__init__ imports
scripts to register the built-ins (Linear A, Greek).
Access registered scripts:
import aegean
aegean.registered_scripts() # ['greek', 'lineara']
script = aegean.get_script("greek")
script.sign_inventory # the Greek alphabet
script.tokenize("ἐν ἀρχῇ") # [Token(...), ...]- Heavy deps (
numpy/pandas/scipy) and provider SDKs are lazy-imported inside functions;import aegeanstays instant. - No large/binary assets are bundled — that's what the download-to-cache layer is for. The wheel stays < 3 MB (CI guards it).
- Every exploratory method (cross-linguistic distance, morphology clustering, accounting reconciliation, decipherment, AI readings) carries its caveat and is labeled unverified at point of use. The Linear A material is undeciphered — analysis is never presented as ground truth.
Start here
Aegean scripts
Greek
Capabilities
Reference