Home

pyaegean

A specialist Python toolkit for Ancient Greek — alphabetic Greek and the Aegean syllabic scripts (Linear A, Linear B, Cypriot, and Cypro-Minoan). pyaegean focuses narrowly and deeply on Greek and the Aegean world: a script-agnostic corpus data layer, the analytical methods from the Linear A Research Workbench, translation, and a pluggable multi-provider AI layer.

Status: v0.8.0 (beta). The API is close to stable, but a 1.0 awaits outside use and a short methods write-up. The script-agnostic core, Linear A, Linear B (Mycenaean Greek), the Cypriot syllabary (Arcado-Cypriot Greek), and the undeciphered Cypro-Minoan script complete the Aegean set — each deciphered script with a sign inventory, transliteration, and a Greek-reading bridge; Cypro-Minoan, undeciphered, ships its sign inventory only. The Greek NLP track is a full pipeline: a zero-dependency core (tokenize → scansion → IPA → POS → morphology → lemmas), opt-in backends (Perseus AGDT treebank, LSJ glossing, generalizing taggers/lemmatizers), and an opt-in neural pipeline (use_neural_pipeline) — one jointly-trained torch-free model for tagging, morphology, UD dependency parsing, and lemmatization that is state of the art on the UD Ancient Greek benchmarks (measured: 96.9 UPOS / 96.1 UFeats / 94.4 lemma / 89.2 UAS / 84.4 LAS on the Perseus test fold, end-to-end from raw text) — plus real works on demand (load_work("tlg0012.tlg001") → the Iliad), a benchmark harness, and a neutral out-of-AGDT (PROIEL) evaluator. The multi-provider AI layer + hybrid translation are implemented, over a corpus data layer with a lossless JSON round-trip (to_json/from_json) and a compound query(), plus EpiDoc/CSV/Parquet export. Analytical and generative output on the undeciphered Linear A material is exploratory — see Limitations for the full picture of what pyaegean can and cannot claim, and Data & Provenance for where every dataset comes from.

New here?

Never used Python? Start with Getting Started — it walks you from "I have nothing installed" to your first result, no prior programming assumed.
Want to learn by doing? The Tutorial answers two real research questions end to end — one in Linear A, one in Greek.
Something not working? See the FAQ & Troubleshooting.

Quick start

import aegean

corpus = aegean.load("lineara")          # 1,721 inscriptions, bundled, offline
print(len(corpus))                       # 1721

ht = corpus.filter(site="Haghia Triada") # filter by metadata (site name)
df = corpus.to_dataframe(level="word")   # pandas-native, one row per word

from aegean.analysis import balance_check, word_matches_sign_pattern
checks = balance_check(corpus.get("HT13"))          # KU-RO accounting reconciliation
hits = [w for w, _ in corpus.word_frequencies()
        if word_matches_sign_pattern(w, "KU-*-RO")] # wildcard sign search

from aegean import greek
greek.betacode_to_unicode("mh=nin")          # 'μῆνιν'
greek.syllabify("ἄνθρωπος")                  # ['ἄν', 'θρω', 'πος']
greek.accentuation("λόγος").classification    # 'paroxytone'

What's here

Module	What it does
`aegean.core`	Script-agnostic model: `Corpus`, `Document`, `Token`, `Sign`, `SignInventory`, `Numeral`, the `Script` plugin registry, provenance, a lossless JSON round-trip, and a compound `query()`
Linear A	Bundled 1,721-inscription corpus, the full Unicode Linear A repertoire (344 signs; 84 carry conventional sound values), sign→sound map, transliteration
Linear B	Mycenaean Greek: 211-sign Unicode inventory, transliteration, a Greek-reading bridge (`po-me → ποιμήν`), accounting, the full DAMOS corpus on demand (`aegean.load("damos")`, ~5,900 tablets)
Cypriot	Arcado-Cypriot Greek: 55-sign Unicode syllabary, transliteration, a Greek-reading bridge (`pa-si-le-u-se → βασιλεύς`)
Cypro-Minoan	Undeciphered Bronze Age Cyprus: 99-sign Unicode inventory + sign-sequence tokenization (no phonetics or bridge — the script is undeciphered)
Analysis	Accounting reconciliation, sign-pattern search, phonetic distance/alignment, cross-script comparison (Linear B ↔ Greek by sound), morphology clustering, collocation stats, corpus statistics (dispersion, keyness, bootstrap), query engine, structure detection, an opt-in analysis cache
Greek NLP	Beta Code↔Unicode, tokenize, syllabify, accent & prosody, metrical scansion, reconstructed IPA, POS tagging, morphological analysis, lemmatize; opt-in Perseus-treebank lemmas/POS (`use_treebank`), LSJ glossing (`use_lsj`), generalizing pure-Python taggers/lemmatizers/parser, and the neural pipeline (`use_neural_pipeline`) — joint tagging + morphology + UD parsing + lemmatization, state of the art on the UD Ancient Greek benchmarks (96.9 UPOS / 96.1 UFeats / 94.4 lemma / 89.2 UAS / 84.4 LAS, Perseus test) — plus real works on demand (`load_work`) and a benchmark harness
`aegean.io`	Export adapters: EpiDoc (TEI) write — the inverse of the bring-your-own reader — plus CSV and Parquet
CLI	The `aegean` command line (`[cli]` extra): the whole toolkit without writing Python — corpus commands, the full Greek NLP pipeline, analysis, data fetching, and the (exploratory) AI layer; `--json` everywhere, stdin-pipeable
Geography	`aegean.geo`: corpus → geopandas GeoDataFrame (per-inscription or per-site points) from a bundled, Pleiades-aligned Aegean gazetteer, for mapping/spatial analysis
`aegean.viz` (Analysis)	One-line plots (the `[viz]` extra): frequency bars, dispersion/keyness charts, co-occurrence networks, accounting diagonals, scansion grids — and `aegean plot` from the shell
AI Layer	Multi-provider clients (Anthropic/OpenAI/Grok/Gemini), grounding, caching, exploratory-labeled capabilities, hybrid translation
Data & Provenance	Bundled data, download-to-cache, citation/licensing

The API reference documents every public module, class, and function, generated from the docstrings and type hints — complementing the guides above.

Install

pip install pyaegean            # core + Linear A + Greek
pip install "pyaegean[cli]"     # + the `aegean` command line
pip install "pyaegean[ai]"      # + Anthropic / OpenAI / Grok / Gemini clients
pip install "pyaegean[all]"     # everything

See Installation for the full extras matrix, and Development to build from source and run the test suite.

Roadmap

Shipped (through v0.8): all four Aegean scripts (Linear A, Linear B, Cypriot, Cypro-Minoan); a deep Greek NLP track — treebank lemmas/POS, LSJ glossing, generalizing pure-Python taggers/lemmatizers/parser, the neural joint pipeline (state of the art on the UD Ancient Greek benchmarks: 96.9 UPOS / 96.1 UFeats / 94.4 lemma / 89.2 UAS / 84.4 LAS, Perseus test, end-to-end from raw text), real works on demand (load_work — Perseus/First1KGreek), and a benchmark harness; the aegean command line covering the whole toolkit; the multi-provider AI layer and hybrid translation; the corpus data layer with a lossless JSON round-trip (to_json/from_json), a compound query(), schema-valid EpiDoc/CSV/Parquet export, citation automation (cite() down to the exact subset), and a data-versioning manifest (data.versions()); geographic analysis with Pleiades alignment; editorial-status + variant-reading round-trips (ReadingStatus, Token.alt ↔ EpiDoc); the full Unicode Linear A sign repertoire; the corpus-data expansion — the full DAMOS Linear B corpus (aegean.load("damos"), ~5,900 tablets) and the SigLA Linear A dataset (aegean.load("sigla")) fetched on demand, with prebuilt LSJ/AGDT artifacts for a fast first use; corpus statistics (dispersion/keyness/bootstrap), visualization one-liners, and cross-script phonetic comparison; traceable, measurable AI grounding; the opt-in analysis cache; and the extended Pleiades alignment (33/56, coordinate-verified).

On the list next:

DAMOS scribal-hand analysis and SigLA apparatus decoding
Richer load_work addressing across more of the Perseus / First1KGreek canon
A smaller neural model (selective quantization, optional GPU execution), held to the same accuracy gate
Morpheus-backed tables for the offline morphology tier
Wider gazetteer / Pleiades coverage

A 1.0 stable waits on outside use and a short methods write-up.

For specialists

Aegean epigrapher, Mycenologist, philologist, or historical linguist? See For Specialists for what's established vs. exploratory and how to file a correction, validate an exploratory result, or contribute a sourced fact — your judgement is part of how the toolkit stays trustworthy.

License

Apache-2.0. Corpus data is GORILA (Godart & Olivier 1976–1985) via mwenge/lineara.xyz; facsimile imagery © École Française d'Athènes (referenced, not redistributed). See Data & Provenance.

pyaegean

Home

Start here

Aegean scripts

Greek

Greek NLP

Capabilities

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

pyaegean

New here?

Quick start

What's here

Install

Roadmap

For specialists

License

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pyaegean

Clone this wiki locally