-
Notifications
You must be signed in to change notification settings - Fork 0
FAQ
pyaegean is a free Python toolkit for Ancient Greek and the Aegean scripts (Linear A, Linear B, Cypriot, Cypro-Minoan): you give it Greek text or an inscription, and it gives back syllables, accents, metre, morphology, statistics, corpus queries, and more. Use it to explore a corpus, scan verse, tag and lemmatise Greek, or run honest, citeable analysis over undeciphered scripts.
This is a big Q&A for the snags that come up — install, Greek text, finding works to load, the web demo, offline use, the AI layer, and citing your results. If your question isn't here, please open an issue. New to all this? Start with Getting Started.
Python can't find the library. Almost always one of:
-
Your virtual environment isn't active. Re-activate it (you do this every
new terminal session) —
.venv\Scripts\Activate.ps1on Windows,source .venv/bin/activateon macOS/Linux — then try again. See Getting Started, Step 3. -
It isn't installed in this environment. Run
pip install pyaegean. - You installed with one Python and are running another. Check with
pip show pyaegeanandpython --version.
Don't install system-wide or use sudo. Make a
virtual environment
and install into it — no special permissions needed.
Run this once, then activate again:
Set-ExecutionPolicy -Scope CurrentUser RemoteSignedThe core library has no hard third-party dependencies. pandas is an optional extra
for DataFrame output — install it with pip install "pyaegean[data]". It's imported
only when you call a DataFrame feature, which is part of why import aegean stays
fast.
pip install --upgrade pyaegeanThat fetches the latest release from PyPI and installs it over the old one (no need to uninstall first). A few tips:
-
Update an extra the same way — keep the bracket so the optional dependencies upgrade too:
pip install --upgrade "pyaegean[cli]"(or[neural],[all], …). -
Check what you'll get / what you have:
pip index versions pyaegean # versions available on PyPI pip show pyaegean # the version currently installed python -c "import aegean; print(aegean.__version__)"
-
Pin a specific version if you need reproducibility:
pip install pyaegean==0.8.5. -
Cached datasets survive an upgrade. Updating the package never re-downloads the corpora or models you've already fetched — they live in a separate cache (see Where are downloaded/fetched files stored?), keyed by an immutable version, so they're reused as-is.
Everything past the offline core is opt-in, grouped into extras you add in brackets.
Install one (or several) with, e.g., pip install "pyaegean[cli]" or
pip install "pyaegean[neural,viz]".
| Extra | What it adds |
|---|---|
data |
pandas, for DataFrame output (to_dataframe) |
parquet |
Parquet export (pyarrow) |
epidoc |
EpiDoc TEI import/export |
geo |
GeoJSON / geographic output |
viz |
plotting (matplotlib) — figures and the scansion grid |
neural |
the neural pipeline + neural lemmatizer (most accurate Greek NLP) |
cli |
the aegean command-line interface |
mcp |
the MCP server, to drive pyaegean from an MCP client (e.g. Claude Code) |
anthropic |
the AI Layer via Anthropic |
openai |
the AI Layer via OpenAI |
gemini |
the AI Layer via Google Gemini |
grok |
the AI Layer via xAI Grok |
ai |
the AI Layer core (provider-agnostic) |
dev |
the test/lint/type toolchain (contributors) |
docs |
the documentation toolchain (contributors) |
all |
everything above |
See Installation for the full breakdown.
Python 3.10 or newer. Check with python --version.
aegean.load("…") opens a built-in corpus by id. The ids:
import aegean
from aegean.core.corpus import _LOADERS
sorted(_LOADERS)
# ['cypriot', 'cyprominoan', 'damos', 'greek', 'lineara', 'linearb', 'nt', 'sigla']| id | What it is | Offline? |
|---|---|---|
lineara |
the full Linear A corpus (1,721 documents) | yes, bundled |
linearb |
the bundled Linear B sample | yes, bundled |
greek |
a 5-passage Ancient Greek sample | yes, bundled |
cypriot |
the Cypriot Syllabary corpus | yes, bundled |
cyprominoan |
the Cypro-Minoan corpus | yes, bundled |
damos |
the full DAMOS Linear B corpus (~5,900 tablets, ~2 MB) | fetched on first use |
sigla |
the SigLA Linear A dataset (~1 MB) | fetched on first use |
nt |
the Greek New Testament (Nestle 1904); one book bundled offline, the rest fetched | mostly fetched |
corpus = aegean.load("lineara")
len(corpus) # 1721registered_scripts() lists the writing systems the package understands (each can
have its own corpus loader):
aegean.registered_scripts()
# ['cypriot', 'cyprominoan', 'greek', 'lineara', 'linearb']For real Greek literature beyond the bundled sample there are two on-demand loaders, each with a discovery helper so you never have to guess an id. See Greek Works and Books for the full guide.
Classical / literary works come from Perseus (canonical-greekLit) and
First1KGreek via greek.load_work(...). To browse a curated, verified starter list:
from aegean import greek
ws = greek.popular_works()
len(ws) # 25
ws[0]
# {'id': 'tlg0012.tlg001', 'author': 'Homer', 'title': 'Iliad'}That's a curated short list. For the whole reachable canon, greek.catalog() is an
offline, instant index of every work with a Greek edition in the open Perseus +
First1KGreek repos — ~1,800 of them — searchable by author, title (English or Greek), or
free text:
len(greek.catalog()) # 1778
hits = greek.catalog(author="plato")
len(hits) # 39
hits[0]
# {'id': 'tlg0059.tlg001', 'author': 'Plato', 'title': 'Euthyphro',
# 'greek_title': 'Εὐθύφρων', 'source': 'perseus'}Each entry's id goes straight to load_work. The catalogue is honest about coverage —
it lists exactly what the open repos hold at the pinned commit, so some authors that
aren't online upstream (Sappho, for instance) genuinely aren't in it; that's correct, not
a gap in pyaegean.
Same lists from the CLI (pip install "pyaegean[cli]"):
aegean greek works # the 25 curated highlights
aegean greek catalog --author plato # search the full ~1,800-work catalogue
aegean greek catalog sappho # a no-match is reported plainlyThe curated 25, as a table:
aegean greek works
# Popular Greek works
# ┌────────────────┬──────────────┬──────────────────────────────────┐
# │ id │ author │ title │
# ├────────────────┼──────────────┼──────────────────────────────────┤
# │ tlg0012.tlg001 │ Homer │ Iliad │
# │ tlg0012.tlg002 │ Homer │ Odyssey │
# │ tlg0020.tlg001 │ Hesiod │ Theogony │
# │ … │ … │ … │
# │ tlg0086.tlg010 │ Aristotle │ Nicomachean Ethics │
# └────────────────┴──────────────┴──────────────────────────────────┘
# Load one with, e.g.: aegean greek work tlg0012.tlg001 --ref 1.1-1.10
# This is a curated subset — search the full ~1,800-work canon with `aegean greek catalog`A sample of the 25 curated works (the list is a starting point, not the whole
canon — load_work takes any Perseus canonical-greekLit / First1KGreek CTS id;
browse them all at scaife.perseus.org):
| id | author | title |
|---|---|---|
tlg0012.tlg001 |
Homer | Iliad |
tlg0012.tlg002 |
Homer | Odyssey |
tlg0020.tlg002 |
Hesiod | Works and Days |
tlg0085.tlg005 |
Aeschylus | Agamemnon |
tlg0011.tlg004 |
Sophocles | Oedipus Tyrannus |
tlg0006.tlg003 |
Euripides | Medea |
tlg0016.tlg001 |
Herodotus | Histories |
tlg0003.tlg001 |
Thucydides | History of the Peloponnesian War |
tlg0032.tlg006 |
Xenophon | Anabasis |
tlg0059.tlg030 |
Plato | Republic |
tlg0086.tlg010 |
Aristotle | Nicomachean Ethics |
Then load one (network on first fetch only; pinned to a commit, so it's
reproducible). ref selects a sub-section — a book number, a book.chapter, or a
verse line-range:
# minimal, network on first use — shape shown, not run here:
iliad = greek.load_work("tlg0012.tlg001", ref="1.1-1.10") # Iliad, book 1, lines 1–10
len(iliad) # one Document per addressed textpart
iliad.provenance.license # 'CC BY-SA 4.0 (Perseus Digital Library)'CLI equivalent:
aegean greek work tlg0012.tlg001 --ref 1.1-1.10
work flag |
Meaning |
|---|---|
--ref |
section to load: 1 (book), 1.2 (chapter), 1.1-1.50 (lines) |
--source |
auto (default), perseus, or first1k
|
--edition |
pick a specific edition file when a work has several |
--output / -o
|
write the corpus to a JSON file |
--json |
machine-readable JSON on stdout |
The Greek New Testament has its own loader, greek.load_nt(book, ref=...),
which returns a Koine corpus with per-token gold lemma, morph parse, Strong's
number, and UD POS. To list the 27 books and every name each accepts:
books = greek.nt_books()
len(books) # 27
books[3]
# {'name': 'John', 'aliases': ['john', 'jn', 'jhn']}aegean greek nt-books
# New Testament books (Nestle 1904)
# ┌────────┬─────────────────────────────────┐
# │ book │ accepted names │
# ├────────┼─────────────────────────────────┤
# │ Matt │ matthew, matt, mt │
# │ John │ john, jn, jhn │
# │ Rev │ revelation, rev, rv, apocalypse │
# │ … │ … │
# └────────┴─────────────────────────────────┘
# Load one in Python: greek.load_nt('John', ref='1.1-18')Any name or alias works as the book argument; ref mirrors load_work
("3" = chapter 3, "3.16" = a verse, "3.16-18" = a verse range, "3-5" = a
chapter range). All 27 books and their aliases:
| Book | Accepted names |
|---|---|
| Matt | matthew, matt, mt |
| Mark | mark, mk, mrk |
| Luke | luke, lk, luk |
| John | john, jn, jhn |
| Acts | acts, act |
| Rom | romans, rom, rm |
| 1Cor | 1corinthians, 1cor, 1co |
| 2Cor | 2corinthians, 2cor, 2co |
| Gal | galatians, gal, ga |
| Eph | ephesians, eph |
| Phil | philippians, phil, php |
| Col | colossians, col |
| 1Thess | 1thessalonians, 1thess, 1th |
| 2Thess | 2thessalonians, 2thess, 2th |
| 1Tim | 1timothy, 1tim, 1ti |
| 2Tim | 2timothy, 2tim, 2ti |
| Titus | titus, tit |
| Phlm | philemon, phlm, phm |
| Heb | hebrews, heb |
| Jas | james, jas, jms |
| 1Pet | 1peter, 1pet, 1pe |
| 2Pet | 2peter, 2pet, 2pe |
| 1John | 1john, 1jn, 1jhn |
| 2John | 2john, 2jn, 2jhn |
| 3John | 3john, 3jn, 3jhn |
| Jude | jude, jud |
| Rev | revelation, rev, rv, apocalypse |
For a quick Koine gloss (no download — the bundled Dodson lexicon is CC0):
greek.use_dodson()
greek.gloss_nt("λόγος") # 'a word, speech, divine utterance, analogy'aegean greek gloss-nt "λόγος"
# a word, speech, divine utterance, analogyYes. aegean.io turns a string, a .txt, a folder of .txt files, or a .csv into a
real Corpus — with the full filter / search / analyse / export API — so you don't have
to write any Corpus boilerplate. It's all offline stdlib:
from aegean import io
io.from_text("λόγος δὲ καὶ ἀριθμός", doc_id="note") # from a string
io.from_text_file("essay.txt") # from one file
io.from_text_dir("poems/") # one corpus from a folder
io.from_csv("rows.csv", text_col="line", id_col="id") # one document per rowGreek (and nt) text is run through the Greek word tokenizer (punctuation stripped);
other scripts split on whitespace — pass script_id="lineara" etc. for those. split
controls how a file becomes documents: "whole" (default, one doc), "paragraph"
(blank-line blocks), or "line".
From the shell, aegean import writes the result to a .json or .db that then works
anywhere a corpus is accepted:
aegean import myplato.txt -o myplato.json # wrote 1 document(s) to myplato.json
aegean stats myplato.json --top 5 # …now analyse it like any corpusNote that aegean.load(...) and the CLI's corpus argument still take only a .json/.db
corpus — a raw .txt/.csv has to be imported first, as above. (The error you'd get
from skipping that step says so, and prints the exact aegean import line to run.)
The pure-Python core (the Greek pipeline plus the bundled Linear A corpus) runs in your browser — nothing to install, no server: the web demo. It's Pyodide (CPython compiled to WebAssembly) running locally in the page.
When the page opens it shows "loading…" for a few seconds. That's expected — on first load the browser has to:
- download and start the Pyodide WebAssembly runtime,
- load
micropipand thesqlite3stdlib module, then -
micropip.install("pyaegean")— fetch the package wheel and unpack it.
When that finishes the status flips to "ready" and the tools respond instantly.
It's slower the very first time because everything is being fetched over the
network; subsequent runs are quicker as the browser caches the pieces. If it gets
stuck on "failed to load," it's usually a flaky network or a strict
content-blocker — reload, or just pip install pyaegean locally instead.
Only the offline core runs client-side: Beta Code, syllabification, accents, scansion, and the bundled Linear A corpus. The neural and AI layers don't run in the browser — those need a local install (and, for AI, a provider key). For anything heavy, install locally per Getting Started.
That's a display/font issue, not a data problem — the text is correct underneath.
- Best fix: use Jupyter or a modern editor (VS Code), which render polytonic Greek cleanly.
-
In a Windows terminal: run
chcp 65001to switch it to UTF-8 first, and use a font with Greek coverage. - If you write Greek to a file, open it as UTF-8.
You don't need one. Type Beta Code — the standard ASCII transliteration used by the TLG and Perseus — and convert:
from aegean import greek
greek.betacode_to_unicode("mh=nin") # 'μῆνιν'aegean greek betacode "mh=nin"
# μῆνινSee Greek NLP → Beta Code for the full key.
Unicode has more than one way to encode the same accented letter. Normalise first:
greek.normalize("ό") # canonical NFC formYes — entirely offline, no extras needed:
greek.syllabify("ἄνθρωπος") # ['ἄν', 'θρω', 'πος']
greek.accentuation("λόγος").classification # 'paroxytone'
greek.scan_hexameter("ἄνδρα μοι ἔννεπε, Μοῦσα, πολύτροπον, ὃς μάλα πολλὰ").pattern
# '—⏑⏑|—⏑⏑|—⏑⏑|—⏑⏑|—⏑⏑|—×'The scanner covers dactylic hexameter, elegiac pentameter, and iambic trimeter — see Metrical scansion.
No. The core library, the full Linear A corpus, and the Greek pipeline all work
offline. A few opt-in things touch the network on first use, then cache:
the fetched corpora — aegean.load("damos") (the full ~5,900-tablet Linear B corpus,
~2 MB) and aegean.load("sigla") (the SigLA Linear A dataset, ~1 MB) —
greek.load_work(...) / greek.load_nt(...) (real Greek texts, pinned to a commit),
data.fetch(...) for large extra assets (the facsimile images), the optional AI layer,
and the opt-in Greek backends. The treebank/LSJ/tagger/lemmatizer/parser backends now
fetch small prebuilt artifacts — greek.use_lsj() a ~15 MB index (not 270 MB of
Perseus TEI), and greek.use_treebank() / use_tagger() / use_lemmatizer() /
use_parser() one shared ~15 MB AGDT-derived bundle (no 75 MB download or local
training) — falling back to building from source if an asset is unreachable. The
[neural] models are larger: greek.use_neural_lemmatizer() (~232 MB) and
greek.use_neural_pipeline() (~518 MB). Everything else, including the rule-based
pipeline, works fully offline.
Only for the AI Layer (translation, glossing, decipherment
hypotheses). Everything else — analysis, scansion, morphology, statistics — needs
no key and no account. To use AI, install a provider extra and set its key, e.g.
pip install "pyaegean[anthropic]" and ANTHROPIC_API_KEY.
from aegean import data
data.cache_dir() # the cache location (override with the PYAEGEAN_CACHE env var)From the CLI you can see the location and what's in it, list what's fetchable, and record exact versions for reproducibility:
aegean data cache # cache location + per-entry sizes
aegean data list # the fetchable datasets (name, size note, license)
aegean data versions # the reproducibility manifest: each dataset's version + sha256Upgrading the package never wipes this cache, so you don't re-download after a
pip install --upgrade. To reclaim space, clear the analysis cache with
aegean cache --clear (that's the computed-results cache, separate from the
fetched datasets).
No — and the library is built to keep you honest about this. Linear A is undeciphered. The phonetic values come from Linear B as a working convention, and every analytical or AI method is labeled exploratory: evidence to weigh, never a translation. Treat results as leads for a human expert, not answers. See Limitations.
It's a hypothesis from a language model, returned as an ExploratoryResult with
provenance and an unmistakable exploratory label. Useful for ideas; never citable
as fact. Always verify against primary scholarship. See AI Layer and
Limitations.
The default rule/seed engines are an offline baseline: high-precision on closed
classes (article, prepositions, pronouns…) and regular paradigms, but they miss
irregular, third-declension, contract, and most open-class forms — and they tell you
when a result is reconstructed (lemma_certain=False).
Several opt-in backends raise accuracy well past that baseline. The strongest is the
neural pipeline — greek.use_neural_pipeline() (the [neural] extra): one joint
model for POS, morphology, UD dependency parsing, and lemmatization, state of the art
on the UD Ancient Greek benchmarks (96.9 UPOS / 96.1 UFeats / 94.4 lemma / 89.2 UAS /
84.4 LAS on the Perseus test fold, measured end-to-end from raw text — see
the neural pipeline). The lighter tiers:
greek.use_treebank() supplies attested, correctly-accented lemmas, full morphology,
and gold POS for forms attested in the AGDT; greek.use_tagger() generalizes POS at
~84% on unseen forms; greek.use_neural_lemmatizer() (a GreTa seq2seq) reaches 76.3%
on unseen forms, while the zero-dependency greek.use_lemmatizer() (edit-trees +
perceptron) reaches ~40%; greek.use_parser() is a pure-Python dependency parser
(~0.67 UAS / 0.57 LAS on projective AGDT). Quantify any combination on your own gold
set with greek.benchmark.compare_modes(). For meaning, opt into greek.use_lsj() (LSJ
glossing). See Treebank-backed mode
and Morphological analysis.
Every corpus carries its citation, and the repo ships a CITATION.cff:
corpus = aegean.load("lineara")
corpus.cite() # one line; also corpus.cite("bibtex") / corpus.cite("apa")
# Godart, L. & Olivier, J.-P. (1976–1985). Recueil des inscriptions en linéaire A. — https://github.com/mwenge/lineara.xyzThe citation follows the exact subset you used: a filtered corpus records what was filtered, and query results record the query —
from aegean.analysis import FilterRow
corpus.filter(site="Haghia Triada").cite()
# … — https://github.com/mwenge/lineara.xyz [subset: filter(site='Haghia Triada') → 1110 of 1721 documents]
results = corpus.query([FilterRow("word-prefix", "KU")], output="words")
results.cite() # … [query: Word starts with: KU → N words]Fetched Greek works carry their own provenance and license, too — load_work
pins the source commit and records the upstream CC BY-SA attribution:
iliad = greek.load_work("tlg0012.tlg001", ref="1") # network on first use
iliad.provenance.license # 'CC BY-SA 4.0 (Perseus Digital Library)'
iliad.provenance.data_version # the pinned commit, for reproducibilitySee Data & Provenance for full licensing and attribution.
-
Bugs / feature requests: GitHub Issues
— please include your pyaegean version (
python -c "import aegean; print(aegean.__version__)"). - How a function behaves: the per-domain reference pages — Linear A, Analysis, Greek NLP, AI Layer, Greek Works and Books, CLI.
- What the tools can and can't do: Limitations.
- Contributing a fix or a script plugin: Development.
Start here
Aegean scripts
Greek
Capabilities
Reference