-
Notifications
You must be signed in to change notification settings - Fork 0
CLI
aegean is the whole toolkit from your terminal — corpora, Greek NLP, surface
analysis, the fetch-to-cache data layer, SQLite, plots, and the (exploratory) AI
layer — without writing a line of Python. If you've never used a command line
before, start with Getting Started (it shows you how to open a
terminal); then come back here. Everything below is something you can copy, paste,
and run.
In a hurry? The CLI Cheatsheet is the dense one-page index of every command and flag. This page is the guided tour: it explains each group and shows a worked example with real output.
pip install "pyaegean[cli]" # adds typer + rich; the core library stays zero-dependency
aegean --helpIf you only ran pip install pyaegean, the library works but the aegean
command isn't installed yet. The [cli] extra adds it. (If you run aegean
without it, you get one line telling you exactly that.)
Learn these once and every command behaves predictably.
| Convention | What it does | Example |
|---|---|---|
--json |
Print one machine-readable JSON document to stdout and nothing else, so results pipe into jq, files, or other programs. Greek stays readable (ensure_ascii=False). |
aegean info lineara --json |
- reads stdin |
Anywhere a command takes a TEXT argument, passing - reads the text from standard input, so commands compose in pipelines. |
echo "μῆνιν" | aegean greek lemmatize - |
| Exit codes |
0 success · 1 a domain error (one line on stderr, prefixed aegean:) · 2 a usage error (typer's default). balance --strict exits 1 when any total fails to balance. |
see below |
Here are those exit codes, actually demonstrated:
aegean info lineara --json > /dev/null ; echo "exit=$?" # exit=0 (success)
aegean info bogus # aegean: unknown corpus 'bogus'; available: …
# exit=1 (domain error, message on stderr)
aegean info # exit=2 (usage error: missing argument)
aegean balance lineara HT13 --strict ; echo "exit=$?" # exit=1 (a total didn't balance)A help summary is one -h/--help away on every command and group:
aegean --help
aegean greek --help
aegean greek scan --helpWindows note: if polytonic Greek shows up as boxes or
?, that's the terminal font, not pyaegean. SetPYTHONUTF8=1and runchcp 65001once to switch the console to UTF-8 — or just use the--jsonoutput, which is always correct, and view it in an editor. See Getting Started.
aegean --version # pyaegean 0.8.5| Group | What's in it |
|---|---|
| (top level) |
repl info load show search query stats dispersion keyness cache balance cite export combine import geo sign bridge plot workbench
|
aegean greek … |
normalize → tokenize → syllabify → accent → scan → tag → lemmatize → morph → parse, plus pipeline, gloss/gloss-nt, work/works/catalog/nt-books, and eval
|
aegean analyze … |
distance align compare nearest assoc cooccur clusters structure hands
|
aegean data … |
list fetch versions cache
|
aegean db … |
build add search (SQLite + FTS5) |
aegean ai … |
providers translate gloss hypotheses ask extract eval (exploratory, key-gated) |
aegean-mcp |
a separate console script: serve the tools to AI agents over MCP |
If you're running several commands in a row, aegean repl opens an interactive
shell so you don't retype aegean each time. Inside it you type the subcommand
directly, with Tab-completion of commands and options and an arrow-key
history:
$ aegean repl
aegean interactive shell — commands without the 'aegean' prefix.
Tab completes, :help lists commands, :exit or Ctrl-D quits.
aegean> info lineara
…the same table aegean info lineara prints…
aegean> greek syllabify Ποσειδῶνι
Ποσειδῶνι → Πο-σει-δῶ-νι
aegean> stats lineara --top 3
…
aegean> :exit
Every line is dispatched through the same command tree, so a command behaves
exactly as it does on the regular command line — --json, -o, corpus files and
work ids, all of it. A mistyped command just prints its error and leaves the shell
open. :help (or help) prints the command list; :exit, quit, or Ctrl-D
leaves. The shell needs the [cli] extra (it ships prompt_toolkit).
When standard input isn't a terminal, the shell reads one command per line instead of prompting, so you can script it:
printf 'info lineara\nstats lineara --top 5\n' | aegean replEvery corpus command takes a corpus id as its first argument. The bundled,
offline-from-install corpora are lineara, linearb, cypriot, cyprominoan,
and greek. Three more download to your cache on first use: damos (the full
~5,900-tablet DAMOS Linear B corpus), sigla (the SigLA Linear A dataset), and
nt (the Greek New Testament). Pass an unknown id and the error lists the valid
ones:
aegean info bogus
# aegean: unknown corpus 'bogus'; available: cypriot, cyprominoan, damos, greek, lineara, linearb, nt, siglaAny corpus argument is more than just an id now. Wherever a command takes a corpus (and wherever
aegean.read_corpus(spec)does in Python), you can pass: a registered id (lineara), a Greek work id (tlg0012.tlg001→ fetches the Iliad likeaegean greek work), a path to a saved corpus (.jsonor.dbyou wrote earlier), or-to read corpus JSON from stdin. So these all work with no Python:aegean db build tlg0012.tlg001 -o iliad.db # build a DB straight from a work id aegean stats iliad.json # run stats on a corpus file you saved aegean export tlg0012.tlg002 -f csv -o odyssey.csv # export a work to CSVWork ids and saved files share one resolver, so anything you can
buildorexportyou can alsostats,query,keyness, and so on.
For the meaning of document ids like HT13 and work ids like tlg0012.tlg001,
see Greek Works and Books and the Linear A /
Linear B pages.
Size, provenance, license, and the one-line citation.
aegean info lineara --json{
"corpus": "lineara",
"documents": 1721,
"words": 1381,
"tokens": 6406,
"signs_in_inventory": 344,
"source": "GORILA (Godart & Olivier 1976–1985) via mwenge/lineara.xyz",
"license": "Apache-2.0 (corpus JSON); facsimile imagery © École Française d'Athènes, not redistributed",
"citation": "Godart, L. & Olivier, J.-P. (1976–1985). Recueil des inscriptions en linéaire A. — https://github.com/mwenge/lineara.xyz"
}Drop --json for a human-readable table. The same in Python:
import aegean
c = aegean.load("lineara")
len(c) # 1721
c.provenance.license # 'Apache-2.0 (corpus JSON); …'Filter on --site, --period, --scribe, --support; without -o it lists
the matches (capped by --limit, default 20), with -o it writes a
round-trippable corpus JSON file.
aegean load lineara --site "Haghia Triada" # list the first 20 matches
aegean load lineara --site "Haghia Triada" -o ht.json # → "wrote 1110 documents to ht.json"aegean show lineara HT13HT13 site=Haghia Triada period=LMIB scribe=HT Scribe 8 support=Tablet
1: KA-U-DE-TA VIN 𐄁 TE 𐄁
2: RE-ZA 5 ¹⁄₂
3: TE-TU 56
4: TE-KI 27 ¹⁄₂
5: KU-ZU-NI 18
6: DA-SI-*118 19
7: I-DU-NE-SI 5
8: KU-RO 130 ¹⁄₂
--json gives the full metadata block plus lines as nested token lists.
* matches exactly one sign. Returns matching words with their frequencies.
aegean search lineara "KU-*-RO"'KU-*-RO': 1 word(s)
┌──────────┬───────┐
│ word │ count │
├──────────┼───────┤
│ KU-MA-RO │ 1 │
└──────────┴───────┘
Build a query from repeated --where field=value rows. Rows AND together by
default; prefix the field with or: to OR a row, or ! to negate it.
--output-kind is inscriptions (default) or words.
aegean query lineara --where "site-is=Haghia Triada" --where "or:id-contains=ZA" \
--output-kind words --jsonThe result carries a description of the query and a citation for the exact
subset — so the precise result set behind a figure is one --json | jq .citation
away. List the queryable fields with --fields:
aegean query lineara --fields| field | scope | kind |
|---|---|---|
id-contains |
inscription | text |
site-is |
inscription | site |
scribe-is |
inscription | scribe |
period-is |
inscription | period |
support-is |
inscription | support |
has-image |
inscription | boolean |
has-annotation |
inscription | boolean |
ins-contains-word |
inscription | word |
word-contains |
word | text |
word-prefix |
word | text |
word-suffix |
word | text |
word-min-syllables |
word | number |
word-max-syllables |
word | number |
word-contains-sign |
word | sign |
word-cooccurs-with |
word | word |
word-sign-pattern |
word | text |
Save the matched subset as a reusable corpus. Add --output/-o (with a
.json or .db extension) and query writes the matching inscriptions out as a
corpus you can feed straight back into any other command:
aegean query lineara --where "site-is=Zakros" -o zakros.json
# wrote 53 inscriptions to zakros.json
aegean stats zakros.json --top 3 # then analyse the saved subsetThe saved file records a subset: query(…) → N documents provenance note, so the
exact filter behind it travels with the data. (-o only writes inscriptions —
use --output-kind words --json if you want the word list instead.)
Note:
--limitonly trims the human-readable table;--jsonalways emits the full result set (so a pipeline never silently loses rows). Trim JSON withjqinstead, e.g.… --json | jq '.words[:5]'.
Word frequencies by default; --signs counts individual signs.
aegean stats lineara --signs --top 5┌──────┬───────┐
│ item │ count │
├──────┼───────┤
│ │ 552 │
│ 𐄁 │ 468 │
│ 1 │ 310 │
│ KU │ 307 │
│ KA │ 284 │
└──────┴───────┘
Gries' DP: 0 = perfectly even across documents, 1 = concentrated in a few.
Give one item, or omit it to rank the corpus.
aegean dispersion lineara --top 5┌───────────┬──────┬─────────────┬───────┬────────┐
│ item │ freq │ range/parts │ DP │ DPnorm │
├───────────┼──────┼─────────────┼───────┼────────┤
│ KU-RO │ 37 │ 34/559 │ 0.850 │ 0.851 │
│ KI-RO │ 16 │ 12/559 │ 0.938 │ 0.938 │
│ KU-PA₃-NU │ 8 │ 7/559 │ 0.948 │ 0.949 │
│ SA-RA₂ │ 20 │ 20/559 │ 0.948 │ 0.949 │
│ A-DU │ 10 │ 10/559 │ 0.963 │ 0.964 │
└───────────┴──────┴─────────────┴───────┴────────┘
Compares either a metadata subset against the rest of the same corpus, or one
corpus against another (--reference). Reports log-likelihood (G²) and log-ratio
with a p-value.
aegean keyness lineara --site "Zakros" --top 5┌────────────────────┬────────┬───────────┬───────┬───────────┬─────────┐
│ item │ target │ reference │ G2 │ log-ratio │ p │
├────────────────────┼────────┼───────────┼───────┼───────────┼─────────┤
│ *28B-NU-MA-RE │ 3/132 │ 0/1249 │ 14.15 │ +6.05 │ 0.00017 │
│ DU-RE-ZA-SE │ 3/132 │ 0/1249 │ 14.15 │ +6.05 │ 0.00017 │
│ SI-PI-KI │ 3/132 │ 0/1249 │ 14.15 │ +6.05 │ 0.00017 │
│ A-TI-KA-A-DU-KO-MI │ 2/132 │ 0/1249 │ 9.42 │ +5.56 │ 0.0021 │
│ DA-I-PI-TA │ 2/132 │ 0/1249 │ 9.42 │ +5.56 │ 0.0021 │
└────────────────────┴────────┴───────────┴───────┴───────────┴─────────┘
Save a result straight to a file.
stats,keyness,dispersion, andsearchall take--output/-o, and the format follows the extension:.json(the same document as--json),.csv(a plain table — stdlib only, no pandas), or.txt(the human view). It writes silently and prints nothing else:aegean stats lineara --top 3 -o freq.csv # freq.csv: # item,count # KU-RO,37 # SA-RA₂,20 # KI-RO,16
Checks stated totals (KU-RO in Linear A, TO-SO in Linear B) against the sum
of the listed items. Give one document, or omit it to sweep the whole corpus.
aegean balance lineara HT13┌──────┬────────┬────────┬──────────┬──────┬──────────┐
│ doc │ marker │ stated │ computed │ diff │ balances │
├──────┼────────┼────────┼──────────┼──────┼──────────┤
│ HT13 │ KU-RO │ 130.5 │ 131.0 │ 0.5 │ NO │
└──────┴────────┴────────┴──────────┴──────┴──────────┘
--strict makes the command exit 1 whenever any checked total fails — handy in
a script. See Linear A for what KU-RO discrepancies actually mean.
aegean cite lineara --site "Haghia Triada"
# Godart, L. & Olivier, J.-P. (1976–1985). Recueil des inscriptions en linéaire A.
# — https://… [subset: filter(site='Haghia Triada') → 1110 of 1721 documents]--style is plain (default), bibtex, or apa. Append a BibTeX entry to your
bibliography with aegean cite lineara --site Zakros --style bibtex >> paper.bib.
aegean export lineara -f csv -o lineara.csv # → "wrote 1721 documents to lineara.csv (csv)"
aegean export greek -f epidoc -o greek.xml # EpiDoc TEI
aegean export lineara -f sqlite -o lineara.db # same DB as `aegean db build`--format |
output | needs |
|---|---|---|
json |
lossless, round-trippable corpus | core |
csv |
one row per document/token/word (--level) |
core |
parquet |
same, columnar |
[parquet] extra |
epidoc |
EpiDoc TEI XML | core |
sqlite |
queryable DB with FTS5 | core |
--level token (csv/parquet) emits one row per token and spreads per-token
annotations — the Greek NT's lemma / morph / Strong's / gloss — into columns.
Filters (--site etc.) apply before export.
Give two or more sources and one --output/-o (a .json or .db) and combine
merges them into a single saved corpus. Each source is resolved like any corpus
argument — an id, a saved .json/.db, a Greek work id, or - — so you can
stitch works, subsets, and bundled corpora together in one go:
aegean combine tlg0012.tlg001 tlg0012.tlg002 -o homer.db # all of Homer in one database
# wrote … documents to homer.db (merged 2 sources)A run you can try offline, against the bundled corpora:
aegean combine lineara cypriot -o aegean-mix.json
# wrote 1723 documents to aegean-mix.json (merged 2 sources)The merged corpus keeps a provenance that names every source — its citation
reads Merged corpus of: … listing each one. If two sources share a document id,
--on-conflict decides what happens: error (the default — stop and tell you),
first (keep the earliest), last (keep the latest), or suffix (keep both,
appending #2, #3, … to the later ids). The same in Python:
import aegean
merged = aegean.combine([aegean.load("lineara"), aegean.load("cypriot")])
# or from an existing corpus:
both = aegean.load("lineara").merge(aegean.load("cypriot"), dedupe="suffix")
just_a_few = aegean.load("lineara").subset(["HT13", "HT9a", "HT11a"])Corpus.merge(*others, dedupe=…) takes the same four dedupe values as
--on-conflict; Corpus.subset(ids) pulls out a named slice. See
Greek Works and Books for the work ids you can combine.
Everything above analyses corpora that pyaegean already knows about. import turns
your own material — a plain-text file, a folder of text files, or a CSV — into a
real corpus you can then stats, search, query, export, and so on. It always
writes to --output/-o (a .json or .db), and the result works anywhere a corpus
is accepted. (Greek/Koine text is run through the Greek tokenizer, which strips
punctuation; any other --script splits on whitespace.)
aegean import john.txt -o john.json --script nt # one plain-text file → a corpus
# wrote 1 document(s) to john.json
aegean stats john.json --top 5 # then analyse it like any corpus john.json: top 5
words
┌───────┬───────┐
│ item │ count │
├───────┼───────┤
│ ἦν │ 4 │
│ λόγος │ 3 │
│ ὁ │ 3 │
│ θεόν │ 2 │
│ καὶ │ 2 │
└───────┴───────┘
--split decides how a text becomes documents — whole (the default, one
document for the whole file), paragraph (one per blank-line-separated block), or
line (one per non-empty line). With more than one block the ids are numbered
<base>:1, <base>:2, …; the base id is the file's stem unless you override it with
--id:
aegean import john.txt -o john-lines.json --script nt --split line
# wrote 2 document(s) to john-lines.jsonA folder imports every matching file into one corpus (each file's stem becomes a
document id, de-duplicated with a #2, #3, … suffix on collision). --glob
chooses which files; --split applies per file:
aegean import poems/ -o poems.db --split line # a directory of *.txt → a database
# wrote 2 document(s) to poems.db
aegean db search poems.db θεάA CSV treats each row as a document: --text-col names the column holding the
text (default text), and --id-col names the column holding the id (otherwise ids
are <stem>:<row>):
aegean import verses.csv -o verses.json --script nt --text-col line --id-col id
# wrote 2 document(s) to verses.json
aegean show verses.json v2
# v2
# 1: καὶ ὁ λόγος ἦν πρὸς τὸν θεόν--encoding (default utf-8) reads non-UTF-8 files. The same lives on aegean.io
in Python — from_text, from_text_file, from_text_dir, and from_csv (the CSV
one also takes meta_cols= to carry columns into document metadata):
from aegean import io
c = io.from_text("μῆνιν ἄειδε θεά", script_id="nt") # a raw string
c = io.from_text_file("john.txt", script_id="nt", split="line")
c = io.from_csv("verses.csv", text_col="line", id_col="id", script_id="nt")import is the only way plain text enters a corpus: read_corpus and every
corpus argument still load only .json/.db files (and work ids), so feeding a raw
.txt straight to a command fails with a message telling you to import it first:
aegean stats john.txt --top 3
# aegean: unknown corpus 'john.txt'; expected a registered id (…), a Greek work id …,
# a path to a .json or .db corpus, or '-' …. To load plain text, import it first:
# `aegean import john.txt -o corpus.json` (or aegean.io.from_text_file / from_csv …)
# [stderr, exit 1]aegean geo lineara lineara: 52 located site(s) of 52
┌──────────────────┬───────┬───────┬───────────┐
│ site │ lat │ lon │ pleiades │
├──────────────────┼───────┼───────┼───────────┤
│ Haghia Triada │ 35.06 │ 24.79 │ 589672 │
│ Gournia │ 35.11 │ 25.79 │ 771100776 │
│ … │ │ │ │
└──────────────────┴───────┴───────┴───────────┘
Add -o sites.geojson to write GeoJSON instead of printing a table (that path
needs the [geo] extra). More on the map data in Geography.
Glyph, Unicode codepoint, sound value, and the raw attributes for a single sign in a script's inventory.
aegean sign lineara KU --json{
"label": "KU",
"glyph": "𐙂",
"codepoint": "U+10642",
"phonetic": "ku",
"attrs": { "sharedWithLinearB": true, "total": 16, "confidence": 1, "altGlyphs": [] }
}For the deciphered scripts (linearb, cypriot): the attested Greek reading plus
a gloss.
aegean bridge linearb po-me
# po-me → ποιμήν (shepherd)This is the analysis memoization cache (distinct from the data download cache
under aegean data cache). It's off unless you enable it for the shell:
aegean cache
# analysis cache: off — set PYAEGEAN_ANALYSIS_CACHE=1 (or a path) to enableSet PYAEGEAN_ANALYSIS_CACHE=1 (or a directory path) and expensive analyses
(dispersion, keyness, clustering) are reused across runs; aegean cache --clear
wipes it.
Draws a single figure and writes it to --output (.png/.svg/.pdf). Needs
the [viz] extra. The first argument is the figure kind:
| kind | what it draws |
|---|---|
freq |
top-N sign or word frequencies |
dispersion |
DP scatter (annotate the top N) |
keyness |
keyness bars (subset vs rest, or vs --reference) |
network |
co-occurrence network (--word for one word's ego network) |
balance |
accounting reconciliation chart |
scansion |
a metrical scansion grid for one Greek line |
pip install "pyaegean[viz]"
aegean plot keyness lineara --site Zakros -o zakros.png # → "wrote zakros.png"
aegean plot scansion "ἄνδρα μοι ἔννεπε, Μοῦσα, πολύτροπον" -o scan.svg --meter hexameterFor scansion the second argument is the Greek line itself (- reads stdin); for
every other kind it's a corpus name.
aegean workbench # fetch the build (~3 MB, first use) and open it in your browser
aegean workbench --port 9000 # choose a port (default 8000); --no-browser to not open oneFetches the sha256-pinned static build to the cache, then serves the browser UI —
the corpus, maps, and analysis modules — at http://localhost:8000/ until you
press Ctrl+C. If the Linear A facsimile imagery is already cached
(aegean data fetch lineara-images), the picture browser works too.
The full Ancient Greek pipeline from the shell. The zero-dependency stages run the moment you install; the heavier backends are opt-in flags (next section). Full explanations live on Greek NLP; metre is on Meters.
Every text argument accepts - for stdin, and every command takes --json.
aegean greek betacode "mh=nin a)/eide qea/" # μῆνιν ἄειδε θεά
aegean greek betacode "μῆνιν" --reverse # mh=nin (Unicode → Beta Code)
aegean greek normalize "λόγoς kai" --lenient # repairs OCR artifacts; warns on stderr
aegean greek strip "μῆνιν" # μηνιν (drop all diacritics)
aegean greek tokenize "ἐν ἀρχῇ ἦν ὁ λόγος." # one token per line (--sentences to split sentences)
aegean greek syllabify εἰσφέρω # εἰσ-φέ-ρω
aegean greek accent λόγος # acute, paroxytone
aegean greek quantities πατρός # πα:common | τρός:heavy
aegean greek scan "ἄνδρα μοι ἔννεπε, Μοῦσα, …" # dactylic hexameter
aegean greek ipa "λόγος" --period koine # loɣos (--period attic|koine)
aegean greek gloss-nt λόγος # Koine gloss, bundled Dodson lexicon (no download)Real runs:
aegean greek betacode "mh=nin a)/eide qea/"
# μῆνιν ἄειδε θεά
aegean greek syllabify εἰσφέρω
# εἰσφέρω → εἰσ-φέ-ρω
aegean greek quantities πατρός
# πατρός → πα:common | τρός:heavy
aegean greek normalize "λόγoς kai" --lenient
# aegean: lenient normalize: repaired 1 Latin letter(s) in Greek words (o→ο) [stderr]
# λόγος kai
aegean greek ipa "λόγος" --period koine
# loɣosaccent prints a small table; the Python equivalent of the same fact:
from aegean import greek
greek.accentuation("λόγος").classification # 'paroxytone'
greek.betacode_to_unicode("mh=nin") # 'μῆνιν'scan checks a line against a fixed metrical template and prints the pattern,
the feet, and the caesura — or exits 1 with the reason if the line declines.
Synizesis is lexical, not guessed: a line that only scans via synizesis on a word
outside the curated lexicon declines rather than inventing a fit.
aegean greek scan "ἄνδρα μοι ἔννεπε, Μοῦσα, πολύτροπον, ὃς μάλα πολλὰ"
# —⏑⏑|—⏑⏑|—⏑⏑|—⏑⏑|—⏑⏑|—×
# hexameter: dactyl, dactyl, dactyl, dactyl, dactyl, final; caesura: trochaic
aegean greek scan "ὦ κοινὸν αὐτάδελφον Ἰσμήνης κάρα" --meter trimeter
# ×—⏑—|×—⏑—|×—⏑×
# trimeter: metron, metron, metron; caesura: hephthemimeral
aegean greek scan "λόγος"
# aegean: line does not scan as dactylic hexameter (2 syllables): 'λόγος' [stderr, exit 1]--meter accepts:
| name | metre |
|---|---|
hexameter |
dactylic hexameter (Homer) — the default |
pentameter |
elegiac pentameter (the second line of an elegiac couplet) |
trimeter |
iambic trimeter (tragic/comic dialogue) |
glyconic · pherecratean · adonean
|
aeolic cola |
sapphic_hendecasyllable |
the Sapphic eleven-syllable line |
alcaic_hendecasyllable · alcaic_enneasyllable · alcaic_decasyllable
|
the Alcaic stanza lines |
--json adds feet, syllables, quantities, caesura, and an ambiguous
flag. See Meters for what's in scope and what isn't.
echo "μῆνιν ἄειδε θεά" | aegean greek lemmatize -
# μῆνιν μῆνις
# ἄειδε ἀείδω
# θεά θεά
aegean greek morph λόγον
# λόγος [NOUN acc sg masc]
# λόγος [NOUN acc sg fem]
# λόγος [NOUN nom sg neut]
# λόγος [NOUN acc sg neut]
# λόγος [NOUN voc sg neut]
aegean greek tag "ἐν ἀρχῇ ἦν ὁ λόγος." # UPOS per token
aegean greek pipeline "ἐν ἀρχῇ ἦν ὁ λόγος." --json # per-token records in one callA lemma that the lexicon doesn't know is still returned, marked (fallback) (and
"known": false in JSON), so you can tell a real hit from a heuristic guess.
aegean greek gloss-nt λόγος
# a word, speech, divine utterance, analogy
aegean greek gloss-nt λόγος --full
# λόγος (G3056): a word, speech, divine utterance, analogy.
aegean greek gloss-nt 3056 --strongs # look up by Strong's numbergloss-nt uses the bundled CC0 Dodson lexicon — no download. The classical
gloss command uses the larger LSJ index instead and activates it on first use
(~270 MB, or ~15 MB if lsj-index is fetched). See the backend section below.
Each flag stands in for a use_*() activation in the Python API. The first time
you use one, it may download a model or build an index to the cache (a note goes
to stderr); after that, everything is offline.
| flag | activates | first-use cost |
|---|---|---|
--treebank |
the Perseus AGDT lexicon | ~75 MB fetch |
--tagger |
the generalizing POS tagger | trains from the AGDT |
--lemmatizer |
the edit-tree lemmatizer | trains from the AGDT |
--parser |
the pure-Python arc-eager dependency parser | trains from the AGDT |
--neural-lemmatizer |
the GreTa seq2seq lemmatizer ([neural]) |
~232 MB model |
--neural |
the joint neural pipeline — best tagger/parser/lemmatizer ([neural]) |
~518 MB model |
--lsj |
LSJ glossing (also set by greek gloss) |
~270 MB (or ~15 MB index) |
# heavy — fetches the model on first use, then offline:
aegean greek pipeline "ἐν ἀρχῇ ἦν ὁ λόγος." --neural
aegean greek parse "ἐν ἀρχῇ ἦν ὁ λόγος" --neural # UD dependency tree
aegean greek tag "…" --treebank --tagger # AGDT lookup + perceptron taggerwork fetches a real text from Perseus canonical-greekLit / First1KGreek
(CC BY-SA, commit-pinned, cached once) and parses it into a corpus. works lists
a curated, verified catalog of 25 ids; catalog searches the full ~1,800-work
discovery index (offline metadata); nt-books lists the 27 NT books and the names
the loaders accept. The full id reference is on
Greek Works and Books.
aegean greek works
# id author title
# tlg0012.tlg001 Homer Iliad
# tlg0012.tlg002 Homer Odyssey
# tlg0011.tlg002 Sophocles Antigone
# tlg0059.tlg030 Plato Republic
# … (curated subset — the full canon is at https://scaife.perseus.org)
# heavy (network on first use):
aegean greek work tlg0012.tlg001 # the Iliad: 24 books, ~127k tokens
aegean greek work tlg0012.tlg001 --ref 1.1-1.50 # just book 1, lines 1–50
aegean greek work tlg0012.tlg001 -o iliad.json # save as a corpus file--ref selects a section: 1 (book), 1.2 (chapter), or 1.1-1.50 (line
range). --source is auto/perseus/first1k; --edition picks a specific
edition file.
catalog is the full discovery index behind works. Where works lists 25
curated highlights, catalog searches the complete bundled metadata for every
work with a Greek edition in Perseus canonical-greekLit + First1KGreek — 1,778 works
in all (768 from perseus, 1,010 from first1k). It's offline and instant: just
metadata, no fetch. Any id it prints goes straight to aegean greek work.
aegean greek catalog --author plato --limit 8 Greek works (39 matches)
┌────────────────┬────────┬────────────┬────────────────────┬─────────┐
│ id │ author │ title │ greek │ src │
├────────────────┼────────┼────────────┼────────────────────┼─────────┤
│ tlg0059.tlg001 │ Plato │ Euthyphro │ Εὐθύφρων │ perseus │
│ tlg0059.tlg002 │ Plato │ Apology │ Ἀπολογία Σωκράτους │ perseus │
│ tlg0059.tlg003 │ Plato │ Crito │ Κρίτων │ perseus │
│ tlg0059.tlg004 │ Plato │ Phaedo │ Φαίδων │ perseus │
│ tlg0059.tlg005 │ Plato │ Cratylus │ Κρατύλος │ perseus │
│ tlg0059.tlg006 │ Plato │ Theaetetus │ Θεαίτητος │ perseus │
│ tlg0059.tlg007 │ Plato │ Sophist │ Σοφιστής │ perseus │
│ tlg0059.tlg008 │ Plato │ Statesman │ Πολιτικός │ perseus │
└────────────────┴────────┴────────────┴────────────────────┴─────────┘
… and 31 more — narrow with --author/--title, or --limit 0 to list all (-o to save).
Load one with, e.g.: aegean greek work tlg0012.tlg001 --ref 1.1-1.10
The bare QUERY argument is a catch-all substring over id, author, English title,
and Greek title; --author/-a, --title/-t (matches English or Greek), and
--source perseus|first1k are the targeted filters (all case-insensitive, all
combine with AND). --limit/-n caps the table (0 = all), --json emits the full
result set, and --output/-o saves it (.json/.csv/.txt by extension):
aegean greek catalog herodotus --json[
{
"id": "tlg0016.tlg001",
"author": "Herodotus",
"title": "Histories",
"greek_title": "Ἱστορίαι",
"source": "perseus"
},
{
"id": "tlg0062.tlg056",
"author": "Lucian of Samosata",
"title": "Herodotus",
"greek_title": "Ἡρόδοτος ἢ Ἀετίων",
"source": "perseus"
}
]aegean greek catalog --author aristophanes --source perseus -o aristophanes.csv
# wrote 11 works to aristophanes.csv (id,author,title,greek_title,source — one row per work)Coverage is exactly what those open repositories hold at the pinned commit, so some
authors are genuinely absent upstream and therefore here too — aegean greek catalog sappho honestly returns nothing rather than inventing an entry:
aegean greek catalog sappho
# No works match. Try a looser filter, or browse https://scaife.perseus.orgThe same in Python is greek.catalog(query=None, *, author=None, title=None, source=None), returning a list of {id, author, title, greek_title, source} dicts;
greek.popular_works() stays the curated 25.
aegean greek eval TARGET runs the official evaluators against fetched gold data —
heavy, but it reproduces pyaegean's measured accuracy. Targets: ud, proiel,
nt, tagger, lemmatizer, parser.
# heavy: fetches gold data and the model
aegean greek eval ud --treebank perseus --split test --neuralThe exact figures and how they were measured are on Greek NLP and Limitations.
Exploratory surface analyses over the (largely undeciphered) Aegean material: evidence to weigh, not conclusions. Full method notes are on Analysis.
aegean analyze distance KU-RO KI-RO
# KU-RO ↔ KI-RO: 0.200
aegean analyze align KU-RO KI-RO # per-position match / vowel / same-class / far / gapcompare romanizes two words from possibly different scripts and aligns them by
sound; nearest ranks a corpus's words by closeness to a query word.
aegean analyze compare po-me ποιμήν
# po-me [linearb] → pome ποιμήν [greek] → poimēn
# similarity 0.62 (distance 0.383)
# a b op
# p p match
# o o match
# · i ins
# m m match
# e ē sub-vowel
# · n insaegean analyze nearest qa-si-re-u greek --top 5 --json
# [{"candidate": "ἱστορίης", "distance": 0.525}, {"candidate": "ἄειδε", "distance": 0.571}, …]--script-a/--script-b choose the scripts (greek · lineara · linearb ·
cypriot); --fold-aspiration maps θ/φ/χ → t/p/k, which is fairer against
defective syllabic spelling. These numbers are exploratory — read the alignment
and the ranking, not the absolute distance.
aegean analyze assoc lineara KU-RO KI-RO # χ², log-likelihood, Fisher, PMI over shared documents
aegean analyze cooccur lineara KU-RO # what shares a tablet with KU-RO, rankedaegean analyze clusters lineara # stem + productive-suffix clusters (exploratory)
aegean analyze structure lineara # accounting / libation / list / text / other census
aegean analyze structure lineara HT13 # classify one document
aegean analyze hands damos # scribal-hand profiles (needs a hand per document)
aegean analyze hands damos --hand "Knossos 103" # one hand's characteristic vocabulary (keyness)hands needs a corpus that records a scribe per document — DAMOS does, so it
fetches on first use; the bundled lineara records HT scribes too.
Save any of these to a file.
assoc,cooccur,clusters, andhandsall take--output/-o, with the format set by the extension —.json,.csv(stdlib, no pandas), or.txt:aegean analyze cooccur lineara KU-RO -o ku-ro-neighbours.json aegean analyze clusters lineara -o clusters.csv
The fetch-to-cache layer: list what can be downloaded, fetch it (sha256-verified), pin versions for a paper, and inspect the cache.
aegean data list # the fetchable datasets (sizes, licenses)
aegean data fetch grc-joint # pre-fetch (e.g. before going offline)
aegean data versions --json > data-versions.json # pin every dataset's sha256 for reproducibility
aegean data cache # cache location + contents (override: PYAEGEAN_CACHE)aegean data list shows the full registry. The fetchable datasets (all
downloaded on demand, never bundled):
| name | what | license |
|---|---|---|
agdt-derived |
prebuilt AGDT lexicon + tagger/lemmatizer/parser models | CC BY-SA 3.0 (Perseus AGDT) |
grc-joint |
the joint tagger-parser-lemmatizer model (~518 MB; the [neural] extra) |
CC BY-SA 4.0 |
grc-lemma-neural |
the GreTa seq2seq lemmatizer (~232 MB; the [neural] extra) |
CC BY-SA 4.0 |
lsj-index |
prebuilt LSJ lemma→entry index (~15 MB) | CC BY-SA 4.0 (Perseus) |
damos-corpus |
DAMOS Linear B corpus (~5,900 tablets) — aegean.load('damos')
|
CC BY-NC-SA 4.0 |
sigla-corpus |
SigLA Linear A dataset (781 docs) — aegean.load('sigla')
|
CC BY-NC-SA 4.0 |
nt-corpus |
Greek New Testament (Nestle 1904; ~137,800 tokens) — aegean.load('nt')
|
CC0-1.0 |
lineara-images |
3,368 facsimile/photo files (~116 MB) | academic reference only |
linearb-corpus |
a bring-your-own Linear B export (no default source) | per your source |
workbench-app |
the prebuilt workbench web app (~3 MB) — served by aegean workbench
|
Apache-2.0 |
aegean data versions --json is the reproducibility manifest — every bundled and
fetched dataset with its sha256. See Data & Provenance for
the licensing details and why nothing non-redistributable is bundled.
Build a queryable SQLite database from any corpus (documents + tokens + an FTS5 full-text index) and search it.
aegean db build lineara -o lineara.db # → "wrote 1721 documents to lineara.db"
aegean db search lineara.db KU-RO --limit 3 'KU-RO' in lineara.db
┌───────┬─────┬───────┐
│ doc │ pos │ text │
├───────┼─────┼───────┤
│ HT9a │ 25 │ KU-RO │
│ HT9b │ 20 │ KU-RO │
│ HT11a │ 7 │ KU-RO │
└───────┴─────┴───────┘
db build resolves its corpus like anything else — so aegean db build tlg0012.tlg001 -o iliad.db builds a database straight from a Greek work id.
--no-fts skips the full-text index. aegean export CORPUS -f sqlite -o file.db
writes the same database. Load it back in Python with Corpus.from_sql(path), or
stream it with aegean.db.stream(path).
db add upserts documents into a database you already built: a document whose id
already exists is replaced, new ids are added, and the FTS5 index is refreshed.
The source resolves like any corpus argument (id, .json/.db, work id, or -):
aegean db build lineara -o aegean.db # → "wrote 1721 documents to aegean.db"
aegean db add cypriot -o aegean.db # → "added/updated 2 documents in aegean.db"Mixing scripts is allowed and noted on stderr (the database's script id becomes
mixed). The Python equivalents take an append=True flag:
corpus.to_sql("aegean.db", append=True) # or aegean.db.to_sqlite(corpus, "aegean.db", append=True)The generative layer. Every result here is exploratory — a labeled model
hypothesis carrying its grounding evidence, never a citable fact, and never a
"decipherment." It needs a provider SDK (an extra such as
pip install "pyaegean[anthropic]") and that provider's API key in your
environment. Without a key, the command exits 1 with a clear message — it never
silently calls out.
aegean ai providers
# anthropic
# gemini
# grok
# openaiThe commands (each takes --provider / --model, and most take --trace):
aegean ai translate "ἐν ἀρχῇ ἦν ὁ λόγος" # grounded hybrid translation
aegean ai translate "KU-RO 130" --script lineara # exploratory (undeciphered!)
aegean ai gloss "μῆνιν ἄειδε θεά" # interlinear word-by-word gloss
aegean ai hypotheses "A-TA-I-*301-WA-JA" --corpus lineara # cautious decipherment hypotheses
aegean ai ask "What is KU-RO?" --corpus lineara --trace # answer strictly from grounding
aegean ai extract "OLE S 1" --fields commodity,amount # structured JSON, ready for jq
aegean ai eval --provider anthropic # grounding-fidelity eval--corpus NAME grounds the answer on that corpus's frequent words. --trace
prints the grounding provenance under the answer — the local corpus / lexicon /
analysis facts the model was given, grouped by source — so you can audit exactly
what it was (and wasn't) told. extract always prints JSON, so it pipes straight
into jq.
Save the output, label and all. translate, gloss, hypotheses, ask, and
extract take --output/-o. A .json file carries the text plus its provenance
and grounding evidence; a .txt file is the labeled text. The exploratory label
stays attached on disk — a saved result never loses the "this is a hypothesis, not
a finding" framing:
aegean ai gloss "μῆνιν ἄειδε θεά" -o gloss.json # text + provenance + grounding
aegean ai ask "What is KU-RO?" --corpus lineara -o answer.txtIn Python the same lives on ExploratoryResult: .to_dict(), .to_json(path),
and ExploratoryResult.from_dict(data) round-trip a result through disk with its
label and grounding intact. The full design and the meaning of "grounded" are on
AI Layer; the hard limits are on Limitations.
A separate console script (the [mcp] extra) that exposes the toolkit to AI
agents — Claude Code and other MCP clients — over stdio, so an agent can use
pyaegean without writing Python.
pip install "pyaegean[mcp]"
aegean-mcp # serve the read/analysis tools over stdioIt offers a small set of read/analysis tools: list and inspect corpora, wildcard sign search, accounting reconciliation, the Greek pipeline, verse scansion, and Koine glossing.
Reconcile every Linear A account and keep only the failures:
aegean balance lineara --json | jq '[.[] | select(.balances | not)]' > discrepancies.jsonLemmatize a file of Greek, one lemma per line:
cat chapter.txt | aegean greek lemmatize - --json | jq -r '.[].lemma'Scan a poem line by line, keeping only the lines that scan:
while read -r line; do aegean greek scan "$line" --json 2>/dev/null | jq -r .pattern; done < poem.txtMap a word's distribution and cite the subset you used:
aegean geo lineara --output sites.geojson
aegean cite lineara --site "Zakros" --style bibtex >> paper.bibBuild one searchable database of all of Homer straight from work ids, then keep growing it — no Python anywhere:
aegean combine tlg0012.tlg001 tlg0012.tlg002 -o homer.db # Iliad + Odyssey (see Greek Works and Books for ids)
aegean db add tlg0011.tlg002 -o homer.db # add Sophocles' Antigone later
aegean db search homer.db μῆνιν --limit 3Save a query subset once, reuse it everywhere:
aegean query lineara --where "site-is=Zakros" -o zakros.json
aegean keyness zakros.json --reference lineara --top 5 -o zakros-keyness.csvMore worked pipelines are on Recipes.
- The AI layer is exploratory. Translations, glosses, and "hypotheses" for undeciphered material are labeled model output with grounding, not findings. The Aegean scripts remain undeciphered. See Limitations.
-
Heavy commands download on first use. Anything marked heavy here
(
--neural,greek work,greek eval,gloss, the fetched corpora) pulls data to the cache the first time, with a note on stderr; afterwards it's offline. Pre-fetch withaegean data fetchbefore going offline. -
--jsonis the contract; the table view is for humans. Don't parse the rich tables — pass--jsonand usejq.--limittrims only the human view. - Metre and accuracy are bounded. Lyric metres beyond the fixed aeolic templates are out of scope, and the trainable backends have measured ceilings — both documented on Meters and Limitations.
For the terse one-page index of every command and flag, see the CLI Cheatsheet.
Start here
Aegean scripts
Greek
Capabilities
Reference