CLI Cheatsheet

This is the dense, one-page index of every aegean command and its key flags, each with a copy-pasteable example. It's the lookup card you keep open while you work; the CLI page is the guided tour that explains each group with prose. If you've never used a terminal, start with Getting Started.

pip install "pyaegean[cli]"     # adds typer + rich; the core library stays zero-dependency
aegean --help                   # the command map
aegean --version                # pyaegean 0.8.1

If you only ran pip install pyaegean, the library works but the aegean command isn't installed until you add the [cli] extra.

Conventions that hold everywhere

Convention	What it does	Example
`--json`	Print one machine-readable JSON document and nothing else, so results pipe into `jq`, files, or programs. Greek stays readable (`ensure_ascii=False`).	`aegean info lineara --json`
`-` reads stdin	Anywhere a command takes a `TEXT` argument, `-` reads it from standard input, so commands compose.	`echo "μῆνιν" \| aegean greek lemmatize -`
A corpus arg is flexible	Every corpus argument resolves the same way: a registered id, a Greek work id (`tlg0012.tlg001` → fetched & parsed), a path to a saved `.json` or `.db` corpus, or `-` for a `Corpus.to_json()` document on stdin. So `aegean stats iliad.json` and `aegean export tlg0012.tlg002 -f csv -o odyssey.csv` work with no Python.	`aegean stats lineara.json`
Exit codes	`0` success · `1` a domain error (one line on stderr, prefixed `aegean:`) · `2` a usage error.	`aegean greek scan "λόγος"` → exit `1`

Every command and group answers -h / --help. The bundled, offline-from-install corpora are lineara, linearb, cypriot, cyprominoan, greek; three more download to the cache on first use: damos, sigla, nt. The same read_corpus(spec) in Python resolves any of those forms: aegean.read_corpus("iliad.json").

At a glance

Group	Commands
(top level)	`info` `load` `show` `search` `query` `stats` `dispersion` `keyness` `cache` `balance` `cite` `export` `combine` `import` `geo` `sign` `bridge` `plot` `workbench`
`greek`	`normalize` `betacode` `strip` `tokenize` `syllabify` `accent` `quantities` `scan` `ipa` `tag` `lemmatize` `morph` `parse` `gloss` `gloss-nt` `pipeline` `work` `works` `catalog` `nt-books` `eval`
`analyze`	`distance` `align` `compare` `nearest` `assoc` `cooccur` `clusters` `structure` `hands`
`data`	`list` `fetch` `versions` `cache`
`db`	`build` `add` `search`
`geo`	(top-level command — coordinates / GeoJSON)
`ai`	`translate` `gloss` `hypotheses` `ask` `extract` `eval` `providers`
`workbench`	(top-level command — serve the web UI)
`aegean-mcp`	separate console script — serve the tools to AI agents over MCP

Corpus & analysis (top level)

Every corpus command takes a corpus as its first argument (an id, a Greek work id, a .json/.db file, or -) and accepts --json. The analysis commands (stats, dispersion, keyness, search) also take -o/--output to write the result to a file — .json / .csv (stdlib, no pandas) / .txt by extension.

Command	What it does	Key flags	One-line example
`info`	Corpus overview: size, provenance, license, citation	`--json`	`aegean info lineara`
`load`	Filter by metadata; list matches or export them	`--site --period --scribe --support -o/--output --limit`	`aegean load lineara --site "Haghia Triada"`
`show`	One document: metadata + line-by-line tokens	`--json`	`aegean show lineara HT13`
`search`	Words matching a wildcard sign pattern (`*` = one sign)	`-o/--output --json`	`aegean search lineara "KU-*-RO"`
`query`	Compound-query engine (AND/OR/NOT predicates)	`--where --output-kind --fields -o/--output --limit`	`aegean query lineara --where "site-is=Zakros"`
`stats`	Frequency table of words (or `--signs`)	`--signs --top -o/--output`	`aegean stats lineara --signs --top 5`
`dispersion`	How evenly an item spreads (Gries' DP)	`--signs --top --min-frequency -o/--output`	`aegean dispersion lineara KU-RO`
`keyness`	Characteristic items vs a reference (G² + log-ratio)	`--reference --site/... --signs --top --min-target -o/--output`	`aegean keyness lineara --site Zakros`
`balance`	Accounting reconciliation (KU-RO / TO-SO vs items)	`--strict --json`	`aegean balance lineara HT13`
`cite`	Cite the corpus, or the exact filtered subset	`--style --site/...`	`aegean cite lineara --site Zakros --style bibtex`
`export`	Export to JSON / CSV / Parquet / EpiDoc / SQLite	`-f/--format -o/--output --level --site/...`	`aegean export lineara -f csv -o lineara.csv`
`combine`	Merge several corpora into one and save it	`-o/--output --on-conflict`	`aegean combine tlg0012.tlg001 tlg0012.tlg002 -o homer.db`
`import`	Import your own text (`.txt` / folder / `.csv`) into a corpus	`-o/--output --script --split --id --glob --text-col --id-col --encoding`	`aegean import john.txt -o john.json --script nt`
`geo`	Find-site coordinates; GeoJSON with `-o`	`--level -o/--output --json`	`aegean geo lineara`
`sign`	Look up one sign: glyph, codepoint, sound value	`--json`	`aegean sign lineara KU --json`
`bridge`	Read a deciphered syllabic word as Greek	`--json`	`aegean bridge linearb po-me`
`cache`	Inspect (or `--clear`) the opt-in analysis cache	`--clear --json`	`aegean cache`
`plot`	Draw one figure to a file (`[viz]` extra)	`-o/--output --signs --top --word --meter --dpi …`	`aegean plot keyness lineara --site Zakros -o k.png`
`workbench`	Serve the Linear A Research Workbench locally	`-p/--port --no-browser --force`	`aegean workbench`

Verified examples

aegean info lineara --json

{
  "corpus": "lineara",
  "documents": 1721,
  "words": 1381,
  "tokens": 6406,
  "signs_in_inventory": 344,
  "source": "GORILA (Godart & Olivier 1976–1985) via mwenge/lineara.xyz",
  "license": "Apache-2.0 (corpus JSON); facsimile imagery © École Française d'Athènes, not redistributed",
  "citation": "Godart, L. & Olivier, J.-P. (1976–1985). Recueil des inscriptions en linéaire A. — https://github.com/mwenge/lineara.xyz"
}

Same fact in Python (every corpus command has an API equivalent):

import aegean
c = aegean.load("lineara")
len(c)                  # 1721
c.provenance.license    # 'Apache-2.0 (corpus JSON); …'

aegean search lineara "KU-*-RO"
# 'KU-*-RO': 1 word(s)   →   KU-MA-RO  (count 1)

aegean stats lineara --signs --top 5
# 𐝫 552 · 𐄁 468 · 1 310 · KU 307 · KA 284

aegean dispersion lineara --top 3
# KU-RO  freq 37  range 34/559  DP 0.850 DPnorm 0.851
# KI-RO  freq 16  range 12/559  DP 0.938 DPnorm 0.938

aegean keyness lineara --site Zakros --top 3
# *28B-NU-MA-RE  3/132 vs 0/1249  G2 14.15  log-ratio +6.05  p 0.00017
# (or compare two corpora: aegean keyness nt --reference greek)

aegean balance lineara HT13
# HT13  KU-RO  stated 130.5  computed 131.0  diff 0.5  balances NO

aegean cite lineara --site "Haghia Triada"
# Godart, L. & Olivier, J.-P. (1976–1985). Recueil… [subset: filter(site='Haghia Triada') → 1110 of 1721 documents]

aegean sign lineara KU --json
# {"label":"KU","glyph":"𐙂","codepoint":"U+10642","phonetic":"ku","attrs":{…}}

aegean bridge linearb po-me
# po-me → ποιμήν   (shepherd)

Add -o FILE to stats, dispersion, keyness, or search to write the result straight to disk (the format follows the extension; the table still prints unless you also pass --json):

aegean stats lineara --top 3 -o counts.csv        # item,count
aegean keyness lineara --site Zakros -o key.json  # full result set, as JSON
aegean dispersion lineara --top 5 -o dp.txt       # tab-separated rows

`query` — queryable fields

Build a query from repeated --where field=value rows. Rows AND by default; prefix the field with or: to OR a row, or ! to negate it. List the fields with aegean query CORPUS --fields. --limit trims only the human table — --json always emits the full result set, so a pipeline never silently loses rows.

aegean query lineara --where "site-is=Haghia Triada" --where "or:word-prefix=KU" \
       --output-kind words --json

query -o out.json (or .db) saves the matched inscriptions as a reusable corpus — a subset: provenance note records that it's a slice. (-o works on the inscription output only.) So a query becomes the input to anything else:

aegean query lineara --where "site-is=Zakros" -o zakros.json
#   → wrote 53 inscriptions to zakros.json
aegean stats zakros.json --top 3            # …then analyse just that subset

In Python: QueryResults.to_corpus(source).

field	scope	kind
`id-contains`	inscription	text
`site-is`	inscription	site
`scribe-is`	inscription	scribe
`period-is`	inscription	period
`support-is`	inscription	support
`has-image`	inscription	boolean
`has-annotation`	inscription	boolean
`ins-contains-word`	inscription	word
`word-contains`	word	text
`word-prefix`	word	text
`word-suffix`	word	text
`word-min-syllables`	word	number
`word-max-syllables`	word	number
`word-contains-sign`	word	sign
`word-cooccurs-with`	word	word
`word-sign-pattern`	word	text

`export` — formats

aegean export lineara -f csv -o lineara.csv       # → "wrote 1721 documents to lineara.csv (csv)"
aegean export greek   -f epidoc -o greek.xml      # EpiDoc TEI XML
aegean export lineara -f sqlite -o lineara.db     # same DB as `aegean db build`

`--format`	output	needs
`json`	lossless, round-trippable corpus	core
`csv`	one row per document / token / word (`--level`)	core
`parquet`	same, columnar	`[parquet]` extra
`epidoc`	EpiDoc TEI XML	core
`sqlite`	queryable DB with FTS5	core

--level token (csv/parquet) emits one row per token and spreads per-token annotations — the Greek NT's lemma / morph / Strong's / gloss — into columns.

`combine` — merge corpora into one

combine SRC... -o out.json|.db stitches two or more corpora into a single saved corpus. Each source resolves like any corpus argument (id, work id, .json/.db file, or -), so you can fold fetched Greek works and your own files together. The merged provenance names every source.

aegean combine lineara linearb -o aegean.json
#   → wrote 1739 documents to aegean.json (merged 2 sources)

# all of Homer in one queryable database, no Python:
aegean combine tlg0012.tlg001 tlg0012.tlg002 -o homer.db
#   → wrote … documents to homer.db (merged 2 sources)

--on-conflict decides what happens when an id appears in more than one source:

`--on-conflict`	duplicate id behaviour
`error` (default)	stop with a clear message — nothing is written
`first`	keep the first source's document
`last`	keep the last source's document
`suffix`	keep both, suffixing the later id so it stays unique

In Python: aegean.combine([c1, c2], dedupe="error"), Corpus.merge(*others, dedupe="first"), and Corpus.subset(ids) for the inverse (pull a named slice out).

`import` — your own text → a corpus

import SRC -o out.json|.db turns a plain-text file, a folder of text files, or a CSV into a real corpus you can then stats/search/query/export. Greek/Koine text goes through the Greek tokenizer (punctuation stripped); other --scripts split on whitespace. End to end:

aegean import john.txt -o john.json --script nt   # → wrote 1 document(s) to john.json
aegean stats john.json --top 3                    # ἦν 4 · λόγος 3 · ὁ 3

input	command	notes
one `.txt`	`aegean import john.txt -o john.json --script nt`	id = file stem (override with `--id`)
a folder	`aegean import poems/ -o poems.db --glob "*.txt"`	one corpus from many files; ids = stems (`#2` on clash)
a `.csv`	`aegean import rows.csv -o rows.json --text-col line --id-col id`	one document per row

--split controls how a text becomes documents — whole (default, one doc), paragraph (blank-line blocks), or line (one per line); multi-block ids are numbered <base>:1, <base>:2, …. --encoding (default utf-8) reads non-UTF-8 files. import is the only door for plain text: read_corpus and every corpus argument still load only .json/.db (and work ids), so a bare .txt fed to a command exits 1 and tells you to import it first.

In Python: aegean.io.from_text(s), from_text_file(path), from_text_dir(path), from_csv(path, text_col=…, id_col=…, meta_cols=…) — each takes script_id= and (for text) split=.

`plot` — figure kinds (`[viz]` extra)

pip install "pyaegean[viz]"
aegean plot keyness lineara --site Zakros -o zakros.png
aegean plot scansion "ἄνδρα μοι ἔννεπε, Μοῦσα, πολύτροπον" -o scan.svg --meter hexameter

`KIND`	what it draws	relevant flags
`freq`	top-N sign or word frequencies	`--signs --top`
`dispersion`	DP scatter (annotate top N)	`--signs --top`
`keyness`	keyness bars (subset vs rest, or `--reference`)	`--reference --site/... --signs --top`
`network`	co-occurrence network	`--word` (ego network) `--min-count`
`balance`	accounting reconciliation chart	—
`scansion`	metrical scansion grid for one Greek line	`--meter` (second arg is the line; `-` = stdin)

--output -o (required) takes .png / .svg / .pdf; --dpi sets raster resolution for PNG (default 150).

Greek NLP — `aegean greek …`

The full Ancient Greek pipeline from the shell. Zero-dependency stages run the moment you install; the heavier backends are opt-in flags (below). Every text argument accepts - for stdin, and every command takes --json. Full prose lives on Greek NLP.

Command	What it does	Key flags	One-line example
`normalize`	Unicode-normalize; `--lenient` repairs OCR/Beta-Code	`--form --lenient`	`aegean greek normalize "λόγoς" --lenient`
`betacode`	Beta Code → polytonic Greek (`--reverse` for back)	`--reverse`	`aegean greek betacode "mh=nin"`
`strip`	Strip all diacritics	—	`aegean greek strip "μῆνιν"`
`tokenize`	Words + punctuation (or `--sentences`)	`--sentences --json`	`aegean greek tokenize "ἐν ἀρχῇ ἦν ὁ λόγος."`
`syllabify`	Split word(s) into syllables	`--json`	`aegean greek syllabify εἰσφέρω`
`accent`	Accent type, position, classification	`--json`	`aegean greek accent λόγος`
`quantities`	Per-syllable metrical quantity	`--json`	`aegean greek quantities πατρός`
`scan`	Metrical scansion against a fixed template	`--meter --json`	`aegean greek scan "…" --meter hexameter`
`ipa`	Reconstructed IPA pronunciation	`--period`	`aegean greek ipa "λόγος" --period koine`
`tag`	POS-tag (UD coarse tags)	`--treebank --tagger --neural --json`	`aegean greek tag "ἐν ἀρχῇ ἦν ὁ λόγος."`
`lemmatize`	Lemmatize every word	`--treebank --lemmatizer --neural-lemmatizer --neural --json`	`aegean greek lemmatize "μῆνιν ἄειδε θεά"`
`morph`	Candidate morphological parses	`--treebank --json`	`aegean greek morph λόγον`
`parse`	Dependency-parse a sentence	`--neural --parser --json`	`aegean greek parse "…" --neural`
`gloss`	Short LSJ gloss (~270 MB fetch first use)	`--json`	`aegean greek gloss λόγος`
`gloss-nt`	Koine gloss from bundled Dodson lexicon (no download)	`--strongs --full --json`	`aegean greek gloss-nt λόγος --full`
`pipeline`	The one-call pipeline: per-token records	`--parse --parser --treebank --tagger --lemmatizer --neural-lemmatizer --neural --json`	`aegean greek pipeline "ἐν ἀρχῇ" --json`
`work`	Fetch a real Greek work (Perseus / First1KGreek)	`--ref --source --edition -o --json`	`aegean greek work tlg0012.tlg001 --ref 1.1-1.50`
`works`	List the curated catalog of 25 well-known works	`--json`	`aegean greek works`
`catalog`	Search the full ~1,800-work discovery index (offline metadata)	`--author/-a --title/-t --source --limit/-n -o/--output --json`	`aegean greek catalog --author plato`
`nt-books`	List the 27 NT books + names the loaders accept	`--json`	`aegean greek nt-books`
`eval`	Reproduce the published numbers (heavy)	`--treebank --split --neural --tagger --lemmatizer --neural-lemmatizer --json`	`aegean greek eval ud --neural`

Stages that work immediately

aegean greek betacode "mh=nin a)/eide qea/"      # μῆνιν ἄειδε θεά
aegean greek betacode "μῆνιν" --reverse          # mh=nin
aegean greek strip "μῆνιν"                        # μηνιν
aegean greek syllabify εἰσφέρω                    # εἰσφέρω → εἰσ-φέ-ρω
aegean greek quantities πατρός                    # πατρός → πα:common | τρός:heavy
aegean greek ipa "λόγος" --period koine          # loɣos

aegean greek normalize "λόγoς kai" --lenient
# aegean: lenient normalize: repaired 1 Latin letter(s) in Greek words (o→ο)   [stderr]
# λόγος kai

aegean greek tokenize "ἐν ἀρχῇ ἦν ὁ λόγος."
# ἐν / ἀρχῇ / ἦν / ὁ / λόγος / .   (one token per line)

accent prints a small table; the same fact in Python:

from aegean import greek
greek.accentuation("λόγος").classification     # 'paroxytone'
greek.betacode_to_unicode("mh=nin")            # 'μῆνιν'

Scansion (`scan`)

aegean greek scan "ἄνδρα μοι ἔννεπε, Μοῦσα, πολύτροπον, ὃς μάλα πολλὰ"
# —⏑⏑|—⏑⏑|—⏑⏑|—⏑⏑|—⏑⏑|—×
# hexameter: dactyl, dactyl, dactyl, dactyl, dactyl, final; caesura: trochaic

aegean greek scan "ὦ κοινὸν αὐτάδελφον Ἰσμήνης κάρα" --meter trimeter
# ×—⏑—|×—⏑—|×—⏑×
# trimeter: metron, metron, metron; caesura: hephthemimeral

aegean greek scan "λόγος"
# aegean: line does not scan as dactylic hexameter (2 syllables): 'λόγος'   [stderr, exit 1]

Synizesis is lexical, not guessed: a line that only fits via synizesis on a word outside the curated lexicon exits 1 with the reason rather than inventing a fit. --meter accepts:

name	metre
`hexameter`	dactylic hexameter (Homer) — the default
`pentameter`	elegiac pentameter (second line of the couplet)
`trimeter`	iambic trimeter (tragic/comic dialogue)
`glyconic` · `pherecratean` · `adonean`	aeolic cola
`sapphic_hendecasyllable`	the Sapphic eleven-syllable line
`alcaic_hendecasyllable` · `alcaic_enneasyllable` · `alcaic_decasyllable`	the Alcaic stanza lines

scan --json adds feet, syllables, quantities, caesura, meter, and an ambiguous flag. The same in Python: greek.scan_hexameter(line).pattern.

Tagging, lemmatizing, parsing, morphology

echo "μῆνιν ἄειδε θεά" | aegean greek lemmatize -
# μῆνιν   μῆνις
# ἄειδε   ἀείδω
# θεά     θεά

aegean greek tag "ἐν ἀρχῇ ἦν ὁ λόγος."
# ἐν ADP · ἀρχῇ NOUN · ἦν VERB · ὁ DET · λόγος NOUN · . PUNCT

aegean greek morph λόγον
# λόγος [NOUN acc sg masc]   λόγος [NOUN acc sg fem]   λόγος [NOUN nom sg neut]   …

aegean greek pipeline "ἐν ἀρχῇ" --json
# [{"sentence":0,"index":1,"text":"ἐν","upos":"ADP","lemma":"ἐν","lemma_known":true,…}, …]

A lemma the lexicon doesn't know is still returned, marked (fallback) (and "known": false in JSON), so you can tell a real hit from a heuristic guess.

Glossing

aegean greek gloss-nt λόγος               # a word, speech, divine utterance, analogy
aegean greek gloss-nt λόγος --full        # λόγος (G3056): a word, speech, divine utterance, analogy.
aegean greek gloss-nt 3056 --strongs      # look up by Strong's number → same gloss

gloss-nt uses the bundled CC0 Dodson lexicon (no download). The classical gloss command uses the larger LSJ index, activated on first use (~270 MB, or ~15 MB if the lsj-index dataset is fetched).

Backend flags (download/build on first use)

Each flag stands in for a use_*() activation in the Python API. The first time you use one it may download a model or build an index to the cache (a note goes to stderr); after that it's offline.

flag	activates	first-use cost
`--treebank`	the Perseus AGDT lexicon	~75 MB fetch
`--tagger`	the generalizing POS tagger	trains from the AGDT
`--lemmatizer`	the edit-tree lemmatizer	trains from the AGDT
`--parser`	the pure-Python arc-eager dependency parser	trains from the AGDT
`--neural-lemmatizer`	the GreTa seq2seq lemmatizer (`[neural]`)	~232 MB model
`--neural`	the joint neural pipeline — best tagger/parser/lemmatizer (`[neural]`)	~518 MB model

# heavy — fetches on first use, then offline:
aegean greek pipeline "ἐν ἀρχῇ ἦν ὁ λόγος." --neural
aegean greek parse "ἐν ἀρχῇ ἦν ὁ λόγος" --neural     # UD dependency tree
aegean greek tag "…" --treebank --tagger             # AGDT lookup + perceptron tagger

Loading real Greek works

work fetches a CC BY-SA text from Perseus canonical-greekLit / First1KGreek (commit-pinned, cached once) and parses it into a corpus. --ref selects a section: 1 (book), 1.2 (chapter), 1.1-1.50 (line range); --source is auto / perseus / first1k; --edition picks a specific edition file. Browse ids at scaife.perseus.org. Full reference: Greek Works and Books.

aegean greek works                                # 25 curated highlights (table below)
aegean greek catalog --author plato               # search the full ~1,800-work index (offline)
aegean greek work tlg0012.tlg001                  # heavy: the Iliad, one doc per book
aegean greek work tlg0012.tlg001 --ref 1.1-1.50   # just book 1, lines 1–50
aegean greek work tlg0012.tlg001 -o iliad.json    # save as a corpus file

The curated catalog (aegean greek works):

id	author	title
`tlg0012.tlg001`	Homer	Iliad
`tlg0012.tlg002`	Homer	Odyssey
`tlg0020.tlg001`	Hesiod	Theogony
`tlg0020.tlg002`	Hesiod	Works and Days
`tlg0085.tlg004`	Aeschylus	Seven Against Thebes
`tlg0085.tlg005`	Aeschylus	Agamemnon
`tlg0085.tlg006`	Aeschylus	Libation Bearers
`tlg0011.tlg001`	Sophocles	Trachiniae
`tlg0011.tlg002`	Sophocles	Antigone
`tlg0011.tlg003`	Sophocles	Ajax
`tlg0011.tlg004`	Sophocles	Oedipus Tyrannus
`tlg0006.tlg001`	Euripides	Cyclops
`tlg0006.tlg002`	Euripides	Alcestis
`tlg0006.tlg003`	Euripides	Medea
`tlg0019.tlg002`	Aristophanes	Knights
`tlg0019.tlg003`	Aristophanes	Clouds
`tlg0016.tlg001`	Herodotus	Histories
`tlg0003.tlg001`	Thucydides	History of the Peloponnesian War
`tlg0032.tlg002`	Xenophon	Memorabilia
`tlg0032.tlg006`	Xenophon	Anabasis
`tlg0059.tlg002`	Plato	Apology
`tlg0059.tlg003`	Plato	Crito
`tlg0059.tlg004`	Plato	Phaedo
`tlg0059.tlg030`	Plato	Republic
`tlg0086.tlg010`	Aristotle	Nicomachean Ethics

This is a starting point, not the whole canon — work accepts any valid Perseus / First1KGreek id.

`catalog` — the full discovery index

catalog searches the complete bundled metadata behind works — 1,778 works (768 perseus + 1,010 first1k), every text with a Greek edition in the two pinned repos. Offline, instant, no fetch; any id it returns goes straight to aegean greek work. The bare QUERY is a catch-all over id/author/English-title/Greek-title; --author/-a, --title/-t, and --source perseus|first1k are targeted filters (case-insensitive, AND). --limit/-n 0 lists all; --json emits everything; --output/-o saves (.json/.csv/.txt).

aegean greek catalog --author plato --limit 8
#   → 39 matches: tlg0059.tlg001 Euthyphro · tlg0059.tlg002 Apology · … (table)

aegean greek catalog herodotus --json
# [{"id":"tlg0016.tlg001","author":"Herodotus","title":"Histories",
#   "greek_title":"Ἱστορίαι","source":"perseus"}, … ]

aegean greek catalog --author aristophanes --source perseus -o aristophanes.csv
#   → wrote 11 works to aristophanes.csv   (id,author,title,greek_title,source)

Coverage is exactly what the upstream repos hold, so genuinely-absent authors return nothing rather than a fabricated row:

aegean greek catalog sappho
# No works match. Try a looser filter, or browse https://scaife.perseus.org

In Python: greek.catalog(query=None, *, author=None, title=None, source=None) → {id, author, title, greek_title, source} dicts; greek.popular_works() stays the curated 25.

The 27 NT books (aegean greek nt-books) and the names gloss-nt / load_nt accept:

book	accepted names	book	accepted names
Matt	matthew, matt, mt	1Tim	1timothy, 1tim, 1ti
Mark	mark, mk, mrk	2Tim	2timothy, 2tim, 2ti
Luke	luke, lk, luk	Titus	titus, tit
John	john, jn, jhn	Phlm	philemon, phlm, phm
Acts	acts, act	Heb	hebrews, heb
Rom	romans, rom, rm	Jas	james, jas, jms
1Cor	1corinthians, 1cor, 1co	1Pet	1peter, 1pet, 1pe
2Cor	2corinthians, 2cor, 2co	2Pet	2peter, 2pet, 2pe
Gal	galatians, gal, ga	1John	1john, 1jn, 1jhn
Eph	ephesians, eph	2John	2john, 2jn, 2jhn
Phil	philippians, phil, php	3John	3john, 3jn, 3jhn
Col	colossians, col	Jude	jude, jud
1Thess	1thessalonians, 1thess, 1th	Rev	revelation, rev, rv, apocalypse
2Thess	2thessalonians, 2thess, 2th

In Python: from aegean import greek; greek.load_nt("John", ref="1.1-18") and greek.load_work("tlg0012.tlg001", ref="1.1-1.50").

Reproducing the published numbers (`eval`)

aegean greek eval TARGET runs the official evaluators against fetched gold data (heavy). Targets: ud, proiel, nt, tagger, lemmatizer, parser. For ud, --treebank is perseus / proiel and --split is dev / test.

aegean greek eval ud --treebank perseus --split test --neural   # heavy

The exact figures and how they were measured are on Greek NLP and Limitations.

Analysis — `aegean analyze …`

Exploratory surface analyses over the (largely undeciphered) Aegean material: evidence to weigh, not conclusions. Method notes are on Analysis.

Command	What it does	Key flags	One-line example
`distance`	Weighted phonetic distance in [0,1] (0 = identical)	`--json`	`aegean analyze distance KU-RO KI-RO`
`align`	Per-position phonetic alignment	`--json`	`aegean analyze align KU-RO KI-RO`
`compare`	Compare two words across scripts by sound	`--script-a --script-b --fold-aspiration --json`	`aegean analyze compare po-me ποιμήν`
`nearest`	Rank a corpus's words by closeness to WORD	`--script-a --fold-aspiration --top --json`	`aegean analyze nearest qa-si-re-u greek`
`assoc`	Doc-level association: χ², G², Fisher, PMI	`-o/--output --json`	`aegean analyze assoc lineara KU-RO KI-RO`
`cooccur`	Words sharing a document with WORD, ranked	`--top -o/--output --json`	`aegean analyze cooccur lineara KU-RO`
`clusters`	Stems with productive-suffix derivations	`--min-size --top -o/--output --json`	`aegean analyze clusters lineara`
`structure`	Heuristic doc categories (accounting/libation/…)	`--json`	`aegean analyze structure lineara`
`hands`	Scribal-hand profiles / keyness (e.g. DAMOS)	`--hand --top --min-docs --signs -o/--output --json`	`aegean analyze hands damos`

Verified examples

aegean analyze distance KU-RO KI-RO
# KU-RO ↔ KI-RO: 0.200

aegean analyze align KU-RO KI-RO
# K K match · U I sub-far · - - match · R R match · O O match

aegean analyze compare po-me ποιμήν
# po-me [linearb] → pome    ποιμήν [greek] → poimēn
# similarity 0.62  (distance 0.383)   [+ a per-position alignment table]

aegean analyze nearest qa-si-re-u greek --top 3 --json
# [{"candidate":"ἱστορίης","distance":0.525},{"candidate":"ἄειδε","distance":0.571},
#  {"candidate":"Ἡροδότου","distance":0.612}]

aegean analyze assoc lineara KU-RO KI-RO
# joint/w1/w2/docs 5/34/12/1721 · chi_squared 78.75 · log_likelihood 23.94 · fisher_p 1.6e-06

aegean analyze cooccur lineara KU-RO --top 5
# KI-RO 5 · *306-TU 4 · KU-PA₃-NU 4 · SA-RA₂ 4 · PA-DE 3

aegean analyze structure lineara
# accounting 134 · libation 15 · list 7 · text 2 · other 1563   (heuristic census)

aegean analyze cooccur lineara KU-RO --top 5 -o cooccur.csv   # save the ranking

assoc, cooccur, clusters, and hands take the same -o FILE (.json / .csv / .txt) as the top-level analysis commands.

compare / nearest script options (--script-a / --script-b): greek, lineara, linearb, cypriot. --fold-aspiration maps θ/φ/χ → t/p/k, which is fairer against defective syllabic spelling. These numbers are exploratory — read the alignment and the ranking, not the absolute distance. The Python equivalent of distance:

from aegean import analysis
analysis.phonetic_distance("KU-RO", "KI-RO")    # 0.2

hands needs a corpus that records a scribe per document (DAMOS does; the bundled lineara records HT scribes too):

aegean analyze hands damos                       # profile every hand
aegean analyze hands damos --hand "Knossos 103"  # one hand's characteristic words (keyness)

Data — `aegean data …`

The fetch-to-cache layer. Nothing here is bundled; everything downloads on demand and is sha256-verified.

Command	What it does	Key flags	One-line example
`list`	The fetchable datasets (name, size note, license)	`--json`	`aegean data list`
`fetch`	Fetch a dataset into the cache (idempotent)	`--force`	`aegean data fetch grc-joint`
`versions`	Reproducibility manifest: version + sha256	`--json`	`aegean data versions --json > data-versions.json`
`cache`	Cache location + current contents	`--json`	`aegean data cache`

The fetchable datasets (aegean data list):

name	what	license
`agdt-derived`	prebuilt AGDT lexicon + tagger/lemmatizer/parser models	CC BY-SA 3.0 (Perseus AGDT)
`grc-joint`	joint tagger-parser-lemmatizer, ~518 MB (the `[neural]` extra)	CC BY-SA 4.0
`grc-lemma-neural`	GreTa seq2seq lemmatizer, ~232 MB (the `[neural]` extra)	CC BY-SA 4.0
`lsj-index`	prebuilt LSJ lemma→entry index (~15 MB)	CC BY-SA 4.0 (Perseus)
`damos-corpus`	DAMOS Linear B corpus, ~5,900 tablets — `load('damos')`	CC BY-NC-SA 4.0
`sigla-corpus`	SigLA Linear A dataset, 781 docs — `load('sigla')`	CC BY-NC-SA 4.0
`nt-corpus`	Greek NT (Nestle 1904), 260 chapters / ~137,800 tokens — `load('nt')`	CC0-1.0
`lineara-images`	3,368 facsimile/photo files (~116 MB)	academic reference only
`linearb-corpus`	bring-your-own Linear B export (no default source)	per your source
`workbench-app`	prebuilt workbench web app (~3 MB) — served by `aegean workbench`	Apache-2.0

aegean data fetch grc-joint                       # pre-fetch before going offline
aegean data versions --json > data-versions.json  # pin every dataset's sha256 for a paper
aegean data cache                                 # cache path + contents (override: PYAEGEAN_CACHE)

There are two caches: this data download cache (aegean data cache, override with PYAEGEAN_CACHE) and the opt-in analysis memoization cache (aegean cache, enabled with PYAEGEAN_ANALYSIS_CACHE=1). Licensing details are on Data & Provenance.

SQLite — `aegean db …`

Build a queryable SQLite database from any corpus (documents + tokens + an FTS5 full-text index) and search it.

Command	What it does	Key flags	One-line example
`build`	Write a corpus to a SQLite DB	`-o/--output --no-fts`	`aegean db build lineara -o lineara.db`
`add`	Upsert another corpus into an existing DB	`-o/--output`	`aegean db add cypriot -o lineara.db`
`search`	Full-text search a corpus DB's tokens	`--limit --json`	`aegean db search lineara.db KU-RO`

aegean db build lineara -o lineara.db        # → "wrote 1721 documents to lineara.db"
aegean db search lineara.db KU-RO --limit 3
#   doc    pos  text
#   HT9a   25   KU-RO
#   HT9b   20   KU-RO
#   HT11a  7    KU-RO

build's corpus argument is any of the usual forms (id, work id, .json/.db file, -), so aegean db build tlg0012.tlg001 -o iliad.db builds straight from a Greek work.

db add SRC -o existing.db grows a database in place: it upserts by document id — a matching id is replaced, new ids are added, and the FTS index is refreshed. It's the incremental sibling of combine:

aegean db build lineara -o aegean.db    # start the database
aegean db add cypriot   -o aegean.db    # → "added/updated 2 documents in aegean.db"

--no-fts skips the full-text index. aegean export CORPUS -f sqlite -o file.db writes the same database. In Python the same upsert is corpus.to_sql(path, append=True) (or aegean.db.to_sqlite(corpus, path, append=True)); load a DB back with Corpus.from_sql(path).

AI — `aegean ai …` (exploratory, key-gated)

The generative layer. Every result is exploratory — a labeled model hypothesis carrying its grounding, never a citable fact and never a "decipherment." It needs a provider SDK (an extra such as pip install "pyaegean[anthropic]") and that provider's API key in your environment. Without a key the command exits 1 with a clear message — it never silently calls out. Design notes: AI Layer; hard limits: Limitations.

Command	What it does	Key flags	One-line example
`providers`	List the registered AI providers	`--json`	`aegean ai providers`
`translate`	Hybrid translation (local grounding → LLM)	`--script --target --provider --model --trace -o/--output --json`	`aegean ai translate "ἐν ἀρχῇ ἦν ὁ λόγος"`
`gloss`	Interlinear word-by-word gloss	`--source --provider --model --trace -o/--output --json`	`aegean ai gloss "μῆνιν ἄειδε θεά"`
`hypotheses`	Cautious decipherment hypotheses (strictly exploratory)	`--corpus --provider --model --trace -o/--output --json`	`aegean ai hypotheses "A-TA-I-*301-WA-JA" --corpus lineara`
`ask`	Answer strictly from the supplied grounding	`--corpus --provider --model --trace -o/--output --json`	`aegean ai ask "What is KU-RO?" --corpus lineara`
`extract`	Structured (JSON) extraction, ready to pipe	`--fields --instruction --corpus --provider --model -o/--output --json`	`aegean ai extract "OLE S 1" --fields commodity,amount`
`eval`	Grounded-generation fidelity eval	`--provider --model --json`	`aegean ai eval --provider anthropic`

aegean ai providers
# anthropic / gemini / grok / openai

# heavy (needs the provider extra + API key in your environment):
aegean ai translate "KU-RO 130" --script lineara          # exploratory (undeciphered!)
aegean ai ask "What is KU-RO?" --corpus lineara --trace   # answer + grounding provenance
aegean ai extract "OLE S 1" --fields commodity,amount     # → {"commodity":"OLE","amount":…}

--provider is anthropic (default) / openai / grok / gemini; --model overrides the model. --corpus NAME grounds the answer on that corpus's frequent words. --trace prints the grounding provenance under the answer, so you can audit exactly what the model was (and wasn't) told. extract always prints JSON.

-o FILE saves the run for later: .json keeps the text plus its provenance and grounding (and the exploratory label is preserved on disk), while .txt writes the labeled text. In Python the same record round-trips via ExploratoryResult.to_dict() / to_json() / from_dict().

# heavy (needs a provider + key); save the audit trail next to the answer:
aegean ai ask "What is KU-RO?" --corpus lineara -o answer.json   # text + grounding
aegean ai translate "KU-RO 130" --script lineara -o draft.txt    # labeled text

MCP server — `aegean-mcp`

A separate console script (the [mcp] extra) that exposes the toolkit to AI agents — Claude Code and other MCP clients — over stdio, so an agent can use pyaegean without writing Python.

pip install "pyaegean[mcp]"
aegean-mcp                # serve the read/analysis tools over stdio

It offers a small set of read/analysis tools: list and inspect corpora, wildcard sign search, accounting reconciliation, the Greek pipeline, verse scansion, and Koine glossing.

Quick recipes

# Keep only the Linear A accounts that DON'T balance
aegean balance lineara --json | jq '[.[] | select(.balances | not)]' > discrepancies.json

# Lemmatize a file, one lemma per line
cat chapter.txt | aegean greek lemmatize - --json | jq -r '.[].lemma'

# Pin every dataset's sha256 for a paper's reproducibility appendix
aegean data versions --json > data-versions.json

# Map a word's distribution and cite the exact subset you used
aegean geo lineara --output sites.geojson
aegean cite lineara --site "Zakros" --style bibtex >> paper.bib

# Save a query subset, then analyse just that slice — no Python in between
aegean query lineara --where "site-is=Zakros" -o zakros.json
aegean stats zakros.json --top 10 -o zakros-counts.csv

# Build one database from several sources, growing it as you go
aegean combine lineara linearb -o aegean.db   # both scripts in one DB
aegean db add cypriot -o aegean.db            # add a third later (upsert by id)

More worked pipelines are on Recipes.

Notes & limits

--json is the contract; the rich tables are for humans. Don't parse the tables — pass --json and use jq. --limit trims only the human view.
Heavy commands download on first use. --neural, greek work, greek eval, gloss, and the fetched corpora pull data to the cache the first time, with a note on stderr; afterwards they're offline. Pre-fetch with aegean data fetch.
The AI layer is exploratory. Translations, glosses, and "hypotheses" for undeciphered material are labeled model output with grounding, not findings. The Aegean scripts remain undeciphered.
Metre and accuracy are bounded. Lyric metres beyond the fixed aeolic templates are out of scope, and the trainable backends have measured ceilings — both documented on Limitations.

For the guided tour with full prose and more worked output, see CLI.

pyaegean

Home

Start here

Aegean scripts

Greek

Capabilities

Reference

CLI Cheatsheet

CLI Cheatsheet

Conventions that hold everywhere

At a glance

Corpus & analysis (top level)

Verified examples

query — queryable fields

export — formats

combine — merge corpora into one

import — your own text → a corpus

plot — figure kinds ([viz] extra)

Greek NLP — aegean greek …

Stages that work immediately

Scansion (scan)

Tagging, lemmatizing, parsing, morphology

Glossing

Backend flags (download/build on first use)

Loading real Greek works

catalog — the full discovery index

Reproducing the published numbers (eval)

Analysis — aegean analyze …

Verified examples

Data — aegean data …

SQLite — aegean db …

AI — aegean ai … (exploratory, key-gated)

MCP server — aegean-mcp

Quick recipes

Notes & limits

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pyaegean

Clone this wiki locally

`query` — queryable fields

`export` — formats

`combine` — merge corpora into one

`import` — your own text → a corpus

`plot` — figure kinds (`[viz]` extra)

Greek NLP — `aegean greek …`

Scansion (`scan`)

`catalog` — the full discovery index

Reproducing the published numbers (`eval`)

Analysis — `aegean analyze …`

Data — `aegean data …`

SQLite — `aegean db …`

AI — `aegean ai …` (exploratory, key-gated)

MCP server — `aegean-mcp`