Skip to content
github-actions[bot] edited this page Jun 16, 2026 · 21 revisions

The aegean command line

aegean is the whole toolkit from your terminal — corpora, Greek NLP, surface analysis, the fetch-to-cache data layer, SQLite, plots, and the (exploratory) AI layer — without writing a line of Python. If you've never used a command line before, start with Getting Started (it shows you how to open a terminal); then come back here. Everything below is something you can copy, paste, and run.

In a hurry? The CLI Cheatsheet is the dense one-page index of every command and flag. This page is the guided tour: it explains each group and shows a worked example with real output.

pip install "pyaegean[cli]"     # adds typer + rich; the core library stays zero-dependency
aegean --help

If you only ran pip install pyaegean, the library works but the aegean command isn't installed yet. The [cli] extra adds it. (If you run aegean without it, you get one line telling you exactly that.)

Three conventions that hold everywhere

Learn these once and every command behaves predictably.

Convention What it does Example
--json Print one machine-readable JSON document to stdout and nothing else, so results pipe into jq, files, or other programs. Greek stays readable (ensure_ascii=False). aegean info lineara --json
- reads stdin Anywhere a command takes a TEXT argument, passing - reads the text from standard input, so commands compose in pipelines. echo "μῆνιν" | aegean greek lemmatize -
Exit codes 0 success · 1 a domain error (one line on stderr, prefixed aegean:) · 2 a usage error (typer's default). balance --strict exits 1 when any total fails to balance. see below

Here are those exit codes, actually demonstrated:

aegean info lineara --json > /dev/null ; echo "exit=$?"      # exit=0   (success)
aegean info bogus                                            # aegean: unknown corpus 'bogus'; available: …
                                                            # exit=1   (domain error, message on stderr)
aegean info                                                  # exit=2   (usage error: missing argument)
aegean balance lineara HT13 --strict ; echo "exit=$?"        # exit=1   (a total didn't balance)

A help summary is one -h/--help away on every command and group:

aegean --help
aegean greek --help
aegean greek scan --help

Windows note: if polytonic Greek shows up as boxes or ?, that's the terminal font, not pyaegean. Set PYTHONUTF8=1 and run chcp 65001 once to switch the console to UTF-8 — or just use the --json output, which is always correct, and view it in an editor. See Getting Started.

The command map

aegean --version          # pyaegean 0.8.5
Group What's in it
(top level) repl info load show search query stats dispersion keyness cache balance cite export combine import geo sign bridge plot workbench
aegean greek … normalize → tokenize → syllabify → accent → scan → tag → lemmatize → morph → parse, plus pipeline, gloss/gloss-nt, work/works/catalog/nt-books, and eval
aegean analyze … distance align compare nearest assoc cooccur clusters structure hands
aegean data … list fetch versions cache
aegean db … build add search (SQLite + FTS5)
aegean ai … providers translate gloss hypotheses ask extract eval (exploratory, key-gated)
aegean-mcp a separate console script: serve the tools to AI agents over MCP

Interactive shell (aegean repl)

If you're running several commands in a row, aegean repl opens an interactive shell so you don't retype aegean each time. Inside it you type the subcommand directly, with Tab-completion of commands and options and an arrow-key history:

$ aegean repl
aegean interactive shell — commands without the 'aegean' prefix.
Tab completes, :help lists commands, :exit or Ctrl-D quits.
aegean> info lineara
…the same table aegean info lineara prints…
aegean> greek syllabify Ποσειδῶνι
Ποσειδῶνι → Πο-σει-δῶ-νι
aegean> stats lineara --top 3
…
aegean> :exit

Every line is dispatched through the same command tree, so a command behaves exactly as it does on the regular command line — --json, -o, corpus files and work ids, all of it. A mistyped command just prints its error and leaves the shell open. :help (or help) prints the command list; :exit, quit, or Ctrl-D leaves. The shell needs the [cli] extra (it ships prompt_toolkit).

When standard input isn't a terminal, the shell reads one command per line instead of prompting, so you can script it:

printf 'info lineara\nstats lineara --top 5\n' | aegean repl

Corpus commands (top level)

Every corpus command takes a corpus id as its first argument. The bundled, offline-from-install corpora are lineara, linearb, cypriot, cyprominoan, and greek. Three more download to your cache on first use: damos (the full ~5,900-tablet DAMOS Linear B corpus), sigla (the SigLA Linear A dataset), and nt (the Greek New Testament). Pass an unknown id and the error lists the valid ones:

aegean info bogus
# aegean: unknown corpus 'bogus'; available: cypriot, cyprominoan, damos, greek, lineara, linearb, nt, sigla

Any corpus argument is more than just an id now. Wherever a command takes a corpus (and wherever aegean.read_corpus(spec) does in Python), you can pass: a registered id (lineara), a Greek work id (tlg0012.tlg001 → fetches the Iliad like aegean greek work), a path to a saved corpus (.json or .db you wrote earlier), or - to read corpus JSON from stdin. So these all work with no Python:

aegean db build tlg0012.tlg001 -o iliad.db        # build a DB straight from a work id
aegean stats iliad.json                            # run stats on a corpus file you saved
aegean export tlg0012.tlg002 -f csv -o odyssey.csv # export a work to CSV

Work ids and saved files share one resolver, so anything you can build or export you can also stats, query, keyness, and so on.

For the meaning of document ids like HT13 and work ids like tlg0012.tlg001, see Greek Works and Books and the Linear A / Linear B pages.

info — what's in a corpus

Size, provenance, license, and the one-line citation.

aegean info lineara --json
{
  "corpus": "lineara",
  "documents": 1721,
  "words": 1381,
  "tokens": 6406,
  "signs_in_inventory": 344,
  "source": "GORILA (Godart & Olivier 1976–1985) via mwenge/lineara.xyz",
  "license": "Apache-2.0 (corpus JSON); facsimile imagery © École Française d'Athènes, not redistributed",
  "citation": "Godart, L. & Olivier, J.-P. (1976–1985). Recueil des inscriptions en linéaire A. — https://github.com/mwenge/lineara.xyz"
}

Drop --json for a human-readable table. The same in Python:

import aegean
c = aegean.load("lineara")
len(c)                 # 1721
c.provenance.license   # 'Apache-2.0 (corpus JSON); …'

load — filter by metadata, list or export

Filter on --site, --period, --scribe, --support; without -o it lists the matches (capped by --limit, default 20), with -o it writes a round-trippable corpus JSON file.

aegean load lineara --site "Haghia Triada"               # list the first 20 matches
aegean load lineara --site "Haghia Triada" -o ht.json    # → "wrote 1110 documents to ht.json"

show — one document, line by line

aegean show lineara HT13
HT13  site=Haghia Triada  period=LMIB  scribe=HT Scribe 8  support=Tablet
  1: KA-U-DE-TA VIN 𐄁 TE 𐄁
  2: RE-ZA 5 ¹⁄₂
  3: TE-TU 56
  4: TE-KI 27 ¹⁄₂
  5: KU-ZU-NI 18
  6: DA-SI-*118 19
  7: I-DU-NE-SI 5
  8: KU-RO 130 ¹⁄₂

--json gives the full metadata block plus lines as nested token lists.

search — wildcard sign-pattern word search

* matches exactly one sign. Returns matching words with their frequencies.

aegean search lineara "KU-*-RO"
'KU-*-RO': 1 word(s)
┌──────────┬───────┐
│ word     │ count │
├──────────┼───────┤
│ KU-MA-RO │ 1     │
└──────────┴───────┘

query — the compound-query engine

Build a query from repeated --where field=value rows. Rows AND together by default; prefix the field with or: to OR a row, or ! to negate it. --output-kind is inscriptions (default) or words.

aegean query lineara --where "site-is=Haghia Triada" --where "or:id-contains=ZA" \
       --output-kind words --json

The result carries a description of the query and a citation for the exact subset — so the precise result set behind a figure is one --json | jq .citation away. List the queryable fields with --fields:

aegean query lineara --fields
field scope kind
id-contains inscription text
site-is inscription site
scribe-is inscription scribe
period-is inscription period
support-is inscription support
has-image inscription boolean
has-annotation inscription boolean
ins-contains-word inscription word
word-contains word text
word-prefix word text
word-suffix word text
word-min-syllables word number
word-max-syllables word number
word-contains-sign word sign
word-cooccurs-with word word
word-sign-pattern word text

Save the matched subset as a reusable corpus. Add --output/-o (with a .json or .db extension) and query writes the matching inscriptions out as a corpus you can feed straight back into any other command:

aegean query lineara --where "site-is=Zakros" -o zakros.json
# wrote 53 inscriptions to zakros.json
aegean stats zakros.json --top 3                 # then analyse the saved subset

The saved file records a subset: query(…) → N documents provenance note, so the exact filter behind it travels with the data. (-o only writes inscriptions — use --output-kind words --json if you want the word list instead.)

Note: --limit only trims the human-readable table; --json always emits the full result set (so a pipeline never silently loses rows). Trim JSON with jq instead, e.g. … --json | jq '.words[:5]'.

stats — frequency tables

Word frequencies by default; --signs counts individual signs.

aegean stats lineara --signs --top 5
┌──────┬───────┐
│ item │ count │
├──────┼───────┤
│ 𐝫    │ 552   │
│ 𐄁    │ 468   │
│ 1    │ 310   │
│ KU   │ 307   │
│ KA   │ 284   │
└──────┴───────┘

dispersion — how evenly an item is spread

Gries' DP: 0 = perfectly even across documents, 1 = concentrated in a few. Give one item, or omit it to rank the corpus.

aegean dispersion lineara --top 5
┌───────────┬──────┬─────────────┬───────┬────────┐
│ item      │ freq │ range/parts │ DP    │ DPnorm │
├───────────┼──────┼─────────────┼───────┼────────┤
│ KU-RO     │ 37   │ 34/559      │ 0.850 │ 0.851  │
│ KI-RO     │ 16   │ 12/559      │ 0.938 │ 0.938  │
│ KU-PA₃-NU │ 8    │ 7/559       │ 0.948 │ 0.949  │
│ SA-RA₂    │ 20   │ 20/559      │ 0.948 │ 0.949  │
│ A-DU      │ 10   │ 10/559      │ 0.963 │ 0.964  │
└───────────┴──────┴─────────────┴───────┴────────┘

keyness — characteristic vocabulary of a subset

Compares either a metadata subset against the rest of the same corpus, or one corpus against another (--reference). Reports log-likelihood (G²) and log-ratio with a p-value.

aegean keyness lineara --site "Zakros" --top 5
┌────────────────────┬────────┬───────────┬───────┬───────────┬─────────┐
│ item               │ target │ reference │ G2    │ log-ratio │ p       │
├────────────────────┼────────┼───────────┼───────┼───────────┼─────────┤
│ *28B-NU-MA-RE      │ 3/132  │ 0/1249    │ 14.15 │ +6.05     │ 0.00017 │
│ DU-RE-ZA-SE        │ 3/132  │ 0/1249    │ 14.15 │ +6.05     │ 0.00017 │
│ SI-PI-KI           │ 3/132  │ 0/1249    │ 14.15 │ +6.05     │ 0.00017 │
│ A-TI-KA-A-DU-KO-MI │ 2/132  │ 0/1249    │ 9.42  │ +5.56     │ 0.0021  │
│ DA-I-PI-TA         │ 2/132  │ 0/1249    │ 9.42  │ +5.56     │ 0.0021  │
└────────────────────┴────────┴───────────┴───────┴───────────┴─────────┘

Save a result straight to a file. stats, keyness, dispersion, and search all take --output/-o, and the format follows the extension: .json (the same document as --json), .csv (a plain table — stdlib only, no pandas), or .txt (the human view). It writes silently and prints nothing else:

aegean stats lineara --top 3 -o freq.csv
# freq.csv:
# item,count
# KU-RO,37
# SA-RA₂,20
# KI-RO,16

balance — accounting reconciliation

Checks stated totals (KU-RO in Linear A, TO-SO in Linear B) against the sum of the listed items. Give one document, or omit it to sweep the whole corpus.

aegean balance lineara HT13
┌──────┬────────┬────────┬──────────┬──────┬──────────┐
│ doc  │ marker │ stated │ computed │ diff │ balances │
├──────┼────────┼────────┼──────────┼──────┼──────────┤
│ HT13 │ KU-RO  │ 130.5  │ 131.0    │ 0.5  │ NO       │
└──────┴────────┴────────┴──────────┴──────┴──────────┘

--strict makes the command exit 1 whenever any checked total fails — handy in a script. See Linear A for what KU-RO discrepancies actually mean.

cite — cite a corpus or the exact subset

aegean cite lineara --site "Haghia Triada"
# Godart, L. & Olivier, J.-P. (1976–1985). Recueil des inscriptions en linéaire A.
#   — https://… [subset: filter(site='Haghia Triada') → 1110 of 1721 documents]

--style is plain (default), bibtex, or apa. Append a BibTeX entry to your bibliography with aegean cite lineara --site Zakros --style bibtex >> paper.bib.

export — JSON, CSV, Parquet, EpiDoc, SQLite

aegean export lineara -f csv -o lineara.csv               # → "wrote 1721 documents to lineara.csv (csv)"
aegean export greek -f epidoc -o greek.xml                # EpiDoc TEI
aegean export lineara -f sqlite -o lineara.db             # same DB as `aegean db build`
--format output needs
json lossless, round-trippable corpus core
csv one row per document/token/word (--level) core
parquet same, columnar [parquet] extra
epidoc EpiDoc TEI XML core
sqlite queryable DB with FTS5 core

--level token (csv/parquet) emits one row per token and spreads per-token annotations — the Greek NT's lemma / morph / Strong's / gloss — into columns. Filters (--site etc.) apply before export.

combine — merge several corpora into one file

Give two or more sources and one --output/-o (a .json or .db) and combine merges them into a single saved corpus. Each source is resolved like any corpus argument — an id, a saved .json/.db, a Greek work id, or - — so you can stitch works, subsets, and bundled corpora together in one go:

aegean combine tlg0012.tlg001 tlg0012.tlg002 -o homer.db    # all of Homer in one database
# wrote … documents to homer.db (merged 2 sources)

A run you can try offline, against the bundled corpora:

aegean combine lineara cypriot -o aegean-mix.json
# wrote 1723 documents to aegean-mix.json (merged 2 sources)

The merged corpus keeps a provenance that names every source — its citation reads Merged corpus of: … listing each one. If two sources share a document id, --on-conflict decides what happens: error (the default — stop and tell you), first (keep the earliest), last (keep the latest), or suffix (keep both, appending #2, #3, … to the later ids). The same in Python:

import aegean
merged = aegean.combine([aegean.load("lineara"), aegean.load("cypriot")])
# or from an existing corpus:
both = aegean.load("lineara").merge(aegean.load("cypriot"), dedupe="suffix")
just_a_few = aegean.load("lineara").subset(["HT13", "HT9a", "HT11a"])

Corpus.merge(*others, dedupe=…) takes the same four dedupe values as --on-conflict; Corpus.subset(ids) pulls out a named slice. See Greek Works and Books for the work ids you can combine.

import — bring your own text into a corpus

Everything above analyses corpora that pyaegean already knows about. import turns your own material — a plain-text file, a folder of text files, or a CSV — into a real corpus you can then stats, search, query, export, and so on. It always writes to --output/-o (a .json or .db), and the result works anywhere a corpus is accepted. (Greek/Koine text is run through the Greek tokenizer, which strips punctuation; any other --script splits on whitespace.)

aegean import john.txt -o john.json --script nt        # one plain-text file → a corpus
# wrote 1 document(s) to john.json
aegean stats john.json --top 5                          # then analyse it like any corpus
 john.json: top 5
      words
┌───────┬───────┐
│ item  │ count │
├───────┼───────┤
│ ἦν    │ 4     │
│ λόγος │ 3     │
│ ὁ     │ 3     │
│ θεόν  │ 2     │
│ καὶ   │ 2     │
└───────┴───────┘

--split decides how a text becomes documentswhole (the default, one document for the whole file), paragraph (one per blank-line-separated block), or line (one per non-empty line). With more than one block the ids are numbered <base>:1, <base>:2, …; the base id is the file's stem unless you override it with --id:

aegean import john.txt -o john-lines.json --script nt --split line
# wrote 2 document(s) to john-lines.json

A folder imports every matching file into one corpus (each file's stem becomes a document id, de-duplicated with a #2, #3, … suffix on collision). --glob chooses which files; --split applies per file:

aegean import poems/ -o poems.db --split line          # a directory of *.txt → a database
# wrote 2 document(s) to poems.db
aegean db search poems.db θεά

A CSV treats each row as a document: --text-col names the column holding the text (default text), and --id-col names the column holding the id (otherwise ids are <stem>:<row>):

aegean import verses.csv -o verses.json --script nt --text-col line --id-col id
# wrote 2 document(s) to verses.json
aegean show verses.json v2
# v2
#   1: καὶ ὁ λόγος ἦν πρὸς τὸν θεόν

--encoding (default utf-8) reads non-UTF-8 files. The same lives on aegean.io in Python — from_text, from_text_file, from_text_dir, and from_csv (the CSV one also takes meta_cols= to carry columns into document metadata):

from aegean import io
c = io.from_text("μῆνιν ἄειδε θεά", script_id="nt")           # a raw string
c = io.from_text_file("john.txt", script_id="nt", split="line")
c = io.from_csv("verses.csv", text_col="line", id_col="id", script_id="nt")

import is the only way plain text enters a corpus: read_corpus and every corpus argument still load only .json/.db files (and work ids), so feeding a raw .txt straight to a command fails with a message telling you to import it first:

aegean stats john.txt --top 3
# aegean: unknown corpus 'john.txt'; expected a registered id (…), a Greek work id …,
#   a path to a .json or .db corpus, or '-' …. To load plain text, import it first:
#   `aegean import john.txt -o corpus.json` (or aegean.io.from_text_file / from_csv …)
#   [stderr, exit 1]

geo — find-site coordinates

aegean geo lineara
       lineara: 52 located site(s) of 52
┌──────────────────┬───────┬───────┬───────────┐
│ site             │ lat   │ lon   │ pleiades  │
├──────────────────┼───────┼───────┼───────────┤
│ Haghia Triada    │ 35.06 │ 24.79 │ 589672    │
│ Gournia          │ 35.11 │ 25.79 │ 771100776 │
│ …                │       │       │           │
└──────────────────┴───────┴───────┴───────────┘

Add -o sites.geojson to write GeoJSON instead of printing a table (that path needs the [geo] extra). More on the map data in Geography.

sign — look up one sign

Glyph, Unicode codepoint, sound value, and the raw attributes for a single sign in a script's inventory.

aegean sign lineara KU --json
{
  "label": "KU",
  "glyph": "𐙂",
  "codepoint": "U+10642",
  "phonetic": "ku",
  "attrs": { "sharedWithLinearB": true, "total": 16, "confidence": 1, "altGlyphs": [] }
}

bridge — read a deciphered syllabic word as Greek

For the deciphered scripts (linearb, cypriot): the attested Greek reading plus a gloss.

aegean bridge linearb po-me
# po-me → ποιμήν   (shepherd)

cache — the opt-in analysis cache

This is the analysis memoization cache (distinct from the data download cache under aegean data cache). It's off unless you enable it for the shell:

aegean cache
# analysis cache: off — set PYAEGEAN_ANALYSIS_CACHE=1 (or a path) to enable

Set PYAEGEAN_ANALYSIS_CACHE=1 (or a directory path) and expensive analyses (dispersion, keyness, clustering) are reused across runs; aegean cache --clear wipes it.

plot — one figure to a file

Draws a single figure and writes it to --output (.png/.svg/.pdf). Needs the [viz] extra. The first argument is the figure kind:

kind what it draws
freq top-N sign or word frequencies
dispersion DP scatter (annotate the top N)
keyness keyness bars (subset vs rest, or vs --reference)
network co-occurrence network (--word for one word's ego network)
balance accounting reconciliation chart
scansion a metrical scansion grid for one Greek line
pip install "pyaegean[viz]"
aegean plot keyness lineara --site Zakros -o zakros.png      # → "wrote zakros.png"
aegean plot scansion "ἄνδρα μοι ἔννεπε, Μοῦσα, πολύτροπον" -o scan.svg --meter hexameter

For scansion the second argument is the Greek line itself (- reads stdin); for every other kind it's a corpus name.

workbench — serve the Linear A Research Workbench locally

aegean workbench                 # fetch the build (~3 MB, first use) and open it in your browser
aegean workbench --port 9000     # choose a port (default 8000); --no-browser to not open one

Fetches the sha256-pinned static build to the cache, then serves the browser UI — the corpus, maps, and analysis modules — at http://localhost:8000/ until you press Ctrl+C. If the Linear A facsimile imagery is already cached (aegean data fetch lineara-images), the picture browser works too.


Greek NLP — aegean greek …

The full Ancient Greek pipeline from the shell. The zero-dependency stages run the moment you install; the heavier backends are opt-in flags (next section). Full explanations live on Greek NLP; metre is on Meters.

Every text argument accepts - for stdin, and every command takes --json.

The stages that work immediately

aegean greek betacode "mh=nin a)/eide qea/"      # μῆνιν ἄειδε θεά
aegean greek betacode "μῆνιν" --reverse          # mh=nin   (Unicode → Beta Code)
aegean greek normalize "λόγoς kai" --lenient     # repairs OCR artifacts; warns on stderr
aegean greek strip "μῆνιν"                        # μηνιν   (drop all diacritics)
aegean greek tokenize "ἐν ἀρχῇ ἦν ὁ λόγος."       # one token per line (--sentences to split sentences)
aegean greek syllabify εἰσφέρω                    # εἰσ-φέ-ρω
aegean greek accent λόγος                         # acute, paroxytone
aegean greek quantities πατρός                    # πα:common | τρός:heavy
aegean greek scan "ἄνδρα μοι ἔννεπε, Μοῦσα, …"     # dactylic hexameter
aegean greek ipa "λόγος" --period koine          # loɣos  (--period attic|koine)
aegean greek gloss-nt λόγος                        # Koine gloss, bundled Dodson lexicon (no download)

Real runs:

aegean greek betacode "mh=nin a)/eide qea/"
# μῆνιν ἄειδε θεά

aegean greek syllabify εἰσφέρω
# εἰσφέρω → εἰσ-φέ-ρω

aegean greek quantities πατρός
# πατρός → πα:common | τρός:heavy

aegean greek normalize "λόγoς kai" --lenient
# aegean: lenient normalize: repaired 1 Latin letter(s) in Greek words (o→ο)   [stderr]
# λόγος kai

aegean greek ipa "λόγος" --period koine
# loɣos

accent prints a small table; the Python equivalent of the same fact:

from aegean import greek
greek.accentuation("λόγος").classification     # 'paroxytone'
greek.betacode_to_unicode("mh=nin")            # 'μῆνιν'

Scansion (scan)

scan checks a line against a fixed metrical template and prints the pattern, the feet, and the caesura — or exits 1 with the reason if the line declines. Synizesis is lexical, not guessed: a line that only scans via synizesis on a word outside the curated lexicon declines rather than inventing a fit.

aegean greek scan "ἄνδρα μοι ἔννεπε, Μοῦσα, πολύτροπον, ὃς μάλα πολλὰ"
# —⏑⏑|—⏑⏑|—⏑⏑|—⏑⏑|—⏑⏑|—×
# hexameter: dactyl, dactyl, dactyl, dactyl, dactyl, final; caesura: trochaic

aegean greek scan "ὦ κοινὸν αὐτάδελφον Ἰσμήνης κάρα" --meter trimeter
# ×—⏑—|×—⏑—|×—⏑×
# trimeter: metron, metron, metron; caesura: hephthemimeral

aegean greek scan "λόγος"
# aegean: line does not scan as dactylic hexameter (2 syllables): 'λόγος'   [stderr, exit 1]

--meter accepts:

name metre
hexameter dactylic hexameter (Homer) — the default
pentameter elegiac pentameter (the second line of an elegiac couplet)
trimeter iambic trimeter (tragic/comic dialogue)
glyconic · pherecratean · adonean aeolic cola
sapphic_hendecasyllable the Sapphic eleven-syllable line
alcaic_hendecasyllable · alcaic_enneasyllable · alcaic_decasyllable the Alcaic stanza lines

--json adds feet, syllables, quantities, caesura, and an ambiguous flag. See Meters for what's in scope and what isn't.

Tagging, lemmatizing, parsing

echo "μῆνιν ἄειδε θεά" | aegean greek lemmatize -
# μῆνιν	μῆνις
# ἄειδε	ἀείδω
# θεά	θεά

aegean greek morph λόγον
# λόγος [NOUN acc sg masc]
# λόγος [NOUN acc sg fem]
# λόγος [NOUN nom sg neut]
# λόγος [NOUN acc sg neut]
# λόγος [NOUN voc sg neut]

aegean greek tag "ἐν ἀρχῇ ἦν ὁ λόγος."          # UPOS per token
aegean greek pipeline "ἐν ἀρχῇ ἦν ὁ λόγος." --json   # per-token records in one call

A lemma that the lexicon doesn't know is still returned, marked (fallback) (and "known": false in JSON), so you can tell a real hit from a heuristic guess.

Glossing

aegean greek gloss-nt λόγος
# a word, speech, divine utterance, analogy

aegean greek gloss-nt λόγος --full
# λόγος (G3056): a word, speech, divine utterance, analogy.

aegean greek gloss-nt 3056 --strongs        # look up by Strong's number

gloss-nt uses the bundled CC0 Dodson lexicon — no download. The classical gloss command uses the larger LSJ index instead and activates it on first use (~270 MB, or ~15 MB if lsj-index is fetched). See the backend section below.

Backend flags (download/build on first use)

Each flag stands in for a use_*() activation in the Python API. The first time you use one, it may download a model or build an index to the cache (a note goes to stderr); after that, everything is offline.

flag activates first-use cost
--treebank the Perseus AGDT lexicon ~75 MB fetch
--tagger the generalizing POS tagger trains from the AGDT
--lemmatizer the edit-tree lemmatizer trains from the AGDT
--parser the pure-Python arc-eager dependency parser trains from the AGDT
--neural-lemmatizer the GreTa seq2seq lemmatizer ([neural]) ~232 MB model
--neural the joint neural pipeline — best tagger/parser/lemmatizer ([neural]) ~518 MB model
--lsj LSJ glossing (also set by greek gloss) ~270 MB (or ~15 MB index)
# heavy — fetches the model on first use, then offline:
aegean greek pipeline "ἐν ἀρχῇ ἦν ὁ λόγος." --neural
aegean greek parse "ἐν ἀρχῇ ἦν ὁ λόγος" --neural          # UD dependency tree
aegean greek tag "" --treebank --tagger                  # AGDT lookup + perceptron tagger

Loading real Greek works

work fetches a real text from Perseus canonical-greekLit / First1KGreek (CC BY-SA, commit-pinned, cached once) and parses it into a corpus. works lists a curated, verified catalog of 25 ids; catalog searches the full ~1,800-work discovery index (offline metadata); nt-books lists the 27 NT books and the names the loaders accept. The full id reference is on Greek Works and Books.

aegean greek works
# id              author        title
# tlg0012.tlg001  Homer         Iliad
# tlg0012.tlg002  Homer         Odyssey
# tlg0011.tlg002  Sophocles     Antigone
# tlg0059.tlg030  Plato         Republic
# …  (curated subset — the full canon is at https://scaife.perseus.org)

# heavy (network on first use):
aegean greek work tlg0012.tlg001                 # the Iliad: 24 books, ~127k tokens
aegean greek work tlg0012.tlg001 --ref 1.1-1.50  # just book 1, lines 1–50
aegean greek work tlg0012.tlg001 -o iliad.json   # save as a corpus file

--ref selects a section: 1 (book), 1.2 (chapter), or 1.1-1.50 (line range). --source is auto/perseus/first1k; --edition picks a specific edition file.

catalog is the full discovery index behind works. Where works lists 25 curated highlights, catalog searches the complete bundled metadata for every work with a Greek edition in Perseus canonical-greekLit + First1KGreek — 1,778 works in all (768 from perseus, 1,010 from first1k). It's offline and instant: just metadata, no fetch. Any id it prints goes straight to aegean greek work.

aegean greek catalog --author plato --limit 8
                       Greek works (39 matches)
┌────────────────┬────────┬────────────┬────────────────────┬─────────┐
│ id             │ author │ title      │ greek              │ src     │
├────────────────┼────────┼────────────┼────────────────────┼─────────┤
│ tlg0059.tlg001 │ Plato  │ Euthyphro  │ Εὐθύφρων           │ perseus │
│ tlg0059.tlg002 │ Plato  │ Apology    │ Ἀπολογία Σωκράτους │ perseus │
│ tlg0059.tlg003 │ Plato  │ Crito      │ Κρίτων             │ perseus │
│ tlg0059.tlg004 │ Plato  │ Phaedo     │ Φαίδων             │ perseus │
│ tlg0059.tlg005 │ Plato  │ Cratylus   │ Κρατύλος           │ perseus │
│ tlg0059.tlg006 │ Plato  │ Theaetetus │ Θεαίτητος          │ perseus │
│ tlg0059.tlg007 │ Plato  │ Sophist    │ Σοφιστής           │ perseus │
│ tlg0059.tlg008 │ Plato  │ Statesman  │ Πολιτικός          │ perseus │
└────────────────┴────────┴────────────┴────────────────────┴─────────┘

… and 31 more — narrow with --author/--title, or --limit 0 to list all (-o to save).
Load one with, e.g.:  aegean greek work tlg0012.tlg001 --ref 1.1-1.10

The bare QUERY argument is a catch-all substring over id, author, English title, and Greek title; --author/-a, --title/-t (matches English or Greek), and --source perseus|first1k are the targeted filters (all case-insensitive, all combine with AND). --limit/-n caps the table (0 = all), --json emits the full result set, and --output/-o saves it (.json/.csv/.txt by extension):

aegean greek catalog herodotus --json
[
  {
    "id": "tlg0016.tlg001",
    "author": "Herodotus",
    "title": "Histories",
    "greek_title": "Ἱστορίαι",
    "source": "perseus"
  },
  {
    "id": "tlg0062.tlg056",
    "author": "Lucian of Samosata",
    "title": "Herodotus",
    "greek_title": "Ἡρόδοτος ἢ Ἀετίων",
    "source": "perseus"
  }
]
aegean greek catalog --author aristophanes --source perseus -o aristophanes.csv
# wrote 11 works to aristophanes.csv     (id,author,title,greek_title,source — one row per work)

Coverage is exactly what those open repositories hold at the pinned commit, so some authors are genuinely absent upstream and therefore here too — aegean greek catalog sappho honestly returns nothing rather than inventing an entry:

aegean greek catalog sappho
# No works match. Try a looser filter, or browse https://scaife.perseus.org

The same in Python is greek.catalog(query=None, *, author=None, title=None, source=None), returning a list of {id, author, title, greek_title, source} dicts; greek.popular_works() stays the curated 25.

Reproducing the published numbers (eval)

aegean greek eval TARGET runs the official evaluators against fetched gold data — heavy, but it reproduces pyaegean's measured accuracy. Targets: ud, proiel, nt, tagger, lemmatizer, parser.

# heavy: fetches gold data and the model
aegean greek eval ud --treebank perseus --split test --neural

The exact figures and how they were measured are on Greek NLP and Limitations.


Analysis — aegean analyze …

Exploratory surface analyses over the (largely undeciphered) Aegean material: evidence to weigh, not conclusions. Full method notes are on Analysis.

Phonetic distance and alignment

aegean analyze distance KU-RO KI-RO
# KU-RO ↔ KI-RO: 0.200

aegean analyze align KU-RO KI-RO        # per-position match / vowel / same-class / far / gap

Cross-script comparison

compare romanizes two words from possibly different scripts and aligns them by sound; nearest ranks a corpus's words by closeness to a query word.

aegean analyze compare po-me ποιμήν
# po-me [linearb] → pome    ποιμήν [greek] → poimēn
# similarity 0.62  (distance 0.383)
#   a  b  op
#   p  p  match
#   o  o  match
#   ·  i  ins
#   m  m  match
#   e  ē  sub-vowel
#   ·  n  ins
aegean analyze nearest qa-si-re-u greek --top 5 --json
# [{"candidate": "ἱστορίης", "distance": 0.525}, {"candidate": "ἄειδε", "distance": 0.571}, …]

--script-a/--script-b choose the scripts (greek · lineara · linearb · cypriot); --fold-aspiration maps θ/φ/χ → t/p/k, which is fairer against defective syllabic spelling. These numbers are exploratory — read the alignment and the ranking, not the absolute distance.

Association and co-occurrence

aegean analyze assoc lineara KU-RO KI-RO    # χ², log-likelihood, Fisher, PMI over shared documents
aegean analyze cooccur lineara KU-RO        # what shares a tablet with KU-RO, ranked

Morphology, structure, scribal hands

aegean analyze clusters lineara             # stem + productive-suffix clusters (exploratory)
aegean analyze structure lineara            # accounting / libation / list / text / other census
aegean analyze structure lineara HT13       # classify one document
aegean analyze hands damos                  # scribal-hand profiles (needs a hand per document)
aegean analyze hands damos --hand "Knossos 103"   # one hand's characteristic vocabulary (keyness)

hands needs a corpus that records a scribe per document — DAMOS does, so it fetches on first use; the bundled lineara records HT scribes too.

Save any of these to a file. assoc, cooccur, clusters, and hands all take --output/-o, with the format set by the extension — .json, .csv (stdlib, no pandas), or .txt:

aegean analyze cooccur lineara KU-RO -o ku-ro-neighbours.json
aegean analyze clusters lineara -o clusters.csv

Data — aegean data …

The fetch-to-cache layer: list what can be downloaded, fetch it (sha256-verified), pin versions for a paper, and inspect the cache.

aegean data list                                   # the fetchable datasets (sizes, licenses)
aegean data fetch grc-joint                         # pre-fetch (e.g. before going offline)
aegean data versions --json > data-versions.json    # pin every dataset's sha256 for reproducibility
aegean data cache                                   # cache location + contents (override: PYAEGEAN_CACHE)

aegean data list shows the full registry. The fetchable datasets (all downloaded on demand, never bundled):

name what license
agdt-derived prebuilt AGDT lexicon + tagger/lemmatizer/parser models CC BY-SA 3.0 (Perseus AGDT)
grc-joint the joint tagger-parser-lemmatizer model (~518 MB; the [neural] extra) CC BY-SA 4.0
grc-lemma-neural the GreTa seq2seq lemmatizer (~232 MB; the [neural] extra) CC BY-SA 4.0
lsj-index prebuilt LSJ lemma→entry index (~15 MB) CC BY-SA 4.0 (Perseus)
damos-corpus DAMOS Linear B corpus (~5,900 tablets) — aegean.load('damos') CC BY-NC-SA 4.0
sigla-corpus SigLA Linear A dataset (781 docs) — aegean.load('sigla') CC BY-NC-SA 4.0
nt-corpus Greek New Testament (Nestle 1904; ~137,800 tokens) — aegean.load('nt') CC0-1.0
lineara-images 3,368 facsimile/photo files (~116 MB) academic reference only
linearb-corpus a bring-your-own Linear B export (no default source) per your source
workbench-app the prebuilt workbench web app (~3 MB) — served by aegean workbench Apache-2.0

aegean data versions --json is the reproducibility manifest — every bundled and fetched dataset with its sha256. See Data & Provenance for the licensing details and why nothing non-redistributable is bundled.


SQLite — aegean db …

Build a queryable SQLite database from any corpus (documents + tokens + an FTS5 full-text index) and search it.

aegean db build lineara -o lineara.db        # → "wrote 1721 documents to lineara.db"
aegean db search lineara.db KU-RO --limit 3
   'KU-RO' in lineara.db
┌───────┬─────┬───────┐
│ doc   │ pos │ text  │
├───────┼─────┼───────┤
│ HT9a  │ 25  │ KU-RO │
│ HT9b  │ 20  │ KU-RO │
│ HT11a │ 7   │ KU-RO │
└───────┴─────┴───────┘

db build resolves its corpus like anything else — so aegean db build tlg0012.tlg001 -o iliad.db builds a database straight from a Greek work id. --no-fts skips the full-text index. aegean export CORPUS -f sqlite -o file.db writes the same database. Load it back in Python with Corpus.from_sql(path), or stream it with aegean.db.stream(path).

db add — grow an existing database

db add upserts documents into a database you already built: a document whose id already exists is replaced, new ids are added, and the FTS5 index is refreshed. The source resolves like any corpus argument (id, .json/.db, work id, or -):

aegean db build lineara -o aegean.db         # → "wrote 1721 documents to aegean.db"
aegean db add cypriot -o aegean.db           # → "added/updated 2 documents in aegean.db"

Mixing scripts is allowed and noted on stderr (the database's script id becomes mixed). The Python equivalents take an append=True flag:

corpus.to_sql("aegean.db", append=True)      # or aegean.db.to_sqlite(corpus, "aegean.db", append=True)

AI — aegean ai … (exploratory, key-gated)

The generative layer. Every result here is exploratory — a labeled model hypothesis carrying its grounding evidence, never a citable fact, and never a "decipherment." It needs a provider SDK (an extra such as pip install "pyaegean[anthropic]") and that provider's API key in your environment. Without a key, the command exits 1 with a clear message — it never silently calls out.

aegean ai providers
# anthropic
# gemini
# grok
# openai

The commands (each takes --provider / --model, and most take --trace):

aegean ai translate "ἐν ἀρχῇ ἦν ὁ λόγος"                      # grounded hybrid translation
aegean ai translate "KU-RO 130" --script lineara              # exploratory (undeciphered!)
aegean ai gloss "μῆνιν ἄειδε θεά"                             # interlinear word-by-word gloss
aegean ai hypotheses "A-TA-I-*301-WA-JA" --corpus lineara     # cautious decipherment hypotheses
aegean ai ask "What is KU-RO?" --corpus lineara --trace       # answer strictly from grounding
aegean ai extract "OLE S 1" --fields commodity,amount         # structured JSON, ready for jq
aegean ai eval --provider anthropic                           # grounding-fidelity eval

--corpus NAME grounds the answer on that corpus's frequent words. --trace prints the grounding provenance under the answer — the local corpus / lexicon / analysis facts the model was given, grouped by source — so you can audit exactly what it was (and wasn't) told. extract always prints JSON, so it pipes straight into jq.

Save the output, label and all. translate, gloss, hypotheses, ask, and extract take --output/-o. A .json file carries the text plus its provenance and grounding evidence; a .txt file is the labeled text. The exploratory label stays attached on disk — a saved result never loses the "this is a hypothesis, not a finding" framing:

aegean ai gloss "μῆνιν ἄειδε θεά" -o gloss.json        # text + provenance + grounding
aegean ai ask "What is KU-RO?" --corpus lineara -o answer.txt

In Python the same lives on ExploratoryResult: .to_dict(), .to_json(path), and ExploratoryResult.from_dict(data) round-trip a result through disk with its label and grounding intact. The full design and the meaning of "grounded" are on AI Layer; the hard limits are on Limitations.


MCP server — aegean-mcp

A separate console script (the [mcp] extra) that exposes the toolkit to AI agents — Claude Code and other MCP clients — over stdio, so an agent can use pyaegean without writing Python.

pip install "pyaegean[mcp]"
aegean-mcp                # serve the read/analysis tools over stdio

It offers a small set of read/analysis tools: list and inspect corpora, wildcard sign search, accounting reconciliation, the Greek pipeline, verse scansion, and Koine glossing.


Recipes

Reconcile every Linear A account and keep only the failures:

aegean balance lineara --json | jq '[.[] | select(.balances | not)]' > discrepancies.json

Lemmatize a file of Greek, one lemma per line:

cat chapter.txt | aegean greek lemmatize - --json | jq -r '.[].lemma'

Scan a poem line by line, keeping only the lines that scan:

while read -r line; do aegean greek scan "$line" --json 2>/dev/null | jq -r .pattern; done < poem.txt

Map a word's distribution and cite the subset you used:

aegean geo lineara --output sites.geojson
aegean cite lineara --site "Zakros" --style bibtex >> paper.bib

Build one searchable database of all of Homer straight from work ids, then keep growing it — no Python anywhere:

aegean combine tlg0012.tlg001 tlg0012.tlg002 -o homer.db     # Iliad + Odyssey (see Greek Works and Books for ids)
aegean db add tlg0011.tlg002 -o homer.db                     # add Sophocles' Antigone later
aegean db search homer.db μῆνιν --limit 3

Save a query subset once, reuse it everywhere:

aegean query lineara --where "site-is=Zakros" -o zakros.json
aegean keyness zakros.json --reference lineara --top 5 -o zakros-keyness.csv

More worked pipelines are on Recipes.


Notes and limits

  • The AI layer is exploratory. Translations, glosses, and "hypotheses" for undeciphered material are labeled model output with grounding, not findings. The Aegean scripts remain undeciphered. See Limitations.
  • Heavy commands download on first use. Anything marked heavy here (--neural, greek work, greek eval, gloss, the fetched corpora) pulls data to the cache the first time, with a note on stderr; afterwards it's offline. Pre-fetch with aegean data fetch before going offline.
  • --json is the contract; the table view is for humans. Don't parse the rich tables — pass --json and use jq. --limit trims only the human view.
  • Metre and accuracy are bounded. Lyric metres beyond the fixed aeolic templates are out of scope, and the trainable backends have measured ceilings — both documented on Meters and Limitations.

For the terse one-page index of every command and flag, see the CLI Cheatsheet.

Clone this wiki locally