Skip to content
github-actions[bot] edited this page Jun 12, 2026 · 13 revisions

The aegean CLI

The whole toolkit from the command line — corpora, Greek NLP, analysis, data, and the (exploratory) AI layer — without writing Python.

pip install "pyaegean[cli]"     # typer + rich; the core stays zero-dependency
aegean --help

Three conventions hold everywhere:

  • --json on every command prints one machine-readable JSON document, so results pipe into jq, files, or other tools.
  • - reads stdin wherever a command takes a TEXT argument, so commands compose in shell pipelines.
  • Exit codes script cleanly: 0 success, 1 domain error (one line on stderr), 2 usage error. balance --strict exits 1 on any unbalanced total.

On Windows, set PYTHONUTF8=1 (and a UTF-8 console) so Greek renders correctly.

Corpus commands

aegean info lineara                         # size, provenance, license, citation
aegean load lineara --site "Haghia Triada" -o ht.json   # filter → lossless JSON
aegean show lineara HT13                    # one document, line by line
aegean search lineara "KU-*-RO"             # wildcard sign-pattern word search
aegean stats lineara --top 10               # word frequencies (--signs for signs)
aegean dispersion damos --top 10            # Gries' DP: evenly spread vs concentrated
aegean keyness damos --site Pylos           # key vocabulary of a subset vs the rest
aegean cache                                # the opt-in analysis cache (--clear to wipe)
aegean plot keyness damos --site Pylos -o pylos.png   # any of: freq | dispersion |
aegean plot scansion "ἄνδρα μοι ἔννεπε…" -o scan.svg  # keyness | network | balance | scansion
aegean balance lineara HT13                 # KU-RO reconciliation (--strict exits 1)
aegean cite lineara --site "Haghia Triada"  # cite the exact subset (--style bibtex|apa)
aegean export lineara -f csv -o lineara.csv # json | csv | parquet | epidoc
aegean geo lineara                          # located find-sites (+ --output sites.geojson)
aegean sign lineara KU                      # one sign: glyph, codepoint, sound value
aegean bridge linearb po-me                 # po-me → ποιμήν (shepherd)

Every command takes a corpus name: the bundled lineara / linearb / cypriot / cyprominoan / greek, or the fetched-on-demand damos (the full ~5,900-tablet Linear B corpus) and sigla (the SigLA Linear A dataset) — both CC BY-NC-SA, downloaded to your cache on first use. aegean stats damos --top 10 works exactly like its lineara counterpart.

The compound query engine takes repeated --where field=value rows (prefix or: to OR a row, ! to negate it; --fields lists the field registry):

aegean query lineara --where "site-is=Haghia Triada" --where "or:id-contains=ZA" \
       --output-kind words --json

Query results and filtered subsets print their citation, so the exact result set used in a paper is one --json | jq .citation away.

Greek NLP (aegean greek …)

The zero-dependency stages work immediately:

aegean greek betacode "mh=nin a)/eide qea/"          # μῆνιν ἄειδε θεά
aegean greek normalize "λόγoς kai μh=νιν" --lenient  # repairs OCR artifacts, warns on stderr
aegean greek tokenize "ἐν ἀρχῇ ἦν ὁ λόγος." [--sentences]
aegean greek syllabify εἰσφέρω                       # εἰσ-φέ-ρω (compound exception)
aegean greek accent λόγος                            # paroxytone
aegean greek quantities πατρός                       # syllable quantities
aegean greek scan "ἄνδρα μοι ἔννεπε, Μοῦσα, πολύτροπον, ὃς μάλα πολλὰ"
aegean greek scan "ὦ κοινὸν αὐτάδελφον Ἰσμήνης κάρα" --meter trimeter  # iambic dialogue
aegean greek ipa "λόγος" --period koine              # reconstructed pronunciation
aegean greek tag "ἐν ἀρχῇ ἦν ὁ λόγος."               # UPOS per token
aegean greek lemmatize "μῆνιν ἄειδε θεά"             # lemma per word
aegean greek morph λόγον                             # candidate morphological parses
aegean greek pipeline "ἐν ἀρχῇ ἦν ὁ λόγος." --json   # per-token records, one call

Backend flags stand in for the use_*() activations — each may download/build to the cache on first use (a note goes to stderr), then everything is offline:

aegean greek pipeline "ἐν ἀρχῇ ἦν ὁ λόγος." --neural   # the joint neural pipeline
aegean greek parse "ἐν ἀρχῇ ἦν ὁ λόγος" --neural       # UD dependency tree
aegean greek tag "" --treebank --tagger               # AGDT lookup + perceptron tagger
aegean greek gloss λόγος                               # LSJ gloss (~270 MB first use)

Real Greek works load on demand (Perseus canonical-greekLit / First1KGreek, CC BY-SA, commit-pinned, cached):

aegean greek work tlg0012.tlg001                 # the Iliad: 24 books, ~127k tokens
aegean greek work tlg0012.tlg001 -o iliad.json   # as a round-trippable corpus file
aegean greek work tlg0012.tlg001 --ref 1.1-1.50  # just book 1, lines 1–50

aegean greek eval ud --treebank perseus --split test --neural reproduces the published numbers through the official CoNLL 2018 evaluator (heavy: fetches gold data and the model); eval proiel|tagger|lemmatizer|parser cover the other measured evaluations.

Analysis (aegean analyze …)

Exploratory surface analyses over the undeciphered material — evidence to weigh, not conclusions:

aegean analyze distance KU-RO KI-RO          # weighted phonetic distance
aegean analyze align KU-RO KI-RO             # per-position alignment
aegean analyze compare po-me ποιμήν          # cross-script: Linear B vs Greek by sound
aegean analyze nearest qa-si-re-u greek      # rank a corpus's words by sound (→ βασιλεύς)
aegean analyze assoc lineara KU-RO KI-RO     # χ², G², Fisher, PMI over shared documents
aegean analyze cooccur lineara KU-RO         # what shares a tablet with KU-RO
aegean analyze clusters lineara              # stem + productive-suffix clusters
aegean analyze structure lineara [HT13]      # accounting/libation/list/text census

compare/nearest take --script-a/--script-b (greek · lineara · linearb · cypriot) and --fold-aspiration (θ/φ/χ → t/p/k, fairer against syllabic spelling).

Data (aegean data …)

aegean data list      # the fetchable datasets (sizes, licenses)
aegean data fetch grc-joint    # pre-fetch (e.g. before going offline); sha256-verified
aegean data cache     # cache location + contents (override with PYAEGEAN_CACHE)
aegean data versions --json > data-versions.json   # pin every dataset's sha256 for a paper

AI (aegean ai …) — exploratory, key-gated

Generative commands need a provider extra (pip install "pyaegean[anthropic]") and its API key in the environment; every result is labeled exploratory and carries its grounding:

aegean ai providers
aegean ai translate "ἐν ἀρχῇ ἦν ὁ λόγος"             # grounded hybrid translation
aegean ai translate "KU-RO 130" --script lineara      # exploratory (undeciphered!)
aegean ai gloss "μῆνιν ἄειδε θεά"                     # interlinear gloss
aegean ai hypotheses "A-TA-I-*301-WA-JA" --corpus lineara   # cautious decipherment hypotheses
aegean ai ask "What is KU-RO?" --corpus lineara --trace      # --trace audits the grounding
aegean ai extract "OLE S 1" --fields commodity,amount        # structured JSON, ready for jq
aegean ai eval --provider anthropic                          # grounding-fidelity eval

Add --trace to any of these to print the grounding provenance under the answer — the local corpus/lexicon/analysis facts the model was given, grouped by source.

Recipes

Reconcile every Haghia Triada account and keep the failures:

aegean balance lineara --json | jq '[.[] | select(.balances | not)]' > discrepancies.json

Lemmatize a file of Greek, one lemma per line:

cat chapter.txt | aegean greek lemmatize - --json | jq -r '.[].lemma'

Scan a poem line-by-line, keeping only the lines that scan:

while read -r line; do aegean greek scan "$line" --json 2>/dev/null | jq -r .pattern; done < poem.txt

Map a word's distribution and cite the subset you used:

aegean geo lineara --output sites.geojson
aegean cite lineara --site "Zakros" --style bibtex >> paper.bib

Clone this wiki locally