-
Notifications
You must be signed in to change notification settings - Fork 0
Limitations
pyaegean's rule is that it tells you where it's confident and where it's guessing. This page is the complete picture in one place: what the toolkit cannot do and why, what it does not yet do and plans to, and what it deliberately won't do. Use it the way you'd use a methods section — to know exactly how far a result can be pushed before you reach for a human expert.
It is a living register. Found something it gets wrong, or a limit worth recording? See For Specialists for how to file a correction or a validation. Three kinds of limits behave very differently, and the rest of this page is organized around them:
- Limits of the evidence — undeciphered scripts, fragmentary tablets, contested readings. No amount of code removes these; the toolkit's job is to keep them visible.
- Limits of licensing — data the project may use but not redistribute, so it is fetched on demand rather than bundled.
- Engineering limits — things code can fix, tracked on the roadmap, each paired with the plan.
Then two reference sections: the measured accuracy boundaries (the weak numbers, published alongside the strong ones) and the by-design trade-offs (decisions made on purpose, not bugs and not on the roadmap).
Related pages: Greek NLP · Meters · Linear A · Analysis · CLI · Data & Provenance.
| Area | The honest limit | Where it's covered |
|---|---|---|
| Linear A / Cypro-Minoan readings | Undeciphered; results are exploratory, never translations | Evidence |
| Linear A accounting | Only ~35 of 1,721 tablets carry a checkable total | Evidence |
| Full Linear B corpus | NonCommercial → fetched, not bundled (load("damos")) |
Licensing |
| SigLA / UD / PROIEL data | NonCommercial → fetched for research/eval only | Licensing |
| Offline morphology | Misses irregular/3rd-decl./contract paradigms | Engineering |
| Neural model size | ~518 MB fp32 download | Engineering |
| Out-of-domain NLP | Accuracy drops on held-out treebanks | Accuracy |
| Zero-dep baselines | Honest floors, not the accuracy story | Accuracy |
| AI layer | Exploratory by construction — never a reading | By design |
These are properties of the material itself. They will not go away with a better release.
-
Linear A and Cypro-Minoan are undeciphered. Linear A's phonetic transcription uses Linear B sound values as a working convention — in the bundled inventory 47 of the 344 signs carry an empirical sound value (drawn from the 67 signs shared with the Linear B grid), each with a
confidence; the rest have no agreed reading. Cypro-Minoan goes further: of its 99 catalogued signs, none carries a settled sound value, so pyaegean offers no transliteration or lexicon for it. Every analytical or generative result over this material is labeled exploratory — evidence for a human expert to weigh, never a reading or a translation.import aegean inv = aegean.get_script("lineara").sign_inventory read = [s for s in inv if s.phonetic] print(len(list(inv)), len(read)) # 344 47
# The same fact from the CLI — a read sign vs. an unread one: aegean sign lineara KU --json # {"label": "KU", "glyph": "𐙂", "phonetic": "ku", ...} aegean sign cyprominoan CM001 --json # {"label": "CM001", "glyph": "𒾐", "phonetic": "", ...}
-
The corpora are fragmentary. Only about 35 of the 1,721 bundled Linear A inscriptions carry a stated, checkable accounting total (
KU-RO/TO-SO); most tablets are too damaged to balance. Stricter still, only a handful are intact and balancing —accounting.checkable_accounts(corpus)returns the clean drill set. That is the nature of the evidence, not a parser gap.import aegean from aegean.analysis import accounting la = aegean.load("lineara") with_total = [d for d in la if accounting.balance_check(d)] print(len(with_total)) # 35 print(len(accounting.checkable_accounts(la))) # 7 (intact AND balancing)
-
Damage is recorded, not hidden. Where the upstream edition marked a sign as erased or as a damaged/bracketed reading, pyaegean keeps it as a token status rather than dropping it: 552 tokens load as
LOST(across 326 documents) and 120 asUNCLEAR(across 91 documents). In total 366 documents carry at least one such mark — so any analysis can choose to trust, weight, or exclude them.import aegean from aegean import ReadingStatus la = aegean.load("lineara") lost = sum(t.status is ReadingStatus.LOST for d in la for t in d.tokens) unclear = sum(t.status is ReadingStatus.UNCLEAR for d in la for t in d.tokens) print(lost, unclear) # 552 120
-
Linear A metrology and section boundaries are contested.
balance_check's line-item sections are heuristic, and the default 10% tolerance is a lenient cutoff chosen because Aegean fractional metrology is imperfectly understood. A reported "discrepancy" is a lead, not a verdict on the scribe. -
Annotation conventions differ across treebanks. On the out-of-domain PROIEL evaluation, morphological-feature scores are capped by convention mismatches (PROIEL annotates feature types the Perseus scheme lacks, and the 9-character AGDT positional tag doesn't exist there at all) — a measurement boundary, not a model defect. Protocol notes:
docs/benchmarks.md.
The four undeciphered/partly-read scripts at a glance:
| Script |
aegean.load(...) id |
Signs | Signs with a sound value | Lexicon | Status |
|---|---|---|---|---|---|
| Linear A | lineara |
344 | 47 (working convention) | — | Undeciphered |
| Linear B | linearb |
(Linear B grid) | full grid | 150 entries | Deciphered |
| Cypriot syllabary | cypriot |
(ICS grid) | full grid | 17 entries | Deciphered |
| Cypro-Minoan | cyprominoan |
99 | 0 | — | Undeciphered |
Some of the most useful data is NonCommercial or all-rights-reserved. pyaegean's wheel is Apache-2.0, so NonCommercial data can't live in the wheel — it is fetched to your local cache on demand, sha256-pinned, with the upstream obligations passing through to you. The bundled, offline defaults stay small and license-clean.
-
The full Linear B corpus is NonCommercial, so it is fetched, not bundled. The most complete edition, DAMOS (Oslo), is CC BY-NC-SA 4.0 — usable and redistributable, but NonCommercial.
aegean.load("damos")fetches it (~5,900 tablets: Knossos, Pylos, Thebes, …, with site/chronology, scribal hands, find context, and object class) to your cache on demand, and the NC + ShareAlike obligations pass to you. The bundled 18-tablet sample remains the offline, zero-network default.import aegean lb = aegean.load("linearb") # bundled sample, offline print(len(lb)) # 18 damos = aegean.load("damos") # fetched on first use (network), ~5,900 tablets
-
SigLA's sign-level data (Salgarella & Castellan) is CC BY-NC-SA 4.0 and is integrated on the same fetch-to-cache, research-use, never-bundled pattern.
aegean.load("sigla")fetches the decoded v2 dataset (781 documents) with typology, dimensions, periods, SigLA's own word division (1,376 words grouped intoWORDtokens) and commodity ideograms (LOGOGRAMtokens). SigLA is a palaeographic database — it records sign occurrences and word division, not the cardinal-number quantities of the accounts — so it carries no numeral values (use GORILA, the bundled corpus, for accounting), and its word division and complex-sign notation differ editorially from GORILA. The drawings remain at sigla.phis.me (referenced, never redistributed). -
The UD and PROIEL treebanks are CC BY-NC-SA — pyaegean fetches them for evaluation only and never trains on them. (The models are trained on the CC BY-SA AGDT / Gorman / Pedalion treebanks, which permit it.)
-
Linear A facsimile imagery is © École Française d'Athènes and other rightsholders: referenced and fetched for academic use, never redistributed (~116 MB mirror, opt-in).
The fetched, never-bundled assets, and why:
load id / asset |
What it is | License | Why fetched, not bundled |
|---|---|---|---|
damos |
Full Linear B, ~5,900 tablets | CC BY-NC-SA 4.0 | NonCommercial |
sigla |
SigLA Linear A palaeography, 781 docs | CC BY-NC-SA 4.0 | NonCommercial |
nt |
Greek NT (Nestle 1904), ~137,800 tokens | CC0 / public domain | Size — redistributable, one book bundled offline |
| UD / PROIEL folds | Evaluation treebanks | CC BY-NC-SA | NonCommercial; eval only, never trained on |
| neural model | Joint tagger/parser/lemmatizer | CC BY-SA (from treebanks) | Size (~518 MB); the [neural] extra |
| facsimile mirror | Linear A photos/drawings | © rightsholders | Not licensed for redistribution |
Bring-your-own corpora are supported too: point PYAEGEAN_LINEARB_CORPUS at a
local edition (e.g. LiBER, which is all-rights-reserved and stays
bring-your-own). See Data & Provenance for the cache layout and
the integrity-check workflow.
Each of these is something code can fix, and each is on the roadmap.
| Limitation today | Plan |
|---|---|
The bundled Linear A corpus is a normalized transcription — the full Leiden apparatus (restorations, dotted readings) was dropped upstream. What survived is now interpreted (erased marks → LOST, bracketed readings → UNCLEAR), and SigLA is loadable (aegean.load("sigla"): 781 documents with typology, dimensions, periods, SigLA's own word division, and commodity ideograms) |
Largely done. Remaining by design: SigLA carries no cardinal-numeral values (it is a palaeographic sign database) — use GORILA for accounting |
| Linear B bundles an 18-tablet illustrative sample and a 150-entry Greek-bridge lexicon — every entry source-attested (curated core + Wiktionary-stated equations). 150 is close to the natural ceiling of stated Ancient Greek equations; many Mycenaean words have no alphabetic descendant to bridge to | The full ~5,900-tablet corpus is one call away via aegean.load("damos"); lexicon growth is contribution-driven and per-entry verified |
| The Cypriot lexicon is small (17 entries, Idalion-centred) | Grows by verified contribution from published ICS facts |
greek.load_work reads top-level textparts only and drops <note>/<bibl> silently |
Done: load_work(work, ref=…) addresses a textpart / nested div / verse line-range, and <note>/<bibl> are carried in DocumentMeta.notes
|
Done: greek.popular_works() / aegean greek works (a curated, source-verified catalog of 25 well-known works) and greek.nt_books() / aegean greek nt-books (the 27 NT books); any Perseus / First1KGreek CTS id still works |
|
| Scansion covers dactylic hexameter, elegiac pentameter, iambic trimeter (with resolution), and the aeolic lyric lines (glyconic, pherecratean, sapphic, alcaic). Synizesis is applied only for words in a curated, test-enforced lexicon — a line needing it on an un-listed word declines rather than guesses | Done. Remaining by design: non-aeolic lyric (dactylo-epitrite, free astrophic) — a research project. See Meters |
| The offline rule morphology misses irregular, third-declension, and contract paradigms, and doesn't restore accents on reconstructed lemmas | By design — the treebank and neural tiers cover these forms; a redistributable offline paradigm table would need a license-clean source (the Morpheus audit did not clear) |
| The neural pipeline's model is a 518 MB fp32 download (int8 quantization failed its accuracy gate and was rejected) | Planned: selective quantization under the same ≤0.3-point gate, and optional GPU execution providers |
| 33 of 56 gazetteer find-sites carry Pleiades ids; the rest are mostly minor findspots / peak sanctuaries not yet in Pleiades | The alignment was extended (each coordinate-verified); the remaining sites are listed as upstream-contribution candidates (docs/pleiades-candidates.md) |
| The syllabification exception lexicon lists dictionary forms; inflected compounds fall back to the phonotactic rules | Grows by contribution — adding an entry is a one-line, test-enforced change (CONTRIBUTING) |
The discovery helpers above are pure offline metadata — handy when you don't know the CTS id you need:
from aegean import greek
print(len(greek.popular_works())) # 25 (Homer, the tragedians, Plato, …)
print(len(greek.nt_books())) # 27aegean greek works # curated catalog with CTS ids
aegean greek nt-books # the 27 NT books load_nt acceptsEvery accuracy claim is measured and the protocol published — which means the weak numbers are public too. Two boundaries matter most.
The zero-dependency baselines are honest floors, not the accuracy story. They exist for the zero-install path; the opt-in tiers, ending in the neural pipeline, are where the accuracy lives.
| Baseline (pure-Python, zero deps) | Roughly | Use it for |
|---|---|---|
| Rule POS tagger | High precision on closed classes only | Quick, offline first pass |
| Edit-tree lemmatizer | ~40% on unseen forms | Offline fallback |
| Arc-eager parser | ~0.67 UAS (projective) | Offline structural sketch |
Turn on the neural pipeline (greek.use_neural_pipeline(), the [neural] extra)
and the same functions consult it instead.
Out-of-domain performance drops, as it does for every system. The neural pipeline's lemma accuracy is 94.40 on the UD Perseus test fold and 90.57 on UD PROIEL — a treebank none of pyaegean's models train on. Both numbers are always reported together.
| Metric | UD Perseus (in family) | UD PROIEL (out of domain) |
|---|---|---|
| Lemma | 94.40 | 90.57 |
| UPOS | 96.94 | 87.16 |
| UFeats | 96.12 | 59.49 (convention-capped) |
| UAS | 89.16 | 82.52 |
| LAS | 84.38 | 63.51 (scheme-capped) |
The PROIEL UFeats/LAS figures are measurement boundaries, not model failures:
PROIEL's UD conversion uses feature types and a relation scheme the Perseus
training data doesn't share. Full tables and protocol:
docs/benchmarks.md.
See Greek NLP for how to switch tiers.
These are deliberate. They're listed here so you can judge them, not so they get "fixed."
-
The core stays zero-dependency — no pydantic, no database backend, instant import. Heavy capability lives behind extras and the fetch-to-cache layer. Models and corpora are never bundled in the wheel. The opt-in extras:
Extra Adds Powers clityper, rich the aegeancommandneuralonnxruntime, tokenizers, numpy the joint NLP pipeline (torch-free) datapandas tabular export / DataFrames parquetpyarrow Parquet export epidoclxml EpiDoc TEI in/out geogeopandas, shapely GeoJSON / GeoDataFrames vizmatplotlib the plotfiguresmcpmcp the Model Context Protocol server (e.g. for Claude Code) anthropic/openai/gemini/grokone provider SDK the AI layer aiall providers the AI layer, any provider allai,epidoc,geo,data,cli,viz,mcpeverything except neural -
to_dict()is a lossy summary on purpose (words + metadata, for quick interop); the lossless, reversible path isCorpus.to_json()/Corpus.from_json(). Reach for the latter whenever round-tripping matters. -
The gazetteer is mapping-grade (~1 km site-level coordinates), for distribution maps — not survey work.
-
The AI layer is exploratory by construction. Every generative result is a labeled, provenanced hypothesis built on local deterministic grounding — by policy it is never presented as a reading, whatever the model's confidence. It needs your own API key and the matching extra; nothing is sent anywhere unless you turn it on.
-
Corpora are held in memory — fine for everything pyaegean ships and fetches (the largest, DAMOS, is ~5,900 documents).
Corpus.iter_documents/iter_tokens/iter_wordsstream token-by-token where you need it, and an opt-in analysis cache reuses heavy results; true streaming load is deferred until a corpus needs it (design note: docs/large-corpora.md). -
Try it without installing anything — the core pipeline runs client-side via Pyodide at the web demo; the CLI is the full no-Python path; and the linearaworkbench web app covers the visual Linear A use case.
Spotted a limit that's missing, stale, or wrong? That's exactly what this page is for — see For Specialists to file it.
Start here
Aegean scripts
Greek
Capabilities
Reference