Limitations

pyaegean's rule is that it tells you where it's confident and where it's guessing. This page is the complete picture in one place: what the toolkit cannot do and why, what it does not yet do and plans to, and what it deliberately won't do. Use it the way you'd use a methods section — to know exactly how far a result can be pushed before you reach for a human expert.

It is a living register. Found something it gets wrong, or a limit worth recording? See For Specialists for how to file a correction or a validation. Three kinds of limits behave very differently, and the rest of this page is organized around them:

Limits of the evidence — undeciphered scripts, fragmentary tablets, contested readings. No amount of code removes these; the toolkit's job is to keep them visible.
Limits of licensing — data the project may use but not redistribute, so it is fetched on demand rather than bundled.
Engineering limits — things code can fix, tracked on the roadmap, each paired with the plan.

Then two reference sections: the measured accuracy boundaries (the weak numbers, published alongside the strong ones) and the by-design trade-offs (decisions made on purpose, not bugs and not on the roadmap).

Related pages: Greek NLP · Meters · Linear A · Analysis · CLI · Data & Provenance.

Quick map of the limits

Area	The honest limit	Where it's covered
Linear A / Cypro-Minoan readings	Undeciphered; results are exploratory, never translations	Evidence
Linear A accounting	Only ~35 of 1,721 tablets carry a checkable total	Evidence
Full Linear B corpus	NonCommercial → fetched, not bundled (`load("damos")`)	Licensing
SigLA / UD / PROIEL data	NonCommercial → fetched for research/eval only	Licensing
Offline morphology	Misses irregular/3rd-decl./contract paradigms	Engineering
Neural model size	~518 MB fp32 download	Engineering
Out-of-domain NLP	Accuracy drops on held-out treebanks	Accuracy
Zero-dep baselines	Honest floors, not the accuracy story	Accuracy
AI layer	Exploratory by construction — never a reading	By design

Limits of the evidence (not fixable by code)

These are properties of the material itself. They will not go away with a better release.

Linear A and Cypro-Minoan are undeciphered. Linear A's phonetic transcription uses Linear B sound values as a working convention — in the bundled inventory 47 of the 344 signs carry an empirical sound value (drawn from the 67 signs shared with the Linear B grid), each with a confidence; the rest have no agreed reading. Cypro-Minoan goes further: of its 99 catalogued signs, none carries a settled sound value, so pyaegean offers no transliteration or lexicon for it. Every analytical or generative result over this material is labeled exploratory — evidence for a human expert to weigh, never a reading or a translation.
```
import aegean
inv = aegean.get_script("lineara").sign_inventory
read = [s for s in inv if s.phonetic]
print(len(list(inv)), len(read))     # 344 47
```
```
# The same fact from the CLI — a read sign vs. an unread one:
aegean sign lineara KU --json
# {"label": "KU", "glyph": "𐙂", "phonetic": "ku", ...}
aegean sign cyprominoan CM001 --json
# {"label": "CM001", "glyph": "𒾐", "phonetic": "", ...}
```
The corpora are fragmentary. Only about 35 of the 1,721 bundled Linear A inscriptions carry a stated, checkable accounting total (KU-RO/TO-SO); most tablets are too damaged to balance. Stricter still, only a handful are intact and balancing — accounting.checkable_accounts(corpus) returns the clean drill set. That is the nature of the evidence, not a parser gap.
```
import aegean
from aegean.analysis import accounting
la = aegean.load("lineara")
with_total = [d for d in la if accounting.balance_check(d)]
print(len(with_total))                       # 35
print(len(accounting.checkable_accounts(la))) # 7  (intact AND balancing)
```
Damage is recorded, not hidden. Where the upstream edition marked a sign as erased or as a damaged/bracketed reading, pyaegean keeps it as a token status rather than dropping it: 552 tokens load as LOST (across 326 documents) and 120 as UNCLEAR (across 91 documents). In total 366 documents carry at least one such mark — so any analysis can choose to trust, weight, or exclude them.
```
import aegean
from aegean import ReadingStatus
la = aegean.load("lineara")
lost = sum(t.status is ReadingStatus.LOST for d in la for t in d.tokens)
unclear = sum(t.status is ReadingStatus.UNCLEAR for d in la for t in d.tokens)
print(lost, unclear)                          # 552 120
```
Linear A metrology and section boundaries are contested. balance_check's line-item sections are heuristic, and the default 10% tolerance is a lenient cutoff chosen because Aegean fractional metrology is imperfectly understood. A reported "discrepancy" is a lead, not a verdict on the scribe.
Annotation conventions differ across treebanks. On the out-of-domain PROIEL evaluation, morphological-feature scores are capped by convention mismatches (PROIEL annotates feature types the Perseus scheme lacks, and the 9-character AGDT positional tag doesn't exist there at all) — a measurement boundary, not a model defect. Protocol notes: docs/benchmarks.md.

The four undeciphered/partly-read scripts at a glance:

Script	`aegean.load(...)` id	Signs	Signs with a sound value	Lexicon	Status
Linear A	`lineara`	344	47 (working convention)	—	Undeciphered
Linear B	`linearb`	(Linear B grid)	full grid	150 entries	Deciphered
Cypriot syllabary	`cypriot`	(ICS grid)	full grid	17 entries	Deciphered
Cypro-Minoan	`cyprominoan`	99	0	—	Undeciphered

Limits of licensing (fixable only by permission)

Some of the most useful data is NonCommercial or all-rights-reserved. pyaegean's wheel is Apache-2.0, so NonCommercial data can't live in the wheel — it is fetched to your local cache on demand, sha256-pinned, with the upstream obligations passing through to you. The bundled, offline defaults stay small and license-clean.

The full Linear B corpus is NonCommercial, so it is fetched, not bundled. The most complete edition, DAMOS (Oslo), is CC BY-NC-SA 4.0 — usable and redistributable, but NonCommercial. aegean.load("damos") fetches it (~5,900 tablets: Knossos, Pylos, Thebes, …, with site/chronology, scribal hands, find context, and object class) to your cache on demand, and the NC + ShareAlike obligations pass to you. The bundled 18-tablet sample remains the offline, zero-network default.
```
import aegean
lb = aegean.load("linearb")          # bundled sample, offline
print(len(lb))                        # 18
damos = aegean.load("damos")          # fetched on first use (network), ~5,900 tablets
```
SigLA's sign-level data (Salgarella & Castellan) is CC BY-NC-SA 4.0 and is integrated on the same fetch-to-cache, research-use, never-bundled pattern. aegean.load("sigla") fetches the decoded v2 dataset (781 documents) with typology, dimensions, periods, SigLA's own word division (1,376 words grouped into WORD tokens) and commodity ideograms (LOGOGRAM tokens). SigLA is a palaeographic database — it records sign occurrences and word division, not the cardinal-number quantities of the accounts — so it carries no numeral values (use GORILA, the bundled corpus, for accounting), and its word division and complex-sign notation differ editorially from GORILA. The drawings remain at sigla.phis.me (referenced, never redistributed).
The UD and PROIEL treebanks are CC BY-NC-SA — pyaegean fetches them for evaluation only and never trains on them. (The models are trained on the CC BY-SA AGDT / Gorman / Pedalion treebanks, which permit it.)
Linear A facsimile imagery is © École Française d'Athènes and other rightsholders: referenced and fetched for academic use, never redistributed (~116 MB mirror, opt-in).

The fetched, never-bundled assets, and why:

`load` id / asset	What it is	License	Why fetched, not bundled
`damos`	Full Linear B, ~5,900 tablets	CC BY-NC-SA 4.0	NonCommercial
`sigla`	SigLA Linear A palaeography, 781 docs	CC BY-NC-SA 4.0	NonCommercial
`nt`	Greek NT (Nestle 1904), ~137,800 tokens	CC0 / public domain	Size — redistributable, one book bundled offline
UD / PROIEL folds	Evaluation treebanks	CC BY-NC-SA	NonCommercial; eval only, never trained on
neural model	Joint tagger/parser/lemmatizer	CC BY-SA (from treebanks)	Size (~518 MB); the `[neural]` extra
facsimile mirror	Linear A photos/drawings	© rightsholders	Not licensed for redistribution

Bring-your-own corpora are supported too: point PYAEGEAN_LINEARB_CORPUS at a local edition (e.g. LiBER, which is all-rights-reserved and stays bring-your-own). See Data & Provenance for the cache layout and the integrity-check workflow.

Engineering limits we plan to lift

Each of these is something code can fix, and each is on the roadmap.

Limitation today	Plan
The bundled Linear A corpus is a normalized transcription — the full Leiden apparatus (restorations, dotted readings) was dropped upstream. What survived is now interpreted (erased marks → `LOST`, bracketed readings → `UNCLEAR`), and SigLA is loadable (`aegean.load("sigla")`: 781 documents with typology, dimensions, periods, SigLA's own word division, and commodity ideograms)	Largely done. Remaining by design: SigLA carries no cardinal-numeral values (it is a palaeographic sign database) — use GORILA for accounting
Linear B bundles an 18-tablet illustrative sample and a 150-entry Greek-bridge lexicon — every entry source-attested (curated core + Wiktionary-stated equations). 150 is close to the natural ceiling of stated Ancient Greek equations; many Mycenaean words have no alphabetic descendant to bridge to	The full ~5,900-tablet corpus is one call away via `aegean.load("damos")`; lexicon growth is contribution-driven and per-entry verified
The Cypriot lexicon is small (17 entries, Idalion-centred)	Grows by verified contribution from published ICS facts
~~`greek.load_work` reads top-level textparts only and drops `<note>`/`<bibl>` silently~~	Done: `load_work(work, ref=…)` addresses a textpart / nested div / verse line-range, and `<note>`/`<bibl>` are carried in `DocumentMeta.notes`
~~No way to discover which Greek works or NT books are loadable~~	Done: `greek.popular_works()` / `aegean greek works` (a curated, source-verified catalog of 25 well-known works) and `greek.nt_books()` / `aegean greek nt-books` (the 27 NT books); any Perseus / First1KGreek CTS id still works
Scansion covers dactylic hexameter, elegiac pentameter, iambic trimeter (with resolution), and the aeolic lyric lines (glyconic, pherecratean, sapphic, alcaic). Synizesis is applied only for words in a curated, test-enforced lexicon — a line needing it on an un-listed word declines rather than guesses	Done. Remaining by design: non-aeolic lyric (dactylo-epitrite, free astrophic) — a research project. See Meters
The offline rule morphology misses irregular, third-declension, and contract paradigms, and doesn't restore accents on reconstructed lemmas	By design — the treebank and neural tiers cover these forms; a redistributable offline paradigm table would need a license-clean source (the Morpheus audit did not clear)
The neural pipeline's model is a 518 MB fp32 download (int8 quantization failed its accuracy gate and was rejected)	Planned: selective quantization under the same ≤0.3-point gate, and optional GPU execution providers
33 of 56 gazetteer find-sites carry Pleiades ids; the rest are mostly minor findspots / peak sanctuaries not yet in Pleiades	The alignment was extended (each coordinate-verified); the remaining sites are listed as upstream-contribution candidates (`docs/pleiades-candidates.md`)
The syllabification exception lexicon lists dictionary forms; inflected compounds fall back to the phonotactic rules	Grows by contribution — adding an entry is a one-line, test-enforced change (CONTRIBUTING)

The discovery helpers above are pure offline metadata — handy when you don't know the CTS id you need:

from aegean import greek
print(len(greek.popular_works()))   # 25  (Homer, the tragedians, Plato, …)
print(len(greek.nt_books()))        # 27

aegean greek works        # curated catalog with CTS ids
aegean greek nt-books     # the 27 NT books load_nt accepts

Measured accuracy boundaries

Every accuracy claim is measured and the protocol published — which means the weak numbers are public too. Two boundaries matter most.

The zero-dependency baselines are honest floors, not the accuracy story. They exist for the zero-install path; the opt-in tiers, ending in the neural pipeline, are where the accuracy lives.

Baseline (pure-Python, zero deps)	Roughly	Use it for
Rule POS tagger	High precision on closed classes only	Quick, offline first pass
Edit-tree lemmatizer	~40% on unseen forms	Offline fallback
Arc-eager parser	~0.67 UAS (projective)	Offline structural sketch

Turn on the neural pipeline (greek.use_neural_pipeline(), the [neural] extra) and the same functions consult it instead.

Out-of-domain performance drops, as it does for every system. The neural pipeline's lemma accuracy is 94.40 on the UD Perseus test fold and 90.57 on UD PROIEL — a treebank none of pyaegean's models train on. Both numbers are always reported together.

Metric	UD Perseus (in family)	UD PROIEL (out of domain)
Lemma	94.40	90.57
UPOS	96.94	87.16
UFeats	96.12	59.49 (convention-capped)
UAS	89.16	82.52
LAS	84.38	63.51 (scheme-capped)

The PROIEL UFeats/LAS figures are measurement boundaries, not model failures: PROIEL's UD conversion uses feature types and a relation scheme the Perseus training data doesn't share. Full tables and protocol: docs/benchmarks.md. See Greek NLP for how to switch tiers.

By design (documented trade-offs, not on the roadmap)

These are deliberate. They're listed here so you can judge them, not so they get "fixed."

The core stays zero-dependency — no pydantic, no database backend, instant import. Heavy capability lives behind extras and the fetch-to-cache layer. Models and corpora are never bundled in the wheel. The opt-in extras:

Extra	Adds	Powers
`cli`	typer, rich	the `aegean` command
`neural`	onnxruntime, tokenizers, numpy	the joint NLP pipeline (torch-free)
`data`	pandas	tabular export / DataFrames
`parquet`	pyarrow	Parquet export
`epidoc`	lxml	EpiDoc TEI in/out
`geo`	geopandas, shapely	GeoJSON / GeoDataFrames
`viz`	matplotlib	the `plot` figures
`mcp`	mcp	the Model Context Protocol server (e.g. for Claude Code)
`anthropic` / `openai` / `gemini` / `grok`	one provider SDK	the AI layer
`ai`	all providers	the AI layer, any provider
`all`	`ai,epidoc,geo,data,cli,viz,mcp`	everything except `neural`

to_dict() is a lossy summary on purpose (words + metadata, for quick interop); the lossless, reversible path is Corpus.to_json() / Corpus.from_json(). Reach for the latter whenever round-tripping matters.
The gazetteer is mapping-grade (~1 km site-level coordinates), for distribution maps — not survey work.
The AI layer is exploratory by construction. Every generative result is a labeled, provenanced hypothesis built on local deterministic grounding — by policy it is never presented as a reading, whatever the model's confidence. It needs your own API key and the matching extra; nothing is sent anywhere unless you turn it on.
Corpora are held in memory — fine for everything pyaegean ships and fetches (the largest, DAMOS, is ~5,900 documents). Corpus.iter_documents / iter_tokens / iter_words stream token-by-token where you need it, and an opt-in analysis cache reuses heavy results; true streaming load is deferred until a corpus needs it (design note: docs/large-corpora.md).
Try it without installing anything — the core pipeline runs client-side via Pyodide at the web demo; the CLI is the full no-Python path; and the linearaworkbench web app covers the visual Linear A use case.

Spotted a limit that's missing, stale, or wrong? That's exactly what this page is for — see For Specialists to file it.

pyaegean

Home

Start here

Aegean scripts

Greek

Capabilities

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limitations

Limitations

Quick map of the limits

Limits of the evidence (not fixable by code)

Limits of licensing (fixable only by permission)

Engineering limits we plan to lift

Measured accuracy boundaries

By design (documented trade-offs, not on the roadmap)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pyaegean

Clone this wiki locally