Limitations

pyaegean's rule is that it tells you where it's confident and where it's guessing. This page is the complete picture in one place: what the toolkit cannot do and why, what it does not yet do and plans to, and what it deliberately won't do. Three kinds of limits behave very differently:

Limits of the evidence — undeciphered scripts, fragmentary tablets, contested readings. No amount of code removes these; the toolkit's job is to keep them visible.
Engineering limits — things code can fix, tracked on the roadmap.
Design decisions — trade-offs made on purpose, documented so you can judge them.

Limits of the evidence (not fixable by code)

Linear A and Cypro-Minoan are undeciphered. Linear A's phonetic transcription uses Linear B sound values as a working convention — 84 of the 344 signs carry one; the rest have no agreed reading. Cypro-Minoan has no settled sound values at all, so pyaegean offers no transliteration or lexicon for it. Every analytical or generative result over this material is labeled exploratory: evidence for a human expert to weigh, never a reading or a translation.
The corpora are fragmentary. Only ≈40 of the 1,721 bundled Linear A inscriptions carry a checkable accounting total; most tablets are too damaged. That is the nature of the evidence, not a parser gap.
Linear A metrology and section boundaries are contested. balance_check's line-item sections are heuristic; a "discrepancy" is a lead, not a verdict on the scribe.
Annotation conventions differ across treebanks. On the out-of-domain PROIEL evaluation, morphological-feature scores are capped by convention mismatches (and the 9-character positional tag doesn't exist there at all) — a measurement boundary, not a model defect. The protocol notes are in docs/benchmarks.md.

Limits of licensing (fixable only by permission)

No openly-licensed Linear B corpus exists. DAMOS (Oslo) is CC BY-NC-SA; LiBER (CNR) is all-rights-reserved. pyaegean bundles a small illustrative sample and parses your own licensed export locally (PYAEGEAN_LINEARB_CORPUS); licensing inquiries with both projects are a roadmap item, and the outcome will be documented either way.
The UD and PROIEL treebanks are CC BY-NC-SA — pyaegean fetches them for evaluation only and never trains on them.
Linear A facsimile imagery is © École Française d'Athènes and other rightsholders: referenced and fetched for academic use, never redistributed.
SigLA's apparatus-grade sign data (Salgarella & Castellan) would lift the Linear A corpus toward edition grade — and its dataset and drawings are published CC BY-NC-SA 4.0, so integration is a build item, not a permission question: the same fetch-to-cache, research-use, never-bundled pattern as PROIEL and the UD treebanks (the NC obligation passes to you, and NC data stays out of the Apache-2.0 wheel). On the roadmap under WP4.

Engineering limits we plan to lift

Each of these is on the roadmap; the pointer says where.

Limitation today	Plan
The bundled Linear A corpus is a normalized transcription — the full Leiden apparatus (restorations, dotted readings) was dropped upstream. The audit is done and what survived is now interpreted: erased-sign marks load as `LOST` (552 tokens) and damaged/bracketed readings as `UNCLEAR` (120 tokens, 366 documents). The SigLA corpus is now loadable (`aegean.load("sigla")`): 781 documents with typology, dimensions, and periods, sign-level granularity	Remaining: decode SigLA's numerals/erasure flags and word grouping (preserved as `raw_flags` in the dataset), and the complex-sign refinements
Linear B bundles an 18-tablet illustrative sample and a 150-entry Greek-bridge lexicon — every entry source-attested (curated core + Wiktionary-stated equations). 150 is close to the natural ceiling of stated Ancient Greek equations at the source; many Mycenaean words have no alphabetic descendant to bridge to	Further growth is contribution-driven and per-entry verified; the real-corpus route stays bring-your-own EpiDoc pending the DAMOS/LiBER inquiries
The Cypriot lexicon is small (17 entries, Idalion-centred)	Grows by verified contribution from published ICS facts
`greek.load_work` reads top-level textparts only, drops `<note>`/`<bibl>` silently, and discovers editions via an unauthenticated GitHub listing (rate-limited at scale)	WP4: loader hardening — cached listings/auth token, deeper textpart addressing
Scansion covers dactylic hexameter + elegiac pentameter only, and synizesis is declined, never inferred (a line that needs it raises rather than guesses)	Post-0.8.0: iambic trimeter and lyric metres, plus a curated synizesis lexicon on the same pattern as the syllabification exceptions
The offline rule morphology misses irregular, third-declension, and contract paradigms, and doesn't restore accents on reconstructed lemmas (the treebank and neural tiers cover these)	Post-0.8.0: Morpheus-backed morphological tables for the offline tier
The neural pipeline's model is a 518 MB fp32 download (int8 quantization failed its accuracy gate and was rejected)	Post-0.8.0: selective quantization under the same ≤0.3-point gate, and optional GPU execution providers
Only 26 of 56 gazetteer find-sites carry Pleiades ids	WP8: extend the alignment and contribute the missing Bronze Age sites upstream
The syllabification exception lexicon lists dictionary forms; inflected compounds fall back to the phonotactic rules	Grows by contribution — adding an entry is a one-line, test-enforced change (CONTRIBUTING)

Measured accuracy boundaries

Every accuracy claim is measured and the protocol published — which also means the weak numbers are public:

The zero-dependency baselines are honest floors: the rule POS tagger is high-precision only on closed classes; the pure-Python edit-tree lemmatizer reaches ~40% on unseen forms; the arc-eager parser is a ~0.67 UAS (projective) baseline. Each exists for the zero-install path — the opt-in tiers, ending in the neural pipeline, are the accuracy story.
Out-of-domain performance drops, as it does for every system: the neural pipeline's lemma accuracy goes from 94.4 (UD Perseus test) to 90.6 on PROIEL, a source none of pyaegean's models train on. Both numbers are always reported together; docs/benchmarks.md holds the full tables.

By design (documented trade-offs, not on the roadmap)

The core stays zero-dependency — no pydantic, no database backend, instant import; heavy capability lives behind extras and the fetch-to-cache layer. Models and corpora are never bundled in the wheel.
to_dict() is a lossy summary on purpose; the lossless path is to_json()/from_json().
The gazetteer is mapping-grade (~1 km site-level coordinates), for distribution maps — not survey work.
The AI layer is exploratory by construction. Every generative result is a labeled, provenanced hypothesis built on local deterministic grounding — by policy it is never presented as a reading, whatever the model's confidence.
A web demo is parked — the CLI is the no-Python path; the linearaworkbench web app covers the visual Linear A use case.

pyaegean

Home

Start here

Aegean scripts

Greek

Capabilities

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limitations

Limitations

Limits of the evidence (not fixable by code)

Limits of licensing (fixable only by permission)

Engineering limits we plan to lift

Measured accuracy boundaries

By design (documented trade-offs, not on the roadmap)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pyaegean

Clone this wiki locally