Skip to content

Limitations

github-actions[bot] edited this page Jun 11, 2026 · 17 revisions

Limitations

pyaegean's rule is that it tells you where it's confident and where it's guessing. This page is the complete picture in one place: what the toolkit cannot do and why, what it does not yet do and plans to, and what it deliberately won't do. Three kinds of limits behave very differently:

  1. Limits of the evidence — undeciphered scripts, fragmentary tablets, contested readings. No amount of code removes these; the toolkit's job is to keep them visible.
  2. Engineering limits — things code can fix, tracked on the roadmap.
  3. Design decisions — trade-offs made on purpose, documented so you can judge them.

Limits of the evidence (not fixable by code)

  • Linear A and Cypro-Minoan are undeciphered. Linear A's phonetic transcription uses Linear B sound values as a working convention — 84 of the 344 signs carry one; the rest have no agreed reading. Cypro-Minoan has no settled sound values at all, so pyaegean offers no transliteration or lexicon for it. Every analytical or generative result over this material is labeled exploratory: evidence for a human expert to weigh, never a reading or a translation.
  • The corpora are fragmentary. Only ≈40 of the 1,721 bundled Linear A inscriptions carry a checkable accounting total; most tablets are too damaged. That is the nature of the evidence, not a parser gap.
  • Linear A metrology and section boundaries are contested. balance_check's line-item sections are heuristic; a "discrepancy" is a lead, not a verdict on the scribe.
  • Annotation conventions differ across treebanks. On the out-of-domain PROIEL evaluation, morphological-feature scores are capped by convention mismatches (and the 9-character positional tag doesn't exist there at all) — a measurement boundary, not a model defect. The protocol notes are in docs/benchmarks.md.

Limits of licensing (fixable only by permission)

  • No openly-licensed Linear B corpus exists. DAMOS (Oslo) is CC BY-NC-SA; LiBER (CNR) is all-rights-reserved. pyaegean bundles a small illustrative sample and parses your own licensed export locally (PYAEGEAN_LINEARB_CORPUS); licensing inquiries with both projects are a roadmap item, and the outcome will be documented either way.
  • The UD and PROIEL treebanks are CC BY-NC-SA — pyaegean fetches them for evaluation only and never trains on them.
  • Linear A facsimile imagery is © École Française d'Athènes and other rightsholders: referenced and fetched for academic use, never redistributed.
  • SigLA's apparatus-grade sign data (Salgarella & Castellan) would lift the Linear A corpus toward edition grade — and its dataset and drawings are published CC BY-NC-SA 4.0, so integration is a build item, not a permission question: the same fetch-to-cache, research-use, never-bundled pattern as PROIEL and the UD treebanks (the NC obligation passes to you, and NC data stays out of the Apache-2.0 wheel). On the roadmap under WP4.

Engineering limits we plan to lift

Each of these is on the roadmap; the pointer says where.

Limitation today Plan
The bundled Linear A corpus is a normalized transcription — the full Leiden apparatus (restorations, dotted readings) was dropped upstream. The audit is done and what survived is now interpreted: erased-sign marks load as LOST (552 tokens) and damaged/bracketed readings as UNCLEAR (120 tokens, 366 documents). The SigLA corpus is now loadable (aegean.load("sigla")): 781 documents with typology, dimensions, and periods, sign-level granularity Remaining: decode SigLA's numerals/erasure flags and word grouping (preserved as raw_flags in the dataset), and the complex-sign refinements
Linear B bundles an 18-tablet illustrative sample and a 150-entry Greek-bridge lexicon — every entry source-attested (curated core + Wiktionary-stated equations). 150 is close to the natural ceiling of stated Ancient Greek equations at the source; many Mycenaean words have no alphabetic descendant to bridge to Further growth is contribution-driven and per-entry verified; the real-corpus route stays bring-your-own EpiDoc pending the DAMOS/LiBER inquiries
The Cypriot lexicon is small (17 entries, Idalion-centred) Grows by verified contribution from published ICS facts
greek.load_work reads top-level textparts only, drops <note>/<bibl> silently, and discovers editions via an unauthenticated GitHub listing (rate-limited at scale) WP4: loader hardening — cached listings/auth token, deeper textpart addressing
Scansion covers dactylic hexameter + elegiac pentameter only, and synizesis is declined, never inferred (a line that needs it raises rather than guesses) Post-0.8.0: iambic trimeter and lyric metres, plus a curated synizesis lexicon on the same pattern as the syllabification exceptions
The offline rule morphology misses irregular, third-declension, and contract paradigms, and doesn't restore accents on reconstructed lemmas (the treebank and neural tiers cover these) Post-0.8.0: Morpheus-backed morphological tables for the offline tier
The neural pipeline's model is a 518 MB fp32 download (int8 quantization failed its accuracy gate and was rejected) Post-0.8.0: selective quantization under the same ≤0.3-point gate, and optional GPU execution providers
Only 26 of 56 gazetteer find-sites carry Pleiades ids WP8: extend the alignment and contribute the missing Bronze Age sites upstream
The syllabification exception lexicon lists dictionary forms; inflected compounds fall back to the phonotactic rules Grows by contribution — adding an entry is a one-line, test-enforced change (CONTRIBUTING)

Measured accuracy boundaries

Every accuracy claim is measured and the protocol published — which also means the weak numbers are public:

  • The zero-dependency baselines are honest floors: the rule POS tagger is high-precision only on closed classes; the pure-Python edit-tree lemmatizer reaches ~40% on unseen forms; the arc-eager parser is a ~0.67 UAS (projective) baseline. Each exists for the zero-install path — the opt-in tiers, ending in the neural pipeline, are the accuracy story.
  • Out-of-domain performance drops, as it does for every system: the neural pipeline's lemma accuracy goes from 94.4 (UD Perseus test) to 90.6 on PROIEL, a source none of pyaegean's models train on. Both numbers are always reported together; docs/benchmarks.md holds the full tables.

By design (documented trade-offs, not on the roadmap)

  • The core stays zero-dependency — no pydantic, no database backend, instant import; heavy capability lives behind extras and the fetch-to-cache layer. Models and corpora are never bundled in the wheel.
  • to_dict() is a lossy summary on purpose; the lossless path is to_json()/from_json().
  • The gazetteer is mapping-grade (~1 km site-level coordinates), for distribution maps — not survey work.
  • The AI layer is exploratory by construction. Every generative result is a labeled, provenanced hypothesis built on local deterministic grounding — by policy it is never presented as a reading, whatever the model's confidence.
  • A web demo is parked — the CLI is the no-Python path; the linearaworkbench web app covers the visual Linear A use case.

Clone this wiki locally