-
Notifications
You must be signed in to change notification settings - Fork 0
Linear A
Linear A is fully wired as a script plugin. The bundled corpus is 1,721 inscriptions with the full Unicode Linear A sign repertoire — 344 signs, of which 84 carry the conventional sound values Linear A shares with the deciphered Linear B (the rest have no agreed reading).
Exploratory material. Linear A is undeciphered: the phonetic transcription uses Linear B sound values as a working convention, and every analytical method here surfaces evidence to weigh, never readings or translations. See Analysis for the methods and the Limitations page for the full picture.
The bundled corpus is a normalized transcription, but the apparatus it does carry is now interpreted: the upstream's erased-sign marks load as
ReadingStatus.LOST(552 tokens), damaged-at-break words and bracketed uncertain readings asUNCLEAR(120 tokens) — 366 of the 1,721 documents carry editorial status. The full Leiden apparatus (restorations, dotted readings) is still absent — the upstream digitization dropped it — so for edition-grade work consult GORILA and SigLA (see below). The EpiDoc reader/writer round-trip status as<unclear>/<supplied>/<gap>.
A second, independent Linear A corpus: SigLA, the paleographical database of Salgarella & Castellan (dataset published CC BY-NC-SA 4.0; its paper invites use outside the interface). pyaegean hosts the decoded dataset as a sha256-pinned release asset and loads it on demand — the NonCommercial obligation passes to you, and nothing ships in the wheel:
sigla = aegean.load("sigla") # ~1.2 MB fetch on first use, then cached
len(sigla) # 781 documents
doc = sigla.get("HT 13")
doc.meta.name # 'HT 13 (6.1×10.5×0.8 cm)' — physical dimensions!
" ".join(t.text for t in doc.tokens)
# 'KA-U-DE-TA VIN TE RE-ZA TE-TU TE-KI KU-*79-NI DA-SI-*118 I-DU-NE-SI KU-RO'What it adds over the bundled corpus: document typology, find-site,
physical dimensions, period, and EFA plate references — and a fully
independent reading of each tablet, useful for cross-checking (the two corpora
agree on 602 of 646 shared documents at ≥60% sign overlap once notation
differences like *120↔GRA are normalized; the rest is genuine scholarly
variation, e.g. SigLA's *79 where GORILA reads ZU).
Since v2 the corpus uses SigLA's own word division: signs group into
WORD tokens (KA-U-DE-TA, with the syllabograms in Token.signs), commodity
ideograms are LOGOGRAM tokens (VIN, the composite *100+*77 = VIR+KA),
and an unresolved sign inside a word shows as *?. One honest limit remains:
SigLA is a palaeographic sign database, so it records sign occurrences and
word division but not the cardinal-number quantities of the accounts — there
are no numeral values here (use the bundled GORILA corpus for accounting). Cite
SigLA in academic work
(Limitations · Data & Provenance).
import aegean
corpus = aegean.load("lineara")
len(corpus) # 1721
doc = corpus.get("HT13") # one Document by id
ht = corpus.filter(site="Haghia Triada") # AND-combine any DocumentMeta fields
ht_lmib = corpus.filter(site="Haghia Triada", period="LMIB") # period codes are un-spaced (LMIA/LMIB/MMII…)meta.site holds the full site name ("Haghia Triada", "Khania", …); the
familiar two-letter site codes (HT, ZA, KH) are the prefix of each
document's id. To select "all HT tablets", query the id (see the
query engine):
from aegean.analysis import FilterRow, run_query
res = run_query(corpus, [FilterRow("id-contains", "HT")], output="inscriptions")
len(res.inscriptions)corpus.word_frequencies()[:5] # [(word, count), ...] desc by count
corpus.to_dataframe() # one row per document
corpus.to_dataframe(level="word") # one row per WORD token
corpus.to_dataframe(level="token") # every token (words, numerals, …)The to_dataframe() views need pandas, which ships as the optional [data]
extra (pip install 'pyaegean[data]'); everything else here runs on the
dependency-free core.
A document's tokens carry their role:
doc = corpus.get("HT13")
[t.text for t in doc.words] # multi-sign lexical words
[t.text for t in doc.numerals] # numerals / metrological fractions
[t.text for t in doc.logograms] # commodity / ideogram signs
doc.line_tokens # tokens regrouped by physical lineinv = aegean.get_script("lineara").sign_inventory
len(inv) # 344 — the full Unicode Linear A repertoire
[s for s in inv if s.phonetic] # the 84 signs with assigned sound values
sign = inv.by_label("KU")
sign.phonetic # 'ku'
inv.to_dataframe() # pandas view of the inventoryfrom aegean.scripts.lineara.phonetic import word_to_phonetic
word_to_phonetic("KU-RO") # 'kuro'
word_to_phonetic("PA-I-TO") # 'paito'
word_to_phonetic("KU-RO", {"KU": "gu"}) # 'guro' (hypothesis override)KU-RO ("total") and PO-TO-KU-RO ("grand total") let you check a tablet's
arithmetic against its line items. Exploratory: section boundaries are
heuristic and the metrology is contested. Only about 40 of the 1,721 tablets
(precisely 39) carry a stated KU-RO total and are checkable at all; most are
too fragmentary — the nature of the corpus, not a tool limit.
from aegean.analysis import balance_check
for chk in balance_check(corpus.get("HT13")):
print(chk) # each total line vs the summed items it governsDash-separated sign labels with wildcards: * = exactly one sign, ** = zero
or more. Case-insensitive after subscript folding (RA₂ ≡ RA2).
from aegean.analysis import word_matches_sign_pattern
word_matches_sign_pattern("KU-NE-RO", "KU-*-RO") # True
word_matches_sign_pattern("KU-RO", "KU-*-RO") # False
[w for w, _ in corpus.word_frequencies()
if word_matches_sign_pattern(w, "KU-*-RO")]The full analytical toolkit — phonetic distance, alignment, morphology clustering, collocation statistics, the query engine, and tablet-structure classification — is on the Analysis page.
print(corpus.provenance.cite())
# Godart, L. & Olivier, J.-P. (1976–1985). Recueil des inscriptions en linéaire A. — https://github.com/mwenge/lineara.xyzStart here
Aegean scripts
Greek
Capabilities
Reference