-
Notifications
You must be signed in to change notification settings - Fork 0
Linear A
Linear A is the script of Bronze-Age Crete, still undeciphered. pyaegean ships the whole corpus offline and gives you tools to explore it: load and filter 1,721 inscriptions, look up signs, search for words by their sign shape (with wildcards), check the accounting tablets' arithmetic, and sort tablets by what they look like they're for. You'd reach for this page to answer questions like "which words end in -RO?", "does this tablet's KU-RO total actually add up?", or "what does sign KU look like?" — all without writing much Python, and with a matching command line for every method.
Read this first — it's exploratory material. Linear A is undeciphered. The phonetic transcription here uses Linear B sound values as a working convention (the two scripts share many signs), and every method on this page surfaces evidence to weigh, never readings or translations. Numbers, totals, and structure labels are heuristics on a damaged, contested corpus. See the Limitations page for the honest, full picture, and Analysis for the methods in depth.
Everything below runs on the dependency-free core unless a step is explicitly
marked as needing the [data] extra (pandas) or a one-time fetch (SigLA).
What the bundled corpus contains. The first four counts plus source and license
come straight from aegean info lineara; the sound-value, shared-with-Linear-B,
editorial-status, and checkable-KU-RO rows are from the Python introspection
shown later on this page:
| Quantity | Value |
|---|---|
| Inscriptions (documents) | 1,721 |
| Word tokens (all multi-sign; 995 distinct) | 1,381 |
| Tokens (words, numerals, logograms, separators) | 6,406 |
| Signs in the inventory | 344 (the full Unicode Linear A block) |
| Signs with an assigned sound value | 47 |
| Signs marked shared with Linear B | 67 |
| Documents carrying editorial status |
366 (552 LOST + 120 UNCLEAR tokens) |
Tablets with a checkable KU-RO total |
35 (39 total lines) |
| Source | GORILA (Godart & Olivier 1976–1985) via mwenge/lineara.xyz
|
| License | Apache-2.0 (corpus JSON); facsimile imagery © École Française d'Athènes, not redistributed |
aegean info lineara
# documents 1721
# words 1381
# tokens 6406
# signs_in_inventory 344
# source GORILA (Godart & Olivier 1976–1985) via mwenge/lineara.xyz
# license Apache-2.0 (corpus JSON); facsimile imagery © École Française d'Athènes, not redistributedThe corpus is bundled in the wheel — no download, works offline.
import aegean
corpus = aegean.load("lineara")
len(corpus) # 1721
doc = corpus.get("HT13") # one Document by id (note: no space — "HT13")There is no separate CLI "load the corpus" step — every aegean command takes
the corpus id (lineara) as its first argument and loads it for you.
corpus.filter(...) AND-combines any DocumentMeta fields. The
period codes are written un-spaced (LMIA, LMIB, MMII, …), and
meta.site holds the full site name:
ht = corpus.filter(site="Haghia Triada")
len(ht) # 1110
ht_lmib = corpus.filter(site="Haghia Triada", period="LMIB")
len(ht_lmib) # 1110The same on the command line, which also lists the matches:
aegean load lineara --site "Haghia Triada" --period LMIB --limit 5| Filter field |
corpus.filter(...) keyword |
aegean load flag |
|---|---|---|
| Find-site (full name) | site= |
--site |
| Period / context code | period= |
--period |
| Scribal hand | scribe= |
--scribe |
| Support (object type) | support= |
--support |
| (export the filtered set) | — | --output / -o FILE.json |
| (rows shown) | — |
--limit N (default 20) |
Site codes vs. site names.
meta.siteis the full name ("Haghia Triada","Khania", …). The familiar two-letter codes (HT,ZA,KH) are the prefix of each document's id, not a metadata field. To select "all HT tablets," query the id with the query engine:
from aegean.analysis import FilterRow, run_query
res = run_query(corpus, [FilterRow("id-contains", "HT")], output="inscriptions")
len(res.inscriptions) # 1110aegean query lineara --where id-contains=HTaegean show lineara HT13
# HT13 site=Haghia Triada period=LMIB scribe=HT Scribe 8 support=Tablet
# 1: KA-U-DE-TA VIN 𐄁 TE 𐄁
# 2: RE-ZA 5 ¹⁄₂
# 3: TE-TU 56
# 4: TE-KI 27 ¹⁄₂
# 5: KU-ZU-NI 18
# 6: DA-SI-*118 19
# 7: I-DU-NE-SI 5
# 8: KU-RO 130 ¹⁄₂Every token carries its role, so you can pull just words, numerals, or commodity signs:
doc = corpus.get("HT13")
[t.text for t in doc.words]
# ['KA-U-DE-TA', 'RE-ZA', 'TE-TU', 'TE-KI', 'KU-ZU-NI', 'DA-SI-*118', 'I-DU-NE-SI', 'KU-RO']
[t.text for t in doc.numerals]
# ['5', '¹⁄₂', '56', '27', '¹⁄₂', '18', '19', '5', '130', '¹⁄₂']
[t.text for t in doc.logograms]
# ['VIN', 'TE']
doc.line_tokens # tokens regrouped by physical linecorpus.word_frequencies()[:5]
# [('KU-RO', 37), ('SA-RA₂', 20), ('KI-RO', 16), ('*411-VS', 15), ('A-TA-I-*301-WA-JA', 11)]The DataFrame views need pandas, shipped as the optional [data] extra
(pip install 'pyaegean[data]'); everything else on this page runs on the core.
corpus.to_dataframe() # (1721, 10) — one row per document
corpus.to_dataframe(level="word") # one row per WORD token
corpus.to_dataframe(level="token") # every token (words, numerals, …)The document-level frame's columns are: id, script_id, site, support,
scribe, findspot, period, name, n_tokens, n_words.
The inventory is the full Unicode Linear A repertoire — 344 signs. Of those,
47 carry an assigned sound value (phonetic); the rest are carried from the
Unicode Character Database with attrs["source"] == "ucd" and no reading,
because Linear A is undeciphered and most of its repertoire has no agreed value.
inv = aegean.get_script("lineara").sign_inventory
len(inv) # 344
len([s for s in inv if s.phonetic]) # 47 — the read signs
len([s for s in inv if s.attrs.get("sharedWithLinearB")]) # 67 — shared glyphs
sign = inv.by_label("KU")
sign.phonetic # 'ku'
sign.glyph # '𐙂'
sign.codepoint # 67138 (the int; the CLI/JSON render it as 'U+10642')
inv.to_dataframe() # (344, 10) — needs [data]The inventory DataFrame's columns are: label, glyph, codepoint,
phonetic, sharedWithLinearB, linearAOnly, total, confidence,
altGlyphs, source.
Look up a single sign from the shell:
aegean sign lineara KU
# label KU
# glyph 𐙂
# codepoint U+10642
# phonetic ku
# attrs.sharedWithLinearB True
# attrs.linearAOnly False
# attrs.total 16
# attrs.confidence 1aegean sign lineara KU --json
# {"label": "KU", "glyph": "𐙂", "codepoint": "U+10642", "phonetic": "ku",
# "attrs": {"sharedWithLinearB": true, "linearAOnly": false,
# "total": 16, "confidence": 1, "altGlyphs": []}}The sign argument accepts either a label (KU, *301) or a single glyph
character. The attrs.confidence field rates how secure the empirical
sign→sound alignment is — treat it as evidence, not canon.
The Linear-B-value convention, with an optional hypothesis override so you can
test "what if KU were really gu?":
from aegean.scripts.lineara.phonetic import word_to_phonetic
word_to_phonetic("KU-RO") # 'kuro'
word_to_phonetic("PA-I-TO") # 'paito'
word_to_phonetic("KU-RO", {"KU": "gu"}) # 'guro' (hypothesis override)This is the workhorse for "find every word shaped like X." A pattern is a dash-separated list of sign labels with two wildcards:
| Wildcard | Meaning |
|---|---|
* |
exactly one sign (any value) |
** |
zero or more signs |
(a label, e.g. KU) |
that exact sign |
Matching is case-insensitive after subscript folding, so RA₂ and RA2 are the
same sign. The key thing to internalize: * is one whole sign, not one
letter. KU-*-RO means "KU, then any single sign, then RO" — a three-sign
word.
from aegean.analysis import word_matches_sign_pattern
word_matches_sign_pattern("KU-NE-RO", "KU-*-RO") # True — three signs, middle is anything
word_matches_sign_pattern("KU-RO", "KU-*-RO") # False — only two signs, * needs one in between
word_matches_sign_pattern("KU-RO", "**-RO") # True — ** allows zero-or-more before RO
word_matches_sign_pattern("A-TA-I-*301-WA-JA", "A-**-JA") # True — A … JA, anything betweenaegean search lineara "*-RO"
# '*-RO': 6 word(s)
# KU-RO 37
# KI-RO 16
# *86-RO 4
# SA-RO 4
# NU-RO 2
# RE-RO 1aegean search lineara "KU-*"
# 'KU-*': 12 word(s)
# KU-RO 37
# KU-PA 4
# KU-RA 2
# KU-RE 2
# KU-*305 1
# KU-*321 1
# KU-DA 1 (… KU-KA, KU-NI, KU-PA₃, KU-PI, KU-TA each 1)aegean search lineara "KU-*-RO"
# 'KU-*-RO': 1 word(s)
# KU-MA-RO 1The same searches in Python, ranked by frequency:
[(w, n) for w, n in corpus.word_frequencies()
if word_matches_sign_pattern(w, "**-RO")]
# [('KU-RO', 37), ('KI-RO', 16), ('*86-RO', 4), ('SA-RO', 4), ('KI-DA-RO', 2), ('NU-RO', 2), ...] — 18 wordsIf you need the compiled form (e.g. to reuse one pattern over many words):
from aegean.analysis import compile_sign_pattern
from aegean.analysis.patterns import match_sign_pattern
pat = compile_sign_pattern("KU-*-RO")
match_sign_pattern(["KU", "MA", "RO"], pat) # TrueSign-pattern search is also one predicate inside the larger
query engine (word-sign-pattern), which lets you
combine it with site/period/co-occurrence filters.
Single-sign words never match a pattern (there's no dash to split), and an empty pattern matches nothing.
KU-RO means "total" and PO-TO-KU-RO "grand total." On an accounting tablet
you can check the stated total against the line items above it — a concrete,
falsifiable thing to do with an undeciphered script.
Exploratory. Section boundaries are heuristic and Aegean metrology is contested, so a "balance" is evidence, not proof. Only 35 of the 1,721 tablets carry a
KU-ROtotal that's checkable at all (39 total lines, since a few tablets state more than one); the rest are too fragmentary — that's the nature of the corpus, not a tool limit. Of the 39 checked lines, 8 balance exactly.
Check one tablet:
aegean balance lineara HT13
# doc marker stated computed diff balances
# HT13 KU-RO 130.5 131.0 0.5 NOThe stated total is 130½ but the six items above it sum to 131 — off by ½. The Python form gives you the full record per total line:
from aegean.analysis import balance_check
for chk in balance_check(corpus.get("HT13")):
print(chk)
# BalanceCheck(stated_total=130.5, computed_sum=131.0, item_count=6,
# difference=0.5, balances=False, marker='KU-RO', total_line_index=7)BalanceCheck fields: stated_total, computed_sum, item_count,
difference, balances, marker, total_line_index.
Omit the document id to check every total line at once:
aegean balance lineara
# lineara: 39 total line(s) checked
# HT9a KU-RO 31.75 31.0 -0.75 NO
# HT9b KU-RO 24.0 24.0 0.0 yes
# HT11b KU-RO 180.0 180.0 0.0 yes
# HT13 KU-RO 130.5 131.0 0.5 NO
# HT25b KU-RO 52.0 52.0 0.0 yes
# ... (39 lines across 35 documents)Add --strict to exit non-zero if any checked total fails to balance — handy in
a script or CI step. --json emits the rows as machine-readable JSON.
A stricter filter than "balances": every token securely read (no lacuna, no bracketed restoration) and the arithmetic holds within a tolerance. These are the clean teaching/drill candidates.
from aegean.analysis.accounting import checkable_accounts, is_checkable_account
clean = checkable_accounts(corpus) # default tolerance 10%
[d.id for d in clean] # ['HT9a', 'HT9b', 'HT11b', 'HT13', 'HT89', ...]
is_checkable_account(corpus.get("HT11b")) # TrueThe default tolerance=0.10 is lenient on purpose, because Aegean metrology is
imperfectly understood; raise or lower it to taste.
Sort inscriptions by what their token stream looks like — a quick way to find the accounts, the libation formulas, or the running text. These are content-shape heuristics, not genre attributions; you're expected to override individual calls.
| Category key | Label | Signal |
|---|---|---|
accounting |
Accounting | Has KU-RO, or numerals plus several multi-sign words |
libation |
Libation | Contains a known libation-formula word |
list |
Lists | Many separator marks, no numerals |
text |
Text / Other | Extended hyphenated text, no numerals |
other |
Unclassified | Short or ambiguous |
The known libation-formula words are A-TA-I-*301-WA-JA, JA-SA-SA-RA-ME, and
A-DI-KI-TE-TE-DU. The precedence is exactly the order above (accounting wins
over libation, etc.).
Census the whole corpus:
aegean analyze structure lineara
# lineara: structure census (heuristic)
# accounting 134
# libation 15
# list 7
# text 2
# other 1563Classify one document:
aegean analyze structure lineara HT13
# HT13: accountingIn Python:
from aegean.analysis.structure import classify_structure, classify_corpus
classify_structure(corpus.get("HT13")) # 'accounting'
buckets = classify_corpus(corpus) # {'accounting': [...ids...], 'libation': [...], ...}
len(buckets["accounting"]) # 134The accounting tablets count commodities marked by ideograms (grain, oil,
sheep, people …). pyaegean ships a curated catalog of 21 commodity heads with
their standard GORILA/Younger glosses and a broad category, plus helpers to
identify them in a token stream. (The glosses are standard; the syllabic
values of the underlying signs are a separate, open question, and the numbered
*NNN logograms are genuinely undeciphered as to referent.)
from aegean.scripts.lineara.commodities import COMMODITIES, commodity_head, is_lexical_word
len(COMMODITIES) # 21
commodity_head("GRA") # 'GRA'
commodity_head("OLE+U") # 'OLE' — strips the ligature modifier
commodity_head("OVISm") # 'OVIS' — strips the sex marker
commodity_head("KU-RO") # None — hyphenated → a syllabic word, never a logogram
is_lexical_word("KU-RO") # True — a real syllabic word, not a logogram chainThe most frequent commodities across the corpus (oil, grain, cyperus, figs, wine, people):
from collections import Counter
from aegean.scripts.lineara.commodities import commodity_head
counts = Counter(commodity_head(t.text)
for d in corpus for t in d.tokens
if commodity_head(t.text))
counts.most_common(6)
# [('OLE', 128), ('GRA', 108), ('CYP', 85), ('NI', 76), ('VIN', 62), ('VIR', 48)]| Category | Heads |
|---|---|
| agricultural |
GRA, HORD, OLE, OLIV, VIN, FIC, NI, CYP, AROM, GRA_PA
|
| livestock |
OVIS, CAP, SUS, BOS
|
| people |
VIR, MUL
|
| material |
TELA, LANA, AES, AUR, ARG
|
The bundled corpus is a normalized transcription, but the apparatus it does
carry is interpreted on load. The upstream's erased-sign marks become
ReadingStatus.LOST (where the text isn't preserved), and damaged-at-a-break
words plus bracketed uncertain readings become UNCLEAR:
from aegean.core.model import ReadingStatus
lost = sum(1 for d in corpus for t in d.tokens if t.status is ReadingStatus.LOST)
unclear = sum(1 for d in corpus for t in d.tokens if t.status is ReadingStatus.UNCLEAR)
docs = sum(1 for d in corpus if any(t.status is not ReadingStatus.CERTAIN for t in d.tokens))
lost, unclear, docs # (552, 120, 366)So 366 of 1,721 documents carry editorial status (552 LOST tokens, 120
UNCLEAR). The full Leiden apparatus (restorations, dotted readings) is
still absent — the upstream digitization dropped it — so for edition-grade work
consult GORILA and SigLA (below). The EpiDoc reader/writer round-trips
these as <unclear> / <supplied> / <gap>.
A second, independent Linear A corpus: SigLA, the paleographical database of Salgarella & Castellan (dataset published CC BY-NC-SA 4.0; its paper invites use outside the interface). pyaegean hosts the decoded dataset as a sha256-pinned release asset and loads it on demand — the NonCommercial obligation passes to you, and nothing ships in the wheel.
sigla = aegean.load("sigla") # ~1.2 MB fetch on first use, then cached
len(sigla) # 781
doc = sigla.get("HT 13") # note the space — SigLA ids are spaced
doc.meta.name # 'HT 13 (6.1×10.5×0.8 cm)' — physical dimensions!
" ".join(t.text for t in doc.tokens)
# 'KA-U-DE-TA VIN TE RE-ZA TE-TU TE-KI KU-*79-NI DA-SI-*118 I-DU-NE-SI KU-RO'What SigLA adds over the bundled corpus: document typology, find-site,
physical dimensions, period, EFA plate references, SigLA's own word
division, and a fully independent reading of each tablet (note KU-*79-NI where
GORILA reads KU-ZU-NI) — useful for cross-checking. The two corpora broadly
agree on shared documents once notation differences (like *120↔GRA) are
normalized; where they diverge it's genuine scholarly variation.
One honest limit: SigLA is a palaeographic sign database, so it records sign occurrences and word division but not the cardinal-number quantities of the accounts — there are no numeral values here. Use the bundled GORILA corpus for accounting. Cite SigLA in academic work.
All the same commands work on it (aegean info sigla, aegean search sigla "*-RO", aegean stats sigla, …). See Data & Provenance.
Word and sign frequency tables straight from the shell:
aegean stats lineara --top 5
# KU-RO 37
# SA-RA₂ 20
# KI-RO 16
# *411-VS 15
# A-TA-I-*301-WA-JA 11
aegean stats lineara --signs --top 5
# 552 (the erased-sign marker)
# 𐄁 468 (the word divider)
# 1 310
# KU 307
# KA 284The full analytical toolkit — phonetic distance and alignment, cross-script nearest-neighbour, association statistics (χ², log-likelihood, Fisher, PMI), collocation, morphology clustering, the compound query engine, keyness, and dispersion — is documented on the Analysis page. The Linear B and Cypriot corpora use the same model and the same commands, and the Greek side lives under Greek NLP / Meters.
| Command | What it does | Key flags |
|---|---|---|
aegean info lineara |
Corpus size, provenance, license | --json |
aegean load lineara |
Filter by metadata; list or export | --site --period --scribe --support -o --limit --json |
aegean show lineara HT13 |
One document, line by line | --json |
aegean search lineara "KU-*-RO" |
Wildcard sign-pattern search | --json |
aegean query lineara |
Compound query engine | --where --output-kind --fields --limit --json |
aegean stats lineara |
Word / sign frequency table | --signs --top --json |
aegean sign lineara KU |
Look up one sign | --json |
aegean balance lineara [HT13] |
KU-RO reconciliation | --strict --json |
aegean analyze structure lineara [HT13] |
Heuristic categories | --json |
aegean cite lineara |
Cite the corpus (or a subset) | --style --site --period --scribe --support |
aegean export lineara |
Export JSON / CSV / Parquet / EpiDoc / SQLite | (see CLI) |
print(corpus.provenance.cite())
# Godart, L. & Olivier, J.-P. (1976–1985). Recueil des inscriptions en linéaire A. — https://github.com/mwenge/lineara.xyzaegean cite lineara --style bibtex # or plain / apa
aegean cite lineara --site "Haghia Triada" # cite the exact filtered subset
# … [subset: filter(site='Haghia Triada') → 1110 of 1721 documents]- Undeciphered. Sound values are a Linear B convention; transliterations, "totals," structure labels, and commodity glosses are evidence to weigh, never ground truth.
- Damaged corpus. Only 35 tablets carry a checkable total; section boundaries for reconciliation are heuristic; the full Leiden apparatus isn't in the bundled data.
- Heuristic structure. The accounting/libation/list/text labels are content-shape rules, not genre attributions — override them freely.
- SigLA carries no quantities and is NonCommercial; the obligation passes to you.
The full picture, with citations and caveats, is on the Limitations page. For the methods behind the numbers, see Analysis; to get from nothing installed to your first result, see Getting Started.
Start here
Aegean scripts
Greek
Capabilities
Reference