-
Notifications
You must be signed in to change notification settings - Fork 0
Analysis
aegean.analysis ports the Linear A Research Workbench's analytical methods to
Python, faithfully — each is checked against shared golden fixtures
(tests/fixtures/golden/algorithms.json) so the port can't silently diverge,
and property tests mirror the workbench's invariants.
All cross-linguistic and decipherment-adjacent methods are exploratory. The Linear A corpus is undeciphered.
A weighted Levenshtein over phonetic strings, normalized to [0, 1]. Vowel↔vowel
substitutions are cheap (0.3), same articulatory-class consonants moderate (0.5),
everything else full cost (1).
from aegean.analysis import phonetic_distance, extract_root
phonetic_distance("kuro", "kuro") # 0.0
phonetic_distance("kuro", "karo") # 0.075 (one vowel swap / 4)
phonetic_distance("kuro", "kulo") # 0.125 (r↔l, same class / 4)
extract_root("KU-RO") # 'kr' (consonant skeleton)Which phonemes count as "same class" is a linguistic judgement, exposed as a configurable scheme:
from aegean.analysis import (
build_phonetic_classes, DEFAULT_PHONETIC_SCHEME,
CONSERVATIVE_PHONETIC_SCHEME, describe_phonetic_scheme,
)
describe_phonetic_scheme(DEFAULT_PHONETIC_SCHEME)
# 'interdentals=dental, ḥ=velar, ž=sibilant, strip-notation=on'
cl = build_phonetic_classes(CONSERVATIVE_PHONETIC_SCHEME)Per-phoneme alignment classifies each position as match / vowel-sub / class-sub / far-sub / insertion / deletion:
from aegean.analysis import align_phonetic
[c.op for c in align_phonetic("ka", "ko")] # ['match', 'sub-vowel']
[c.op for c in align_phonetic("pa", "ba")] # ['sub-class', 'match']Word-level multiple-sequence alignment (progressive Needleman–Wunsch) lines up whole inscriptions:
from aegean.analysis import align_sequences
align_sequences([["A-B", "X-Y", "C-D"], ["A-B", "Z-Z", "C-D"]])
# [['A-B','A-B'], ['X-Y','Z-Z'], ['C-D','C-D']]Heuristic lemmatization for an undeciphered script: find suffixes productive across many words, then group words sharing a stem via a productive suffix.
from aegean.analysis import find_morphological_clusters
clusters = find_morphological_clusters(
corpus.word_frequencies(),
min_suffix_productivity=5, min_cluster_size=2, max_suffix_len=2,
)
c = clusters[0]
c.stem, c.total_count, c.suffixes
[(m.word, m.count, m.suffix) for m in c.members]For a word pair across N documents (joint, countA, countB, total). SciPy provides the exact special functions (lazy import).
from aegean.analysis import (
chi_squared_2x2, log_likelihood_ratio_2x2, chi_squared_p_value,
fishers_exact, wilson_interval, pmi_interval,
)
chi_squared_2x2(5, 10, 10, 100) # ≈ 15.123 (Yates-corrected)
log_likelihood_ratio_2x2(5, 10, 10, 100) # ≈ 12.533 (G², Dunning 1993)
chi_squared_p_value(3.841) # ≈ 0.05
fishers_exact(5, 5, 5, 10) # ≈ 0.007937 (two-sided)
wilson_interval(5, 10) # (low, high) 95% CIA compound predicate engine over the corpus: an inscription/word field registry, AND/OR/NOT combination, and inscription- or word-output modes.
from aegean.analysis import FilterRow, run_query
res = run_query(corpus, [
FilterRow("id-contains", "HT"), # HT tablets (site code is the id prefix)
FilterRow("word-suffix", "RO", connector="and"), # …with a word ending in -RO
], output="inscriptions")
[d.id for d in res.inscriptions][:5] # ['HT1', 'HT9a', 'HT9b', ...]
words = run_query(corpus, [FilterRow("word-sign-pattern", "KU-*-RO")],
output="words").words # [(word, count), ...]Fields include site-is, scribe-is, period-is, support-is, id-contains,
has-image, ins-contains-word, and word-scope word-contains/-prefix/
-suffix/-min-syllables/-max-syllables/-contains-sign/-cooccurs-with/
-sign-pattern. summarize_filters(...) renders a one-line label.
Heuristic genre classification by content shape — accounting / libation / list / text / other.
from aegean.analysis import classify_structure, classify_corpus
classify_structure(corpus.get("HT13")) # e.g. 'accounting'
buckets = classify_corpus(corpus) # {category: [doc_id, ...]}
{k: len(v) for k, v in buckets.items()}Start here
Aegean scripts
Greek
Capabilities
Reference