Analysis

aegean.analysis ports the Linear A Research Workbench's analytical methods to Python, faithfully — each is checked against shared golden fixtures (tests/fixtures/golden/algorithms.json) so the port can't silently diverge, and property tests mirror the workbench's invariants.

All cross-linguistic and decipherment-adjacent methods are exploratory. The Linear A corpus is undeciphered.

Phonetic distance & classes

A weighted Levenshtein over phonetic strings, normalized to [0, 1]. Vowel↔vowel substitutions are cheap (0.3), same articulatory-class consonants moderate (0.5), everything else full cost (1).

from aegean.analysis import phonetic_distance, extract_root
phonetic_distance("kuro", "kuro")     # 0.0
phonetic_distance("kuro", "karo")     # 0.075   (one vowel swap / 4)
phonetic_distance("kuro", "kulo")     # 0.125   (r↔l, same class / 4)
extract_root("KU-RO")                 # 'kr'     (consonant skeleton)

Which phonemes count as "same class" is a linguistic judgement, exposed as a configurable scheme:

from aegean.analysis import (
    build_phonetic_classes, DEFAULT_PHONETIC_SCHEME,
    CONSERVATIVE_PHONETIC_SCHEME, describe_phonetic_scheme,
)
describe_phonetic_scheme(DEFAULT_PHONETIC_SCHEME)
# 'interdentals=dental, ḥ=velar, ž=sibilant, strip-notation=on'
cl = build_phonetic_classes(CONSERVATIVE_PHONETIC_SCHEME)

Alignment

Per-phoneme alignment classifies each position as match / vowel-sub / class-sub / far-sub / insertion / deletion:

from aegean.analysis import align_phonetic
[c.op for c in align_phonetic("ka", "ko")]    # ['match', 'sub-vowel']
[c.op for c in align_phonetic("pa", "ba")]    # ['sub-class', 'match']

Word-level multiple-sequence alignment (progressive Needleman–Wunsch) lines up whole inscriptions:

from aegean.analysis import align_sequences
align_sequences([["A-B", "X-Y", "C-D"], ["A-B", "Z-Z", "C-D"]])
# [['A-B','A-B'], ['X-Y','Z-Z'], ['C-D','C-D']]

Morphological clustering

Heuristic lemmatization for an undeciphered script: find suffixes productive across many words, then group words sharing a stem via a productive suffix.

from aegean.analysis import find_morphological_clusters
clusters = find_morphological_clusters(
    corpus.word_frequencies(),
    min_suffix_productivity=5, min_cluster_size=2, max_suffix_len=2,
)
c = clusters[0]
c.stem, c.total_count, c.suffixes
[(m.word, m.count, m.suffix) for m in c.members]

Collocation statistics

For a word pair across N documents (joint, countA, countB, total). SciPy provides the exact special functions (lazy import).

from aegean.analysis import (
    chi_squared_2x2, log_likelihood_ratio_2x2, chi_squared_p_value,
    fishers_exact, wilson_interval, pmi_interval,
)
chi_squared_2x2(5, 10, 10, 100)          # ≈ 15.123  (Yates-corrected)
log_likelihood_ratio_2x2(5, 10, 10, 100) # ≈ 12.533  (G², Dunning 1993)
chi_squared_p_value(3.841)               # ≈ 0.05
fishers_exact(5, 5, 5, 10)               # ≈ 0.007937 (two-sided)
wilson_interval(5, 10)                    # (low, high) 95% CI

Query engine

A compound predicate engine over the corpus: an inscription/word field registry, AND/OR/NOT combination, and inscription- or word-output modes.

from aegean.analysis import FilterRow, run_query
res = run_query(corpus, [
    FilterRow("id-contains", "HT"),                      # HT tablets (site code is the id prefix)
    FilterRow("word-suffix", "RO", connector="and"),     # …with a word ending in -RO
], output="inscriptions")
[d.id for d in res.inscriptions][:5]                      # ['HT1', 'HT9a', 'HT9b', ...]

words = run_query(corpus, [FilterRow("word-sign-pattern", "KU-*-RO")],
                  output="words").words   # [(word, count), ...]

Fields include site-is, scribe-is, period-is, support-is, id-contains, has-image, ins-contains-word, and word-scope word-contains/-prefix/ -suffix/-min-syllables/-max-syllables/-contains-sign/-cooccurs-with/ -sign-pattern. summarize_filters(...) renders a one-line label.

Tablet-structure classification

Heuristic genre classification by content shape — accounting / libation / list / text / other.

from aegean.analysis import classify_structure, classify_corpus
classify_structure(corpus.get("HT13"))   # e.g. 'accounting'
buckets = classify_corpus(corpus)         # {category: [doc_id, ...]}
{k: len(v) for k, v in buckets.items()}

pyaegean

Home

Start here

Aegean scripts

Greek

Capabilities

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis

Analysis

Phonetic distance & classes

Alignment

Morphological clustering

Collocation statistics

Query engine

Tablet-structure classification

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pyaegean

Clone this wiki locally