Data and Provenance

Data & Provenance

Bundled vs fetched

Compact text data ships inside the wheel and works offline:

Linear A: inscriptions.json, signs.json, phonetic_map.json
Linear B / Cypriot: signs.json, phonetic_map.json, lexicon.json, sample_inscriptions.json (Unicode UCD)
Cypro-Minoan: signs.json, sample_inscriptions.json (undeciphered — no phonetic map or lexicon)
Greek: sample_texts.json, lemmata.json, benchmark_gold.json

Large or license-restricted assets are never bundled — they are fetched on demand into a user cache. The wheel ships only code + tiny JSON (CI's scripts/check_footprint.py enforces that, plus an instant, heavy-dep-free import).

from aegean.data import load_bundled_json
load_bundled_json("lineara", "signs.json")

Download-to-cache: `fetch()`

fetch(name) downloads a registered remote dataset into the cache and returns its path. Downloads are sha256-verified (when a checksum is pinned), atomic (written to a .part file then renamed), and idempotent (a present, valid cache entry is a no-op). Archive datasets (extract=True, e.g. lineara-images) are unpacked into a cache directory — safely (members that escape the directory are rejected) — and fetch() returns that directory.

from aegean import data
data.cache_dir()                 # where datasets are cached (override: PYAEGEAN_CACHE)
path = data.fetch("lineara-images")

Errors are explicit and never block import:

unknown dataset → DataNotAvailableError
no pinned URL → DataNotAvailableError naming the env override to set
checksum mismatch → DataNotAvailableError (the bad download is removed)

The Linear A imagery (`lineara-images`)

The facsimile/photo set (3,368 files, ~116 MB download, ~125 MB unpacked) is fetched (never re-hosted) from a release on the ryanpavlicek/linearaworkbench repo. fetch downloads the tar.gz and unpacks it into a cache directory of images. Its copyright is a patchwork — most images are © École Française d'Athènes (the GORILA volumes), others are held by named scholars, publications, and photographers (see the corpus's per-image imageRights); that attribution is unaffected by fetching, and pyaegean does not redistribute the images itself.

The release asset's URL and sha256 are pinned (and verified), so a plain call just works and is integrity-checked:

data.fetch("lineara-images")     # downloads the pinned asset, sha256-verified, unpacks, caches

To fetch from your own mirror instead, set an env override (the pinned sha256 is not enforced against an override):

export PYAEGEAN_LINEARA_IMAGES_URL="https://example.org/lineara-images.tar.gz"

The override pattern is general: PYAEGEAN_<NAME>_URL (uppercased, -→_) overrides any dataset's URL.

The Greek treebank lexicon (`use_treebank`)

aegean.greek.use_treebank() downloads the Perseus Ancient Greek Dependency Treebank (AGDT v2.1, Greek) — 33 .tb.xml files, ~75 MB, pinned to a fixed commit — into the cache, then builds a derived form→lemma/morphology lexicon there (agdt-greek-lexicon.json); use_parser() trains a dependency-parser model (agdt-parser-model.json.gz), use_tagger() trains a POS-tagger model (agdt-postagger.json.gz), and use_lemmatizer() trains an edit-tree lemmatizer model (agdt-lemmatizer.json.gz) from the same files. The treebank is CC BY-SA 3.0; it is fetched (never re-hosted), and the derived lexicon stays in the local cache — pyaegean neither bundles nor redistributes it, so the ShareAlike terms don't reach the Apache-2.0 package. Cite the AGDT in work that relies on it. Network is needed only on the first call; the build is idempotent thereafter. See Greek NLP → Treebank-backed mode.

The Greek lexicon (LSJ, `use_lsj`)

aegean.greek.use_lsj() downloads the Perseus Liddell-Scott-Jones lexicon (the TEI A Greek-English Lexicon — 27 files, ~270 MB, pinned to a fixed commit) into the cache and builds a derived, gzipped lemma→entry index there (lsj-perseus-index.json.gz, ~15 MB). The LSJ is CC BY-SA 4.0 (Perseus Digital Library, with NEH funding); it is fetched (never re-hosted) and the index stays in the local cache — pyaegean neither bundles nor redistributes it. Attribute Perseus per the statement in NOTICE. Network is needed only on the first call. See Greek NLP → Lexicon (LSJ).

The Greek neural lemmatizer model (`use_neural_lemmatizer`, `[neural]`)

aegean.greek.use_neural_lemmatizer() activates a seq2seq lemmatizer that generates the lemma for a form, reaching 76.3% on unseen forms. It pairs a bundled gold lemma lookup (which answers attested forms) with the neural model (which handles the rest); the model is fetched to the cache (~232 MB), never bundled, and runs torch-free on numpy + onnxruntime, loaded only on activation.

Model card: the base model is bowphs/GreTa, an Ancient-Greek T5 released under Apache-2.0. pyaegean fine-tunes it into a form→lemma seq2seq on the AGDT (CC BY-SA 3.0), Pedalion (CC BY-SA 4.0), and Gorman (CC BY-SA 4.0) treebanks, then exports the result to int8 ONNX. The released model is CC BY-SA 4.0, fetched to the user cache and never bundled, so the wheel stays Apache-2.0. See Greek NLP → Neural lemmatizer.

The Greek neural joint pipeline model (`use_neural_pipeline`, `[neural]`)

aegean.greek.use_neural_pipeline() activates one jointly-trained model serving POS, full morphology (UD FEATS), UD dependency trees, and lemmas from a single forward pass — state of the art on the UD Ancient Greek benchmarks (see Greek NLP → The neural pipeline for the measured numbers). The model bundle (fp32 ONNX + tokenizer + label maps + lemma scripts/lookup, ~518 MB, sha256-pinned) is fetched to the cache, never bundled, and runs torch-free on numpy + onnxruntime, loaded only on activation.

Model card: the base encoder is bowphs/GreBerta (Riemenschneider & Frank, Apache-2.0). pyaegean fine-tunes it — tagging heads, a biaffine dependency parser, and an edit-script lemma head — on the AGDT (CC BY-SA 3.0), Gorman (CC BY-SA 4.0), and Pedalion (CC BY-SA 4.0) treebanks, with every sentence of the UD-Perseus dev/test folds and all PROIEL evaluation texts excluded from training (the leakage manifest is built by agdt_ud_overlap(); the protocol is documented in docs/benchmarks.md). The released bundle is CC BY-SA 4.0, fetched to the user cache and never bundled, so the wheel stays Apache-2.0.

The PROIEL evaluation set (`evaluate_on_proiel`)

aegean.greek.evaluate_on_proiel() scores the Greek lemmatizer/tagger against the PROIEL treebank (Greek New Testament + Herodotus) — a source none of pyaegean's models trained on — for a neutral, out-of-AGDT generalization number. PROIEL is CC BY-NC-SA 3.0; it is fetched to the cache for evaluation only, read locally, and never bundled or re-hosted (NonCommercial + ShareAlike). Cite Haug & Jøhndal (2008). See Greek NLP → Neutral evaluation.

Data versioning — pinning for papers

Every dataset pyaegean can touch is versioned and hashable:

from aegean import data
manifest = data.versions()
# {"package": "0.8.0",
#  "bundled": {"lineara/inscriptions.json": {"sha256": "…", "bytes": …}, …},
#  "fetched": {"grc-joint": {"url": "…", "sha256": "…", "cached": True}, …}}

Bundled data ships inside the wheel, so its version is the package version (also stamped on every bundled corpus as Provenance.data_version); fetched assets are sha256-pinned release files, verified on download. To pin an analysis for a paper: record aegean.__version__ and dump the manifest (aegean data versions --json > data-versions.json from the CLI) alongside your results — matching sha256s mean byte-identical data.

Your own corpus

A scholar's own inscriptions get the full API (filter, query, DataFrames, citation, export) without writing a loader:

corpus = aegean.Corpus.from_records([
    {"id": "X1", "text": "KU-RO 10", "meta": {"site": "My site"}},
    {"id": "X2", "lines": [["A-DU", {"text": "5", "status": "unclear"}]]},
], script_id="myfind",
   provenance=aegean.Provenance(source="My dig notebook", citation="Me (2026)."))

Tokens may be plain strings (kinds inferred: parseable numerals vs words, hyphenated tokens get their signs split) or dicts carrying kind, status (editorial certainty), and alt (variant readings). Make it loadable by name with aegean.core.corpus.register_loader("myfind", lambda: corpus); for EpiDoc sources, the bring-your-own reader (see Linear B) covers the same model including <unclear>/<supplied> status and <app>/<rdg> variants.

Variant readings

Token.alt carries alternate readings alongside the editorial status. The EpiDoc writer emits them as a critical apparatus — <app><lem><w>PO-ME</w></lem><rdg><w>PO-MA</w></rdg></app> (validated against the official EpiDoc schema) — and the reader folds them back to one token with its alt tuple, so variants survive the EpiDoc and JSON round-trips.

Provenance & citation

Every Corpus carries a Provenance that stamps exports and gives a citation:

corpus = aegean.load("lineara")
corpus.provenance.source      # 'GORILA (Godart & Olivier 1976–1985) via mwenge/lineara.xyz'
corpus.provenance.license
corpus.provenance.cite()      # one-line citation for papers/logs

corpus.to_dict()["_meta"]      # tool, schemaVersion, scriptId, documentCount, source, license, citation

A note on the Linear A corpus: the bundled transcription is normalized — it does not carry the full Leiden apparatus (lacunae, restorations, uncertain readings), because the upstream digitization dropped it. For edition-grade readings, consult GORILA and SigLA. The data model can still record editorial status — aegean.ReadingStatus (CERTAIN / UNCLEAR / RESTORED / LOST), which the EpiDoc reader/writer round-trip as <unclear>/<supplied>/<gap> — so a bring-your-own EpiDoc corpus keeps its apparatus through a load/export cycle.

Licensing summary

Code — Apache-2.0.
Linear A corpus JSON — GORILA via mwenge/lineara.xyz (Apache-2.0).
Greek sample corpus — public-domain ancient texts (seed only).
Greek treebank lexicon (opt-in) — Perseus AGDT v2.1, CC BY-SA 3.0; fetched and built in the user cache, never bundled or redistributed.
Greek lexicon / LSJ (opt-in) — Perseus Liddell-Scott-Jones, CC BY-SA 4.0; fetched and indexed in the user cache, never bundled or redistributed.
Greek neural lemmatizer (opt-in [neural]) — a GreTa seq2seq (Apache-2.0 base) fine-tuned on the AGDT (CC BY-SA 3.0), Pedalion (CC BY-SA 4.0), and Gorman (CC BY-SA 4.0) treebanks. The model — int8 ONNX weights plus a derived gold lemma lookup — is CC BY-SA 4.0, fetched to the user cache (~232 MB), never bundled; the wheel stays Apache-2.0.
Greek neural joint pipeline (opt-in [neural]) — a GreBerta-based joint model (Apache-2.0 base) fine-tuned on the AGDT (CC BY-SA 3.0), Gorman (CC BY-SA 4.0), and Pedalion (CC BY-SA 4.0) treebanks, evaluation folds excluded from training. The model bundle is CC BY-SA 4.0, fetched to the user cache (~518 MB), never bundled; the wheel stays Apache-2.0.
PROIEL evaluation set (opt-in) — the PROIEL treebank (Greek NT + Herodotus), CC BY-NC-SA 3.0; fetched to the user cache for evaluate_on_proiel only, never bundled or redistributed (NonCommercial + ShareAlike).

See the repository NOTICE and CITATION.cff for full attribution.

pyaegean

Home

Start here

Aegean scripts

Greek

Capabilities

Reference

Data and Provenance

Data & Provenance

Bundled vs fetched

Download-to-cache: fetch()

The Linear A imagery (lineara-images)

The Greek treebank lexicon (use_treebank)

The Greek lexicon (LSJ, use_lsj)

The Greek neural lemmatizer model (use_neural_lemmatizer, [neural])

The Greek neural joint pipeline model (use_neural_pipeline, [neural])

The PROIEL evaluation set (evaluate_on_proiel)

Data versioning — pinning for papers

Your own corpus

Variant readings

Provenance & citation

Licensing summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pyaegean

Clone this wiki locally

Download-to-cache: `fetch()`

The Linear A imagery (`lineara-images`)

The Greek treebank lexicon (`use_treebank`)

The Greek lexicon (LSJ, `use_lsj`)

The Greek neural lemmatizer model (`use_neural_lemmatizer`, `[neural]`)

The Greek neural joint pipeline model (`use_neural_pipeline`, `[neural]`)

The PROIEL evaluation set (`evaluate_on_proiel`)