-
Notifications
You must be signed in to change notification settings - Fork 0
Data and Provenance
This page is the honest accounting of where every byte comes from: what ships inside the wheel, what gets fetched on demand, the license and size of each dataset, how the cache works, and how to pin an exact data snapshot so a paper is reproducible. If you ever need to answer "what edition is this, and may I redistribute it?", the answer is here.
The short version: code and tiny text JSON are bundled and work offline; large or license-restricted assets are never bundled — they download into a local cache on first use, sha256-verified. Nothing is re-hosted that the license forbids re-hosting.
Compact text data ships inside the wheel and works offline with zero third-party dependencies:
-
Linear A —
inscriptions.json,signs.json,phonetic_map.json,manifest.json -
Linear B —
signs.json,phonetic_map.json,lexicon.json,sample_inscriptions.json(Unicode UCD) -
Cypriot —
signs.json,phonetic_map.json,lexicon.json,sample_inscriptions.json(Unicode UCD) -
Cypro-Minoan —
signs.json,sample_inscriptions.json(undeciphered — no phonetic map or lexicon) -
Greek —
sample_texts.json,lemmata.json,benchmark_gold.json,nt_sample.json(one NT book),dodson.json(Koine lexicon),works_catalogue.json(the offlinegreek.catalogdiscovery index — metadata only, no texts) -
Geo —
site_coordinates.json(approximate find-site lat/long)
Large or license-restricted assets are never bundled — they are fetched on
demand into a user cache. The wheel ships only code + tiny JSON (CI's
scripts/check_footprint.py enforces that, plus an instant, heavy-dep-free
import).
from aegean.data import load_bundled_json
signs = load_bundled_json("lineara", "signs.json")
len(signs)
# 344This is exactly what aegean.data.versions() reports as bundled — each file
hashed straight out of the installed wheel. Sizes are bytes.
| File | Bytes | Source |
|---|---|---|
cypriot/lexicon.json |
1,377 | Unicode UCD + scholarly excerpts |
cypriot/phonetic_map.json |
712 | Unicode UCD |
cypriot/sample_inscriptions.json |
465 | scholarly excerpts (Masson; Chadwick) |
cypriot/signs.json |
7,032 | Unicode UCD (Cypriot Syllabary) |
cyprominoan/sample_inscriptions.json |
389 | scholarly excerpts (Ferrara) |
cyprominoan/signs.json |
12,681 | Unicode UCD (Cypro-Minoan block) |
geo/site_coordinates.json |
5,872 | GORILA / Younger / public gazetteers |
greek/benchmark_gold.json |
5,974 | gold benchmark fixtures |
greek/dodson.json |
712,301 | Dodson Greek Lexicon (CC0) |
greek/lemmata.json |
1,545 | bundled gold lemma seed |
greek/nt_sample.json |
38,846 | Nestle 1904 — one book (CC0) |
greek/sample_texts.json |
1,054 | public-domain Greek snippets |
greek/works_catalogue.json |
293,563 | Perseus + First1KGreek work index (metadata only; built by scripts/build_greek_catalogue.py) |
lineara/inscriptions.json |
720,766 | GORILA via mwenge/lineara.xyz |
lineara/manifest.json |
454 | corpus manifest |
lineara/phonetic_map.json |
648 | Linear A sound values |
lineara/signs.json |
66,391 | Linear A sign inventory |
linearb/lexicon.json |
12,058 | Unicode UCD + Wiktionary/kaikki excerpts |
linearb/phonetic_map.json |
986 | Unicode UCD |
linearb/sample_inscriptions.json |
7,198 | scholarly excerpts (Ventris & Chadwick) |
linearb/signs.json |
36,030 | Unicode UCD (Linear B Syllabary + Ideograms) |
fetch(name) downloads a registered remote dataset into the cache and returns
its path. Downloads are sha256-verified (when a checksum is pinned),
atomic (written to a .part file then renamed), and idempotent (a
present, valid cache entry is a no-op). Archive datasets (extract=True, e.g.
lineara-images) are unpacked into a cache directory — safely (members that
escape the directory are rejected) — and fetch() returns that directory.
from aegean import data
data.cache_dir() # where datasets are cached (override: PYAEGEAN_CACHE)
path = data.fetch("nt-corpus") # returns the cached file path; no-op if already valid
path.exists(), path.is_file()
# (True, True)Errors are explicit and never block import:
| Situation | Raised | Message detail |
|---|---|---|
| unknown dataset | DataNotAvailableError |
unknown dataset 'nope'; known: [...] |
| no pinned URL | DataNotAvailableError |
names the PYAEGEAN_<NAME>_URL to set |
| checksum mismatch | DataNotAvailableError |
expected vs got; the bad download is removed |
| unsafe archive member | DataNotAvailableError |
unsafe path in archive: ... |
| network failure | DataNotAvailableError |
could not fetch '<name>' from <url>: ... |
from aegean import data
try:
data.fetch("linearb-corpus") # has no default URL
except data.DataNotAvailableError as e:
print(str(e)[:70])
# dataset 'linearb-corpus' has no pinned download URL yet (A user-suppli...fetch(name, force=True) re-downloads even when a valid copy is cached.
These are the registered remote datasets (aegean data list). Every one is
fetched to the user cache on demand and never bundled in the Apache-2.0
wheel. Each URL and sha256 is pinned in the code; an env override
(PYAEGEAN_<NAME>_URL, see below) points any of them at your own licensed copy.
Dataset (name) |
What it is | Size | License | Provenance |
|---|---|---|---|---|
lineara-images |
3,368 facsimile/photo files (archive) | ~116 MB tar.gz, ~119 MB unpacked | © École Française d'Athènes + other rightsholders — academic reference only | Fetched from the ryanpavlicek/linearaworkbench release; never re-hosted |
agdt-derived |
Prebuilt AGDT lexicon + tagger/lemmatizer/parser models | ~15 MB | CC BY-SA 3.0 (derived from Perseus AGDT) | Project-hosted derivative of the AGDT |
lsj-index |
Prebuilt LSJ lemma→entry index | ~15 MB | CC BY-SA 4.0 (Perseus Digital Library) | Project-hosted derivative of the Perseus LSJ |
grc-lemma-neural |
GreTa seq2seq lemmatizer (int8 ONNX + tokenizer + gold lookup) | ~232 MB tar.gz | CC BY-SA 4.0 — derived from AGDT (3.0) + Pedalion (4.0) + Gorman (4.0) |
[neural] extra; fine-tuned from bowphs/GreTa (Apache-2.0 base) |
grc-joint |
Joint tagger-parser-lemmatizer (fp32 ONNX + tokenizer + label maps + lemma scripts/lookup) | ~518 MB tar.gz | CC BY-SA 4.0 — derived from AGDT (3.0) + Gorman (4.0) + Pedalion (4.0) |
[neural] extra; GreBerta-based (Apache-2.0 base), eval folds excluded |
sigla-corpus |
SigLA-derived Linear A dataset v2: 781 docs, SigLA's word division (1,376 words) + commodity ideograms | ~1.2 MB JSON | CC BY-NC-SA 4.0 (SigLA — Salgarella & Castellan) | Decoded from the SigLA web-app payload; drawings stay at sigla.phis.me |
damos-corpus |
DAMOS Linear B corpus v2: ~5,900 tablets, transliterations + metadata | ~3 MB JSON | CC BY-NC-SA 4.0 (DAMOS — F. Aurora) | Decoded from the DAMOS public API; no imagery |
nt-corpus |
Greek NT (Nestle 1904): 260 chapters / ~137,800 tokens, gold lemma + Robinson morph + Strong's + UD UPOS | ~16 MB JSON | CC0-1.0 (morphology/lemmas/Strong's); base text public domain | From biblicalhumanities/Nestle1904; may be redistributed (CC0) |
workbench-app |
Prebuilt Linear A Research Workbench static web app (archive) | ~3 MB tar.gz | Apache-2.0 (build); embedded Linear A data is GORILA-derived | Served locally by aegean workbench
|
linearb-corpus |
A user-supplied Linear B export (bring-your-own) | — (no default source) | bring-your-own; DAMOS is CC BY-NC-SA 4.0, LiBER all-rights-reserved | Set PYAEGEAN_LINEARB_CORPUS_URL to your own licensed copy |
Why two licenses keep appearing. "Project-hosted" derivatives (DAMOS, SigLA, the LSJ index, the AGDT-derived models, the neural models) are republished under the same ShareAlike terms as their source, clearly labeled, and kept out of the Apache-2.0 wheel. NonCommercial obligations (DAMOS, SigLA, PROIEL) pass through to you. CC0 assets (the NT corpus, the Dodson lexicon) carry no such obligation, which is exactly why one NT book can be bundled.
The facsimile/photo set (3,368 files, ~116 MB download, ~119 MB unpacked) is
fetched (never re-hosted) from a release on the ryanpavlicek/linearaworkbench
repo. fetch downloads the tar.gz and unpacks it into a cache directory of
images. Its copyright is a patchwork — most images are © École Française
d'Athènes (the
GORILA volumes,
digitized in the École's CEFAEL library at that link), others are held by named
scholars, publications, and photographers (see the corpus's per-image
imageRights); that attribution is unaffected by fetching, and pyaegean does not
redistribute the images itself.
The release asset's URL and sha256 are pinned (and verified), so a plain call just works and is integrity-checked:
from aegean import data
data.fetch("lineara-images") # downloads the pinned asset, sha256-verified, unpacks, caches → a directoryTo fetch from your own mirror instead, set an env override (the pinned sha256 is not enforced against an override):
export PYAEGEAN_LINEARA_IMAGES_URL="https://example.org/lineara-images.tar.gz"aegean.greek.use_treebank() activates the lexicon derived from the Perseus
Ancient Greek Dependency Treebank (AGDT v2.1, Greek); use_parser() /
use_tagger() / use_lemmatizer() activate the models trained from the same
files. On first use each fetches the small prebuilt artifact from the
project-hosted agdt-derived release asset (one ~15 MB bundle: the
form→lemma/morphology lexicon agdt-greek-lexicon.json plus the three trained
models; sha256-pinned). If that asset is ever unreachable, the original path
still works: download the AGDT itself (33 .tb.xml files, ~75 MB, pinned to a
fixed commit) and build/train locally. The treebank is CC BY-SA 3.0: the
source treebank is never re-hosted, the derived artifacts are published under the
same ShareAlike terms (clearly labeled), and everything is fetched to the cache —
never bundled in the Apache-2.0 wheel. Cite the AGDT in work that relies on it.
Network is needed only on the first call. See
Greek NLP → Treebank-backed mode.
aegean.greek.use_lsj() activates a lemma→entry index derived from the Perseus
Liddell-Scott-Jones lexicon. On first use it fetches the prebuilt index
(lsj-perseus-index.json.gz, ~15 MB, sha256-pinned) from the project-hosted
lsj-index release asset; if that is unreachable it falls back to the original
path — downloading the TEI A Greek-English Lexicon itself (27 files, ~270 MB,
pinned to a fixed commit) and building the index locally. The LSJ is CC BY-SA
4.0 (Perseus Digital Library, with NEH funding): the source TEI is never
re-hosted, the derived index is published under the same ShareAlike terms (clearly
labeled), and both are fetched to the cache — never bundled in the Apache-2.0
wheel. Attribute Perseus per the statement in NOTICE. Network is needed only on
the first call. See Greek NLP → Lexicon (LSJ).
aegean.greek.use_neural_lemmatizer() activates a seq2seq lemmatizer that
generates the lemma for a form, reaching 76.3% on unseen forms. It pairs a
bundled gold lemma lookup (which answers attested forms) with the neural model
(which handles the rest); the model is fetched to the cache (~232 MB), never
bundled, and runs torch-free on numpy + onnxruntime, loaded only on activation.
Model card: the base model is bowphs/GreTa, an Ancient-Greek T5 released under Apache-2.0. pyaegean fine-tunes it into a form→lemma seq2seq on the AGDT (CC BY-SA 3.0), Pedalion (CC BY-SA 4.0), and Gorman (CC BY-SA 4.0) treebanks, then exports the result to int8 ONNX. The released model is CC BY-SA 4.0, fetched to the user cache and never bundled, so the wheel stays Apache-2.0. See Greek NLP → Neural lemmatizer.
aegean.greek.use_neural_pipeline() activates one jointly-trained model serving
POS, full morphology (UD FEATS), UD dependency trees, and lemmas from a single
forward pass — state of the art on the UD Ancient Greek benchmarks (see
Greek NLP → The neural pipeline for the
measured numbers). The model bundle (fp32 ONNX + tokenizer + label maps + lemma
scripts/lookup, ~518 MB, sha256-pinned) is fetched to the cache, never bundled,
and runs torch-free on numpy + onnxruntime, loaded only on activation.
Model card: the base encoder is bowphs/GreBerta (Riemenschneider & Frank,
Apache-2.0). pyaegean fine-tunes it — tagging heads, a biaffine dependency parser,
and an edit-script lemma head — on the AGDT (CC BY-SA 3.0), Gorman
(CC BY-SA 4.0), and Pedalion (CC BY-SA 4.0) treebanks, with every sentence of
the UD-Perseus dev/test folds and all PROIEL evaluation texts excluded from
training (the leakage manifest is built by agdt_ud_overlap(); the protocol is
documented in
docs/benchmarks.md).
The released bundle is CC BY-SA 4.0, fetched to the user cache and never
bundled, so the wheel stays Apache-2.0.
aegean.greek.evaluate_on_proiel() scores the Greek lemmatizer/tagger against the
PROIEL treebank (Greek New Testament + Herodotus) — a source none of pyaegean's
models trained on — for a neutral, out-of-AGDT generalization number. PROIEL is
CC BY-NC-SA 3.0; it is fetched to the cache for evaluation only, read
locally, and never bundled or re-hosted (NonCommercial + ShareAlike). Cite Haug &
Jøhndal (2008). See Greek NLP → Neutral evaluation.
The Universal Dependencies Ancient Greek treebanks (Perseus + PROIEL, CC BY-NC-SA 3.0) and the CoNLL-2018 evaluator (MPL-2.0) are likewise fetched for
evaluate_on_ud()only, never bundled, never trained on. SeeNOTICE.
The SigLA paleographical database (Salgarella & Castellan,
https://sigla.phis.me) publishes its dataset and drawings under
CC BY-NC-SA 4.0, and its paper invites use "outside the interface" and notes
copies can be hosted. pyaegean hosts the decoded dataset (the JSON form the paper
describes, reconstructed from the published web-app payload by
scripts/build_sigla_corpus.py) as the sha256-pinned sigla-corpus release
asset — 781 documents with SigLA's own word division (1,376 words) and
commodity ideograms (~1.2 MB), fetched on demand, never bundled (NonCommercial
data stays out of the Apache-2.0 wheel; the NC + ShareAlike obligations pass
through to you). Attribution, citation, source sha256, and generation date are
inside the file's _meta; drawings are not included and remain at
sigla.phis.me. Cite SigLA in academic work.
import aegean
s = aegean.load("sigla")
len(s) # 781
s.provenance.license # 'CC BY-NC-SA 4.0 (as published by SigLA; ...)'DAMOS — the Database of Mycenaean at Oslo (F. Aurora, https://damos.hf.uio.no)
— is the most complete edition of the Mycenaean (Linear B) corpus, published under
CC BY-NC-SA 4.0. pyaegean hosts the transliterations and core metadata (site,
series, chronology, Trismegistos id, scribal hands, find context, object class)
for ~5,900 tablets, decoded from the DAMOS public web API into compact JSON
(scripts/build_damos_corpus.py) as the sha256-pinned damos-corpus release
asset — fetched on demand, never bundled (NonCommercial data stays out of the
Apache-2.0 wheel; the NC + ShareAlike obligations pass through to you).
Attribution, citation, source URL, and generation date are inside the file's
_meta; no imagery is included. This is the openly-licensed full corpus the
bundled Linear B sample stands in for. Cite DAMOS (Aurora 2015) in academic work.
import aegean
d = aegean.load("damos")
len(d) # 5932 documents
d.provenance.source # 'DAMOS — Database of Mycenaean at Oslo (F. Aurora), ...'The Nestle 1904 Greek NT base text is public domain; its per-token morphology,
lemmas, and Strong's numbers (from biblicalhumanities/Nestle1904) are dedicated to
the public domain under CC0. Because CC0 imposes no restriction, one book is
bundled in the wheel (greek/nt_sample.json, ~38 KB) as an offline sample, and
the full 27-book corpus (260 chapters / ~137,800 tokens) is hosted as the
nt-corpus release asset, fetched on demand by greek.load_nt /
aegean.load("nt"). Koine glossing uses the Dodson Greek Lexicon (J. J.
Dodson; CC0), which is small enough to bundle in the wheel (~712 KB,
greek.use_dodson). Cite the Nestle 1904 edition and Dodson in academic work.
import aegean
from aegean import greek
nt = aegean.load("nt") # full corpus (fetched on first use; cached after)
len(nt) # 260 chapters
greek.use_dodson() # activate the bundled Koine lexicon (offline)
greek.gloss_strongs("G3056") # 'a word, speech, divine utterance, ...' (λόγος)greek.load_nt(book, ref=...) loads one book or a sub-reference rather than the
whole corpus — its signature is load_nt(book=None, *, ref=None, force=False).
aegean.greek.load_work("tlg0012.tlg001") fetches one work's Greek TEI edition
from Perseus canonical-greekLit or First1KGreek (both CC BY-SA; tried in
that order, or pick with source=) into the cache and returns a standard
Corpus — one Document per book/chapter, verse lines or paragraphs as the
physical lines. Its signature is
load_work(work, *, ref=None, source="auto", edition=None, force=False). The
ref= argument addresses a sub-section instead of the whole work, matching
the work's citation structure:
from aegean import greek
greek.load_work("tlg0012.tlg001", ref="1") # Iliad book 1
greek.load_work("tlg0012.tlg001", ref="1.1-1.50") # book 1, lines 1–50
greek.load_work("tlg0016.tlg001", ref="1.2") # Herodotus book 1, chapter 2These three are real work/ref ids but each first call hits the network and parses TEI — run them when you actually need the text, not as a smoke test.
To discover which work ids exist without any download, use the bundled
discovery index greek.catalog() (the works_catalogue.json listed above) —
metadata only (id, author, English + Greek title, source), so it works offline
and instantly. It covers every work with a Greek (-grc) edition in the two
upstream repos at the pinned commits — 1,778 works (768 Perseus + 1,010
First1KGreek). The texts themselves are still fetched on demand; only the index
is bundled.
from aegean import greek
len(greek.catalog()) # 1778 (all works)
len(greek.catalog(source="perseus")) # 768
greek.catalog(author="plato")[1]
# {'id': 'tlg0059.tlg002', 'author': 'Plato', 'title': 'Apology',
# 'greek_title': 'Ἀπολογία Σωκράτους', 'source': 'perseus'}aegean greek catalog --author homer --source perseus --limit 2 --json
# [{"id": "tlg0012.tlg001", "author": "Homer", "title": "Iliad",
# "greek_title": "Ἰλιάς", "source": "perseus"}, ...]Coverage is exactly what the open repos hold at the pinned commit, so some
authors are genuinely absent upstream (e.g. Sappho, tlg0009) and thus absent
here — that is honest, not a gap in pyaegean. The curated greek.popular_works()
(25 well-known works) is the small hand-picked counterpart. See
Greek Works & Books.
Editorial <note> and <bibl> are excluded from the running text but kept in
Document.meta.notes (and they survive the JSON round-trip). The download is
pinned to an upstream commit (recorded as Provenance.data_version, e.g.
PerseusDL/canonical-greekLit@d4fab69a2c26), so a loaded work is reproducible;
override the ref with PYAEGEAN_GREEKLIT_REF / PYAEGEAN_FIRST1K_REF. Nothing is
re-hosted; cite the Perseus Digital Library / Open Greek and Latin and the
underlying edition (each file's TEI header names it).
workbench-app is the prebuilt Linear A Research Workbench static web app (~3 MB
tar.gz, Apache-2.0 build; the embedded Linear A data is GORILA-derived). It is
fetched and unpacked on demand and served locally by aegean workbench.
linearb-corpus is a bring-your-own slot with no default source — it
exists so you can point pyaegean at a local licensed Linear B export (e.g. a DAMOS
EpiDoc download, or a LiBER selection) without a code change. DAMOS itself is now
loadable directly via aegean.load("damos"); LiBER is © CNR Edizioni, all rights
reserved, and is neither bundled nor fetched. Set
PYAEGEAN_LINEARB_CORPUS_URL (or PYAEGEAN_LINEARB_CORPUS) to your own copy;
calling fetch("linearb-corpus") with nothing set raises a clear error.
Every data operation has a CLI mirror (pip install "pyaegean[cli]"). The
subcommands:
| Command | What it does | Flags |
|---|---|---|
aegean data list |
List the fetchable datasets (name, size note, license) |
--json (machine-readable on stdout), -h/--help
|
aegean data fetch NAME |
Fetch a dataset into the cache (sha256-verified); idempotent when cached |
--force (re-download even if cached), -h/--help
|
aegean data versions |
The reproducibility manifest: every dataset's version + sha256 |
--json (machine-readable on stdout), -h/--help
|
aegean data cache |
Show the cache location and its current contents |
--json (machine-readable on stdout), -h/--help
|
aegean data cache
# cache:
# C:\Users\you\.cache\pyaegean (override with PYAEGEAN_CACHE)
# ┌───────────────────────────┬───────┐
# │ entry │ MB │
# ├───────────────────────────┼───────┤
# │ damos-corpus │ 3.1 │
# │ nt-corpus │ 15.8 │
# │ ... │ ... │
# └───────────────────────────┴───────┘aegean data fetch nt-corpus # downloads + verifies; a no-op if already cached
aegean data fetch lineara-images --force # re-download even if cachedThree environment variables control where data lives and where it comes from.
| Variable | Effect |
|---|---|
PYAEGEAN_CACHE |
The cache root. Falls back to XDG_CACHE_HOME, then ~/.cache. The package always writes under <base>/pyaegean. |
PYAEGEAN_<NAME>_URL |
Override one dataset's download URL with your own mirror/licensed copy. Uppercase the name and turn - into _. When set, the pinned sha256 is not enforced (it described the pinned URL only). |
PYAEGEAN_GREEKLIT_REF / PYAEGEAN_FIRST1K_REF
|
Override the upstream commit load_work pins to. |
The PYAEGEAN_<NAME>_URL pattern is mechanical — here is the exact name for each
dataset (verified):
| Dataset | Override variable |
|---|---|
lineara-images |
PYAEGEAN_LINEARA_IMAGES_URL |
agdt-derived |
PYAEGEAN_AGDT_DERIVED_URL |
lsj-index |
PYAEGEAN_LSJ_INDEX_URL |
grc-lemma-neural |
PYAEGEAN_GRC_LEMMA_NEURAL_URL |
grc-joint |
PYAEGEAN_GRC_JOINT_URL |
sigla-corpus |
PYAEGEAN_SIGLA_CORPUS_URL |
damos-corpus |
PYAEGEAN_DAMOS_CORPUS_URL |
nt-corpus |
PYAEGEAN_NT_CORPUS_URL |
workbench-app |
PYAEGEAN_WORKBENCH_APP_URL |
linearb-corpus |
PYAEGEAN_LINEARB_CORPUS_URL |
# point a dataset at your own mirror (sha256 not enforced against an override)
export PYAEGEAN_LINEARA_IMAGES_URL="https://example.org/lineara-images.tar.gz"
# keep all cached data on a big external drive
export PYAEGEAN_CACHE="/mnt/data/pyaegean-cache"Find the cache from code or the CLI at any time:
from aegean import data
data.cache_dir() # e.g. WindowsPath('C:/Users/you/.cache/pyaegean')Every dataset pyaegean can touch is versioned and hashable. data.versions()
returns a reproducibility manifest with three keys — package, bundled,
fetched:
from aegean import data
v = data.versions()
v["package"] # '0.8.5' (your installed version)
v["bundled"]["lineara/inscriptions.json"] # {'sha256': '4705b2b2…', 'bytes': 720766}
v["fetched"]["nt-corpus"]
# {'url': 'https://github.com/ryanpavlicek/pyaegean/releases/download/nt-corpus-v1/nt-corpus.json',
# 'sha256': 'e7aa5dcad729eb91f77018abbef71304d13e200f29dabe1260b79fa37b153949',
# 'license': "CC0-1.0 (morphology, lemmas, Strong's); base Greek text public domain",
# 'cached': True}Each bundled entry is a JSON file hashed straight from the installed wheel; each
fetched entry carries the pinned URL, the pinned sha256, the license, and whether
it is present in your local cache. Bundled data ships inside the wheel, so its
version is the package version (also stamped on every bundled corpus as
Provenance.data_version); fetched assets are sha256-pinned release files,
verified on download.
To pin an analysis for a paper: record aegean.__version__ and dump the
manifest alongside your results — matching sha256s mean byte-identical data.
aegean data versions --json > data-versions.jsonimport json, aegean
from aegean import data
with open("data-versions.json", "w", encoding="utf-8") as f:
json.dump({"package": aegean.__version__, "data": data.versions()}, f, indent=2)The human-readable aegean data versions (no --json) prints the same content as
a table — package, every bundled/... file with its sha256 and byte size, then
every fetched/... asset with its sha256 and cached / not cached / (unpinned)
status.
Every Corpus carries a Provenance that stamps exports and gives a citation:
import aegean
corpus = aegean.load("lineara")
corpus.provenance.source
# 'GORILA (Godart & Olivier 1976–1985) via mwenge/lineara.xyz'
corpus.provenance.license
# 'Apache-2.0 (corpus JSON); facsimile imagery © École Française d'Athènes, not redistributed'
corpus.provenance.cite()
# 'Godart, L. & Olivier, J.-P. (1976–1985). Recueil des inscriptions en linéaire A. — https://github.com/mwenge/lineara.xyz'
corpus.provenance.data_version
# '0.8.5'
corpus.to_dict()["_meta"]
# tool, schemaVersion, scriptId, documentCount, source, license, citationA note on the Linear A corpus: the bundled transcription is normalized, and
the apparatus the upstream data does carry is interpreted on load — its
erased-sign marks become ReadingStatus.LOST (552 tokens) and damaged or
bracketed-uncertain readings become UNCLEAR (120 tokens, across 91 documents);
the two statuses together touch 366 documents.
The full Leiden apparatus (restorations, dotted readings) was dropped by the
upstream digitization and remains absent; for edition-grade readings consult
GORILA and SigLA. aegean.ReadingStatus round-trips through JSON and
EpiDoc (<unclear>/<supplied>/<gap>), so bring-your-own corpora keep their
apparatus through a load/export cycle.
If your text is in a plain .txt file, a folder of text files, or a CSV, import it
in one step — no Python required — and the result works with every corpus command:
aegean import myplato.txt -o myplato.json # then: aegean stats myplato.json
aegean import poems/ -o corpus.db --split line # a folder, one doc per line
aegean import rows.csv -o corpus.json --text-col line --id-col id--split controls how a text file becomes documents: whole (default, one document,
line breaks preserved), paragraph (one per blank-line block), or line (one per line).
Greek/Koine text (--script greek, the default) is run through the Greek word tokenizer;
other scripts split on whitespace. The same paths exist in Python:
from aegean import io
io.from_text_file("myplato.txt") # → Corpus
io.from_text("ἐν ἀρχῇ ἦν ὁ λόγος", doc_id="john")
io.from_text_dir("poems/", split="line")
io.from_csv("rows.csv", text_col="line", id_col="id", meta_cols=["period"])
read_corpus/ theCORPUSargument deliberately load only pyaegean's own.json/.dbformats. A.txt/.csvis imported into that format first (the error message says so); afteraegean import … -o corpus.json,corpus.jsonis a first-class corpus everywhere.
For full control — explicit token kinds, editorial status, variant readings — build from dict records:
import aegean
corpus = aegean.Corpus.from_records([
{"id": "X1", "text": "KU-RO 10", "meta": {"site": "My site"}},
{"id": "X2", "lines": [["A-DU", {"text": "5", "status": "unclear"}]]},
], script_id="myfind",
provenance=aegean.Provenance(source="My dig notebook", citation="Me (2026)."))Tokens may be plain strings (kinds inferred: parseable numerals vs words,
hyphenated tokens get their signs split) or dicts carrying kind, status
(editorial certainty), and alt (variant readings). Make it loadable by name with
aegean.core.corpus.register_loader("myfind", lambda: corpus); for EpiDoc
sources, the bring-your-own reader (see Linear B) covers the same model
including <unclear>/<supplied> status and <app>/<rdg> variants.
Token.alt carries alternate readings alongside the editorial status. The
EpiDoc writer emits them as a critical apparatus —
<app><lem><w>PO-ME</w></lem><rdg><w>PO-MA</w></rdg></app> (validated against the
official EpiDoc schema) — and the reader folds them back to one token with its
alt tuple, so variants survive the EpiDoc and JSON round-trips.
- Code — Apache-2.0.
- Linear A corpus JSON — GORILA via mwenge/lineara.xyz (Apache-2.0).
-
Linear A facsimile imagery (
lineara-images) — © École Française d'Athènes and other rightsholders; referenced, not redistributed. - Aegean sign data (Linear B / Cypriot / Cypro-Minoan, bundled) — Unicode Character Database, Unicode License v3 (retain the notice).
- Greek sample corpus — public-domain ancient texts (seed only).
-
Greek treebank lexicon + models (opt-in,
agdt-derived) — Perseus AGDT v2.1, CC BY-SA 3.0; fetched and built/used in the user cache, never bundled or redistributed. -
Greek lexicon / LSJ (opt-in,
lsj-index) — Perseus Liddell-Scott-Jones, CC BY-SA 4.0; fetched and indexed in the user cache, never bundled or redistributed. -
Greek neural lemmatizer (opt-in
[neural],grc-lemma-neural) — a GreTa seq2seq (Apache-2.0 base) fine-tuned on the AGDT (CC BY-SA 3.0), Pedalion (CC BY-SA 4.0), and Gorman (CC BY-SA 4.0) treebanks. The model — int8 ONNX weights plus a derived gold lemma lookup — is CC BY-SA 4.0, fetched to the user cache (~232 MB), never bundled; the wheel stays Apache-2.0. -
Greek neural joint pipeline (opt-in
[neural],grc-joint) — a GreBerta-based joint model (Apache-2.0 base) fine-tuned on the AGDT (CC BY-SA 3.0), Gorman (CC BY-SA 4.0), and Pedalion (CC BY-SA 4.0) treebanks, evaluation folds excluded from training. The model bundle is CC BY-SA 4.0, fetched to the user cache (~518 MB), never bundled; the wheel stays Apache-2.0. - PROIEL / UD evaluation sets (opt-in) — PROIEL treebank and the UD Ancient Greek treebanks, CC BY-NC-SA 3.0; fetched to the user cache for evaluation only, never bundled, never trained on (NonCommercial + ShareAlike).
-
SigLA corpus (
sigla-corpus) — Salgarella & Castellan, CC BY-NC-SA 4.0; fetched, never bundled; NC + ShareAlike pass through to you. -
DAMOS corpus (
damos-corpus) — F. Aurora, CC BY-NC-SA 4.0; fetched, never bundled; NC + ShareAlike pass through to you. -
Greek New Testament (
nt-corpus) + Dodson lexicon (bundled) — Nestle 1904 base text public domain; morphology/lemmas/Strong's and the Dodson glosses are CC0, so one NT book and the Dodson lexicon are bundled and the full NT corpus may be redistributed. -
Linear A Workbench app (
workbench-app) — Apache-2.0 build; embedded data is GORILA-derived. -
Linear B bring-your-own (
linearb-corpus) — no default source; DAMOS is CC BY-NC-SA 4.0 and LiBER is all-rights-reserved (© CNR Edizioni); neither redistributed.
See the repository NOTICE and CITATION.cff for full attribution.
-
The full Linear A apparatus is not bundled. Restorations and dotted
readings were dropped upstream; only
LOST/UNCLEARsurvive. For edition-grade readings, go to GORILA and SigLA. - NonCommercial data is NonCommercial for you too. DAMOS, SigLA, PROIEL, and the UD treebanks carry CC BY-NC-SA obligations that pass through — you may not use them commercially, and you must ShareAlike.
-
Override URLs skip checksum verification. Setting
PYAEGEAN_<NAME>_URLmeans you vouch for the bytes; the pinned sha256 only describes the pinned release asset. -
Imagery is fetched, never re-hosted.
lineara-imagescopyright is a patchwork; check each file'simageRightsbefore reuse. - First neural/treebank/LSJ/work call needs the network; everything is cached afterward and runs offline.
See Limitations for the project-wide caveats, and Greek NLP, Linear A, Linear B, Analysis, and Cypriot for the features that consume this data.
Start here
Aegean scripts
Greek
Capabilities
Reference