-
Notifications
You must be signed in to change notification settings - Fork 0
Data and Provenance
Compact text data ships inside the wheel and works offline:
- Linear A:
inscriptions.json,signs.json,phonetic_map.json - Greek:
sample_texts.json,lemmata.json
Large or license-restricted assets are never bundled — they are fetched on demand into a user cache. This keeps the wheel < 3 MB (CI guards it).
from aegean.data import load_bundled_json
load_bundled_json("lineara", "signs.json")fetch(name) downloads a registered remote dataset into the cache and returns
its path. Downloads are sha256-verified (when a checksum is pinned),
atomic (written to a .part file then renamed), and idempotent (a
present, valid cache entry is a no-op). Archive datasets (extract=True, e.g.
lineara-images) are unpacked into a cache directory — safely (members that
escape the directory are rejected) — and fetch() returns that directory.
from aegean import data
data.cache_dir() # where datasets are cached (override: PYAEGEAN_CACHE)
path = data.fetch("lineara-images")Errors are explicit and never block import:
- unknown dataset →
DataNotAvailableError - no pinned URL →
DataNotAvailableErrornaming the env override to set - checksum mismatch →
DataNotAvailableError(the bad download is removed)
The ~500 MB facsimile/photo set is fetched (never re-hosted) from a release
on the ryanpavlicek/linearaworkbench repo, where it is already hosted. fetch
downloads the tar.gz and unpacks it into a cache directory of images. Its
copyright is a patchwork — most images are © École Française d'Athènes (the
GORILA volumes), others are held by named scholars, publications, and
photographers (see the corpus's per-image imageRights); that attribution is
unaffected by fetching, and pyaegean does not redistribute the images itself.
Until the release asset's URL+sha256 are pinned, point the fetcher at a copy you are licensed to use with an env override:
export PYAEGEAN_LINEARA_IMAGES_URL="https://example.org/lineara-images.tar"data.fetch("lineara-images") # downloads from the override, sha-checked if pinnedThe override pattern is general: PYAEGEAN_<NAME>_URL (uppercased, -→_)
overrides any dataset's URL.
Every Corpus carries a Provenance that stamps exports and gives a citation:
corpus = aegean.load("lineara")
corpus.provenance.source # 'GORILA (Godart & Olivier 1976–1985) via mwenge/lineara.xyz'
corpus.provenance.license
corpus.provenance.cite() # one-line citation for papers/logs
corpus.to_dict()["_meta"] # tool, schemaVersion, scriptId, source, license, citation- Code — Apache-2.0.
- Linear A corpus JSON — GORILA via mwenge/lineara.xyz (Apache-2.0).
- Linear A facsimile imagery — © École Française d'Athènes; referenced, not redistributed.
- Greek sample corpus — public-domain ancient texts (seed only).
See the repository NOTICE and CITATION.cff for full attribution.
Start here
Aegean scripts
Greek
Capabilities
Reference