Skip to content

Data Model Data Sources

wittkensis edited this page May 19, 2026 · 2 revisions

Data Sources

Glintstone federates cuneiform research data from multiple open-access projects. This document describes each source, what it provides, how it is accessed, and its licensing terms.

Every record imported into Glintstone carries an annotation_run_id linking it to the source that produced it. This provenance chain is the backbone of the trust infrastructure -- see data-quality.md for details.


CDLI (Cuneiform Digital Library Initiative)

The largest single source of cuneiform metadata. CDLI provides the universal artifact identifier (P-number) that serves as Glintstone's primary join key.

License: CC0 (public domain)

Catalog

  • File: cdli_cat.csv (bulk export from cdli.earth)
  • Access: Manual download (no live API for bulk catalog)
  • Records: 353,283 artifacts with 64 metadata fields
  • Key fields: P-number, designation, museum number, excavation number, period, provenience, genre, language, material, object type, dimensions, primary publication, collection, dates referenced
  • Update frequency: Last bulk dump August 2022. CDLI's data pipeline has been frozen upstream.
  • Null rates (from live database audit):
    • museum_no: 9.2%, excavation_no: 68%, width/height: 82%, thickness: 91%
    • period: 1.8%, provenience: 3.1%, genre: 1.4%, language: 5.2%
  • Schema target: artifacts table (primary), artifact_identifiers (museum_no, excavation_no, primary_publication seeded as separate rows)

ATF (ASCII Transliteration Format)

  • File: cdliatf_unblocked.atf (86 MB)
  • Access: Bulk download
  • Records: 135,200 transliterated texts (~3.5M lines)
  • Structure: Line-oriented notation with @surface markers, #tr.XX: translation markers, >>Q composite references
  • Adds ~33k stub records: P-numbers present in ATF but absent from catalog CSV
  • Schema targets: text_lines (line-level decomposition), surfaces (from @markers), translations (from #tr lines), composites and artifact_composites (from >>Q markers)

Translations

  • Extracted from: Inline #tr.XX: markers in ATF file
  • Records: 5,599 translations across 9 languages (en, de, ts, it, fr, es, dk, ca, fa)
  • Schema target: translations table

Images

  • Access: On-demand fetch from cdli.earth/dl, cached locally
  • Format: JPEG photographs and line drawings
  • Coverage: Varies by collection; many tablets have no published images
  • IIIF: Not yet implemented. CDLI has partial IIIF support; integration is planned.

Publications API

  • Access: REST API (cdli.earth), no authentication, CC0
  • Records: 16,725 publications, ~390k artifact-publication links
  • Format: Structured bibliographic data with BibTeX keys
  • Schema targets: publications, publication_authors, artifact_editions
  • Import order: API data imported first (confidence 1.0), CSV fallback for gaps

CSV Supplementary Data

  • Extracted from: cdli_cat.csv freetext fields
  • Provides: ~20k citation strings, ~4k collation records, ~7k fragment join references
  • Schema targets: scholarly_annotations, fragment_joins
  • Confidence: Lower than API data (0.3-0.7 depending on regex parse quality)

ORACC (Open Richly Annotated Cuneiform Corpus)

The richest source of linguistic annotation for cuneiform texts. ORACC is organized as independent scholarly projects, each covering a specific corpus.

License: CC BY-SA 3.0 (per project; some may vary)

Projects integrated

Glintstone integrates ~115 ORACC projects across 6 connectors. All project data is downloaded as zip archives and processed by the ingestion framework. The full project list lives in each connector's ORACC_PROJECTS constant; representative projects:

Family Projects Corpus
dcclt + subprojects dcclt, dcclt/ebla, dcclt/jena, dcclt/nineveh, dcclt/signlists Cuneiform lexical texts and school texts
epsd2 epsd2 Electronic Pennsylvania Sumerian Dictionary
saao + subprojects saao, saa01–saa21, saas2, aebp, knpp State Archives of Assyria (Neo-Assyrian letters, treaties)
rinap + subprojects rinap, rinap1–rinap5 Royal Inscriptions of the Neo-Assyrian Period
riao / ribo + subprojects riao, ribo, ribo/babylon2–10 Royal inscriptions of Assyria and Babylonia
etcsri etcsri Sumerian Royal Inscriptions
blms blms Babylonian Literary and Mythological Sources
cams + subprojects cams, cams/gkab, cams/anzu, cams/barutu, cams/etana, cams/ludlul, cams/selbi, cams/akno, cams/ntlab Corpus of Ancient Mesopotamian Scholarship
atae + subprojects atae/assur, atae/kalhu, atae/nineveh, atae/durszarrukin, + 12 others Neo-Assyrian archival texts by site
adsd + subprojects adsd, adsd/adart1–adart6 Astronomical Diaries and Sundry Diaries
cmawro + subprojects cmawro, cmawro/cmawr1–3, cmawro/maqlu Anti-witchcraft rituals
asbp + subprojects asbp, asbp/ninmed, asbp/rlasb Ashurbanipal Library Project
Other top-level akklove, ario, armep, babcity, balt, borsippa, btmao, btto, ckst, ctij, dsst, ecut, edlex, eisl, glass, hbtin, dccmt, lacost, nere, nimrud, obel, obmc, obta, oimea, pnao, suhu, tcma, tsae, urap, aemw/amarna, … Various specialized corpora

Access method

  • Download: https://build-oracc.museum.upenn.edu/json/<slug>.zip for most projects, where subproject slugs use hyphens (cams/gkabcams-gkab.zip). A small number of projects use https://oracc.museum.upenn.edu/<project>/json.zip instead (the build server returns 500 for these; both URLs are tried by scripts/download-oracc.sh).
  • Contents per project:
    • catalogue.json — project-specific artifact metadata with Pleiades geographic IDs
    • corpusjson/P*.json — per-text CDL (Chunk-Delimiter-Lemma) trees with sign-level annotation
    • gloss-{lang}.json — dictionary entries, variant forms, senses
    • cat.geojson — archaeological site coordinates (most projects)

Lemmatization

  • Records: 416,911 lemmatization rows (as of 2026-05-19)
  • Coverage: ~2% of artifacts have ORACC linguistic annotation; coverage is deep where it exists (lemma + norm + morph per token)
  • Structure: CDL trees decompose each text into nested nodes: chunk > lemma > grapheme. Position within a line is encoded in hex in the ref field.
  • Language codes: 14 language codes — sux, akk, akk-x-stdbab, akk-x-oldbab, akk-x-neoass, akk-x-neobab, akk-x-ltebab, qpn, sux-x-emesal, and others
  • Schema targets: lemmatizations (token-level), lexical_norms (normalized forms), lexical_norm_forms
  • Dead-letter note: ~2.4M ORACC lemma tokens could not be matched to existing text_lines or tokens because those texts have not been imported through the ATF parser. These are tracked in dead_letters and can be replayed once ATF coverage expands.

Glossaries

  • Glossary entries: 220,349 across 103 ORACC projects (glossary_entries table)
  • Glossary forms: 1,469,845 written/orthographic variants (glossary_forms table)
  • Lexical lemmas: 346,480 normalized dictionary entries across 14 languages (lexical_lemmas table)
  • Lexical norms: 834,235 normalized forms; 381,460 norm-form spellings (lexical_norms, lexical_norm_forms)
  • Lexical senses: 860,940 meaning distinctions (lexical_senses table)
  • Sources: gloss-sux.json (Sumerian), gloss-akk.json (Akkadian), gloss-akk-x-stdbab.json (Standard Babylonian), gloss-akk-x-oldbab.json (Old Babylonian), gloss-qpn.json (proper nouns), and language-specific variants per project
  • Schema targets: glossary_entries, glossary_forms, lexical_lemmas, lexical_senses, lexical_norms, lexical_norm_forms

Credits

  • Records: 35,609 per-text editorial credit rows (artifact_credits table) across 61 ORACC projects
  • Schema target: artifact_credits (p_number, oracc_project, credits_text)

OGSL (ORACC Global Sign List)

The canonical cuneiform sign inventory used as Glintstone's primary sign identification system.

License: Part of ORACC (CC BY-SA 3.0) Access: ogsl-sl.json from ORACC zip

  • Signs: 3,367 entries with Unicode codepoints and all known readings
  • Sign values: ~15,000 reading values with sub-indices
  • GDL definitions: JSON structural definitions for compound signs (e.g., |A.AN|)
  • Sign types: simple, compound (|X.Y|), modified (@g gunu, @t tenu, @s sheshig)
  • Schema targets: signs, sign_values, sign_variants
  • Concordance role: OGSL sign_id is the canonical identifier. MZL and ABZ numbers are mapped via Unicode codepoints as a bridge. See data-quality.md for concordance gap details.

CompVis (Cuneiform Sign Detection Dataset)

Sign bounding-box annotations for machine learning training.

License: MIT Access: GitHub sub-repo (CompVis/cuneiform-sign-detection-dataset)

  • Tablets: 81 Neo-Assyrian tablets
  • Annotations: 8,109 sign bounding boxes (11,070 including metadata rows)
  • Labels: MZL integer numbers (Borger's Mesopotamisches Zeichenlexikon)
  • Damage codes: 0=background, 1=intact, 2=broken/uncertain
  • Coordinates: Pixel-absolute, converted to percentage-based for resolution independence
  • Schema target: sign_annotations
  • Concordance requirement: MZL labels must be resolved to OGSL sign_ids. ~200-400 remain unresolved after auto-matching via Unicode bridge.

eBL (Electronic Babylonian Literature)

Fragment-level scholarly editions with OCR training data and bibliographic citations.

License: Research use (varies by component) Access: GitHub sub-repos + REST API

OCR Training Data

  • Source: cuneiform-ocr-data (GitHub)
  • Content: Annotated tablet images for DETR sign detection model training
  • Classes: 173 sign classes
  • Schema target: sign_annotations (future)

Sign Concordance

  • Files: ebl.txt, mzl.txt
  • Content: eBL internal names to ABZ numbers to Unicode
  • Role: Bridge data for MZL/ABZ to OGSL concordance mapping

Bibliography API

  • Access: REST API (ebl.lmu.de), may require authentication token
  • Format: CSL-JSON (Citation Style Language)
  • Content: Fragment-level citations
  • Schema target: publications, artifact_editions
  • Attribution requirement: "eBL - Electronic Babylonian Literature" must appear when displaying eBL-sourced data

ePSD2 (electronic Pennsylvania Sumerian Dictionary)

Comprehensive Sumerian dictionary, hosted as an ORACC project (epsd2). Integrated via the dedicated epsd2.py connector (lexical schema) and also covered by the ORACC family of connectors (glossaries, norms).

License: Part of ORACC (CC BY-SA 3.0)

  • Lexical lemmas: 15,940 from source epsd2 (Sumerian headwords with full sense/norm data)
  • Schema targets: lexical_lemmas, lexical_senses, lexical_norms, lexical_norm_forms

Citation Enrichment Sources

These sources backfill bibliographic metadata and scholar identification. See citation-pipeline-summary.md for the full pipeline.

OpenAlex

  • Access: REST API, no auth, CC0
  • Provides: DOI backfill for publications (15-30% coverage), ORCID identification for scholars (10-20%)
  • Schema targets: Enriches publications.doi and scholars.orcid

Semantic Scholar

  • Access: REST API, rate-limited
  • Provides: Citation graphs for publications with DOIs
  • Schema target: entity_relationships (partially)

KeiBi (Keilschriftbibliographie)

  • Access: No API. Manual BibTeX export required (contact Tubingen)
  • Records: ~90,000 bibliography entries
  • Schema target: publications
  • Note: Largest single bibliography source, but acquisition requires manual effort

Scholar Directories

  • Who's Who in Cuneiform Studies: ~500-1000 scholar records (web scrape)
  • Wikipedia Assyriologists category: ~200 scholars (web scrape)
  • Schema target: scholars

Not Yet Integrated

CAD (Chicago Assyrian Dictionary)

  • Status: Digitization in progress at ISAC (Institute for the Study of Ancient Cultures)
  • Content: 21 volumes covering the complete Akkadian lexicon (A-Z)
  • License: Public domain (copyright expired)
  • Priority: Highest-value single Akkadian source (comprehensive, scholarly consensus)
  • Planned approach: Await structured digitization format from ISAC; PDF extraction as fallback
  • Schema targets: glossary_entries, glossary_senses, interpretations (competing scholarly readings)

CHD (Chicago Hittite Dictionary)

  • Status: Partial digitization available
  • Content: Hittite lexicon (incomplete coverage)
  • License: Check ISAC terms
  • Priority: Primary Hittite dictionary source
  • Planned approach: Import from digitized volumes when available
  • Schema targets: glossary_entries, glossary_forms, glossary_senses

HPM (Hittite Parsed Morphology)

  • Status: Marquette University project, format TBD
  • Content: Morphological decomposition data for Hittite
  • License: Research use (confirm)
  • Priority: Core for Hittite morphological analysis
  • Schema targets: morphology, lemmatizations

Persepolis Fortification Archive

  • Status: Elamite administrative texts, transliterations available
  • Content: Elamite text corpus with transliterations
  • License: Research use (ISAC)
  • Priority: Primary Elamite corpus (Elamite is thinnest language layer)
  • Planned approach: Import transliterations, manual curation for lexicon
  • Schema targets: text_lines, glossary_entries (Elamite lemmas)

ETCSL (Electronic Text Corpus of Sumerian Literature)

  • Status: Connector wired; zip download returns empty archive from build server. Data not yet loaded.
  • Content: XML/HTML transliterations of Sumerian literary texts with lemmatization
  • License: CC BY-SA 3.0 (ORACC)
  • Priority: Complements ePSD2 for Sumerian literary corpus
  • Schema targets: text_lines, lemmatizations, translations

BabyLemmatizer Output

  • Status: Model available, import pathway designed but not yet run
  • Content: Automated POS-tagging and lemmatization in CoNLL-U format
  • Schema targets: lemmatizations, morphology (via annotation_runs with source_type='model')
  • See: ml-integration.md for full details

Source Priority and Conflict Resolution

When sources disagree on the same field:

  • Identity metadata (period, provenience, genre, language): CDLI is authoritative. ORACC enriches with subgenre, supergenre, geographic coordinates, and project membership.
  • Linguistic annotation: ORACC human annotations at highest confidence. BabyLemmatizer output stored as competing interpretation with lower confidence. Multiple analyses coexist via is_consensus flags.
  • Bibliographic data: CDLI API data imported first (confidence 1.0). CSV-derived data fills gaps at lower confidence. Cross-source deduplication uses cascading match: DOI exact (1.0) > bibtex_key (0.95) > title+year (0.8) > short_title+volume (0.9). Below 0.7 staged for manual review.

For field-level source mappings, see source-mapping.yaml.

Clone this wiki locally