Data Model Data Sources

Data Sources

Glintstone federates cuneiform research data from multiple open-access projects. This document describes each source, what it provides, how it is accessed, and its licensing terms.

Every record imported into Glintstone carries an annotation_run_id linking it to the source that produced it. This provenance chain is the backbone of the trust infrastructure -- see data-quality.md for details.

CDLI (Cuneiform Digital Library Initiative)

The largest single source of cuneiform metadata. CDLI provides the universal artifact identifier (P-number) that serves as Glintstone's primary join key.

License: CC0 (public domain)

Catalog

File: cdli_cat.csv (bulk export from cdli.earth)
Access: Manual download (no live API for bulk catalog)
Records: 353,283 artifacts with 64 metadata fields
Key fields: P-number, designation, museum number, excavation number, period, provenience, genre, language, material, object type, dimensions, primary publication, collection, dates referenced
Update frequency: Last bulk dump August 2022. CDLI's data pipeline has been frozen upstream.
Null rates (from live database audit):
- museum_no: 9.2%, excavation_no: 68%, width/height: 82%, thickness: 91%
- period: 1.8%, provenience: 3.1%, genre: 1.4%, language: 5.2%
Schema target: artifacts table (primary), artifact_identifiers (museum_no, excavation_no, primary_publication seeded as separate rows)

ATF (ASCII Transliteration Format)

File: cdliatf_unblocked.atf (86 MB)
Access: Bulk download
Records: 135,200 transliterated texts (~3.5M lines)
Structure: Line-oriented notation with @surface markers, #tr.XX: translation markers, >>Q composite references
Adds ~33k stub records: P-numbers present in ATF but absent from catalog CSV
Schema targets: text_lines (line-level decomposition), surfaces (from @markers), translations (from #tr lines), composites and artifact_composites (from >>Q markers)

Translations

Extracted from: Inline #tr.XX: markers in ATF file
Records: 5,599 translations across 9 languages (en, de, ts, it, fr, es, dk, ca, fa)
Schema target: translations table

Images

Access: On-demand fetch from cdli.earth/dl, cached locally
Format: JPEG photographs and line drawings
Coverage: Varies by collection; many tablets have no published images
IIIF: Not yet implemented. CDLI has partial IIIF support; integration is planned.

Publications API

Access: REST API (cdli.earth), no authentication, CC0
Records: 16,725 publications, ~390k artifact-publication links
Format: Structured bibliographic data with BibTeX keys
Schema targets: publications, publication_authors, artifact_editions
Import order: API data imported first (confidence 1.0), CSV fallback for gaps

CSV Supplementary Data

Extracted from: cdli_cat.csv freetext fields
Provides: ~20k citation strings, ~4k collation records, ~7k fragment join references
Schema targets: scholarly_annotations, fragment_joins
Confidence: Lower than API data (0.3-0.7 depending on regex parse quality)

ORACC (Open Richly Annotated Cuneiform Corpus)

The richest source of linguistic annotation for cuneiform texts. ORACC is organized as independent scholarly projects, each covering a specific corpus.

License: CC BY-SA 3.0 (per project; some may vary)

Projects integrated

Glintstone integrates ~115 ORACC projects across 6 connectors. All project data is downloaded as zip archives and processed by the ingestion framework. The full project list lives in each connector's ORACC_PROJECTS constant; representative projects:

Family	Projects	Corpus
dcclt + subprojects	dcclt, dcclt/ebla, dcclt/jena, dcclt/nineveh, dcclt/signlists	Cuneiform lexical texts and school texts
epsd2	epsd2	Electronic Pennsylvania Sumerian Dictionary
saao + subprojects	saao, saa01–saa21, saas2, aebp, knpp	State Archives of Assyria (Neo-Assyrian letters, treaties)
rinap + subprojects	rinap, rinap1–rinap5	Royal Inscriptions of the Neo-Assyrian Period
riao / ribo + subprojects	riao, ribo, ribo/babylon2–10	Royal inscriptions of Assyria and Babylonia
etcsri	etcsri	Sumerian Royal Inscriptions
blms	blms	Babylonian Literary and Mythological Sources
cams + subprojects	cams, cams/gkab, cams/anzu, cams/barutu, cams/etana, cams/ludlul, cams/selbi, cams/akno, cams/ntlab	Corpus of Ancient Mesopotamian Scholarship
atae + subprojects	atae/assur, atae/kalhu, atae/nineveh, atae/durszarrukin, + 12 others	Neo-Assyrian archival texts by site
adsd + subprojects	adsd, adsd/adart1–adart6	Astronomical Diaries and Sundry Diaries
cmawro + subprojects	cmawro, cmawro/cmawr1–3, cmawro/maqlu	Anti-witchcraft rituals
asbp + subprojects	asbp, asbp/ninmed, asbp/rlasb	Ashurbanipal Library Project
Other top-level	akklove, ario, armep, babcity, balt, borsippa, btmao, btto, ckst, ctij, dsst, ecut, edlex, eisl, glass, hbtin, dccmt, lacost, nere, nimrud, obel, obmc, obta, oimea, pnao, suhu, tcma, tsae, urap, aemw/amarna, …	Various specialized corpora

Access method

Download: https://build-oracc.museum.upenn.edu/json/<slug>.zip for most projects, where subproject slugs use hyphens (cams/gkab → cams-gkab.zip). A small number of projects use https://oracc.museum.upenn.edu/<project>/json.zip instead (the build server returns 500 for these; both URLs are tried by scripts/download-oracc.sh).
Contents per project:
- catalogue.json — project-specific artifact metadata with Pleiades geographic IDs
- corpusjson/P*.json — per-text CDL (Chunk-Delimiter-Lemma) trees with sign-level annotation
- gloss-{lang}.json — dictionary entries, variant forms, senses
- cat.geojson — archaeological site coordinates (most projects)

Lemmatization

Records: 416,911 lemmatization rows (as of 2026-05-19)
Coverage: ~2% of artifacts have ORACC linguistic annotation; coverage is deep where it exists (lemma + norm + morph per token)
Structure: CDL trees decompose each text into nested nodes: chunk > lemma > grapheme. Position within a line is encoded in hex in the ref field.
Language codes: 14 language codes — sux, akk, akk-x-stdbab, akk-x-oldbab, akk-x-neoass, akk-x-neobab, akk-x-ltebab, qpn, sux-x-emesal, and others
Schema targets: lemmatizations (token-level), lexical_norms (normalized forms), lexical_norm_forms
Dead-letter note: ~2.4M ORACC lemma tokens could not be matched to existing text_lines or tokens because those texts have not been imported through the ATF parser. These are tracked in dead_letters and can be replayed once ATF coverage expands.

Glossaries

Glossary entries: 220,349 across 103 ORACC projects (glossary_entries table)
Glossary forms: 1,469,845 written/orthographic variants (glossary_forms table)
Lexical lemmas: 346,480 normalized dictionary entries across 14 languages (lexical_lemmas table)
Lexical norms: 834,235 normalized forms; 381,460 norm-form spellings (lexical_norms, lexical_norm_forms)
Lexical senses: 860,940 meaning distinctions (lexical_senses table)
Sources: gloss-sux.json (Sumerian), gloss-akk.json (Akkadian), gloss-akk-x-stdbab.json (Standard Babylonian), gloss-akk-x-oldbab.json (Old Babylonian), gloss-qpn.json (proper nouns), and language-specific variants per project
Schema targets: glossary_entries, glossary_forms, lexical_lemmas, lexical_senses, lexical_norms, lexical_norm_forms

Credits

Records: 35,609 per-text editorial credit rows (artifact_credits table) across 61 ORACC projects
Schema target: artifact_credits (p_number, oracc_project, credits_text)

OGSL (ORACC Global Sign List)

The canonical cuneiform sign inventory used as Glintstone's primary sign identification system.

License: Part of ORACC (CC BY-SA 3.0) Access: ogsl-sl.json from ORACC zip

Signs: 3,367 entries with Unicode codepoints and all known readings
Sign values: ~15,000 reading values with sub-indices
GDL definitions: JSON structural definitions for compound signs (e.g., |A.AN|)
Sign types: simple, compound (|X.Y|), modified (@g gunu, @t tenu, @s sheshig)
Schema targets: signs, sign_values, sign_variants
Concordance role: OGSL sign_id is the canonical identifier. MZL and ABZ numbers are mapped via Unicode codepoints as a bridge. See data-quality.md for concordance gap details.

CompVis (Cuneiform Sign Detection Dataset)

Sign bounding-box annotations for machine learning training.

License: MIT Access: GitHub sub-repo (CompVis/cuneiform-sign-detection-dataset)

Tablets: 81 Neo-Assyrian tablets
Annotations: 8,109 sign bounding boxes (11,070 including metadata rows)
Labels: MZL integer numbers (Borger's Mesopotamisches Zeichenlexikon)
Damage codes: 0=background, 1=intact, 2=broken/uncertain
Coordinates: Pixel-absolute, converted to percentage-based for resolution independence
Schema target: sign_annotations
Concordance requirement: MZL labels must be resolved to OGSL sign_ids. ~200-400 remain unresolved after auto-matching via Unicode bridge.

eBL (Electronic Babylonian Literature)

Fragment-level scholarly editions with OCR training data and bibliographic citations.

License: Research use (varies by component) Access: GitHub sub-repos + REST API

OCR Training Data

Source: cuneiform-ocr-data (GitHub)
Content: Annotated tablet images for DETR sign detection model training
Classes: 173 sign classes
Schema target: sign_annotations (future)

Sign Concordance

Files: ebl.txt, mzl.txt
Content: eBL internal names to ABZ numbers to Unicode
Role: Bridge data for MZL/ABZ to OGSL concordance mapping

Bibliography API

Access: REST API (ebl.lmu.de), may require authentication token
Format: CSL-JSON (Citation Style Language)
Content: Fragment-level citations
Schema target: publications, artifact_editions
Attribution requirement: "eBL - Electronic Babylonian Literature" must appear when displaying eBL-sourced data

ePSD2 (electronic Pennsylvania Sumerian Dictionary)

Comprehensive Sumerian dictionary, hosted as an ORACC project (epsd2). Integrated via the dedicated epsd2.py connector (lexical schema) and also covered by the ORACC family of connectors (glossaries, norms).

License: Part of ORACC (CC BY-SA 3.0)

Lexical lemmas: 15,940 from source epsd2 (Sumerian headwords with full sense/norm data)
Schema targets: lexical_lemmas, lexical_senses, lexical_norms, lexical_norm_forms

Citation Enrichment Sources

These sources backfill bibliographic metadata and scholar identification. See citation-pipeline-summary.md for the full pipeline.

OpenAlex

Access: REST API, no auth, CC0
Provides: DOI backfill for publications (15-30% coverage), ORCID identification for scholars (10-20%)
Schema targets: Enriches publications.doi and scholars.orcid

Semantic Scholar

Access: REST API, rate-limited
Provides: Citation graphs for publications with DOIs
Schema target: entity_relationships (partially)

KeiBi (Keilschriftbibliographie)

Access: No API. Manual BibTeX export required (contact Tubingen)
Records: ~90,000 bibliography entries
Schema target: publications
Note: Largest single bibliography source, but acquisition requires manual effort

Scholar Directories

Who's Who in Cuneiform Studies: ~500-1000 scholar records (web scrape)
Wikipedia Assyriologists category: ~200 scholars (web scrape)
Schema target: scholars

Not Yet Integrated

CAD (Chicago Assyrian Dictionary)

Status: Digitization in progress at ISAC (Institute for the Study of Ancient Cultures)
Content: 21 volumes covering the complete Akkadian lexicon (A-Z)
License: Public domain (copyright expired)
Priority: Highest-value single Akkadian source (comprehensive, scholarly consensus)
Planned approach: Await structured digitization format from ISAC; PDF extraction as fallback
Schema targets: glossary_entries, glossary_senses, interpretations (competing scholarly readings)

CHD (Chicago Hittite Dictionary)

Status: Partial digitization available
Content: Hittite lexicon (incomplete coverage)
License: Check ISAC terms
Priority: Primary Hittite dictionary source
Planned approach: Import from digitized volumes when available
Schema targets: glossary_entries, glossary_forms, glossary_senses

HPM (Hittite Parsed Morphology)

Status: Marquette University project, format TBD
Content: Morphological decomposition data for Hittite
License: Research use (confirm)
Priority: Core for Hittite morphological analysis
Schema targets: morphology, lemmatizations

Persepolis Fortification Archive

Status: Elamite administrative texts, transliterations available
Content: Elamite text corpus with transliterations
License: Research use (ISAC)
Priority: Primary Elamite corpus (Elamite is thinnest language layer)
Planned approach: Import transliterations, manual curation for lexicon
Schema targets: text_lines, glossary_entries (Elamite lemmas)

ETCSL (Electronic Text Corpus of Sumerian Literature)

Status: Connector wired; zip download returns empty archive from build server. Data not yet loaded.
Content: XML/HTML transliterations of Sumerian literary texts with lemmatization
License: CC BY-SA 3.0 (ORACC)
Priority: Complements ePSD2 for Sumerian literary corpus
Schema targets: text_lines, lemmatizations, translations

BabyLemmatizer Output

Status: Model available, import pathway designed but not yet run
Content: Automated POS-tagging and lemmatization in CoNLL-U format
Schema targets: lemmatizations, morphology (via annotation_runs with source_type='model')
See: ml-integration.md for full details

Source Priority and Conflict Resolution

When sources disagree on the same field:

Identity metadata (period, provenience, genre, language): CDLI is authoritative. ORACC enriches with subgenre, supergenre, geographic coordinates, and project membership.
Linguistic annotation: ORACC human annotations at highest confidence. BabyLemmatizer output stored as competing interpretation with lower confidence. Multiple analyses coexist via is_consensus flags.
Bibliographic data: CDLI API data imported first (confidence 1.0). CSV-derived data fills gaps at lower confidence. Cross-source deduplication uses cascading match: DOI exact (1.0) > bibtex_key (0.95) > title+year (0.8) > short_title+volume (0.9). Below 0.7 staged for manual review.

For field-level source mappings, see source-mapping.yaml.

Source: github.com/wittkensis/glintstone · Issues · Edit this wiki

Home

Start here

Getting Started

Overview

Data Model

Reference — Data Model

Reference — API

Reference — MCP

Opportunities

Personas

Project

Research

Data Model Data Sources

Data Sources

CDLI (Cuneiform Digital Library Initiative)

Catalog

ATF (ASCII Transliteration Format)

Translations

Images

Publications API

CSV Supplementary Data

ORACC (Open Richly Annotated Cuneiform Corpus)

Projects integrated

Access method

Lemmatization

Glossaries

Credits

OGSL (ORACC Global Sign List)

CompVis (Cuneiform Sign Detection Dataset)

eBL (Electronic Babylonian Literature)

OCR Training Data

Sign Concordance

Bibliography API

ePSD2 (electronic Pennsylvania Sumerian Dictionary)

Citation Enrichment Sources

OpenAlex

Semantic Scholar

KeiBi (Keilschriftbibliographie)

Scholar Directories

Not Yet Integrated

CAD (Chicago Assyrian Dictionary)

CHD (Chicago Hittite Dictionary)

HPM (Hittite Parsed Morphology)

Persepolis Fortification Archive

ETCSL (Electronic Text Corpus of Sumerian Literature)

BabyLemmatizer Output

Source Priority and Conflict Resolution

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!