-
Notifications
You must be signed in to change notification settings - Fork 0
Data Model Data Sources
Glintstone federates cuneiform research data from multiple open-access projects. This document describes each source, what it provides, how it is accessed, and its licensing terms.
Every record imported into Glintstone carries an annotation_run_id linking it to the source that produced it. This provenance chain is the backbone of the trust infrastructure -- see data-quality.md for details.
The largest single source of cuneiform metadata. CDLI provides the universal artifact identifier (P-number) that serves as Glintstone's primary join key.
License: CC0 (public domain)
-
File:
cdli_cat.csv(bulk export from cdli.earth) - Access: Manual download (no live API for bulk catalog)
- Records: 353,283 artifacts with 64 metadata fields
- Key fields: P-number, designation, museum number, excavation number, period, provenience, genre, language, material, object type, dimensions, primary publication, collection, dates referenced
- Update frequency: Last bulk dump August 2022. CDLI's data pipeline has been frozen upstream.
-
Null rates (from live database audit):
- museum_no: 9.2%, excavation_no: 68%, width/height: 82%, thickness: 91%
- period: 1.8%, provenience: 3.1%, genre: 1.4%, language: 5.2%
-
Schema target:
artifactstable (primary),artifact_identifiers(museum_no, excavation_no, primary_publication seeded as separate rows)
-
File:
cdliatf_unblocked.atf(86 MB) - Access: Bulk download
- Records: 135,200 transliterated texts (~3.5M lines)
-
Structure: Line-oriented notation with
@surfacemarkers,#tr.XX:translation markers,>>Qcomposite references - Adds ~33k stub records: P-numbers present in ATF but absent from catalog CSV
-
Schema targets:
text_lines(line-level decomposition),surfaces(from @markers),translations(from #tr lines),compositesandartifact_composites(from >>Q markers)
-
Extracted from: Inline
#tr.XX:markers in ATF file - Records: 5,599 translations across 9 languages (en, de, ts, it, fr, es, dk, ca, fa)
-
Schema target:
translationstable
- Access: On-demand fetch from cdli.earth/dl, cached locally
- Format: JPEG photographs and line drawings
- Coverage: Varies by collection; many tablets have no published images
- IIIF: Not yet implemented. CDLI has partial IIIF support; integration is planned.
- Access: REST API (cdli.earth), no authentication, CC0
- Records: 16,725 publications, ~390k artifact-publication links
- Format: Structured bibliographic data with BibTeX keys
-
Schema targets:
publications,publication_authors,artifact_editions - Import order: API data imported first (confidence 1.0), CSV fallback for gaps
-
Extracted from:
cdli_cat.csvfreetext fields - Provides: ~20k citation strings, ~4k collation records, ~7k fragment join references
-
Schema targets:
scholarly_annotations,fragment_joins - Confidence: Lower than API data (0.3-0.7 depending on regex parse quality)
The richest source of linguistic annotation for cuneiform texts. ORACC is organized as independent scholarly projects, each covering a specific corpus.
License: CC BY-SA 3.0 (per project; some may vary)
Glintstone integrates ~115 ORACC projects across 6 connectors. All project data is downloaded as zip archives and processed by the ingestion framework. The full project list lives in each connector's ORACC_PROJECTS constant; representative projects:
| Family | Projects | Corpus |
|---|---|---|
| dcclt + subprojects | dcclt, dcclt/ebla, dcclt/jena, dcclt/nineveh, dcclt/signlists | Cuneiform lexical texts and school texts |
| epsd2 | epsd2 | Electronic Pennsylvania Sumerian Dictionary |
| saao + subprojects | saao, saa01–saa21, saas2, aebp, knpp | State Archives of Assyria (Neo-Assyrian letters, treaties) |
| rinap + subprojects | rinap, rinap1–rinap5 | Royal Inscriptions of the Neo-Assyrian Period |
| riao / ribo + subprojects | riao, ribo, ribo/babylon2–10 | Royal inscriptions of Assyria and Babylonia |
| etcsri | etcsri | Sumerian Royal Inscriptions |
| blms | blms | Babylonian Literary and Mythological Sources |
| cams + subprojects | cams, cams/gkab, cams/anzu, cams/barutu, cams/etana, cams/ludlul, cams/selbi, cams/akno, cams/ntlab | Corpus of Ancient Mesopotamian Scholarship |
| atae + subprojects | atae/assur, atae/kalhu, atae/nineveh, atae/durszarrukin, + 12 others | Neo-Assyrian archival texts by site |
| adsd + subprojects | adsd, adsd/adart1–adart6 | Astronomical Diaries and Sundry Diaries |
| cmawro + subprojects | cmawro, cmawro/cmawr1–3, cmawro/maqlu | Anti-witchcraft rituals |
| asbp + subprojects | asbp, asbp/ninmed, asbp/rlasb | Ashurbanipal Library Project |
| Other top-level | akklove, ario, armep, babcity, balt, borsippa, btmao, btto, ckst, ctij, dsst, ecut, edlex, eisl, glass, hbtin, dccmt, lacost, nere, nimrud, obel, obmc, obta, oimea, pnao, suhu, tcma, tsae, urap, aemw/amarna, … | Various specialized corpora |
-
Download:
https://build-oracc.museum.upenn.edu/json/<slug>.zipfor most projects, where subproject slugs use hyphens (cams/gkab→cams-gkab.zip). A small number of projects usehttps://oracc.museum.upenn.edu/<project>/json.zipinstead (the build server returns 500 for these; both URLs are tried byscripts/download-oracc.sh). -
Contents per project:
-
catalogue.json— project-specific artifact metadata with Pleiades geographic IDs -
corpusjson/P*.json— per-text CDL (Chunk-Delimiter-Lemma) trees with sign-level annotation -
gloss-{lang}.json— dictionary entries, variant forms, senses -
cat.geojson— archaeological site coordinates (most projects)
-
- Records: 416,911 lemmatization rows (as of 2026-05-19)
- Coverage: ~2% of artifacts have ORACC linguistic annotation; coverage is deep where it exists (lemma + norm + morph per token)
-
Structure: CDL trees decompose each text into nested nodes: chunk > lemma > grapheme. Position within a line is encoded in hex in the
reffield. - Language codes: 14 language codes — sux, akk, akk-x-stdbab, akk-x-oldbab, akk-x-neoass, akk-x-neobab, akk-x-ltebab, qpn, sux-x-emesal, and others
-
Schema targets:
lemmatizations(token-level),lexical_norms(normalized forms),lexical_norm_forms -
Dead-letter note: ~2.4M ORACC lemma tokens could not be matched to existing
text_linesortokensbecause those texts have not been imported through the ATF parser. These are tracked indead_lettersand can be replayed once ATF coverage expands.
-
Glossary entries: 220,349 across 103 ORACC projects (
glossary_entriestable) -
Glossary forms: 1,469,845 written/orthographic variants (
glossary_formstable) -
Lexical lemmas: 346,480 normalized dictionary entries across 14 languages (
lexical_lemmastable) -
Lexical norms: 834,235 normalized forms; 381,460 norm-form spellings (
lexical_norms,lexical_norm_forms) -
Lexical senses: 860,940 meaning distinctions (
lexical_sensestable) - Sources: gloss-sux.json (Sumerian), gloss-akk.json (Akkadian), gloss-akk-x-stdbab.json (Standard Babylonian), gloss-akk-x-oldbab.json (Old Babylonian), gloss-qpn.json (proper nouns), and language-specific variants per project
-
Schema targets:
glossary_entries,glossary_forms,lexical_lemmas,lexical_senses,lexical_norms,lexical_norm_forms
-
Records: 35,609 per-text editorial credit rows (
artifact_creditstable) across 61 ORACC projects -
Schema target:
artifact_credits(p_number, oracc_project, credits_text)
The canonical cuneiform sign inventory used as Glintstone's primary sign identification system.
License: Part of ORACC (CC BY-SA 3.0)
Access: ogsl-sl.json from ORACC zip
- Signs: 3,367 entries with Unicode codepoints and all known readings
- Sign values: ~15,000 reading values with sub-indices
- GDL definitions: JSON structural definitions for compound signs (e.g., |A.AN|)
- Sign types: simple, compound (|X.Y|), modified (@g gunu, @t tenu, @s sheshig)
-
Schema targets:
signs,sign_values,sign_variants - Concordance role: OGSL sign_id is the canonical identifier. MZL and ABZ numbers are mapped via Unicode codepoints as a bridge. See data-quality.md for concordance gap details.
Sign bounding-box annotations for machine learning training.
License: MIT Access: GitHub sub-repo (CompVis/cuneiform-sign-detection-dataset)
- Tablets: 81 Neo-Assyrian tablets
- Annotations: 8,109 sign bounding boxes (11,070 including metadata rows)
- Labels: MZL integer numbers (Borger's Mesopotamisches Zeichenlexikon)
- Damage codes: 0=background, 1=intact, 2=broken/uncertain
- Coordinates: Pixel-absolute, converted to percentage-based for resolution independence
-
Schema target:
sign_annotations - Concordance requirement: MZL labels must be resolved to OGSL sign_ids. ~200-400 remain unresolved after auto-matching via Unicode bridge.
Fragment-level scholarly editions with OCR training data and bibliographic citations.
License: Research use (varies by component) Access: GitHub sub-repos + REST API
- Source: cuneiform-ocr-data (GitHub)
- Content: Annotated tablet images for DETR sign detection model training
- Classes: 173 sign classes
-
Schema target:
sign_annotations(future)
- Files: ebl.txt, mzl.txt
- Content: eBL internal names to ABZ numbers to Unicode
- Role: Bridge data for MZL/ABZ to OGSL concordance mapping
- Access: REST API (ebl.lmu.de), may require authentication token
- Format: CSL-JSON (Citation Style Language)
- Content: Fragment-level citations
-
Schema target:
publications,artifact_editions - Attribution requirement: "eBL - Electronic Babylonian Literature" must appear when displaying eBL-sourced data
Comprehensive Sumerian dictionary, hosted as an ORACC project (epsd2). Integrated via the dedicated epsd2.py connector (lexical schema) and also covered by the ORACC family of connectors (glossaries, norms).
License: Part of ORACC (CC BY-SA 3.0)
-
Lexical lemmas: 15,940 from source
epsd2(Sumerian headwords with full sense/norm data) -
Schema targets:
lexical_lemmas,lexical_senses,lexical_norms,lexical_norm_forms
These sources backfill bibliographic metadata and scholar identification. See citation-pipeline-summary.md for the full pipeline.
- Access: REST API, no auth, CC0
- Provides: DOI backfill for publications (15-30% coverage), ORCID identification for scholars (10-20%)
-
Schema targets: Enriches
publications.doiandscholars.orcid
- Access: REST API, rate-limited
- Provides: Citation graphs for publications with DOIs
-
Schema target:
entity_relationships(partially)
- Access: No API. Manual BibTeX export required (contact Tubingen)
- Records: ~90,000 bibliography entries
-
Schema target:
publications - Note: Largest single bibliography source, but acquisition requires manual effort
- Who's Who in Cuneiform Studies: ~500-1000 scholar records (web scrape)
- Wikipedia Assyriologists category: ~200 scholars (web scrape)
-
Schema target:
scholars
- Status: Digitization in progress at ISAC (Institute for the Study of Ancient Cultures)
- Content: 21 volumes covering the complete Akkadian lexicon (A-Z)
- License: Public domain (copyright expired)
- Priority: Highest-value single Akkadian source (comprehensive, scholarly consensus)
- Planned approach: Await structured digitization format from ISAC; PDF extraction as fallback
-
Schema targets:
glossary_entries,glossary_senses,interpretations(competing scholarly readings)
- Status: Partial digitization available
- Content: Hittite lexicon (incomplete coverage)
- License: Check ISAC terms
- Priority: Primary Hittite dictionary source
- Planned approach: Import from digitized volumes when available
-
Schema targets:
glossary_entries,glossary_forms,glossary_senses
- Status: Marquette University project, format TBD
- Content: Morphological decomposition data for Hittite
- License: Research use (confirm)
- Priority: Core for Hittite morphological analysis
-
Schema targets:
morphology,lemmatizations
- Status: Elamite administrative texts, transliterations available
- Content: Elamite text corpus with transliterations
- License: Research use (ISAC)
- Priority: Primary Elamite corpus (Elamite is thinnest language layer)
- Planned approach: Import transliterations, manual curation for lexicon
-
Schema targets:
text_lines,glossary_entries(Elamite lemmas)
- Status: Connector wired; zip download returns empty archive from build server. Data not yet loaded.
- Content: XML/HTML transliterations of Sumerian literary texts with lemmatization
- License: CC BY-SA 3.0 (ORACC)
- Priority: Complements ePSD2 for Sumerian literary corpus
-
Schema targets:
text_lines,lemmatizations,translations
- Status: Model available, import pathway designed but not yet run
- Content: Automated POS-tagging and lemmatization in CoNLL-U format
-
Schema targets:
lemmatizations,morphology(via annotation_runs with source_type='model') - See: ml-integration.md for full details
When sources disagree on the same field:
- Identity metadata (period, provenience, genre, language): CDLI is authoritative. ORACC enriches with subgenre, supergenre, geographic coordinates, and project membership.
-
Linguistic annotation: ORACC human annotations at highest confidence. BabyLemmatizer output stored as competing interpretation with lower confidence. Multiple analyses coexist via
is_consensusflags. - Bibliographic data: CDLI API data imported first (confidence 1.0). CSV-derived data fills gaps at lower confidence. Cross-source deduplication uses cascading match: DOI exact (1.0) > bibtex_key (0.95) > title+year (0.8) > short_title+volume (0.9). Below 0.7 staged for manual review.
For field-level source mappings, see source-mapping.yaml.
Source: github.com/wittkensis/glintstone · Issues · Edit this wiki
Start here
Getting Started
Overview
Data Model
- Data Sources
- Data Quality
- Data Issues
- Import Pipeline Guide
- ML Integration
- Citation Pipeline Summary
Reference — Data Model
Reference — API
Reference — MCP
Opportunities
Personas
Project
Research