-
Notifications
You must be signed in to change notification settings - Fork 0
Data Model Data Sources
Glintstone federates cuneiform research data from multiple open-access projects. This document describes each source, what it provides, how it is accessed, and its licensing terms.
Every record imported into Glintstone carries an annotation_run_id linking it to the source that produced it. This provenance chain is the backbone of the trust infrastructure -- see data-quality.md for details.
The largest single source of cuneiform metadata. CDLI provides the universal artifact identifier (P-number) that serves as Glintstone's primary join key.
License: CC0 (public domain)
-
File:
cdli_cat.csv(bulk export from cdli.earth) - Access: Manual download (no live API for bulk catalog)
- Records: 353,283 artifacts with 64 metadata fields
- Key fields: P-number, designation, museum number, excavation number, period, provenience, genre, language, material, object type, dimensions, primary publication, collection, dates referenced
- Update frequency: Last bulk dump August 2022. CDLI's data pipeline has been frozen upstream.
-
Null rates (from live database audit):
- museum_no: 9.2%, excavation_no: 68%, width/height: 82%, thickness: 91%
- period: 1.8%, provenience: 3.1%, genre: 1.4%, language: 5.2%
-
Schema target:
artifactstable (primary),artifact_identifiers(museum_no, excavation_no, primary_publication seeded as separate rows)
-
File:
cdliatf_unblocked.atf(86 MB) - Access: Bulk download
- Records: 135,200 transliterated texts (~3.5M lines)
-
Structure: Line-oriented notation with
@surfacemarkers,#tr.XX:translation markers,>>Qcomposite references - Adds ~33k stub records: P-numbers present in ATF but absent from catalog CSV
-
Schema targets:
text_lines(line-level decomposition),surfaces(from @markers),translations(from #tr lines),compositesandartifact_composites(from >>Q markers)
-
Extracted from: Inline
#tr.XX:markers in ATF file - Records: 5,599 translations across 9 languages (en, de, ts, it, fr, es, dk, ca, fa)
-
Schema target:
translationstable
- Access: On-demand fetch from cdli.earth/dl, cached locally
- Format: JPEG photographs and line drawings
- Coverage: Varies by collection; many tablets have no published images
- IIIF: Not yet implemented. CDLI has partial IIIF support; integration is planned.
- Access: REST API (cdli.earth), no authentication, CC0
- Records: 16,725 publications, ~390k artifact-publication links
- Format: Structured bibliographic data with BibTeX keys
-
Schema targets:
publications,publication_authors,artifact_editions - Import order: API data imported first (confidence 1.0), CSV fallback for gaps
-
Extracted from:
cdli_cat.csvfreetext fields - Provides: ~20k citation strings, ~4k collation records, ~7k fragment join references
-
Schema targets:
scholarly_annotations,fragment_joins - Confidence: Lower than API data (0.3-0.7 depending on regex parse quality)
The richest source of linguistic annotation for cuneiform texts. ORACC is organized as independent scholarly projects, each covering a specific corpus.
License: CC BY-SA 3.0 (per project; some may vary)
Glintstone downloads all available ORACC project zip archives. Currently integrated or available:
| Project | Corpus | Coverage |
|---|---|---|
| dcclt | Digital Corpus of Cuneiform Lexical Texts | Lexical lists, school texts |
| epsd2 | electronic Pennsylvania Sumerian Dictionary | Sumerian dictionary (1.8 GB) |
| saao (saa01-saa20) | State Archives of Assyria Online | Neo-Assyrian state correspondence |
| rinap (rinap1-5) | Royal Inscriptions of the Neo-Assyrian Period | Royal inscriptions |
| riao (riao1-3) | Royal Inscriptions of Assyria Online | Earlier Assyrian royal inscriptions |
| etcsri | Electronic Text Corpus of Sumerian Royal Inscriptions | Sumerian royal texts |
| blms | Babylonian Literary and Mythological Texts | Literary texts |
| hbtin | Hellenistic Babylonia: Texts, Iconography, Names | Late Babylonian texts |
| dccmt | Digital Corpus of Cuneiform Mathematical Texts | Mathematical texts |
| ribo | Royal Inscriptions of Babylonia Online | Babylonian royal inscriptions |
| amgg | Akkadian Medicine and Greco-Roman connections | Medical texts |
| ogsl | ORACC Global Sign List | Sign inventory (~3,367 signs with readings, see below) |
| etcsl | Electronic Text Corpus of Sumerian Literature | Sumerian literary texts with lemmatization |
Several additional projects (rime, cams, ctij) are known to exist but return server errors when fetched.
-
API: Each project publishes a zip archive at
http://oracc.org/{project}/json -
Contents per project:
-
catalogue.json— project-specific artifact metadata with Pleiades geographic IDs -
corpus/*.json— per-text CDL (Chunk-Delimiter-Lemma) trees with sign-level linguistic annotation -
glossary/gloss-{lang}.json— dictionary entries, variant forms, senses -
geojson/*.geojson— archaeological site coordinates (6 projects)
-
- Records: ~309k lemma tokens from corpus JSON across all projects
- Coverage: ~7,500 texts lemmatized out of 389k total artifacts (approximately 2% coverage)
- Structure: CDL trees decompose each text into nested nodes: chunk > lemma > grapheme
- Language codes: ISO-style (sux, akk, akk-x-stdbab, akk-x-oldbab, qpn, akk-x-neoass, etc.)
-
Schema targets:
tokens,token_readings,lemmatizations,morphology
- Records: ~21k dictionary entries, ~40k variant forms, ~5k entries with sense data
- Sources: gloss-sux.json (Sumerian), gloss-akk.json (Akkadian), gloss-akk-x-stdbab.json (Standard Babylonian), gloss-akk-x-oldbab.json (Old Babylonian), gloss-qpn.json (proper nouns)
-
Schema targets:
glossary_entries,glossary_forms,glossary_senses
The canonical cuneiform sign inventory used as Glintstone's primary sign identification system.
License: Part of ORACC (CC BY-SA 3.0)
Access: ogsl-sl.json from ORACC zip
- Signs: 3,367 entries with Unicode codepoints and all known readings
- Sign values: ~15,000 reading values with sub-indices
- GDL definitions: JSON structural definitions for compound signs (e.g., |A.AN|)
- Sign types: simple, compound (|X.Y|), modified (@g gunu, @t tenu, @s sheshig)
-
Schema targets:
signs,sign_values,sign_variants - Concordance role: OGSL sign_id is the canonical identifier. MZL and ABZ numbers are mapped via Unicode codepoints as a bridge. See data-quality.md for concordance gap details.
Sign bounding-box annotations for machine learning training.
License: MIT Access: GitHub sub-repo (CompVis/cuneiform-sign-detection-dataset)
- Tablets: 81 Neo-Assyrian tablets
- Annotations: 8,109 sign bounding boxes (11,070 including metadata rows)
- Labels: MZL integer numbers (Borger's Mesopotamisches Zeichenlexikon)
- Damage codes: 0=background, 1=intact, 2=broken/uncertain
- Coordinates: Pixel-absolute, converted to percentage-based for resolution independence
-
Schema target:
sign_annotations - Concordance requirement: MZL labels must be resolved to OGSL sign_ids. ~200-400 remain unresolved after auto-matching via Unicode bridge.
Fragment-level scholarly editions with OCR training data and bibliographic citations.
License: Research use (varies by component) Access: GitHub sub-repos + REST API
- Source: cuneiform-ocr-data (GitHub)
- Content: Annotated tablet images for DETR sign detection model training
- Classes: 173 sign classes
-
Schema target:
sign_annotations(future)
- Files: ebl.txt, mzl.txt
- Content: eBL internal names to ABZ numbers to Unicode
- Role: Bridge data for MZL/ABZ to OGSL concordance mapping
- Access: REST API (ebl.lmu.de), may require authentication token
- Format: CSL-JSON (Citation Style Language)
- Content: Fragment-level citations
-
Schema target:
publications,artifact_editions - Attribution requirement: "eBL - Electronic Babylonian Literature" must appear when displaying eBL-sourced data
Comprehensive Sumerian dictionary, hosted as an ORACC project.
License: Part of ORACC (CC BY-SA 3.0) Access: ORACC project zip (epsd2)
- Size: 1.8 GB of dictionary data
- Coverage: Comprehensive Sumerian lexicon with attestation counts, forms, and senses
-
Schema target:
glossary_entries,glossary_forms,glossary_senses
These sources backfill bibliographic metadata and scholar identification. See citation-pipeline-summary.md for the full pipeline.
- Access: REST API, no auth, CC0
- Provides: DOI backfill for publications (15-30% coverage), ORCID identification for scholars (10-20%)
-
Schema targets: Enriches
publications.doiandscholars.orcid
- Access: REST API, rate-limited
- Provides: Citation graphs for publications with DOIs
-
Schema target:
entity_relationships(partially)
- Access: No API. Manual BibTeX export required (contact Tubingen)
- Records: ~90,000 bibliography entries
-
Schema target:
publications - Note: Largest single bibliography source, but acquisition requires manual effort
- Who's Who in Cuneiform Studies: ~500-1000 scholar records (web scrape)
- Wikipedia Assyriologists category: ~200 scholars (web scrape)
-
Schema target:
scholars
- Status: Digitization in progress at ISAC (Institute for the Study of Ancient Cultures)
- Content: 21 volumes covering the complete Akkadian lexicon (A-Z)
- License: Public domain (copyright expired)
- Priority: Highest-value single Akkadian source (comprehensive, scholarly consensus)
- Planned approach: Await structured digitization format from ISAC; PDF extraction as fallback
-
Schema targets:
glossary_entries,glossary_senses,interpretations(competing scholarly readings)
- Status: Partial digitization available
- Content: Hittite lexicon (incomplete coverage)
- License: Check ISAC terms
- Priority: Primary Hittite dictionary source
- Planned approach: Import from digitized volumes when available
-
Schema targets:
glossary_entries,glossary_forms,glossary_senses
- Status: Marquette University project, format TBD
- Content: Morphological decomposition data for Hittite
- License: Research use (confirm)
- Priority: Core for Hittite morphological analysis
-
Schema targets:
morphology,lemmatizations
- Status: Elamite administrative texts, transliterations available
- Content: Elamite text corpus with transliterations
- License: Research use (ISAC)
- Priority: Primary Elamite corpus (Elamite is thinnest language layer)
- Planned approach: Import transliterations, manual curation for lexicon
-
Schema targets:
text_lines,glossary_entries(Elamite lemmas)
- Status: Available as ORACC project (partially integrated)
- Content: XML/HTML transliterations of Sumerian literary texts with lemmatization
- License: CC BY-SA 3.0 (ORACC)
- Priority: Complements ePSD2 for Sumerian literary corpus
-
Schema targets:
text_lines,lemmatizations,translations
- Status: Model available, import pathway designed but not yet run
- Content: Automated POS-tagging and lemmatization in CoNLL-U format
-
Schema targets:
lemmatizations,morphology(via annotation_runs with source_type='model') - See: ml-integration.md for full details
When sources disagree on the same field:
- Identity metadata (period, provenience, genre, language): CDLI is authoritative. ORACC enriches with subgenre, supergenre, geographic coordinates, and project membership.
-
Linguistic annotation: ORACC human annotations at highest confidence. BabyLemmatizer output stored as competing interpretation with lower confidence. Multiple analyses coexist via
is_consensusflags. - Bibliographic data: CDLI API data imported first (confidence 1.0). CSV-derived data fills gaps at lower confidence. Cross-source deduplication uses cascading match: DOI exact (1.0) > bibtex_key (0.95) > title+year (0.8) > short_title+volume (0.9). Below 0.7 staged for manual review.
For field-level source mappings, see source-mapping.yaml.
Source: github.com/wittkensis/glintstone · Issues · Edit this wiki
Start here
Getting Started
Overview
Data Model
- Data Sources
- Data Quality
- Data Issues
- Import Pipeline Guide
- ML Integration
- Citation Pipeline Summary
Reference — Data Model
Reference — API
Reference — MCP
Opportunities
Personas
Project
Research