-
Notifications
You must be signed in to change notification settings - Fork 0
Data Model Data Quality
Glintstone's data model is designed around a core principle: every claim about a cuneiform text -- its reading, lemmatization, translation, or identification -- should be traceable to the scholar, tool, or import process that produced it. This document describes how the schema enforces provenance, supports competing interpretations, and validates data integrity.
Every record in the database carries an annotation_run_id foreign key linking it to an annotation_runs record. This single mechanism handles all provenance:
| Field | Purpose |
|---|---|
source_type |
human, model, hybrid, or import |
source_name |
Provider attribution string (e.g., "ORACC/dcclt", "BabyLemmatizer v2.2", "CDLI - cdli.earth") |
scholar_id |
FK to scholars table (NULL for ML/import runs) |
method |
How the annotation was produced: hand_copy, photograph, collation, autopsy, RTI, 3D_scan, ML_model, import |
publication_id |
Structured link to the publication where this annotation appeared |
created_at |
When the annotation was recorded |
This design means a single JOIN from any data row to annotation_runs answers: who said this, when, how, and where was it published?
Each data source has a canonical attribution string stored in annotation_runs.source_name. These strings must appear in the UI when displaying source-specific data:
CDLI - cdli.earth-
ORACC/{project_name}(e.g., ORACC/dcclt, ORACC/saao) eBL - Electronic Babylonian LiteratureCompVisOpenAlexKeiBi - Keilschriftbibliographie
Verification: verify_provider_attribution.py checks that no records exist without a valid annotation_run_id and that all provider strings match the canonical list.
The scholars table tracks individual contributors with:
- Normalized name (diacritics handled, particles like "von" and "de" standardized)
- ORCID persistent identifier (backfilled from OpenAlex where available)
- Expertise metadata: period specializations, language competencies
- Active-since year (from Who's Who in Cuneiform Studies)
Name deduplication uses cascading match: ORCID exact > surname+initials normalized > full name fuzzy. Ambiguous matches are flagged for manual review.
A single token position on a tablet may have multiple valid readings or lemmatizations proposed by different scholars or tools. The schema preserves all of them.
token_readings stores multiple transliterations for the same token_id:
Token P000001, line 3, position 5:
Reading A: form="du₃" source=ORACC/dcclt confidence=0.95 is_consensus=1
Reading B: form="KAK" source=BabyLemmatizer confidence=0.72 is_consensus=0
The is_consensus flag marks the editorially chosen or highest-confidence reading. At most one reading per token carries this flag.
lemmatizations stores multiple linguistic analyses for the same token:
Token P000001, line 3, position 5:
Analysis A: cf="banû" gw="build" pos=V source=ORACC/dcclt confidence=0.95 is_consensus=1
Analysis B: cf="banû" gw="build" pos=V source=BabyLemmatizer confidence=0.88 is_consensus=0
When sources agree, this strengthens confidence. When they disagree, the competing analyses are preserved for scholarly review.
Three decision tables record how consensus was reached:
token_reading_decisionslemmatization_decisionstranslation_decisions
Each decision records: who decided, the method (editorial, vote, algorithm, import_default), rationale, timestamp, and a supersedes_id chain linking to the previous decision. This creates an immutable audit trail -- decisions are never deleted, only superseded.
Four evidence tables link annotations to their supporting material:
| Table | Links To | Evidence Types |
|---|---|---|
token_reading_evidence |
token_readings | photograph, hand_copy, collation, RTI, 3D_scan |
lemmatization_evidence |
lemmatizations | publication, glossary_match, parallel_text |
sign_annotation_evidence |
sign_annotations | photograph, 3D_scan, dataset_ground_truth |
translation_evidence |
translations | publication, parallel_text, personal_communication |
Each evidence row contains:
-
evidence_type: What kind of evidence (catalog_entry, museum_record, publication, photograph, collation, autopsy, database_export, personal_communication) -
evidence_ref: URL, DOI, catalog page, or freetext citation -
added_by: FK to scholars table -
note: Contextual explanation
This enables questions like: What evidence supports Reading A over Reading B for this token?
A three-tier system links artifacts to their publication history:
artifact_identifiers maps every known identifier (museum number, excavation number, publication designation, project ID) to its P-number. Normalized for fast lookup.
publications stores structured bibliographic records with:
- BibTeX keys prefixed by authority (cdli:, oracc:, ebl:, openalex:, keibi:)
- Series membership (RIME, SAA, RINAP, VAB)
- Supersession chains:
supersedes_idtracks when a newer publication replaces an older one
artifact_editions links individual artifacts to publications with granular metadata:
- Edition type: full_edition, hand_copy, photograph_only, catalog_entry, collation, translation_only, commentary
-
is_current_edition: Algorithmically set (newest full_edition wins), at most one per artifact, manually overridable viaartifact_edition_decisions - Per-artifact supersession (more granular than publication-level -- one publication may supersede another for some artifacts but not all)
scholarly_annotations supports export to W3C Web Annotation Data Model format for interoperability with annotation federation tools. Each annotation carries strict FK targeting (exactly one non-NULL target FK per row, enforced by CHECK constraint).
12 critical data issues were identified by testing the schema against the live 314 MB database (389,715 artifacts). All are addressed in the design. See data-issues.md for the full catalog.
Key issues and their fixes:
| Issue | Severity | Fix |
|---|---|---|
| Period string chaos (46+ formats) | High |
period_canon lookup + period_normalized column |
| Language code mismatch (CDLI vs ORACC) | High |
language_map lookup + languages JSON array |
| ORACC lemma dead-letter backlog (~2.4M) | Medium | Tokens in ORACC CDL files that cannot be matched to text_lines because those texts lack ATF. Tracked in dead_letters, replayable when ATF coverage expands. |
| Sign annotations use MZL, no OGSL concordance | High | Auto-match via Unicode bridge, flag ~200-400 unresolved |
| Glossary senses sparsely populated | Medium | senses[] arrays parsed from ORACC glossary JSON; 860k rows loaded |
| Glossary entry dedup across projects | Medium | UNIQUE(headword, language, project), keep all with source attribution |
Every import step runs validation checks:
- Row count: Expected vs actual, warning if deviation > 5%
- NULL audit: Critical columns checked against threshold (e.g., p_number must be 0% NULL)
- FK integrity: All foreign keys verified post-import
- annotation_run_id NOT NULL: Enforced -- no orphan records allowed
- Deduplication: Cross-source duplicates caught by cascading match (DOI > bibtex_key > title+year > short_title+volume)
Imports below confidence threshold (0.7) are staged in _dedup_candidates for manual review rather than auto-merged.
The ImportCheckpoint class provides:
- SIGINT/SIGTERM signal handling with graceful state save
- Atomic state file writes (JSON in
_progress/directory) - Source checksums to detect upstream changes since last run
- Progress reporting: processed/inserted/updated/skipped/errors per step
Glintstone surfaces data gaps honestly rather than hiding them. Key coverage metrics:
- Lemmatization: ~2% of artifacts have ORACC linguistic annotation (416k lemmatization rows across ~8k texts). The remaining 98% exist at the identity or text layer only. The dead-letter queue holds ~2.4M unmatched ORACC lemma tokens for texts not yet ATF-parsed.
- Geographic coordinates: ~3% of artifacts have Pleiades-linked coordinates (via ORACC geojson).
- Physical dimensions: Width/height available for ~18% of artifacts.
- Images: Coverage varies widely by collection. Many tablets have no published photographs.
- Sign concordance: ~200-400 MZL numbers remain unresolved to OGSL after auto-matching.
The pipeline_status table tracks 5-stage completeness (Image > OCR > ATF > Lemmas > Translation) per artifact, making gaps visible in every view.
Sign-level Grapheme Description Language (GDL) data is stored as JSON on tokens.gdl_json rather than in a normalized sign_instances table. JSON is simpler to import and query for display. Normalized tables would enable queries like "find all tokens containing sign AN" but add significant complexity. Leaning JSON for v2.0 with the option to normalize later.
Sumerian verbal morphology is actively debated. The morphology table uses a slot-based model with columns for conjugation_prefix, dimensional_prefixes, stem_form, pronominal_suffix, aspect, and transitivity. Different scholarly frameworks (Jagersma, Zolyomi, Edzard) analyze these slots differently. The schema supports storing multiple morphological analyses as competing annotations via the same provenance mechanism used for lemmatizations.
CDLI and ORACC sometimes disagree on period, genre, or provenience for the same artifact. Resolution: CDLI is authoritative for identity fields. ORACC-specific enrichment (subgenre, supergenre, geographic coordinates) stored in separate columns. Both raw and normalized values preserved.
The CDLI bulk catalog export has not been updated since August 2022. New tablets added to CDLI after this date are not reflected in Glintstone. The API provides more recent data for some fields (publications, artifact-edition links) but not the full catalog.
Source: github.com/wittkensis/glintstone · Issues · Edit this wiki
Start here
Getting Started
Overview
Data Model
- Data Sources
- Data Quality
- Data Issues
- Import Pipeline Guide
- ML Integration
- Citation Pipeline Summary
Reference — Data Model
Reference — API
Reference — MCP
Opportunities
Personas
Project
Research