-
Notifications
You must be signed in to change notification settings - Fork 0
Opportunities
Beyond Machine Learning: Seven High-Impact Opportunities for Transforming Cuneiform Studies
A strategic analysis of where digital innovation can revolutionize the field
While machine learning for cuneiform sign recognition represents an important advancement, the field's most critical bottlenecks lie in infrastructure, trust systems, and collaborative workflows. This document identifies seven high-leverage opportunities where thoughtful system design could transform how scholars work with ancient texts.
Key Insight: The real constraint isn't reading signs—it's integrating 150 years of fragmentary scholarship into queryable, trustworthy, collaborative knowledge systems.
Glintstone status: schema — core architecture designed Addressed by:
scholars,annotation_runs(scholar_id + method + publication_ref/publication_id), competing interpretations (token_readings,lemmatizationswithis_consensus), 4 evidence tables, 3 decision tables with supersedes chains. Every annotation traces to who/what created it, how, and where published. See:data-model/glintstone-schema.yamlRemaining: UI for provenance display, reputation scoring (deferred), CRDT-based distributed editing (out of scope for data layer)
Every cuneiform text exists in a trust limbo with no provenance tracking for editorial decisions:
- Scholar A publishes reading in 1952
- Scholar B corrects in 1987
- Scholar C challenges in 2003
- Graduate student D finds new fragment in 2015
Current ATF representation:
1. {d}utu lugal an-ki-a
Unanswerable questions:
- Who first read this line?
- When was it last verified against the physical tablet?
- What other readings have been proposed?
- What's the confidence level for each sign?
- Has anyone else independently verified this?
- What happens when someone disagrees?
Build "Git for Cuneiform" with trust primitives
Scholarly Edit Graph
├─ Initial reading (Scholar A, 1952, from hand copy)
├─ Revision (Scholar B, 1987, from photograph)
├─ Challenge (Scholar C, 2003, from collation)
├─ Fragment join (Scholar D, 2015, new evidence)
└─ Synthesis (System, 2026, confidence scoring)
Provenance Tracking
- Every sign reading has complete audit trail
- Method documented (hand copy, photograph, collation, autopsy)
- Tool used (naked eye, RTI, 3D scan, ML model)
- Confidence score evolution over time
Version Control for Readings
{
"tablet": "P363653",
"line": 1,
"sign": 3,
"reading_history": [
{
"reading": "an",
"scholar": "Smith, J.",
"date": "1952-03-15",
"method": "hand_copy",
"confidence": 0.7,
"source": "unpublished_notes"
},
{
"reading": "an",
"scholar": "Jones, M.",
"date": "1987-06-22",
"method": "photograph",
"confidence": 0.85,
"source": "doi:10.1234/journal.1987.123"
},
{
"reading": "dingir",
"scholar": "Chen, L.",
"date": "2003-11-08",
"method": "collation",
"confidence": 0.6,
"source": "field_notes",
"note": "Alternative reading based on traces"
}
],
"current_consensus": {
"reading": "an",
"confidence": 0.82,
"agreeing_scholars": 12,
"dissenting_scholars": 2
}
}Alternative Readings Coexist
- No more single "canonical" text
- Probability distributions instead of binary choices
- Formal disagreement mechanisms
- Dispute visualization and resolution workflows
Contribution Attribution
- Credit for incremental improvements
- Reputation scoring for accuracy
- Citation of specific editorial decisions
- Micro-contributions count (fixing a single sign)
Rollback Capability
- Return to earlier readings if new evidence contradicts
- Preserve reasoning for historical decisions
- Track why readings changed
┌─────────────────────────────────────┐
│ Scholar Interface Layer │
│ (Web UI, CLI, API) │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ Trust & Provenance Engine │
│ - Version control │
│ - Conflict resolution │
│ - Reputation scoring │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ Reading Database │
│ - Current state │
│ - Full history │
│ - Alternative readings │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ Export Layer │
│ - ATF generation │
│ - Citation formats │
│ - Publication ready output │
└─────────────────────────────────────┘
Machine learning will produce readings, but how do you integrate them with 150 years of human scholarship? The trust infrastructure determines whether ML outputs get adopted or ignored.
Key scenarios this solves:
-
ML Integration: When your OCR suggests a reading, it enters the same provenance system as human readings, with confidence scores and method documentation.
-
Scholarly Disagreement: Two experts disagree about a damaged sign. The system preserves both readings, tracks supporting evidence, enables community voting or discussion.
-
New Evidence: A museum finds a better photograph. The system can propagate improved readings while preserving historical context.
-
Quality Assurance: Identify texts that haven't been collated in 50+ years and prioritize for re-examination.
-
Expertise Discovery: "Who has successfully read signs in this specific context before?"
- Adoption: 1000+ scholars using system for editions
- Coverage: 10,000+ texts with full provenance tracking
- Collaboration: 100+ instances of collaborative improvement
- ML Integration: Seamless workflow for incorporating automated readings
- Citation: System referenced in published editions
- Git backend with custom merge strategies for cuneiform
- Blockchain-inspired provenance (without cryptocurrency nonsense)
- Conflict-free replicated data types (CRDTs) for distributed editing
- Reputation algorithms borrowed from Stack Overflow, Wikipedia
- Web annotation standards for linking to primary sources
Most important cuneiform texts exist in dozens or hundreds of fragmentary copies:
- Epic of Gilgamesh: ~200 manuscripts, none complete
- Enūma Eliš (creation myth): ~70 tablets/fragments
- Any law collection: Scattered across multiple damaged tablets
- Manually align overlapping fragments
- Reconstruct missing sections from parallel texts
- Document which line comes from which source
- Publish as single "reconstructed" text
- Footnote everything extensively
Problems:
- Each scholar makes different reconstruction choices
- Composite texts become "black boxes"
- Can't easily see which words come from which sources
- Can't understand confidence levels for reconstructions
- Can't compare alternative reconstructions
- Can't update when new fragments discovered
- Can't programmatically query "how much is actually preserved?"
The same line of Gilgamesh might appear in:
- Old Babylonian tablet from Sippar (1800 BCE) - damaged
- Middle Babylonian tablet from Nippur (1400 BCE) - fragmentary
- Neo-Assyrian tablet from Nineveh (700 BCE) - mostly complete
- Late Babylonian tablet from Uruk (300 BCE) - variant reading
Scholar must:
- Determine textual relationships (which derives from which?)
- Weigh dialectical vs. textual variants
- Choose which witness to follow for each word
- Reconstruct gaps using poetic parallelism
- Document all of this in footnotes
Build a Composite Text Engine with algorithmic reconstruction
Manuscript Registry:
{
"work": "Gilgamesh_Tablet_I",
"canonical_structure": {
"lines": 300,
"columns": 6,
"divisions": ["prologue", "introduction", "narrative"]
},
"manuscripts": [
{
"id": "P363653",
"siglum": "MS A",
"provenance": "Sippar",
"period": "Old Babylonian",
"date_range": "1800-1750 BCE",
"material": "clay",
"preservation_state": "fragmentary",
"preserves": {
"lines": "1-15, 22-30, 45-48",
"completeness": 0.18
},
"textual_tradition": "southern",
"dialect": "Old Babylonian",
"quality": "high",
"images": {
"photograph": "https://cdli/P363653.jpg",
"hand_copy": "https://archive/drawing_123.jpg",
"rti": "https://rti/P363653/viewer"
}
},
{
"id": "P470987",
"siglum": "MS N₁",
"provenance": "Nineveh",
"period": "Neo-Assyrian",
"date_range": "700-650 BCE",
"preserves": {
"lines": "5-35, 40-80",
"completeness": 0.45
},
"textual_tradition": "standard_version",
"quality": "excellent"
}
]
}Line-Level Attestation Tracking:
{
"line": 1,
"text_composite": "ša nagba īmuru adi kalâma",
"witnesses": [
{
"ms": "P363653",
"text": "[š]a na[gba] īmuru",
"confidence": 0.75,
"damage": "beginning_and_middle_damaged",
"reading_notes": "first sign partially visible"
},
{
"ms": "P470987",
"text": "ša nagba īmuru adi kalâma",
"confidence": 0.95,
"damage": "none",
"variant_notes": "complete attestation"
}
],
"reconstruction": {
"method": "primary_witness_P470987",
"confidence": 0.95,
"alternative_readings": [],
"editor_notes": "Line complete in Neo-Assyrian version"
}
}Text Alignment Algorithms (borrowed from genomics):
# Pseudo-code for fragment alignment
def align_fragments(fragments, penalties):
"""
Similar to DNA sequence alignment
- Match: signs are identical
- Mismatch: dialectical variant or textual variant
- Gap: missing section in one witness
"""
alignments = []
for frag_a, frag_b in combinations(fragments, 2):
score = smith_waterman_align(
frag_a.signs,
frag_b.signs,
match_score=2,
mismatch_penalty=-1,
gap_penalty=-2,
variant_penalty=-0.5 # dialectical variants
)
alignments.append((frag_a, frag_b, score))
return build_consensus_sequence(alignments)Variant Classification:
- Orthographic: Different ways to write same word
- Dialectical: Northern vs Southern Babylonian
- Temporal: Language evolution over centuries
- Textual: Actual different versions
- Scribal error: Mistakes in copying
Confidence Propagation:
{
"line": 15,
"reconstruction": "ša nagba īmuru adi kalâma",
"word_confidence": [
{"word": "ša", "conf": 0.98, "witnesses": 5},
{"word": "nagba", "conf": 0.85, "witnesses": 4},
{"word": "īmuru", "conf": 0.92, "witnesses": 5},
{"word": "adi", "conf": 0.65, "witnesses": 2},
{"word": "kalâma", "conf": 0.70, "witnesses": 2}
],
"line_confidence": 0.82,
"reconstruction_method": "weighted_consensus",
"alternatives": [
{
"text": "ša nagba īmuru ana kalâma",
"conf": 0.15,
"witnesses": 1,
"note": "minority reading with 'ana' instead of 'adi'"
}
]
}Gap Filling Strategies:
- Parallel Passages: Use other occurrences of same formula
- Poetic Parallelism: Reconstruct based on meter/structure
- Contextual Prediction: Use surrounding lines
- Statistical Models: N-gram probability from corpus
- Leave Blank: When confidence too low
{
"line": 42,
"preserved": "ina [... ...] ilāni rabûti",
"gap_analysis": {
"missing_signs": "estimated_3_to_5",
"reconstruction_methods": [
{
"method": "parallel_passage",
"source": "Gilgamesh_VI:8",
"suggested": "puhur",
"confidence": 0.7
},
{
"method": "poetic_meter",
"suggested": "kiṣri",
"confidence": 0.4
},
{
"method": "ngram_probability",
"suggested": "maḫar",
"confidence": 0.5
}
],
"editor_choice": "leave_blank",
"rationale": "insufficient_evidence"
}
}Witness Comparison View:
┌─────────────────────────────────────────────────────────┐
│ Gilgamesh Tablet I, Line 1 │
├─────────────────────────────────────────────────────────┤
│ Composite: ša nagba īmuru adi kalâma │
│ │
│ Witnesses: │
│ ○ MS A (OB): [š]a na[gba] īmuru ... [...] │
│ ● MS N₁ (NA): ša nagba īmuru adi kalâma │
│ ○ MS U₁ (LB): ša nag-ba i-mu-ru a-di ka-la-ma │
│ │
│ [Confidence: ████████░░ 85%] │
│ │
│ Toggle witnesses: [✓] MS A [✓] MS N₁ [✓] MS U₁ │
└─────────────────────────────────────────────────────────┘
Features:
- Slide between versions: See how composite changes when you exclude/include witnesses
- Hover for details: See why specific reading was chosen
- Confidence visualization: Color-coded by certainty
- Variant explorer: Click word to see all attestations
- Temporal slider: Watch text evolve across 1,000 years
- Export options: Generate ATF, scholarly apparatus, publication-ready footnotes
Manuscript Coverage Heatmap:
Lines MS A MS N₁ MS U₁ MS B₃ Coverage
1-10 ████ ████ ████ ░░░░ 75%
11-20 ██░░ ████ ████ ████ 81%
21-30 ░░░░ ████ ██░░ ████ 75%
31-40 ░░░░ ████ ████ ████ 75%
...
Textual Relationship Diagram:
[Archetype]
│
┌──────┴──────┐
│ │
[Branch α] [Branch β]
│ │
┌──┴──┐ ┌──┴──┐
│ │ │ │
MS A MS B MS N₁ MS U₁
(OB) (MB) (NA) (LB)
┌─────────────────────────────────────────────┐
│ Scholar Interface │
│ - Visual alignment editor │
│ - Variant comparison │
│ - Reconstruction controls │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ Alignment Engine │
│ - Fragment matching algorithms │
│ - Variant classification │
│ - Confidence scoring │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ Reconstruction Engine │
│ - Gap filling strategies │
│ - Parallel passage finder │
│ - Poetic analysis │
│ - Statistical models │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ Manuscript Database │
│ - Witness metadata │
│ - Line-level attestations │
│ - Textual relationships │
│ - Image links │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ Export Layer │
│ - ATF with apparatus │
│ - Publication formatting │
│ - Scholarly footnotes │
│ - Interactive web edition │
└─────────────────────────────────────────────┘
Current state: Creating composites is a PhD-level skill requiring years of training, done manually by ~50 scholars worldwide.
With this tool:
- Democratization: Graduate students can work on ambitious projects
- Rapid integration: New fragments incorporated immediately (not years later)
- Alternative reconstructions: Multiple scholarly approaches can coexist
- Transparency: Every decision documented and traceable
- Collaboration: Multiple scholars can contribute to same composite
- Living editions: Updates propagate to all citations
Impact on scholarship:
- 80% of important Sumerian/Akkadian literature only exists in composite form
- Tools currently don't exist for this workflow
- Digital composites could become the standard, not exception
- ML outputs could be tested against manuscript traditions systematically
- Adoption: 50+ composites created with tool
- Coverage: Major literary works (Gilgamesh, Atrahasis, Enūma Eliš)
- Collaboration: 10+ multi-scholar composite projects
- Publication: Composites cited in peer-reviewed journals
- Integration: New fragments added within weeks of discovery
Algorithms needed:
- Smith-Waterman alignment (from genomics)
- Needleman-Wunsch global alignment
- Hidden Markov Models for textual traditions
- Bayesian inference for variant probability
- Graph algorithms for stemma (manuscript genealogy)
Data structures:
- Directed acyclic graphs for textual relationships
- Sparse matrices for witness alignment
- Merkle trees for version integrity
- Spatial indexes for parallel passage finding
UI/UX challenges:
- Visualizing high-dimensional variant space
- Progressive disclosure of complexity
- Expert vs. novice modes
- Real-time collaboration for co-editing
Searching cuneiform texts is primitively limited to string matching:
What CDLI/Oracc offer:
Search: "lugal"
Results: Every instance of string "lugal"
What this misses:
-
Synonyms:
šarru,ensi,rubûall mean "ruler" -
Morphological variants:
lugal,lugal-zu,lugal-bi(king, your king, his king) - Semantic queries: "find texts about divine kingship"
- Contextual search: "royal inscriptions mentioning temple construction"
- Grammatical patterns: "imperative verbs in ritual instructions"
- "Find royal inscriptions where the king claims divine authority"
- "Show me economic texts mentioning silver transactions over 5 minas"
- "Find medical texts that prescribe beer as treatment"
- "Where do Sun-god epithets appear with creation language?"
- "Locate dream interpretation texts mentioning kings"
- "Find all dative case divine names in Old Babylonian royal inscriptions"
- "Show me imperative verbs in ritual instructions"
- "Where are relative clauses used with first-person subjects?"
- "Find causative verbs with deity as agent"
- "Locate genitive chains longer than 3 elements"
- "How did terminology for 'kingship' evolve from 2500-500 BCE?"
- "Track usage of loan words from Sumerian in Akkadian texts over time"
- "Compare administrative vocabulary between Ur III and Old Babylonian periods"
- "When did 'Standard Babylonian' formulae emerge?"
- "Find earliest attestations of specific religious concepts"
Build a cuneiform search engine with semantic understanding
Concept Ontology:
{
"concept": "ruler",
"akkadian_terms": [
{"lemma": "šarru", "primary": true, "gloss": "king"},
{"lemma": "rubû", "primary": true, "gloss": "prince, ruler"},
{"lemma": "bēlu", "secondary": true, "gloss": "lord, owner"},
{"lemma": "malkū", "secondary": true, "gloss": "counselor, prince"}
],
"sumerian_terms": [
{"lemma": "lugal", "primary": true, "gloss": "king"},
{"lemma": "ensi", "primary": true, "gloss": "governor, ruler"},
{"lemma": "en", "secondary": true, "gloss": "lord"}
],
"related_concepts": [
"kingship",
"authority",
"sovereignty",
"dynasty"
],
"semantic_fields": [
"politics",
"social_hierarchy",
"royal_ideology"
]
}Multilingual Linking:
{
"sumerian": "dingir",
"akkadian": "ilu",
"hittite": "šiuniš",
"hurrian": "eni",
"concept": "deity",
"semantic_network": {
"hypernym": "supernatural_being",
"hyponyms": ["major_deity", "minor_deity", "personal_deity"],
"related": ["temple", "cult", "offering", "priest"]
}
}Cultural Context Tagging:
{
"text": "P363653",
"genre_tags": [
"royal_inscription",
"building_inscription",
"dedicatory"
],
"content_tags": [
"temple_construction",
"divine_legitimation",
"victory_claim"
],
"cultural_context": [
"Old_Babylonian",
"First_Dynasty_of_Babylon",
"Hammurabi_reign"
],
"semantic_features": {
"mentions_deities": ["Shamash", "Marduk"],
"mentions_places": ["Sippar", "Babylon"],
"mentions_activities": ["building", "dedicating", "restoring"]
}
}Entity Resolution:
{
"entity": "Shamash",
"type": "deity",
"attributes": {
"domain": ["sun", "justice", "divination"],
"cult_centers": ["Sippar", "Larsa"],
"associated_deities": ["Aya (consort)", "Kittu", "Mīšaru"],
"symbols": ["sun_disk", "saw"]
},
"name_variants": [
{"language": "akkadian", "form": "Šamaš"},
{"language": "sumerian", "form": "Utu"},
{"language": "west_semitic", "form": "Shamash"}
],
"attestations": 3847,
"chronology": {
"earliest": "Early Dynastic III",
"latest": "Seleucid"
}
}Query Syntax:
FIND texts WHERE
genre = 'royal_inscription'
AND period IN ['Old Babylonian', 'Middle Babylonian']
AND contains(
word(lemma='šarru', case='nominative'),
word(lemma='banû', stem='G', mood='indicative')
)
AND mentions_deity('Marduk')
ORDER BY date ASC
LIMIT 50Complex Grammatical Patterns:
-- Find relative clauses with divine subjects
FIND clauses WHERE
clause_type = 'relative'
AND subject.word_class = 'divine_name'
AND verb.person = 3
AND verb.number = 'singular'-- Find administrative texts with large silver amounts
FIND texts WHERE
genre = 'administrative'
AND contains(
measure(commodity='silver', amount > 5, unit='mina')
)
AND period = 'Ur III'-- Find medical prescriptions involving beer
FIND texts WHERE
genre = 'medical'
AND contains(
prescription(
ingredient='beer',
verb.stem='Š', -- causative "make drink"
patient=not_null
)
)Morphological Search:
-- All preterite verbs in first-person
FIND words WHERE
word_class = 'verb'
AND tense = 'preterite'
AND person = 1
AND number = 'singular'
GROUP BY lemmaTerm Frequency Over Time:
{
"query": "kingship_terminology",
"results": [
{
"period": "Early Dynastic III",
"date_range": "2600-2350 BCE",
"terms": {
"lugal": {"freq": 0.45, "docs": 234},
"en": {"freq": 0.35, "docs": 178},
"ensi": {"freq": 0.20, "docs": 89}
}
},
{
"period": "Ur III",
"date_range": "2112-2004 BCE",
"terms": {
"lugal": {"freq": 0.65, "docs": 1247},
"ensi": {"freq": 0.30, "docs": 456},
"en": {"freq": 0.05, "docs": 98}
}
},
{
"period": "Old Babylonian",
"date_range": "2000-1600 BCE",
"terms": {
"šarru": {"freq": 0.75, "docs": 892},
"rubû": {"freq": 0.15, "docs": 178},
"lugal": {"freq": 0.10, "docs": 124}
}
}
]
}Semantic Drift Detection:
{
"term": "bēlu",
"original_meaning": "lord, owner",
"semantic_evolution": [
{
"period": "Old Babylonian",
"primary_sense": "owner (of property)",
"secondary_senses": ["master (of slaves)", "lord (generic)"]
},
{
"period": "Middle Babylonian",
"primary_sense": "lord (divine/royal)",
"secondary_senses": ["master", "husband"],
"note": "increased_prestige"
},
{
"period": "Neo-Assyrian",
"primary_sense": "lord (primarily divine)",
"secondary_senses": ["Bel (= Marduk)"],
"note": "specialized_to_deity"
}
]
}Regional Variation Analysis:
{
"feature": "genitive_construction",
"regions": [
{
"region": "Southern Babylonia",
"preferred_form": "status_constructus",
"example": "šar māti",
"frequency": 0.85
},
{
"region": "Northern Babylonia",
"preferred_form": "ša_construction",
"example": "šarru ša māti",
"frequency": 0.65
},
{
"region": "Assyria",
"preferred_form": "mixed",
"note": "both_constructions_common"
}
]
}Formulaic Pattern Detection:
{
"pattern": "royal_epithet_sequence",
"structure": "KING + DESCRIPTOR + PLACE + DEITY",
"examples": [
{
"text": "Hammurabi šar Bābili migir Marduk",
"gloss": "Hammurabi, king of Babylon, beloved of Marduk",
"period": "Old Babylonian",
"frequency": "very_common"
},
{
"text": "Nabonidus šar Bābili zānin Esagil",
"gloss": "Nabonidus, king of Babylon, provider for Esagil",
"period": "Neo-Babylonian",
"frequency": "common"
}
],
"variation_points": [
"deity_name",
"city_name",
"epithet_choice"
]
}┌─────────────────────────────────────────────┐
│ Query Interface │
│ - Natural language input │
│ - Structured query builder │
│ - Saved searches │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ Query Parser & Optimizer │
│ - NL → structured query │
│ - Query expansion (synonyms) │
│ - Execution planning │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ Semantic Layer │
│ - Concept ontology │
│ - Entity resolution │
│ - Multilingual mapping │
│ - Cultural context │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ Search Engine │
│ - Inverted indexes (words) │
│ - Graph indexes (relationships) │
│ - Temporal indexes (chronology) │
│ - Spatial indexes (geography) │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ Corpus Database │
│ - Transliterations │
│ - Lemmatization │
│ - Morphological analysis │
│ - Metadata │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ Results & Visualization │
│ - Concordances (KWIC) │
│ - Frequency charts │
│ - Geographic distribution │
│ - Timeline views │
│ - Export options │
└─────────────────────────────────────────────┘
Question: "How did conceptions of divine justice change from Sumerian to Akkadian periods?"
Query sequence:
- Find all texts mentioning justice deities (Utu/Shamash, Kittu, Mīšaru)
- Filter by period (ED III vs. OB)
- Extract co-occurring concepts
- Compare semantic networks
- Visualize concept evolution
Results:
- Quantitative shift from Sumerian to Akkadian terminology
- New concepts emerge (mīšarum = equity, andurārum = freedom)
- Justice becomes more explicitly royal/political vs. cosmic
Question: "What were standard silver prices for commodities in Ur III period?"
Query:
FIND transactions WHERE
period = 'Ur III'
AND contains(commodity(type='grain' OR type='oil' OR type='wool'))
AND contains(payment(commodity='silver'))
GROUP BY commodity.type, region
AGGREGATE avg(payment.amount), std_dev(payment.amount)Results:
- Average prices by commodity and region
- Price variation over time
- Regional price differences
- Inflation/deflation patterns
Question: "Find intertextual references between creation myths"
Query:
FIND passages WHERE
genre = 'myth'
AND theme = 'creation'
AND contains(formulaic_pattern)
ORDER BY similarity_score DESCResults:
- Shared formulae across compositions
- Textual dependencies (which text borrowed from which)
- Regional variation in mythological traditions
Current limitation: Scholars can only find what they already know to look for. String search requires knowing exact words.
With semantic search:
- Discovery: Find unexpected connections
- Synthesis: Answer questions requiring corpus-wide analysis
- Efficiency: Months of manual reading → minutes of querying
- Rigor: Quantitative backing for scholarly claims
- Accessibility: Non-experts can explore corpus
Example transformation:
- Old way: "I think kingship terminology changed in this period" → read 100 texts manually → publish impression
- New way: Query corpus → generate frequency tables → statistical significance testing → publish data-driven conclusions
- Query volume: 10,000+ semantic searches per month
- User base: 500+ active researchers
- Publications: 50+ papers citing search results
- Discovery: 100+ "unexpected connections" reported
- Pedagogy: Integration into graduate curricula
NLP for dead languages:
- No native speakers for testing
- Limited training data
- Complex morphology
- Ambiguous writing system
Ontology construction:
- Requires deep domain expertise
- Cultural concepts may not translate
- Historical concepts evolve
- Multiple scholarly traditions disagree
Performance:
- 340,000+ CDLI texts
- Millions of words
- Complex relationships
- Real-time queries needed
Suggested stack:
- Elasticsearch for full-text search
- Neo4j for semantic relationships
- FastText embeddings for word similarity
- SpaCy custom models for NER
- GraphQL for flexible querying
Glintstone status: schema — initial framework designed Addressed by three subsystems:
- Discussion threads: 5 per-entity thread tables (token_reading, lemmatization, translation, fragment_join, scholarly_annotation) with strict FKs + shared
discussion_postswith typed contributions (observation, counterargument, evidence, question, synthesis, endorsement)- Fragment joins:
join_groupsfor N-way reconstruction +fragment_joinspairwise links with proposed/verified/accepted/rejected pipeline + evidence + decisions- Scholarly annotations:
scholarly_annotationswith strict nullable FK targeting (artifact, surface, line, token, sign, composite) + CHECK constraint + W3C Web Annotation export view (scholarly_annotations_w3c) See:data-model/glintstone-schema.yaml(ANNOTATION ECOSYSTEM section) Remaining: External annotation harvesting/aggregation, Hypothes.is integration, ActivityPub federation, annotation DOI minting, reputation scoring (deferred), browser extension, annotation overlay UI
Scholarly annotations are trapped in incompatible silos:
- Printed books: Footnotes, apparatus criticus, commentary (inaccessible)
- Digital editions: Inline ATF comments (format-locked)
- Museum catalogs: Database fields (institution-locked)
- Journal articles: Scattered across PDFs (un-linkable)
- Personal research: Private notes in Notion, Obsidian, Evernote (isolated)
- Email discussions: Informal exchanges (ephemeral)
- Conference presentations: Oral remarks (unrecorded)
Scenario: You're reading a text about barley rations
What exists but you can't access:
- Scholar A annotated grain measures in a 1985 article (paywalled)
- Scholar B has notes on administrative terminology in their personal database
- Scholar C discussed parallel texts in an email thread
- Scholar D presented new interpretation at a conference
- Museum curator knows this tablet was recently re-photographed
- Graduate student has spotted a join with another fragment
Result: Each person reinvents the wheel, duplicating effort, missing insights.
Build "Web Annotations for Cuneiform" - a distributed annotation ecosystem
Base Annotation Structure (W3C Web Annotation Model):
{
"@context": "http://www.w3.org/ns/anno.jsonld",
"id": "https://anno.cuneiform.org/anno/12345",
"type": "Annotation",
"creator": {
"id": "https://orcid.org/0000-0001-2345-6789",
"name": "Dr. Jane Smith",
"affiliation": "University of Chicago"
},
"created": "2024-03-15T14:32:00Z",
"modified": "2024-03-20T09:15:00Z",
"motivation": "commenting",
"target": {
"source": "https://cdli.ucla.edu/P363653",
"selector": {
"type": "FragmentSelector",
"conformsTo": "http://tools.ietf.org/rfc/rfc3236",
"value": "line=1&signs=3-5"
}
},
"body": {
"type": "TextualBody",
"value": "Standard OB epithet, cf. Hammurabi Prologue i:12",
"format": "text/plain",
"language": "en"
}
}Cuneiform-Specific Extensions:
{
"target": {
"source": "cuneiform:P363653",
"specificity": "line",
"selector": {
"line": 1,
"column": null,
"signs": [3, 4, 5],
"words": [2],
"text": "an-ki-a"
}
},
"body": {
"type": "commentary",
"content": "Standard OB epithet, cf. Hammurabi Prologue i:12",
"annotation_class": "parallel_passage",
"references": [
{"type": "publication", "doi": "10.1234/journal.1985.456"},
{"type": "text", "cdli": "P470987", "line": 12}
],
"tags": [
"divine_epithet",
"formulaic",
"old_babylonian_royal_style"
],
"confidence": 0.9,
"visibility": "public",
"license": "CC-BY-4.0"
}
}Annotation Types:
{
"annotation_types": [
{
"type": "textual_criticism",
"subtypes": [
"alternative_reading",
"collation_note",
"restoration_proposal",
"damage_assessment"
]
},
{
"type": "commentary",
"subtypes": [
"parallel_passage",
"historical_context",
"cultural_explanation",
"linguistic_note"
]
},
{
"type": "translation",
"subtypes": [
"alternative_translation",
"translation_note",
"idiom_explanation"
]
},
{
"type": "metadata",
"subtypes": [
"bibliography",
"photograph_note",
"museum_information",
"conservation_note"
]
},
{
"type": "connection",
"subtypes": [
"fragment_join",
"duplicate_text",
"related_text",
"intertextual_reference"
]
}
]
}Multi-Source Harvesting:
┌─────────────────────────────────────┐
│ Annotation Sources │
├─────────────────────────────────────┤
│ • Published articles (via DOI) │
│ • CDLI inline notes │
│ • Oracc project notes │
│ • Museum databases (APIs) │
│ • Personal annotation servers │
│ • Hypothesis/Hypothes.is │
│ • Collaborative platforms │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ Annotation Aggregator │
│ - Harvest from sources │
│ - Normalize formats │
│ - Deduplicate │
│ - Resolve conflicts │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ Annotation Index │
│ - By text (P-number) │
│ - By line │
│ - By author │
│ - By type │
│ - By date │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ User Interface │
│ - Overlay on any edition │
│ - Filter by criteria │
│ - Search annotations │
│ - Export/cite │
└─────────────────────────────────────┘
Annotation Overlay UI:
┌──────────────────────────────────────────────────┐
│ P363653 - BM 055488 [⚙] │
├──────────────────────────────────────────────────┤
│ Filters: [All Authors ▾] [All Types ▾] [2019-▾] │
│ [Sort by: Date ▾] │
├──────────────────────────────────────────────────┤
│ Line 1: {d}utu lugal an-ki-a │
│ ︿︿︿︿︿ │
│ 📝 3 annotations available │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ 👤 Dr. Jane Smith (2020) [+]│ │
│ │ Type: parallel_passage │ │
│ │ │ │
│ │ Standard OB epithet for Shamash, │ │
│ │ compare Hammurabi Prologue i:12 │ │
│ │ │ │
│ │ References: • RIM E4.3.6.1 │ │
│ │ • P470987:12 │ │
│ │ Tags: divine_epithet, formulaic │ │
│ │ Confidence: ⭐⭐⭐⭐⭐ (0.9) │ │
│ │ [View full] [Cite] [Reply] │ │
│ └──────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ 👤 Prof. John Doe (2018) [+]│ │
│ │ Type: linguistic_note │ │
│ │ │ │
│ │ Genitive construct, literally "of │ │
│ │ heaven and earth" │ │
│ │ [View full] [Cite] [Reply] │ │
│ └──────────────────────────────────────────┘ │
│ │
│ 🤖 ML Suggestion (2024) │
│ Similar epithet in 47 other texts │
│ [View pattern analysis] │
└──────────────────────────────────────────────────┘
Filter by Annotator Reputation:
{
"annotator": "smith-jane-univ-chicago",
"reputation_score": 0.87,
"metrics": {
"annotations_created": 347,
"annotations_cited": 89,
"upvotes": 234,
"corrections_accepted": 0.92,
"specialization": [
"Old_Babylonian_royal_inscriptions",
"divine_epithets",
"Sippar_texts"
],
"peer_endorsements": 12,
"publication_record": [
"https://doi.org/10.1234/journal",
"https://doi.org/10.5678/monograph"
]
}
}Annotation Threads (Conversations):
{
"annotation_thread": {
"root_annotation": "anno:12345",
"participants": [
"smith-jane",
"doe-john",
"chen-li"
],
"thread": [
{
"author": "smith-jane",
"date": "2024-03-15",
"content": "Standard epithet, compare Hammurabi i:12"
},
{
"author": "doe-john",
"date": "2024-03-16",
"type": "reply",
"content": "Actually, this version uses genitive construction where Hammurabi uses apposition. Significant?"
},
{
"author": "smith-jane",
"date": "2024-03-17",
"type": "reply",
"content": "Good catch! Regional variation? This is from Sippar, Hammurabi from Babylon"
},
{
"author": "chen-li",
"date": "2024-03-18",
"type": "reply",
"content": "See also discussion in Richardson 2005:234 on North/South grammatical differences"
}
],
"resolution": {
"status": "consensus",
"synthesis": "Regional grammatical variation in standard formulae",
"updated_annotation": "anno:12345-v2"
}
}
}Annotation Versioning:
{
"annotation_family": "anno:12345",
"versions": [
{
"version": "1.0",
"date": "2024-03-15",
"author": "smith-jane",
"content": "Standard OB epithet",
"status": "superseded"
},
{
"version": "1.1",
"date": "2024-03-17",
"author": "smith-jane",
"content": "Standard OB epithet with regional grammatical variation (Sippar style)",
"changes": "Added regional context based on discussion",
"status": "superseded"
},
{
"version": "2.0",
"date": "2024-03-20",
"author": "smith-jane",
"co_authors": ["doe-john", "chen-li"],
"content": "Standard OB divine epithet showing northern Babylonian grammatical features...",
"status": "current",
"citations": ["Richardson 2005:234"]
}
]
}LLM-Generated Annotation Synthesis:
{
"synthesis_request": {
"target": "P363653:1",
"sources": ["anno:12345", "anno:67890", "anno:11111"],
"model": "claude-sonnet-4",
"prompt": "Synthesize these three scholarly perspectives on line 1"
},
"synthesis": {
"summary": "Scholars agree this is a standard divine epithet for Shamash used in Old Babylonian royal inscriptions. Smith (2020) notes it appears in Hammurabi's prologue, while Doe (2018) emphasizes the genitive construction. Chen (2019) provides broader context showing this formula is particularly common in texts from Sippar, Shamash's cult center. The grammatical variation noted by Doe may represent regional dialectical differences between northern and southern Babylonia.",
"points_of_agreement": [
"Standard formulaic language",
"Old Babylonian period",
"Royal inscription context"
],
"points_of_disagreement": [],
"gaps": [
"Chronological development of formula",
"Comparison with Neo-Babylonian usage"
],
"suggested_research": [
"Statistical analysis of regional variation",
"Diachronic study of epithet evolution"
]
}
}DOI for Annotations:
{
"annotation": "anno:12345",
"doi": "10.5555/anno.12345",
"citation": {
"apa": "Smith, J. (2024). Annotation on P363653:1 [Annotation]. Cuneiform Annotation Corpus. https://doi.org/10.5555/anno.12345",
"chicago": "Smith, Jane. 2024. Annotation on P363653:1. Cuneiform Annotation Corpus. https://doi.org/10.5555/anno.12345.",
"bibtex": "@misc{smith2024anno12345,\n author = {Smith, Jane},\n title = {Annotation on P363653:1},\n year = {2024},\n publisher = {Cuneiform Annotation Corpus},\n doi = {10.5555/anno.12345}\n}"
},
"citation_count": 7,
"cited_by": [
{"doi": "10.1234/article.2024.1", "context": "As Smith notes..."},
{"doi": "10.5678/monograph.2024", "context": "Following Smith's observation..."}
]
}Granular Attribution:
{
"scholarly_contribution": {
"type": "observation",
"contributor": "smith-jane",
"date": "2024-03-15",
"scope": "single_line_interpretation",
"impact": {
"direct_citations": 7,
"influence_score": 0.12,
"reuse_count": 23,
"correction_rate": 0.0
},
"recognition": "micro_contribution",
"credit_accrual": "cumulative"
}
}Scholar A → Publication → Scholar B reads → Scholar B adds → Publication
↓
Scholar C reads → ...
Problems:
- Each iteration takes years
- Knowledge locked in publications
- Duplicated effort
- Lost insights (emails, conferences)
- High barrier to contribution
┌─ Scholar A annotation
│ ├─ Scholar B reply
│ └─ Scholar C addition
Text ───┼─ Scholar D alternative reading
│ └─ Community discussion → consensus
└─ ML suggestion
└─ Expert verification
Benefits:
- Real-time collaboration
- Incremental contributions valued
- Expertise discoverable
- Learning accelerated
- Attribution granular
┌────────────────────────────────────────────────┐
│ Client Layer │
│ - Browser extension │
│ - Web interface │
│ - API clients │
└────────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────┐
│ Annotation Server │
│ - CRUD operations │
│ - Authentication │
│ - Authorization │
│ - Validation │
└────────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────┐
│ Aggregation Layer │
│ - Harvest external sources │
│ - Normalize formats │
│ - Deduplication │
│ - Conflict resolution │
└────────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────┐
│ Storage │
│ - Annotation database │
│ - Version history │
│ - Media attachments │
│ - Index (text, author, type, date) │
└────────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────┐
│ Services │
│ - Search & discovery │
│ - Reputation scoring │
│ - Citation tracking │
│ - Notification │
│ - Export │
└────────────────────────────────────────────────┘
- Adoption: 1,000+ active annotators
- Coverage: 50,000+ texts with annotations
- Volume: 100,000+ annotations
- Collaboration: 500+ annotation threads
- Citations: 100+ scholarly publications citing annotations
- Integration: 5+ major projects using annotation API
- W3C Web Annotation Data Model (standard)
- Hypothes.is (existing annotation platform to build on)
- ORCID (author identification)
- DOI (annotation citation)
- JSON-LD (linked data)
- ActivityPub (federation protocol)
- WebSub (real-time updates)
Glintstone status: schema — initial framework designed Addressed by three subsystems:
- Named entity registry:
named_entities(person, deity, place, institution, work, etc. extracted from ORACC POS tags) +entity_aliasesfor merge tracking +entity_mentionslinking tokens/lines to entities with provenance + evidence + decisions. ~1,836 entities, ~4,995 mentions from current corpus- Entity relationships:
relationship_predicates(~13 Tier 1 stable predicates: father_of, patron_deity_of, located_in, etc.) +entity_relationshipswith hybrid predicate governance (Tier 1 enum FK + Tier 2-3 freetext) + temporal scope via period names + evidence + decisions- Authority reconciliation:
authority_linksmapping entities to Wikidata, VIAF, Pleiades, GeoNames, PeriodO +authority_reconciliation_disputesfor contested links See:data-model/glintstone-schema.yaml(KNOWLEDGE GRAPH section) Remaining: SPARQL/GraphQL query endpoint, visual query builder, RDF export, geographic/network/timeline visualizations, inference rules, full triple store (current design is relational graph, not RDF-native)
Cuneiform data exists in isolated silos with no structural connections:
Separate databases:
- CDLI: Catalog metadata (340,000+ texts)
- Oracc: Scholarly editions (various projects)
- Museum databases: Physical object information
- ePSD: Sumerian dictionary
- CAD: Akkadian dictionary
- Sign lists: ABZ, MZL, aBZL
- Prosopography: Person name databases
- Geography: Place name gazetteers
- Chronology: King lists, date formulae
None of these interoperate structurally.
Research question: "Who were Hammurabi's contemporaries?"
Current manual process:
- Look up Hammurabi's reign dates (where? Wikipedia? CAD?)
- Find other rulers in same period (multiple sources, inconsistent dates)
- Search CDLI for texts mentioning each ruler
- Check which texts are contemporary (chronology debates)
- Read texts to understand relationships
- Manually compile results
- Time required: Days to weeks
With knowledge graph: Single query, seconds.
Build a unified knowledge graph linking all cuneiform resources
# RDF/Turtle notation for clarity
# Tablets
:P363653 a cuneiform:Tablet ;
cdli:period "Old Babylonian" ;
cdli:provenience :Sippar ;
cdli:collection :BritishMuseum ;
cdli:museumNumber "BM 055488" ;
cdli:genre :RoyalInscription ;
cdli:material "clay" ;
cdli:dimensions "8.2x5.1x2.3"^^units:cm ;
cdli:transliteration <https://oracc.org/obmc/P363653/atf> ;
cdli:images :Image_P363653_obv, :Image_P363653_rev .
# Persons
:Hammurabi a person:King ;
person:name "Hammurabi"@en, "Ḫammurāpi"@akk ;
person:reignStart "1792 BCE"^^xsd:gYear ;
person:reignEnd "1750 BCE"^^xsd:gYear ;
person:dynasty :FirstDynastyOfBabylon ;
person:capital :Babylon ;
person:father :SinMuballit ;
person:contemporary :RimSin, :Shamshi-Adad ;
person:mentionedIn :P363653, :P470987 ;
owl:sameAs dbpedia:Hammurabi, viaf:54944639 .
# Deities
:Shamash a deity:God ;
deity:sumerian "Utu" ;
deity:akkadian "Šamaš" ;
deity:domain "sun"@en, "justice"@en, "divination"@en ;
deity:spouse :Aya ;
deity:cultCenter :Sippar, :Larsa ;
deity:temple :Ebabbar ;
deity:symbol "sun-disk"@en, "saw"@en ;
deity:mentionedIn :P363653 ;
owl:sameAs wikidata:Q207949 .
# Places
:Sippar a geo:City ;
geo:modernName "Tell Abu Habbah" ;
geo:coordinates "33.1°N 44.3°E"^^geo:wktLiteral ;
geo:period "ED III - Late Babylonian" ;
geo:patron_deity :Shamash ;
geo:major_temple :Ebabbar ;
geo:texts_from :P363653, :P123456 ;
owl:sameAs geonames:99155, pleiades:893986 .
# Works (literary compositions)
:Gilgamesh a work:LiteraryWork ;
work:title "Epic of Gilgamesh"@en ;
work:genre :Epic ;
work:language "Akkadian", "Sumerian" ;
work:protagonist :Gilgamesh_Character ;
work:manuscripts :P363653, :P470987, :P789012 ;
work:editions "George 2003", "Foster 2001" .
# Words (lexical entries)
:sharru_1 a lex:Lemma ;
lex:language "Akkadian" ;
lex:writing "šarru" ;
lex:pos "noun" ;
lex:gender "masculine" ;
lex:meaning "king"@en ;
lex:sumerian_equivalent :lugal_1 ;
lex:related :rubû_1, :bēlu_1 ;
lex:attestations 8472 ;
lex:dictionary_entry :CAD_Š_1_120 .# Textual relationships
:P363653 cuneiform:duplicate :P470987 ;
cuneiform:parallel :P789012 ;
cuneiform:quotes :P654321 ;
cuneiform:copied_from :P111111 .
# Personal relationships
:Hammurabi person:father :SinMuballit ;
person:successor :Samsuiluna ;
person:contemporary :RimSin ;
person:defeated :RimSin ;
person:vassal :Zimri-Lim .
# Geographical relationships
:Sippar geo:located_in :Mesopotamia ;
geo:controlled_by :Hammurabi ;
geo:trade_partner :Dilmun ;
geo:distance_from (:Babylon "85 km") .
# Institutional relationships
:Ebabbar temple:dedicated_to :Shamash ;
temple:located_in :Sippar ;
temple:rebuilt_by :Hammurabi ;
temple:priest :Shamash-mudammiq .
# Semantic relationships
:sharru_1 lex:synonym :lugal_1 ;
lex:hypernym :ruler ;
lex:related :kingship ;
lex:derives_from :Proto-Semitic_šarr .# SPARQL query language
SELECT ?ruler ?name ?capital ?reignStart ?reignEnd
WHERE {
# Hammurabi's reign
:Hammurabi person:reignStart ?hamStart ;
person:reignEnd ?hamEnd .
# Find other rulers whose reigns overlap
?ruler a person:King ;
person:name ?name ;
person:capital ?capital ;
person:reignStart ?reignStart ;
person:reignEnd ?reignEnd .
# Reign overlap logic
FILTER(?reignStart <= ?hamEnd && ?reignEnd >= ?hamStart)
FILTER(?ruler != :Hammurabi)
}
ORDER BY ?reignStartResults:
- Rim-Sin of Larsa (1822-1763 BCE)
- Shamshi-Adad I of Assyria (1808-1776 BCE)
- Zimri-Lim of Mari (1775-1761 BCE)
- Etc.
SELECT ?text ?textTitle ?deity ?place
WHERE {
?text cdli:mentionsDeity :Marduk ;
cdli:mentionsDeity ?deity ;
cdli:mentionsPlace ?place ;
cdli:period "Old Babylonian" .
?deity deity:domain "creation" .
?place geo:controlled_by :Hammurabi .
OPTIONAL { ?text dc:title ?textTitle }
}Results: Texts mentioning Marduk with other creation deities, from Hammurabi's empire
SELECT ?period ?term ?frequency
WHERE {
?term lex:meaning "king"@en ;
lex:language "Akkadian" ;
lex:attestations_by_period ?attestation .
?attestation period:name ?period ;
period:count ?frequency .
}
ORDER BY ?periodResults: How different words for "king" rise and fall over time
SELECT ?text ?period ?genre ?publications
WHERE {
?text cdli:period ?period ;
cdli:genre ?genre ;
cdli:preservation "good" .
OPTIONAL { ?text cdli:published_in ?pub }
FILTER NOT EXISTS { ?text oracc:edition ?edition }
}
GROUP BY ?text ?period ?genre
HAVING (COUNT(?pub) < 2)
ORDER BY ?periodResults: Well-preserved texts that lack modern editions
CONSTRUCT {
?person1 network:corresponded_with ?person2
}
WHERE {
?text cdli:genre :Letter ;
cdli:sender ?person1 ;
cdli:recipient ?person2 ;
cdli:period "Old Babylonian" .
}Results: Graph of who wrote to whom (Mari letters, etc.)
Map showing:
- Dots = text findspots
- Color = period
- Size = number of texts
- Click → filter by location
Graph showing:
- Nodes = persons
- Edges = relationships (correspondence, family, political)
- Color = role (king, scribe, merchant)
- Layout = force-directed
Timeline showing:
- Horizontal axis = time
- Vertical tracks = rulers
- Events = military campaigns, building projects
- Texts plotted by date
River chart showing:
- Flow = term frequency over time
- Width = usage volume
- Branches = semantic drift
- Color = meaning category
# Link to broader knowledge graphs
:Hammurabi owl:sameAs
dbpedia:Hammurabi ,
wikidata:Q36359 ,
viaf:54944639 ,
loc:n79054196 .
:Sippar owl:sameAs
geonames:99155 ,
pleiades:893986 ,
wikidata:Q579967 .
:Gilgamesh_Work owl:sameAs
wikidata:Q37112 ,
viaf:185892565 .Benefits:
- Inherit external knowledge
- Cross-reference with other domains
- Leverage existing tools (Wikidata query service)
- Contribute back to global knowledge commons
┌──────────────────────────────────────────┐
│ Query Interface │
│ - SPARQL endpoint │
│ - GraphQL API │
│ - Visual query builder │
└──────────────────────────────────────────┘
↓
┌──────────────────────────────────────────┐
│ Graph Database │
│ - Triple store (RDF) │
│ - Property graph (Neo4j) │
│ - Hybrid approach │
└──────────────────────────────────────────┘
↓
┌──────────────────────────────────────────┐
│ ETL Pipeline │
│ - CDLI harvest │
│ - Oracc import │
│ - Museum APIs │
│ - Dictionary conversion │
│ - Reconciliation │
└──────────────────────────────────────────┘
↓
┌──────────────────────────────────────────┐
│ External Data Sources │
│ - CDLI │
│ - Oracc projects │
│ - Museum collections │
│ - ePSD, CAD │
│ - Wikidata, GeoNames, VIAF │
└──────────────────────────────────────────┘
Core Classes:
cuneiform:Tabletcuneiform:Inscriptioncuneiform:LiteraryWorkperson:Person-
person:King,person:Scribe,person:Merchant deity:Deity-
geo:Place,geo:City,geo:Region -
institution:Temple,institution:Palace -
lex:Lemma,lex:Sign -
event:Event,event:Battle,event:Construction
Core Properties:
-
cdli:period,cdli:provenience,cdli:genre -
person:reignStart,person:reignEnd -
geo:coordinates,geo:modern_name -
work:manuscript,work:edition -
lex:meaning,lex:attestations
Inference Rules:
# If two rulers' reigns overlap, they are contemporaries
{
?ruler1 person:reignStart ?s1 ; person:reignEnd ?e1 .
?ruler2 person:reignStart ?s2 ; person:reignEnd ?e2 .
FILTER(?s1 <= ?e2 && ?e1 >= ?s2)
FILTER(?ruler1 != ?ruler2)
}
=>
{
?ruler1 person:contemporary ?ruler2 .
}
# If a text is from a place during a ruler's reign, the ruler controlled it
{
?text cdli:provenience ?place ;
cdli:date ?date .
?ruler person:reignStart ?start ;
person:reignEnd ?end ;
person:capital ?capital .
FILTER(?date >= ?start && ?date <= ?end)
}
=>
{
?place geo:controlled_by ?ruler .
}Current limitations:
- Every query requires manual integration across databases
- No way to explore relationships computationally
- Insights depend on what researcher already knows
- Network analysis impossible
With knowledge graph:
- Discovery: Find unexpected connections
- Computational analysis: Network metrics, centrality, clustering
- Hypothesis testing: "Did trade correlate with political alliances?"
- Visualization: See patterns invisible in text
- Scalability: Analyze 100,000+ texts simultaneously
Research unlocked:
- Prosopographical networks
- Trade route reconstruction
- Chronological refinement
- Genre evolution
- Intertextuality mapping
- Historical geography
- Social structure analysis
- Entities: 1M+ entities (tablets, persons, places, words)
- Relationships: 10M+ edges
- Queries: 1,000+ queries per day
- Users: 500+ researchers
- Publications: 50+ papers using graph queries
- External links: Integration with Wikidata, GeoNames, VIAF
Triple Store Options:
- Apache Jena (mature, SPARQL 1.1)
- GraphDB (commercial, good inference)
- Blazegraph (used by Wikidata)
Property Graph Options:
- Neo4j (most popular, Cypher query language)
- Amazon Neptune (managed service)
- JanusGraph (distributed, scalable)
Hybrid Approach:
- Use both RDF and property graph
- RDF for ontology, reasoning, external linking
- Property graph for performance, complex traversals
Tooling:
- RDF4J (Java framework)
- rdflib (Python)
- SHACL (schema validation)
- OWL (ontology language)
- GeoSPARQL (spatial queries)
Glintstone status: schema — initial framework designed Addressed by three subsystems:
- Identity concordance:
artifact_identifiersN:1 alias table (museum_no, excavation_no, accession_no, publication, ARK, etc.) with normalized lookup index + evidence + decisions. Seeded from ~353k artifacts- Publication registry:
publicationswith series membership (RIME, SAA, RINAP, VAB), supersedes chains, DOI/BibTeX +publication_authorslinking toscholarstable- Edition bridge:
artifact_editionslinking artifacts to publications with page/plate refs, edition_type (full_edition, hand_copy, photograph_only, etc.),is_current_editionconsensus flag, per-artifact supersedes chain + evidence + decisions- Annotation_runs integration:
publication_idFK added alongside freetextpublication_reffor incremental migration See:data-model/glintstone-schema.yaml(CITATION RESOLUTION section + Layer 0) Remaining: Fuzzy citation parser, resolver API endpoint, citation extraction from PDFs, Handle System / ARK minting, OpenURL context-sensitive linking, UI for "show me all editions of this text"
Cuneiform citation is chaotic and incompatible:
Same text, different citation schemes:
-
RIME 4.3.6.1(Royal Inscriptions of Mesopotamia, Early Periods) -
BM 055488(British Museum accession number) -
P363653(CDLI catalog number) -
VAB 6, 1(Vorderasiatische Bibliothek, old publication) -
Hammurabi 1(Sequence in ruler's corpus) -
Oracc/OBMC P363653(Oracc project reference)
Problems:
- Ambiguity: No guarantee these all reference same object
- Resolution impossible: Can't programmatically link citations
- Obsolescence: Old publications use outdated numbering
- Discoverability: Can't find latest edition automatically
- Versioning: Multiple editions, unclear which is current
- Cross-reference: No way to find parallel citations
Scenario: Reading an article that cites "VAB 6, 1"
Manual process:
- Find out what "VAB" stands for (Vorderasiatische Bibliothek)
- Locate VAB volume 6 (published 1907, out of print)
- Look up text #1
- Note museum number (if given)
- Search CDLI for museum number
- Find P-number
- Check if newer edition exists
- Time required: 15-60 minutes per citation
Build a "DOI for Cuneiform" - universal citation resolution
Primary Identifier Structure:
cuneiform:P363653
Alias System:
{
"primary_id": "cuneiform:P363653",
"aliases": [
{"type": "museum", "value": "BM 055488", "collection": "British Museum"},
{"type": "publication", "value": "RIME 4.3.6.1", "corpus": "RIME"},
{"type": "publication", "value": "VAB 6.1", "corpus": "VAB"},
{"type": "excavation", "value": "Sippar 1894-S123", "expedition": "Sippar"},
{"type": "oracc", "value": "obmc/P363653", "project": "OBMC"}
],
"canonical_citation": "RIME 4.3.6.1 (= BM 055488 = P363653)"
}Resolve any citation format:
GET /resolve?cite=VAB+6.1
Response:
{
"query": "VAB 6.1",
"matched": true,
"confidence": 0.95,
"primary_id": "cuneiform:P363653",
"metadata": {
"period": "Old Babylonian",
"genre": "Royal Inscription",
"ruler": "Hammurabi",
"museum": "British Museum",
"museum_number": "BM 055488"
},
"current_edition": {
"title": "RIME 4.3.6.1",
"url": "https://oracc.org/obmc/P363653",
"date": "2010",
"supersedes": ["VAB 6.1"]
},
"images": [
{"type": "photograph", "url": "https://cdli.ucla.edu/dl/photo/P363653.jpg"},
{"type": "line_drawing", "url": "..."}
],
"translations": [
{"language": "en", "url": "...", "author": "Frayne 1990"}
]
}
Fuzzy matching for partial citations:
GET /resolve?cite=Hammurabi+inscription+from+Sippar
Response:
{
"query": "Hammurabi inscription from Sippar",
"matched": true,
"confidence": 0.75,
"candidates": [
{
"id": "cuneiform:P363653",
"score": 0.85,
"title": "Building inscription of Hammurabi",
"reason": "Exact match: Hammurabi + Sippar + royal inscription"
},
{
"id": "cuneiform:P470987",
"score": 0.72,
"title": "Stele of Hammurabi",
"reason": "Partial match: Hammurabi + Sippar (found in Sippar, not from)"
}
]
}
Edition Graph:
{
"text": "cuneiform:P363653",
"edition_history": [
{
"date": "1907",
"publication": "VAB 6.1",
"type": "hand_copy",
"author": "Ungnad",
"status": "superseded",
"notes": "First publication, from hand copy only"
},
{
"date": "1990",
"publication": "RIME 4.3.6.1",
"type": "scholarly_edition",
"author": "Frayne",
"status": "superseded",
"improvements": [
"Collation with tablet",
"Improved translation",
"Historical commentary"
]
},
{
"date": "2010",
"publication": "Oracc/OBMC P363653",
"type": "digital_edition",
"author": "OBMC project",
"status": "current",
"url": "https://oracc.org/obmc/P363653",
"improvements": [
"Digital transliteration",
"Lemmatization",
"Inline morphology",
"High-resolution images"
]
}
],
"current_edition": "Oracc/OBMC P363653",
"recommended_citation": "RIME 4.3.6.1 (Oracc/OBMC edition)"
}Automatic Edition Updates:
{
"subscription": {
"user": "scholar@university.edu",
"texts": ["P363653", "P470987"],
"notify_on": [
"new_edition",
"new_image",
"fragment_join",
"correction"
]
},
"notifications": [
{
"date": "2024-03-15",
"text": "P363653",
"type": "new_image",
"message": "High-resolution RTI scan added to CDLI",
"url": "https://cdli.ucla.edu/P363653"
}
]
}Track citation networks:
{
"text": "cuneiform:P363653",
"cited_by": [
{
"type": "scholarly_article",
"doi": "10.1234/journal.2020.123",
"authors": ["Smith, J.", "Doe, M."],
"title": "Divine Epithets in Old Babylonian Royal Inscriptions",
"context": "As seen in RIME 4.3.6.1, Hammurabi uses standard formulae..."
},
{
"type": "monograph",
"isbn": "978-1234567890",
"author": "Johnson, K.",
"title": "Kingship in Ancient Mesopotamia",
"page": 145
}
],
"cites": [
{
"text": "P470987",
"relationship": "parallel",
"note": "Similar epithet in different text"
}
],
"citation_count": 47,
"citation_trend": {
"2010-2015": 8,
"2016-2020": 15,
"2021-2024": 24
}
}Parallel Text Network:
{
"text": "cuneiform:P363653",
"relationships": {
"duplicates": [
{"id": "P789012", "similarity": 0.98, "type": "exact_duplicate"}
],
"parallels": [
{"id": "P470987", "similarity": 0.75, "type": "textual_variant"},
{"id": "P654321", "similarity": 0.60, "type": "formulaic_parallel"}
],
"quotes": [
{"id": "P111111", "similarity": 0.40, "type": "partial_quotation"}
]
}
}Auto-generate citations in any format:
GET /cite?id=P363653&format=chicago
Response:
"Frayne, Douglas R. 1990. Old Babylonian Period (2003-1595 BC). The Royal Inscriptions of Mesopotamia, Early Periods 4. Toronto: University of Toronto Press. RIME 4.3.6.1."
Formats supported:
- Chicago
- APA
- MLA
- BibTeX
- Harvard
- Custom templates
Context-aware citation:
{
"cite_request": {
"id": "P363653",
"context": "discussing_translation",
"format": "inline"
},
"response": "RIME 4.3.6.1 (tr. Frayne 1990: 333-334)"
}┌────────────────────────────────────────────┐
│ Client Layer │
│ - Browser extension (auto-link) │
│ - API │
│ - Citation manager plugins │
└────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────┐
│ Resolution Service │
│ - Fuzzy matching │
│ - Alias resolution │
│ - Edition tracking │
│ - Citation formatting │
└────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────┐
│ Identifier Database │
│ - Primary IDs (P-numbers) │
│ - Alias mapping │
│ - Edition history │
│ - Citation graph │
└────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────┐
│ Data Sources │
│ - CDLI catalog │
│ - Museum databases │
│ - Publication corpora (RIME, etc.) │
│ - Oracc projects │
│ - DOI/CrossRef (for citations) │
└────────────────────────────────────────────┘
Auto-link citations in web pages:
<!-- Original text -->
<p>See VAB 6.1 for Hammurabi's building inscription.</p>
<!-- After extension processing -->
<p>See <a href="https://cuneiform.org/resolve/VAB_6.1"
class="cuneiform-citation"
data-id="P363653"
title="RIME 4.3.6.1 (BM 055488)">VAB 6.1</a>
for Hammurabi's building inscription.</p>Hover tooltip shows:
VAB 6.1
= RIME 4.3.6.1
= BM 055488
= CDLI P363653
Period: Old Babylonian
Ruler: Hammurabi
Genre: Building inscription
[View current edition]
[View images]
[Copy citation]
RESTful API:
GET /resolve/{citation}
GET /editions/{id}
GET /cite/{id}?format={format}
GET /related/{id}?type={duplicate|parallel|quote}
POST /subscribe
GraphQL API:
query ResolveCitation($cite: String!) {
resolve(citation: $cite) {
primaryId
confidence
metadata {
period
genre
ruler
}
currentEdition {
title
url
date
}
images {
type
url
}
}
}Current pain points:
- Scholars waste hours tracking down citations
- Old publications become inaccessible
- Can't discover latest editions
- Citation networks invisible
- Cross-referencing manual
With citation resolution:
- Efficiency: Instant resolution of any citation
- Discovery: Find related texts automatically
- Currency: Always get latest edition
- Attribution: Track citation impact
- Integration: Works with existing tools
Broader impact:
- Lowers barrier to cuneiform research
- Enables citation analysis
- Facilitates meta-research
- Supports reproducibility
- Connects scholarship across generations
- Coverage: 100,000+ texts with complete alias mapping
- Resolution rate: >95% of citations resolved correctly
- API usage: 10,000+ resolutions per day
- Adoption: Integration with 5+ citation managers (Zotero, Mendeley, etc.)
- Browser extension: 1,000+ active users
Fuzzy matching challenges:
- OCR errors in digital publications
- Variant spellings (Hammurabi vs. Ḫammurāpi)
- Abbreviated citations
- Typos in source material
Disambiguation:
- Same museum number, different museums
- Republished texts with different numbers
- Tablets later joined to others
Suggested tech:
- Elasticsearch for fuzzy search
- Levenshtein distance for string matching
- Machine learning for citation extraction from PDFs
- Named entity recognition for ruler/place names
- Handle System or ARK for persistent IDs
- OpenURL for context-sensitive linking
Learning cuneiform is brutally difficult with extremely high barriers to entry:
Challenges:
- 600+ signs to memorize
- Multiple readings per sign (e.g., A sign can read: a, ā, ia₅, or be determinative)
- Same sign looks different across periods (Old Babylonian vs. Neo-Assyrian)
- Complex grammar: 3 cases, 2 genders, 3 numbers, multiple verb stems, aspects, moods
- No living speakers: Can't practice conversation
- Scarcity of teachers: Maybe 200 active scholars worldwide
- Years required: Typically 5+ years for reading proficiency
Result:
- Tiny pool of scholars
- Field can't grow
- Interdisciplinary work limited
- Ancient texts remain inaccessible
Traditional classroom:
- Memorize sign lists (boring, decontextualized)
- Work through grammar sequentially (Chapter 1, 2, 3...)
- Read simplified textbook examples (not real texts)
- Eventually tackle actual tablets (years later)
Problems:
- High dropout rate
- Demotivating (years before reading real texts)
- No adaptation to individual learning
- Limited feedback
- Can't assess retention
Build adaptive learning tools leveraging modern technology
Progressive Revelation Interface:
┌─────────────────────────────────────────────────┐
│ Learning Mode: [Beginner ▾] │
│ Show: [☑] Normalization [☐] Trans [☐] Signs │
├─────────────────────────────────────────────────┤
│ P363653 - Building Inscription of Hammurabi │
│ │
│ Šamaš šarru šamê u erṣeti │
│ │
│ [Shamash, king of heaven and earth] │
│ │
│ 💡 Hover words for details │
│ 🎯 Click to mark as learned │
└─────────────────────────────────────────────────┘
After hovering over "šarru":
┌─────────────────────────────────────────────────┐
│ šarru │
├─────────────────────────────────────────────────┤
│ Meaning: king │
│ Part of speech: noun (masculine) │
│ Case: nominative │
│ Transliteration: LUGAL (logogram) │
│ Sumerian equivalent: lugal │
│ │
│ Used in 8,472 texts │
│ Common in: royal inscriptions, laws │
│ │
│ Related words: │
│ • rubû (prince) │
│ • bēlu (lord) │
│ • šarrūtu (kingship) │
│ │
│ [📖 Dictionary] [✓ Mark learned] [🔊 Hear] │
└─────────────────────────────────────────────────┘
After clicking "Show Transliteration":
┌─────────────────────────────────────────────────┐
│ {d}utu LUGAL an-ki-a │
│ ︿︿ ︿︿︿ ︿︿︿︿︿ (hover for sign details) │
│ Šamaš šarru šamê u erṣeti │
│ [Shamash, king of heaven and earth] │
└─────────────────────────────────────────────────┘
Click on LUGAL sign:
┌─────────────────────────────────────────────────┐
│ Sign: LUGAL │
├─────────────────────────────────────────────────┤
│ [Image: cuneiform sign] │
│ │
│ Main readings: │
│ • šarru (king) - most common │
│ • lugal (Sumerian: king) │
│ • šarru (logogram) │
│ │
│ Phonetic values: │
│ • šar │
│ │
│ Sign list: ABZ 151, MZL 306 │
│ Frequency: Very common (8,472 texts) │
│ │
│ Sign variants: │
│ • Old Babylonian: [image] │
│ • Neo-Assyrian: [image] │
│ • Neo-Babylonian: [image] │
│ │
│ [Practice this sign] [Similar signs] │
└─────────────────────────────────────────────────┘
Graduated complexity:
Beginner: Normalized text + translation
Intermediate: + Transliteration + basic parsing
Advanced: + Sign values + variants + attestations
Expert: Full ATF + paleographic notes
Context-Aware Flashcards:
Traditional flashcard (BAD):
Front: [Image of AN sign]
Back: an, ilu, dingir
Context-aware flashcard (GOOD):
Front:
{d}[___]
from: "Shamash, king of heaven and earth"
Back:
{d}[an]
Reading: an (heaven)
Also appears as: AN (determinative), ilu (god)
In this context: Part of "an-ki" = heaven and earth
You've seen this:
• Last week in Hammurabi prologue
• Two weeks ago in creation myth
• Common pattern: an-ki (heaven-earth)
Adaptive Difficulty:
{
"student_profile": {
"name": "student_123",
"level": "intermediate",
"strong_areas": [
"common_verbs",
"noun_declension",
"basic_signs"
],
"weak_areas": [
"rare_signs",
"subordinate_clauses",
"Neo-Assyrian_variants"
],
"confusion_pairs": [
{"signs": ["KA", "DUG"], "error_rate": 0.4},
{"signs": ["AN", "DINGIR"], "error_rate": 0.3}
]
},
"next_card": {
"sign": "DUG",
"reason": "Frequently confused with KA, needs reinforcement",
"context": "Real text snippet to aid recognition",
"difficulty": "medium"
}
}Variant Tracking:
Sign: UTU (sun)
Periods you've learned:
✓ Old Babylonian (confident)
✓ Neo-Assyrian (practicing)
☐ Neo-Babylonian (not started)
☐ Archaic (not started)
Practice mode: [Focus on Neo-Assyrian variants]
Corpus Examples (Not Textbook Sentences):
Traditional approach (BAD):
Textbook: "The king sees the temple"
šarrum bītam īmur
(boring, artificial)
Corpus-based (GOOD):
From actual royal inscription (P363653):
"Shamash, king of heaven and earth, who sees all"
Šamaš šarru šamê u erṣeti šā kalâma īmuru
Grammar focus: Relative clause (šā + verb)
Pattern: [noun] šā [object] [verb]
Frequency: Very common in royal titles
5 other real examples:
1. [from Gilgamesh]
2. [from law code]
3. [from letter]
...
Progressive Complexity:
Week 1-4: Simple sentences (SVO)
šarrum bītam īmur
"The king saw the temple"
Week 5-8: Adding cases
šarrum ana bītim illik
"The king went to the temple"
Week 9-12: Subordinate clauses
šarrum šā bītam īpušu
"The king who built the temple"
Week 13-16: Complex sentences
šarrum šā bītam ana ilim īpušu ina šattim šanītim imūt
"The king who built the temple for the god died in the second year"
Error Analysis:
{
"exercise": "Parse: šarri",
"student_answer": {
"lemma": "šarru",
"case": "genitive",
"number": "singular"
},
"correct": true,
"feedback": {
"type": "reinforcement",
"message": "Correct! The genitive case ending -i is used here.",
"related_forms": {
"nominative": "šarru",
"accusative": "šarra",
"genitive": "šarri"
},
"next_challenge": "Now try identifying genitive chains"
}
}Common mistake patterns:
{
"student_errors": [
{
"error": "confused_accusative_genitive",
"frequency": 0.3,
"contexts": ["after_preposition", "in_construct_chain"],
"remediation": "Focus exercises on case after prepositions"
},
{
"error": "wrong_verb_stem",
"frequency": 0.2,
"stems_confused": ["G_stem", "D_stem"],
"remediation": "Contrast exercises showing semantic difference"
}
]
}Learn Language Through Real Texts:
Example: Teaching genitive construction through actual document:
┌─────────────────────────────────────────────────┐
│ Administrative Text from Ur III Period │
├─────────────────────────────────────────────────┤
│ 5 sila₃ kāṣī │
│ "5 silas of beer" │
│ │
│ 💡 Learning points: │
│ • Genitive construction (kāṣī) │
│ • Measure system (sila₃ ≈ 1 liter) │
│ • Economic context (ration distribution) │
│ │
│ 📚 Historical context: │
│ Workers in Ur III received daily beer rations. │
│ Typical ration: 2-5 silas depending on status. │
│ │
│ 🔗 Related vocabulary: │
│ • še (barley) │
│ • i₃ (oil) │
│ • siki (wool) │
│ │
│ [Next document] [Quiz me] [See similar texts] │
└─────────────────────────────────────────────────┘
Thematic Learning Paths:
Path 1: "Royal Inscriptions" (easier)
• Standard formulae (repetitive, predictable)
• Limited vocabulary
• Formulaic grammar
→ Good for beginners
Path 2: "Letters" (medium)
• Natural language
• Varied vocabulary
• Real-world contexts
→ Good for intermediate
Path 3: "Literary Texts" (harder)
• Poetic language
• Rare words
• Complex syntax
→ Good for advanced
Personalized Learning Paths:
{
"student": "student_123",
"assessment": {
"reading_level": "intermediate",
"sign_recognition": 0.75,
"grammar_understanding": 0.60,
"vocabulary_size": 450
},
"recommended_path": {
"current_goal": "Improve subordinate clause recognition",
"estimated_time": "2 weeks",
"materials": [
{
"type": "text",
"id": "P470987",
"difficulty": "medium",
"focus": "relative_clauses",
"estimated_duration": "30 min"
},
{
"type": "exercise",
"id": "subordinate_clause_drill_3",
"estimated_duration": "15 min"
},
{
"type": "review",
"cards": ["šā", "ša", "relative_pronouns"],
"estimated_duration": "10 min"
}
]
}
}Real-Time Feedback:
Student translates: "šarrum bītam īmur"
Answer: "The king the temple saw"
System feedback:
❌ Word order: Akkadian is flexible, but this sounds unnatural
✓ Vocabulary: All words correct
✓ Grammar: Correct case identification
💡 Suggestion: "The king saw the temple" flows better in English
📖 Note: Akkadian word order ≠ English word order
Progress Tracking:
┌─────────────────────────────────────────────────┐
│ Your Progress │
├─────────────────────────────────────────────────┤
│ Signs mastered: 247/600 ████████░░ 41% │
│ Vocabulary: 450 words │
│ Texts read: 12 │
│ │
│ Recent achievements: │
│ 🎯 Completed "Royal Inscription" track │
│ 🔥 7-day reading streak │
│ ⭐ Mastered all pronouns │
│ │
│ Next milestone: │
│ Read your first complete letter (3 texts away) │
│ │
│ Weak areas: │
│ • Subordinate clauses (practice recommended) │
│ • Rare signs (25% accuracy) │
│ │
│ [Study plan] [Review cards] [Practice reading] │
└─────────────────────────────────────────────────┘
Achievement System:
Badges:
🔰 First Text Read
📚 10 Texts Completed
🏆 100 Signs Mastered
⚡ 30-Day Streak
🌟 Perfect Quiz Score
👑 Read a Royal Inscription
📜 Read a Law Code
🗡️ Read Epic Literature
Leaderboards (Optional):
This Week's Top Learners:
1. student_456 - 450 cards reviewed
2. student_789 - 8 texts completed
3. student_123 - 95% quiz accuracy
Community Features:
Study Groups:
• "Old Babylonian Reading Group" (12 members)
• "Sign Mastery Challenge" (45 members)
• "Neo-Assyrian Variant Practice" (8 members)
Discussion:
💬 "Help with this damaged sign?" (3 replies)
💬 "Best way to memorize verb stems?" (12 replies)
💬 "Anyone want to read Gilgamesh together?" (6 replies)
Current bottleneck:
- 5+ years to reading proficiency
- High dropout rate
- Limited by teacher availability
- Boring, decontextualized learning
With intelligent tools:
- Faster learning: 2-3 years to proficiency (estimate)
- Higher retention: Spaced repetition + context
- Self-paced: Learn anytime, anywhere
- Adaptive: Personalized to strengths/weaknesses
- Motivating: Real texts from day one
- Scalable: Not limited by teacher availability
Impact:
- More scholars: Lower barriers = bigger field
- Interdisciplinary access: Historians, archaeologists can read sources
- Broader perspective: Diverse backgrounds in field
- Better data: More readers = more corrections for ML training
- Adoption: 1,000+ active learners
- Completion rate: 30% complete beginner track (vs. ~5% traditional)
- Time to proficiency: 2-3 years average (vs. 5+ years)
- Retention: 70% still active after 6 months
- Coverage: 100+ texts with pedagogical scaffolding
- Satisfaction: 4.5+ star rating from users
┌────────────────────────────────────────────┐
│ Student Interface │
│ - Reading environment │
│ - Flashcard system │
│ - Exercise platform │
│ - Progress dashboard │
└────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────┐
│ Adaptive Learning Engine │
│ - Spaced repetition algorithm │
│ - Difficulty assessment │
│ - Personalization │
│ - Error pattern detection │
└────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────┐
│ Content Database │
│ - Curated texts (by difficulty) │
│ - Annotated translations │
│ - Grammatical explanations │
│ - Sign descriptions │
│ - Cultural context │
└────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────┐
│ Student Model │
│ - Knowledge state │
│ - Learning history │
│ - Error patterns │
│ - Preferences │
└────────────────────────────────────────────┘
NLP for Ancient Languages:
- Morphological analysis for exercises
- Automatic difficulty assessment
- Error detection in student translations
Pedagogical Design:
- Optimal spacing intervals
- Difficulty progression
- Motivation maintenance
Content Curation:
- Select appropriate texts
- Create scaffolding
- Write explanations
- Design exercises
Suggested Stack:
- Anki algorithm for spaced repetition
- TensorFlow for student modeling
- React for interactive UI
- D3.js for progress visualization
- WebGL for sign rendering
All seven innovations above share a common thread: they're about infrastructure for human-AI collaboration in humanities scholarship.
This isn't just about cuneiform—you'd be pioneering patterns for how computational and humanistic methods integrate in ways that:
- Respect scholarly rigor
- Enable computational scale
- Preserve provenance
- Support collaboration
- Accumulate knowledge
These solutions apply to:
- Papyrology (Greek/Latin ancient texts)
- Medieval manuscripts (damaged, fragmentary)
- Archaeological site documentation (field notes, excavation records)
- Art history attribution (provenance tracking, expert disagreement)
- Paleography (handwriting analysis)
- Epigraphy (inscriptions)
Any domain with:
- Fragmentary evidence
- Expert interpretation
- Contested readings
- Historical accumulation
- Need for transparency
Given your background in:
- Systems thinking → Architecture for complex knowledge
- Trust frameworks → Designing for trust in agentic systems
- Agentic design → Human-AI collaboration patterns
- UX for complex domains → Making scholarly tools usable
- Folk modernist aesthetics → Clear, intentional interfaces
You're positioned to define the paradigm.
Not just "solve cuneiform OCR" but establish how AI and humanities work together in ways that respect both computational power and humanistic judgment.
| Opportunity | Schema Status | Next Phase |
|---|---|---|
| 1. Trust Infrastructure | Designed -- scholars, annotation_runs, evidence/decision tables, competing interpretations | Implementation (SQL DDL, import scripts) |
| 2. Composite Text Assembly | Not started | Research needed |
| 3. Semantic Search | Not started | Depends on Layer 3-4 data population |
| 4. Annotation Ecosystem | Designed -- discussion threads, fragment joins, scholarly annotations, W3C view | Implementation |
| 5. Knowledge Graph | Designed -- named entities, relationships, authority links | Implementation; entity extraction from ORACC POS |
| 6. Citation Resolution | Designed -- identifier concordance, publications, edition bridge | Implementation; CDLI API parsing |
| 7. Pedagogy | Not started | Depends on most other systems |
Priority 1: Trust Infrastructure (Foundational)
- Enables all other innovations
- Most aligned with your expertise
- Solves a universal problem
Priority 2: Annotation Ecosystem (Knowledge multiplier)
- Leverages existing scholarship
- Creates network effects
- Relatively achievable technically
Priority 3: Semantic Search (High utility)
- Immediate value to researchers
- Demonstrates power of infrastructure
- Builds user base for other tools
- Start with trust infrastructure for ML-generated readings
- Build annotation system to capture expert corrections
- Create knowledge graph to link everything
- Add semantic search to make it queryable
This creates a flywheel:
- ML generates readings →
- Experts annotate/correct →
- Trust system tracks provenance →
- Knowledge graph connects entities →
- Semantic search enables discovery →
- More usage → More data → Better ML
Which of these opportunities resonates most strongly with your vision for the cuneiform project?
The trust infrastructure feels like it aligns most closely with your background in designing for trust in agentic systems, but I could see arguments for starting with annotation (more immediate value) or knowledge graph (most ambitious) depending on your goals.
Want to explore implementation details for any of these?
Everything described in this document -- trust infrastructure, composite assembly, semantic search, annotation ecosystems, knowledge graphs, citation resolution, pedagogy -- is language-agnostic at the architectural level. Cuneiform is the proving ground, but the platform Glintstone builds could become the standard infrastructure for working with any extinct or undeciphered writing system.
The underlying problem is universal: fragmentary evidence, competing expert interpretations, centuries of accumulated scholarship, no living speakers, and the need for transparent, collaborative, computationally-augmented decipherment. Every system described above addresses these problems in ways that transfer directly.
These use cuneiform or closely related writing systems. The sign inventory, ATF conventions, and existing Glintstone schema could extend to cover them with relatively modest effort.
- Hittite cuneiform -- Well-deciphered Indo-European language written in cuneiform. ~30,000 tablets. Active scholarly community. Same sign inventory as Akkadian with additions. Oracc already hosts some Hittite projects.
- Elamite cuneiform -- Partially understood language isolate. Thousands of tablets from Persepolis and Susa. Morphology still debated. Perfect candidate for trust infrastructure and competing-interpretation tracking.
- Urartian cuneiform -- Small but significant corpus from eastern Anatolia. Related to Hurrian. Limited scholarly community means every annotation matters more.
- Old Persian cuneiform -- Simpler syllabary (36 signs vs. 600+). Well-deciphered but cross-referencing with Akkadian and Elamite trilingual inscriptions is exactly the kind of composite assembly this platform enables.
- Ugaritic -- Alphabetic cuneiform from Syria. ~1,500 texts. Well-understood but corpus still growing. Mythological texts (Baal Cycle) have the same composite-text-assembly needs as Gilgamesh.
These use non-cuneiform writing systems but face identical scholarly infrastructure challenges. The data model generalizes; the sign inventory and input methods need new front-ends.
- Linear B -- Deciphered in 1952 (Ventris), but the corpus of ~5,000 tablets from Knossos, Pylos, Mycenae, and Thebes is still being actively edited. Fragmentary, damaged, with competing readings and ongoing joins. The trust infrastructure, composite assembly engine, and annotation ecosystem apply directly. Mycenaean Greek vocabulary is small (~1,500 words) but the administrative texts have the same citation chaos as cuneiform (different museum numbering, excavation numbers, publication series).
- Egyptian hieroglyphics and hieratic -- Enormous corpus. Well-deciphered but far from fully published. The Thesaurus Linguae Aegyptiae (TLA) is the closest parallel to CDLI/Oracc, and it faces every problem described in this document: citation fragmentation, siloed annotations, no trust tracking for editorial decisions, limited semantic search. A shared infrastructure layer could serve both fields.
- Mayan glyphs -- Dramatic decipherment progress since the 1970s but still ongoing. ~10,000 inscriptions. Active debates about readings. The field has its own version of ATF chaos (different drawing conventions, inconsistent transliteration standards). Knowledge graph potential is enormous: linking rulers, sites, dates, events across the corpus.
- Meroitic -- Alphabet deciphered (Griffith, 1911) but the language itself remains mostly untranslated. ~1,800 texts from ancient Nubia. This is exactly where ML-augmented pattern detection, semantic clustering, and collaborative annotation could accelerate breakthroughs.
- Luwian hieroglyphs -- Deciphered but corpus expanding. Monumental inscriptions from Iron Age Anatolia and Syria. Same composite-text and knowledge-graph needs as cuneiform royal inscriptions.
These are the scripts where a collaborative decipherment platform could be genuinely transformative. The trust infrastructure becomes critical: tracking every proposed decipherment, its evidence basis, which hypotheses have been tested and falsified, and enabling systematic rather than ad hoc approaches.
- Linear A -- Minoan script, predecessor to Linear B. ~7,000 sign groups across ~1,400 documents. The signs are catalogued but the language is unknown. Any decipherment attempt requires tracking provenance of every proposed sound value, every semantic hypothesis, every structural argument. This is exactly what the trust infrastructure and annotation ecosystem provide.
- Proto-Elamite -- ~1,600 clay tablets from Iran, ca. 3100-2900 BCE. Partially understood (numerical systems decoded) but the script itself is undeciphered. Recent ML work (e.g., CDLI's proto-Elamite digitization) would benefit directly from Glintstone's sign-level provenance and competing-interpretation architecture.
- Indus Valley Script (Harappan) -- ~4,000 inscribed objects. Extremely short texts (average ~5 signs). Whether it encodes language at all is debated. The platform's ability to track and weigh competing hypotheses, link to archaeological context via knowledge graph, and enable distributed expert annotation could help settle fundamental questions.
- Cypro-Minoan -- ~250 inscriptions from Bronze Age Cyprus. Partially deciphered by Steele (2022) but work is ongoing and contested. Small corpus means every scholarly annotation is high-value.
- Rongorongo -- ~26 wooden tablets from Rapa Nui. Tiny corpus, heavily damaged, possibly not linguistic. But the provenance tracking and trust infrastructure would bring rigor to a field plagued by speculative claims.
- Etruscan -- Script is readable (adapted Greek alphabet) but the language is only partially understood despite ~13,000 inscriptions. Semantic search and knowledge graph infrastructure could accelerate understanding of vocabulary and grammar through pattern detection at corpus scale.
Directly transferable (no modification):
- Trust/provenance infrastructure (scholars, annotation runs, evidence chains, competing interpretations)
- Annotation ecosystem (W3C model, discussion threads, DOI minting)
- Citation resolution (alias systems, edition tracking, fuzzy matching)
- Knowledge graph architecture (entities, relationships, authority reconciliation)
- Pedagogical scaffolding (spaced repetition, progressive revelation, adaptive difficulty)
Requires parameterization (script-specific configuration):
- Sign inventories and paleographic variants
- Transliteration conventions (ATF vs. Manuel de Codage vs. Thompson catalog)
- Morphological analysis rules
- Writing direction and layout (cuneiform L-R, hieroglyphics variable, rongorongo boustrophedon)
- Period/region taxonomies
Requires new development:
- Script-specific input methods and rendering
- Language-specific NLP models (morphology, syntax)
- Domain ontologies (Mesopotamian gods vs. Egyptian gods vs. Maya rulers)
- Integration with field-specific databases (TLA, FAMSI, Perseus)
The cuneiform implementation establishes the reference architecture. If designed with generalization in mind from the start -- abstract script layer, pluggable sign inventories, configurable transliteration schemes -- the marginal cost of adding each new script drops dramatically.
This reframes Glintstone from "a cuneiform database" to "the open infrastructure layer for computational epigraphy and philology."
The pitch to funders shifts from "we digitized some tablets" to "we built the platform that accelerates decipherment of every undeciphered script on earth."
| Script | Key Institutions / Projects |
|---|---|
| Linear A/B | Cambridge (CREWS project), University of Bologna, DAMOS database |
| Egyptian | Berlin-Brandenburgische Akademie (TLA), Oxford (Griffith Institute), IFAO Cairo |
| Mayan | University of Bonn (Textdatenbank), FAMSI, Dumbarton Oaks |
| Proto-Elamite | CDLI (already in pipeline), Louvre, University of Tehran |
| Indus Valley | IIT Kharagpur, Decipherment project at TIFR, Helsinki (Asko Parpola) |
| Meroitic | SFDAS, British Museum Sudan collections, Humboldt-Universität |
| Etruscan | CNR Rome, University of Massachusetts Amherst, Corpus Inscriptionum Etruscarum |
Three converging trends make this moment unique:
- ML capabilities have reached the threshold where pattern detection in damaged/fragmentary scripts is genuinely useful, but only if integrated with scholarly trust systems rather than deployed as black boxes.
- Digital humanities infrastructure is mature enough (IIIF, W3C Web Annotation, Linked Open Data, ORCID) that building on standards is viable rather than requiring everything from scratch.
- The fields are small enough that a well-designed platform can achieve critical mass. There are perhaps 500 active cuneiformists, 200 Mayan epigraphers, 50 people working on Linear A. A single platform serving all of them creates network effects that no field could generate alone.
Source: github.com/wittkensis/glintstone · Issues · Edit this wiki
Start here
Getting Started
Overview
Data Model
- Data Sources
- Data Quality
- Data Issues
- Import Pipeline Guide
- ML Integration
- Citation Pipeline Summary
Reference — Data Model
Reference — API
Reference — MCP
Opportunities
Personas
Project
Research