-
Notifications
You must be signed in to change notification settings - Fork 0
Research Cuneiform Linguistics Synthesis
A reference for scholars and engineers. Synthesized from three independent agent outputs that answered the same comparative-typology question from different angles. Read this when you want depth; the skill version (gs-expert-assyriology/comparative-typology.md) is the operational cheat sheet.
Cuneiform was invented by Sumerian speakers in southern Mesopotamia around 3200 BCE, originally as proto-pictographic accounting tokens on clay. Over ~3,000 years it was adapted to record at least four genetically unrelated languages: Sumerian (isolate), Akkadian (East Semitic), Hittite (Indo-European, Anatolian branch), and Elamite (isolate). The script outlived its progenitor language: Sumerian died as a spoken language ~1800 BCE but persisted as a scholarly and liturgical medium until ~100 CE, much as Latin did in medieval Europe.
The analogy to Latin alphabet usage across English, Turkish, Vietnamese, and Swahili is apt: the script is shared, but the underlying linguistic machinery is radically different in each case. The script's polyvalency (one sign, many readings) is itself a record of this multilingualism — Sumerian readings, Akkadian readings, syllabic values, and logographic uses coexist within a single signary.
Agglutinative. Morphemes attach in chains; the verbal template is roughly:
[modal] – [conjugation] – [ventive/ientive] – [dimensional] – [person] – STEM – [person] – [pronominal]
A Sumerian verb such as mu-n-na-ab-sum "he gave it to him" parses slot-by-slot:
- mu- — ventive (motion toward speaker / discourse anchor)
- n- — 3sg.IO dative
- na- — 3sg.human.IO allative
- b- — 3sg.neuter.DO
- sum — give (stem)
Noun phrases are head-first: lugal kalam-ma-k = "king-of-the-land", possessor follows possessed (analogous to Japanese の constructions).
Ergative-absolutive, the most consequential structural fact. The agent of a transitive verb takes the ergative marker -e; the patient of a transitive AND the subject of an intransitive are both absolutive (unmarked):
- lugal-e e₂-Ø mu-n-du₃ — "the king(ERG) built the house(ABS)"
- lugal-Ø dur₂-ru-na — "the king(ABS) sat down"
The king is marked differently depending on whether he is doing something to something or simply existing/acting. This is structurally identical to ergative patterns in Basque, Inuit, and Yup'ik.
Personal vs impersonal (human/divine vs animals/objects). Plural marker -ene for personal class.
Strong but contested. Jagersma 2010 is the modern reference grammar; Zolyomi and Edzard offer alternative analyses of the prefix slot system. The disagreement is genuine and unresolved — multiple framework analyses can validly coexist for the same Sumerian text.
Don't compress the verbal chain into a single morph string. Store slot-by-slot, or preserve the ORACC string and parse on render. The morphology table must support multiple competing parses keyed to annotation_run_id.
Triconsonantal roots with internal vowel ablaut and stem augments. The root b-n-y "build" surfaces as:
- ibnī — G-stem preterite "he built"
- ibtanī — Gtn iterative "he kept building"
- ubnī — Š causative "he had built"
- bītum — noun "house" derived from the same root
- bāniyum — agent noun "builder"
This is structurally identical to Arabic b-n-y → banā "he built", binā' "building", mabnā "structure", and to the Hebrew b-n-h → banah.
G (basic), D (intensive), Š (causative), N (passive), plus t-infixed variants Gt, Gtn, Dt, etc. The system maps almost directly onto the Arabic Forms I–X. Akkadian scribes catalogued the stem system in bilingual paradigms.
Nominative-accusative (unlike Sumerian). Three cases: nominative -um, accusative -am, genitive -im. Case endings (and the case system itself) erode in Neo-Babylonian and Neo-Assyrian — paralleling the loss of case in spoken Arabic.
Predominantly SOV, an adoption from Sumerian. Subordinate clauses are consistently verb-final. Some poetic registers show VSO tendencies.
Identical to Arabic idāfa: bīt šarrim "house of the king" = house-CONSTRUCT king-GEN, where the head noun is in the construct state (no case ending) and the dependent in the genitive.
A first-class schema dimension, not a footnote:
| Dialect | Period | Region | Notes |
|---|---|---|---|
| Old Akkadian | ~2500–1950 BCE | Akkad | Earliest stage |
| Old Babylonian | ~1950–1530 BCE | South | Literary classical standard |
| Old Assyrian | ~1950–1750 BCE | North | Trader colonies in Anatolia |
| Middle Babylonian | ~1530–1000 BCE | South | Kassite period |
| Middle Assyrian | ~1500–1000 BCE | North | Empire expansion |
| Neo-Babylonian | ~1000–600 BCE | South | Chaldean / Neo-Babylonian empire |
| Neo-Assyrian | ~1000–600 BCE | North | Empire administration |
| Late Babylonian | ~600 BCE – 100 CE | South | Last living phase |
Neo-Assyrian features vowel harmony, loss of case endings, and analytical constructions — these aren't dialect-accent differences but full grammatical drift. Track dialect on every lemma and every annotation.
Fusional / inflecting, like German. Two genders (common and neuter — archaic IE without the masculine/feminine split). Extensive case system: nominative, accusative, genitive, dative-locative, ablative, instrumental, allative, vocative.
Two verb conjugations:
- -mi conjugation (the "normal" IE one)
- -ḫi conjugation (unique to Hittite; reflects a very archaic stative/perfect; cognate possibly to the IE perfect)
SOV, similar to German subordinate clauses. The defining clause-level feature is Wackernagel's Position: enclitic particles (pronouns, reflexives, local particles) stack in a fixed order after the first stressed word of the clause. The introductory particle nu often serves as the anchor:
nu=ššan LUGAL-uš É.GAL=an iyat
nu=PTC king-NOM palace=ACC build.3SG.PRET
"And the king built the palace."
The clitic chain =ššan attaches to nu, not to the verb.
A single Hittite text routinely mixes:
- Hittite phonetic spellings (in italics in transliteration)
-
Sumerograms — Sumerian words used as logograms for Hittite words (UPPERCASE:
LUGALforḫassuš"king") -
Akkadograms — Akkadian words used as logograms (ITALIC UPPERCASE:
ANAfor "to",ŠAfor "of") -
Determinatives — silent classifiers (
{d},{m},{ki}) -
Phonetic complements — small syllabic signs attached to logograms to mark the underlying Hittite ending (
LUGAL-uš= nominative singular ḫassus)
Example: LUGAL-uš ANA DINGIR-LIM SISKUR iyazi
-
LUGAL-uš= Sumerogram + Hittite NOM ending → ḫassuš "the king" -
ANA= Akkadian preposition used logographically -
DINGIR-LIM= Sumerogram + Akkadogram complement → šiwannī "to the god" -
SISKUR= Sumerogram → some Hittite ritual term -
iyazi= Hittite verb "makes"
Only iyazi (and the Hittite endings on the logograms) is phonetically Hittite. This is the rule, not the exception in Hittite texts.
Hittite is the language where the graphemic / linguistic split is widest. The schema's three-layer model (graphemic / linguistic / semantic) must be honored fully. tokens.surface_form (what was written) is separate from lemmatizations.lemma_id (what was meant) is separate from glossary_senses (the conceptual content). A token's surface might be Sumerian, its lemma Hittite, its sense English.
Agglutinative. The defining feature is a pervasive noun-class system with three classes based on the relationship to the speech act:
-
Locutive (1st person) — marked
-k -
Allocutive (2nd person) — marked
-t -
Delocutive (3rd person) —
-r(animate sg),-p(animate pl),-me(inanimate)
Adjectives and genitive modifiers take the class marker of their head noun. In sunki-k GN-k = "I, the king of GN", the locutive marker -k is doubled onto both the noun sunki "king" AND the city name, because the speaker (1st person) is the subject of the apposition.
This is analogous to Caucasian and Hurro-Urartian patterns but operates on a deeper class system.
| Period | Range | Notes |
|---|---|---|
| Old Elamite | ~2600–1500 BCE | Very sparsely attested |
| Middle Elamite | ~1500–1000 BCE | Royal inscriptions; best-attested phase |
| Neo-Elamite | ~1000–550 BCE | Transitional |
| Achaemenid Elamite | ~550–330 BCE | Administrative language of the Persian Empire; Persepolis Fortification Archive; heavily Old-Persian-influenced, possibly a bureaucratic pidgin rather than natural speech |
Improving. Recent breakthroughs in Linear Elamite (a parallel non-cuneiform script, ~2000 BCE; François Desset et al.) are starting to back-fill Cuneiform Elamite understanding. The corpus is small, dependencies on Old Persian are heavy in the Achaemenid period, and many readings remain genuinely uncertain.
Confidence scores aren't a nicety — they're load-bearing. The morphology table must support partial / uncertain readings. Class markers should be stored as a separate feature, not baked into the lemma form.
A single cuneiform sign typically encodes:
- Multiple logographic readings (whole words)
- Multiple syllabic values (CV, VC, CVC sequences)
- Functions as a determinative (silent classifier)
And these vary by language, period, and context.
| Sign | Sumerian | Akkadian | Hittite | Elamite |
|---|---|---|---|---|
| AN |
an "sky", dingir "god" |
šamû "heaven", ilu "god", syllable an
|
divine determinative | syllable an
|
| KUR (𒆳) |
kur "mountain, foreign land" |
šadû "mountain", mātu "land", syllables kur/mad/šad
|
"country / land" (with phonetic complement for case) | — |
| UD (𒌓) |
ud "day, sun" |
ūmu "day", syllables ud/tam/tú/par/laḫ
|
"day" | syllable ut
|
| DU (𒁺) |
du "go", gub "stand" |
syllables du/rá/gub
|
"go" | syllable du
|
This polyvalency drives the schema's separation between tokens (a particular wedge group on a tablet), token_readings (the candidate phonological realizations), and lemmatizations (the dictionary entries those readings can map to).
Compare to Chinese characters in Japanese (kanji + kana) or Arabic script in Persian: highly adaptable but polyvalent and context-dependent.
Estimated 20–30% of Akkadian vocabulary in administrative, religious, and technical domains. Examples:
| Sumerian | Akkadian | Meaning |
|---|---|---|
dubsar (DUB.SAR) |
ṭupšarru |
scribe (lit. "tablet-writer") |
engar |
ikkaru |
farmer |
nagar |
naggāru |
carpenter |
é-gal |
ekallu |
palace (lit. "great house") |
abzu |
apsû |
abyss |
nam-tar |
namtaru |
fate; fate demon |
gal |
gallu |
"great" → "demon" (semantic shift in Akkadian) |
saĝĝa |
šangû |
temple administrator |
The direction tells you something about cultural prestige: Sumerian was the "high culture" language, so technical and administrative vocabulary flowed up the prestige gradient.
The Amarna Letters (14th c. BCE) document Akkadian as the diplomatic language of Egypt, Hatti, Babylon, Assyria, and their vassals. From this period:
- ṭuppu "tablet" → Hittite tuppi-, Elamite tup-pi
- ḫazannu "mayor / town leader" → Hittite ḫazannu-
But most "Akkadian influence" in Hittite is logographic, not phonetic — Hittite scribes wrote Akkadian words as logograms and pronounced them in Hittite.
Pre-Sumerian / pre-Akkadian personal names with reduplicated final syllables — Bunene, Zababa, Inana — suggest an unidentifiable substrate language possibly shared between Sumerian and Elamite. The hypothesis explains why certain etymologies are flatly unrecoverable.
Some words appear across multiple languages without clear directionality. Examples include certain plant, animal, and technology names; some divine names; the root ḫul/ḫūlu "evil/bad" appears in Sumerian, Akkadian, and possibly Elamite forms. These may be areal Sprachbund features.
An etymology / cognate layer is essential. A lemma entry should be able to link to cognates and loans across all four languages, with directionality (if known), semantic shift, and confidence level. Many etymologies are genuinely disputed; the model must not pretend otherwise.
Determinatives are signs that classify but aren't pronounced. ATF transcription puts them in {...}:
| Determinative | Class | Example |
|---|---|---|
{d} (DINGIR) |
divine |
{d}UTU = the god Shamash |
{m} / {lu2}
|
male personal name |
{m}ŠARRU-KĪN = Sargon |
{f} / {munus}
|
female personal name | {f}IŠTAR-... |
{ki} |
place name (city) |
Nibru{ki} = Nippur |
{kur} |
foreign land / mountain |
{kur}elam{ki} = Elam |
{giš} |
wooden object |
{giš}gigir "chariot" |
{kuš} |
leather | |
{na4} |
stone | |
{uruda} |
copper | |
{tug2} |
textile |
Determinatives are crucial for named-entity recognition and disambiguation. A search for "place names containing X" is a query over determinative-tagged tokens. Schema-wise, they belong as metadata on a token, not as separate tokens.
Cuneiform was preserved through millennia in part because the scribes built their own scholarly apparatus. Standardized lists like Ea and Diri provided four-column mappings:
- Phonetic gloss of the Sumerian word
- Sign form
- Sign name
- Akkadian translation
These are essentially the world's first dictionaries. They are also the foundation of every modern decipherment: bilingual Sumerian-Akkadian lexical lists are how Sumerian was deciphered (after Akkadian had been read by analogy with Hebrew and Arabic).
The Edubba ("House of Tablets"), the scribal school, drilled students for ~12 years on copying lexical lists and literary works. Scribes were a privileged elite, often more powerful than the illiterate monarchs they served.
| Phase | Date | Characteristics |
|---|---|---|
| Proto-literate (Uruk IV) | ~3100–3000 BCE | Pictographs on tokens; accounting primarily |
| Archaic Sumerian | ~3000–2500 BCE | Rotated 90° (around 2800 BCE); wedge-shaped marks emerge |
| Sumero-Akkadian | ~2500–2000 BCE | Phoneticization via the rebus principle to accommodate Akkadian |
| Regional diversification | ~2000–1000 BCE | Divergent ductus (handwriting styles); Assyrian linear, Hittite Old-Babylonian-derived |
| Achaemenid Elamite | ~550–330 BCE | Reduced to ~131 signs, near-alphabetic, almost no polyvalency |
| Late survival | ~330 BCE – 100 CE | Astronomical and ritual texts in Babylonian |
Sign count over time: Uruk IV ~1,500 (purely logographic); Ur III ~600 (balanced logo-syllabic); Neo-Assyrian ~570-800 (standardized polyphony); Hittite ~375 (high Sumerogram density); Achaemenid Elamite ~131 (simplified).
Cuneiform decipherment proceeded via trilingual inscriptions:
- Bisitun inscription (Old Persian + Elamite + Babylonian) — the Rosetta Stone equivalent for cuneiform. Old Persian was deciphered first (simpler script, known descendants). Akkadian came next via the Babylonian column. Elamite came last.
- Boğazköy archives — ~30,000 Hittite tablets discovered at the capital Hattusa; Bedřich Hrozný confirmed Hittite as Indo-European in 1915.
- Sumerian-Akkadian bilinguals — the lexical-list tradition let scholars work backward from Akkadian to Sumerian.
This history is why the schema has to treat bilingual / trilingual texts as first-class: they ARE the foundation of the field, not edge cases.
- Akkadian — excellent. Von Soden's GAG is canonical; the corpus is huge. Challenges: dialect variation, damaged tablets.
- Sumerian — strong (Jagersma 2010), but the verbal prefix system has genuine open questions. Different analyses (Jagersma, Zolyomi, Edzard) coexist.
- Hittite — strong (Hoffner, Melchert). Challenges: clitic chain analysis at clause level; logogram-heavy texts hide Hittite phonology.
- Elamite — improving. Recent Linear Elamite decipherment (Desset and others) is beginning to back-fill cuneiform Elamite. Small corpus; Achaemenid Old Persian influence distorts.
Distilled implications for the schema and the parser:
-
Multi-source provenance is structural. Every linguistic claim is contestable.
annotation_run_idcarries who/when/what version. - The graphemic / linguistic / semantic split is real. Hittite forces it; the schema benefits from honoring it for every language.
- Polyvalency means tokens ≠ readings. A token can have many candidate readings; readings can map to many lemmas; lemmas can have many senses.
-
Slot-based morphology for Sumerian; root-pattern for Akkadian; inflecting for Hittite; class-based for Elamite. The
morphologytable must support all four shapes. - Dialect is a first-class dimension. Especially for Akkadian.
- Determinatives are metadata. Token-level, never independent words.
- Bilingual / trilingual texts are foundational, not edge cases. Token-level language switching is normal.
- Confidence and disagreement are first-class data. Especially for Elamite and for the contested parts of Sumerian morphology.
- The loanword / cognate / etymology layer needs to exist. Sumerian → Akkadian alone is ~20–30% of administrative vocabulary.
- The Sumero-Akkadian lexical tradition is the field's foundation. Lexical-list ingestion is part of the data model, not a side feature.
- Jagersma, A. H. A Descriptive Grammar of Sumerian. PhD diss., Leiden 2010.
- Huehnergard, J. A Grammar of Akkadian. 3rd ed. Eisenbrauns 2011.
- Hoffner, H. A. & H. C. Melchert. A Grammar of the Hittite Language. Eisenbrauns 2008.
- Stolper, M. W. "Elamite". In Woodard, R. D., ed. The Cambridge Encyclopedia of the World's Ancient Languages. CUP 2004.
- ETCSL — Electronic Text Corpus of Sumerian Literature. http://etcsl.orinst.ox.ac.uk
- ORACC — http://oracc.museum.upenn.edu
- ePSD2 — http://oracc.org/epsd2
This file is a consensus distillation from three independently-commissioned research outputs (Claude, Gemini, Grok) answering the same question — "What do the four cuneiform languages share, where do they differ, and what should our schema track?" Each angle contributed:
- The pedagogical multilingual-analogy framing (German → Hittite, Japanese → Sumerian, Arabic → Akkadian, Mandarin → cuneiform-as-script)
- The detailed comparative-typology tables and the "Sign-Topology Folding" framing
- The "AlphaFold for cuneiform" framing and the concise schema-action items
Where the three diverged: minor differences in transliteration conventions (handled by preferring the ATF / OGSL standard), and confidence on speculative Wanderwörter (this synthesis flags speculation explicitly).
The originals lived in PLAN/Linguistics Schema/v2/Phase 1/ and are archived (committed to git history then removed) as part of the 2026-05-11 knowledge architecture overhaul.
Source: github.com/wittkensis/glintstone · Issues · Edit this wiki
Start here
Getting Started
Overview
Data Model
- Data Sources
- Data Quality
- Data Issues
- Import Pipeline Guide
- ML Integration
- Citation Pipeline Summary
Reference — Data Model
Reference — API
Reference — MCP
Opportunities
Personas
Project
Research