Research BabyLemmatizer Integration

BabyLemmatizer & Lemmatization: Integration with the Cuneiform Data Schema

1. What BabyLemmatizer Actually Does

BabyLemmatizer (Sahala & Lindén, 2023) is the current state-of-the-art tool for two tasks that sit at the critical juncture between your schema's reading layer and linguistic layer:

POS-tagging: Assigning a part-of-speech tag (N, V, AJ, PRP, etc.) to each token in transliterated cuneiform text
Lemmatization: Mapping an inflected surface form to its dictionary headword (citation form)

It approaches both as machine translation (sequence-to-sequence) problems using OpenNMT encoder-decoder networks, not as classification or lookup tasks. This is a crucial architectural choice — it means the model can generate lemmata for forms it has never seen before (OOV words), achieving 68–84% accuracy on unseen forms depending on language/dialect.

The Pipeline Flow

Transliterated text (CoNLL-U)
    → POS-tagger (seq2seq: form → tag)
        → Lemmatizer (seq2seq: form + POS context → lemma)
            → Post-correction (dictionary-based heuristics + confidence scoring)
                → Annotated output (CoNLL-U)

The POS tag predicted in step 1 feeds into step 2 as contextual input — the lemmatizer sees not just the form but its surrounding POS tags. This chained architecture means POS errors propagate into lemmatization errors, which is a key design consideration for your schema's confidence tracking.

2. The Tokenization Problem: Where BabyLemmatizer Makes Its Most Schema-Relevant Decision

BabyLemmatizer defines three tokenization strategies, selectable per model:

Mode	ID	Languages	Description
Logo-syllabic (unindexed)	0	Akkadian, Elamite, Hittite, Urartian, Hurrian	Syllabic signs → character sequences; logograms → preserved as tokens
Logo-syllabic (indexed)	1	Sumerian	Preserves sign indices (subscript numbers) because they carry meaning in Sumerian
Character sequences	2	Non-cuneiform (Greek, Latin)	Standard character-level tokenization

This is the single most important design decision for your schema's relationship with BabyLemmatizer, and here's why:

The Akkadian Example (Mode 0)

Consider the Neo-Assyrian form IMIN{+et} (meaning "seven"):

BabyLemmatizer tokenizes this as something like:

Source: I M I N { + e t }    (character-level for the logographic part, phonetic complement separated)
Target: s e b e              (the lemma: sebe, "seven")

The sign IMIN is a Sumerogram — it's the Sumerian word for "seven" used logographically in an Akkadian text. The phonetic complement {+et} tells you this is the Akkadian word sebe(t) with the feminine ending. BabyLemmatizer's unindexed mode strips the subscript numbers from syllabic signs (e.g., du₃ becomes du3 or just du) because for Akkadian, the subscript is just a disambiguation device for which cuneiform sign is meant — it doesn't affect the phonological value.

The Sumerian Example (Mode 1)

For Sumerian, subscripts do matter. The signs du (to go), du₃ (to build), and du₇ (to be perfect) are completely different words that happen to share similar phonetic values. Stripping subscripts would destroy critical lexical information. So mode 1 preserves them.

Schema Implication

Your TOKEN entity needs to track:

token:
  raw_form: "IMIN{+et}"              # exactly as transliterated
  tokenization_mode: 0                # which strategy was applied
  tokenized_form: ["I","M","I","N","{","+","e","t","}"]  # character-level input to model
  predicted_pos: "NU"                 # number
  predicted_lemma: "sebe"             # dictionary headword
  confidence: 0.94                    # from post-correction scoring
  is_logogram: true                   # IMIN is a Sumerogram
  logogram_language: "Sumerian"       # the writing system origin
  target_language: "Akkadian"         # what language is being expressed

The tokenization mode is effectively a pre-processing contract between your data and any ML model. Different models may expect different tokenizations of the same raw form. Your schema should store the raw form as canonical and derive tokenized representations on demand, or store them as a computed field with provenance.

3. CoNLL-U as the Interchange Format

BabyLemmatizer uses CoNLL-U (and an extended CoNLL-U Plus variant) as its input/output format. This is also the format used by CDLI, MTAAC, and increasingly by other cuneiform NLP projects. Understanding this format is essential for your schema design because it's the closest thing the field has to a standard.

Standard CoNLL-U Fields

ID    FORM       LEMMA    UPOS  XPOS  FEATS              HEAD  DEPREL  DEPS  MISC
1     a-di       adi      PRP   PRP   _                  _     _       _     _
2     IMIN{+et}  sebe     NU    NU    _                  _     _       _     _
3     a-ra-an-šu arnu     N     N     Gender=Masc|...    _     _       _     _
4     pu-uṭ-ri   paṭāru   V     V     Stem=G|Tense=Imp   _     _       _     _

Field	What it maps to in your schema	Notes
ID	Token position	Integer, 1-indexed per sentence
FORM	`raw_form` in your reading_layer	The transliterated surface form
LEMMA	`lemma` → DICTIONARY_ENTRY	The citation/dictionary form
UPOS	Universal POS tag	Cross-linguistic (N, V, ADJ, etc.)
XPOS	Language-specific POS tag	ORACC POS tags (more granular)
FEATS	Morphological features	Key=Value pairs (Gender, Number, Case, Stem, etc.)
HEAD	Syntactic head	Dependency parse (often empty for cuneiform)
DEPREL	Dependency relation	(ditto)
DEPS	Enhanced dependencies	(rarely used)
MISC	Overflow	Damage info, confidence scores, etc.

What CoNLL-U Doesn't Capture (But Your Schema Must)

This is where the gap between BabyLemmatizer's output and your full schema becomes critical:

Sign-level information: CoNLL-U operates at the word level. It doesn't track individual cuneiform signs within a word, their visual forms, damage states, or polyvalent readings. Your schema's graphemic layer sits below CoNLL-U's granularity.
Mixed-language writing: CoNLL-U has a single FORM field. It doesn't natively represent that LUGAL-uš is a Sumerian logogram + Hittite phonetic complement. Your schema needs the sign_origins[] and functions[] arrays from the earlier tutorial.
Alternative readings: CoNLL-U gives one LEMMA. BabyLemmatizer's post-correction produces confidence scores and can flag uncertain analyses, but the format doesn't natively support "this might be lemma A (0.7) or lemma B (0.25)." Your schema needs alternative_readings[].
Tablet/text metadata: CoNLL-U has limited metadata capability via # comment lines. It doesn't model the physical object, provenance, or the full textual hierarchy (tablet → surface → column → line → token).
Cross-references: No mechanism for linking a token to parallel texts, lexical list entries, or sign list references.

Where CoNLL-U Fits in Your Architecture

                        Your Full Schema
                    ┌─────────────────────┐
                    │  TABLET (physical)   │
                    │    └── SURFACE       │
                    │        └── LINE      │
                    │            └── TOKEN │ ◄── This is what CoNLL-U represents
                    │                │     │     (partially)
                    │       ┌────────┼─────┤
                    │  Graphemic  Reading  Linguistic
                    │  Layer      Layer    Layer
                    │  (signs)    (translit) (lemma,POS)
                    └─────────────────────┘
                              ▲
                              │
              ┌───────────────┼───────────────┐
              │               │               │
         ORACC ATF      CoNLL-U          ORACC JSON
         (source)    (BabyLemmatizer     (richest
                      I/O format)        structured
                                         output)

Design recommendation: Use CoNLL-U (or CoNLL-U Plus) as an import/export interchange format, not as your internal data model. Your schema should be richer than CoNLL-U can express, but should be able to serialize to CoNLL-U for interoperability with BabyLemmatizer, spaCy pipelines, and the broader NLP ecosystem.

4. The ORACC Lemmatization System: The Upstream Data Source

BabyLemmatizer was trained on data extracted from ORACC (Open Richly Annotated Cuneiform Corpus), which has its own lemmatization conventions that predate and differ from CoNLL-U. Understanding ORACC's system is essential because it's the largest source of human-annotated cuneiform text.

ORACC Lemmatization Format (ATF #lem lines)

In ORACC ATF files, lemmatization is encoded inline:

1. a-na be-lí-ia qí-bí-ma
#lem: ana[to]PRP; bēlu[lord]N; qabû[say]V

Each lemma annotation has the structure:

CF[GW]POS

Where:

CF (Citation Form): The dictionary headword (e.g., bēlu)
GW (Guide Word): A basic English meaning used as a disambiguator (e.g., lord)
POS: Part of speech tag (e.g., N)

Extended form (when adding new information):

+CF[GW//SENSE]POS'EPOS$NORM0

Where:

SENSE: Contextual meaning (may differ from GW)
EPOS: Effective POS (when a word functions as a different POS than its base, e.g., a noun used as a preposition)
NORM0: Normalized/transcribed form (the actual pronunciation, e.g., bēlīya for "my lord")

How ORACC Lemmatization Maps to Your Schema

ORACC Field	Schema Entity	Notes
CF (Citation Form)	DICTIONARY_ENTRY.lemma_form	The canonical headword
GW (Guide Word)	DICTIONARY_ENTRY.meanings[0]	Primary disambiguating meaning
SENSE	TOKEN.semantic_layer.translation	Context-specific meaning
POS	TOKEN.linguistic_layer.part_of_speech (base)
EPOS	TOKEN.linguistic_layer.part_of_speech (effective)	Captures re-categorization
NORM0	TOKEN.reading_layer.phonetic_reading	The actual pronunciation
FORM (transliteration)	TOKEN.reading_layer.transliteration	What's on the tablet

Critical insight: ORACC's distinction between CF and NORM0 maps directly to the split between your schema's DICTIONARY_ENTRY (abstract lexical item) and TOKEN (specific instance). bēlu is the dictionary entry; bēlīya ("my lord") is the normalized token-level form. BabyLemmatizer predicts the CF — it doesn't currently produce NORM0, though this is on the roadmap.

The L2 Lemmatizer (ORACC's Native Tool)

Before BabyLemmatizer, ORACC's own lemmatizer L2 (by Steve Tinney) was the primary tool. L2 is fundamentally dictionary-based — it looks up forms in a glossary and suggests matches. It cannot handle OOV words (it just flags them for manual annotation). BabyLemmatizer's neural approach fills exactly this gap.

Your pipeline should ideally combine both:

Input text → L2 (dictionary lookup, high-precision for known forms)
           → BabyLemmatizer (neural prediction, handles OOV)
           → Merge (prefer L2 when confident, fall back to BabyLemmatizer)
           → Human review (flagged uncertain cases)

5. What "Lemma" Actually Means Across Languages

This is where the four languages in your schema diverge significantly, and where BabyLemmatizer's language-specific models become essential.

Akkadian Lemmatization

The lemma is the citation form — typically the masculine singular nominative for nouns, the infinitive (G-stem) for verbs:

Surface form: iš-pur → Lemma: šapāru [send]V
Surface form: šar-ra-ti → Lemma: šarratu [queen]N
Surface form: LUGAL → Lemma: šarru [king]N (logographic writing, lemmatized to the Akkadian word)

The mapping from surface to lemma involves:

Stripping inflectional morphology (case, number, person, tense)
Resolving logographic writing to the Akkadian word
Normalizing phonological spelling variants
Identifying the verbal stem (G, D, Š, N, etc.)

BabyLemmatizer achieves 94–96% accuracy on this for in-vocabulary forms, 68–84% for OOV.

Sumerian Lemmatization

Sumerian lemmatization is structured differently because of the agglutinative morphology:

Surface form: mu-na-du₃ → Lemma: du₃ [build]V (the verbal root)
Surface form: lugal-e → Lemma: lugal [king]N (strip case marker)
Surface form: e₂-gal → Lemma: egal [palace]N (compound treated as single lemma)

Key differences from Akkadian:

No root-and-pattern system to reverse-engineer
Compound words may be treated as single lemmata or decomposed (scholarly convention varies)
The same form can be verbal or nominal depending on context (Sumerian word-class flexibility)
Sign indices (du₃ vs. du₇) are critical and must be preserved

BabyLemmatizer uses separate models for literary and administrative Sumerian (different vocabulary distributions, different formulaic patterns).

Hittite Lemmatization

Hittite lemmatization must solve the Sumerogram/Akkadogram problem:

Surface form: LUGAL-uš → Lemma: ḫassus [king] (Sumerogram resolved to Hittite word)
Surface form: ar-ḫa → Lemma: arḫa [away] (phonetically written Hittite)

BabyLemmatizer does not currently have a pretrained Hittite model (it lists Akkadian dialects, Sumerian, and Urartian). The tokenization mode 0 is flagged as applicable to Hittite, but no trained model exists. This is a gap your pipeline would need to fill, likely by training on the Hethitologie Portal or Chicago Hittite Dictionary data.

Elamite Lemmatization

Similarly, no pretrained Elamite model exists. The tokenization mode 0 covers it in principle, but the training data is extremely sparse. Elamite lemmatization would likely require:

A small hand-annotated corpus (from Persepolis texts)
Heavy use of transfer learning from Akkadian models
Much lower accuracy expectations
Mandatory human review for all outputs

Schema Implication: Language-Polymorphic Lemma Definitions

Your DICTIONARY_ENTRY entity needs language-specific structure:

DICTIONARY_ENTRY:
  lemma_form: "šapāru"
  language: "Akkadian"
  dialect: "Neo-Assyrian"
  pos: "V"

  # Akkadian-specific:
  akkadian_root: "š-p-r"
  verbal_stems_attested: ["G", "Š", "N"]

  # Meanings with ORACC-style GW:
  guide_word: "send"
  senses:
    - "to send (a message, person)"
    - "to write (a letter)"
    - "to dispatch"

  # Cross-linguistic links:
  sumerian_logogram: "KIN"  # Sumerogram used for this word
  cognates:
    - {language: "Hebrew", form: "spr", meaning: "to count, write"}
    - {language: "Arabic", form: "sfr", meaning: "to write, travel"}

vs.

DICTIONARY_ENTRY:
  lemma_form: "du₃"
  language: "Sumerian"
  pos: "V"

  # Sumerian-specific:
  sign_name: "DU₃"
  sign_unicode: "U+12085"
  compound_analysis: null  # not a compound

  guide_word: "build"
  senses:
    - "to build, construct"
    - "to erect"

  # Usage as Sumerogram:
  used_logographically_in:
    - {language: "Akkadian", read_as: "banû"}
    - {language: "Hittite", read_as: "unknown (always logographic)"}

6. Confidence Scoring and the Human-in-the-Loop

BabyLemmatizer's post-correction module assigns confidence scores by comparing the neural network's output against dictionary-based heuristics. This is directly relevant to your trust framework thinking.

How Confidence Works in BabyLemmatizer

The post-correction step:

Takes the neural network's predicted lemma + POS
Checks it against a known glossary/dictionary
If the form-lemma pair is attested in the training data → high confidence
If the lemma is in the dictionary but this specific form isn't attested → medium confidence
If the lemma is entirely novel (not in dictionary) → low confidence, flagged for review
An override lexicon can force specific corrections

Schema Integration

TOKEN.annotation_provenance:
  method: "BabyLemmatizer_v2.2"
  model: "neo-assyrian"
  model_version: "2024-06-07"

  pos_prediction:
    value: "V"
    confidence: 0.97
    alternatives: [{value: "N", confidence: 0.02}]
    source: "neural"

  lemma_prediction:
    value: "šapāru"
    confidence: 0.91
    alternatives: [{value: "šāpiru", confidence: 0.06}]
    source: "neural+dictionary_confirmed"

  human_review:
    status: "unreviewed" | "confirmed" | "corrected"
    reviewer: null
    correction: null
    review_date: null

  flags:
    is_oov: false
    is_logographic: false
    damage_affects_reading: false

This maps directly to the trust acquisition → growth → maintenance cycle:

Acquisition: BabyLemmatizer produces an initial annotation with confidence score
Growth: Human reviewer confirms or corrects, which can be fed back into retraining
Maintenance: As the dictionary grows and models improve, confidence scores on previously-annotated tokens can be recalculated

7. The EvaCun 2025 Frontier: LLMs vs. Specialized Models

The field is actively moving. The EvaCun 2025 Shared Task (co-located with NAACL 2025) benchmarked LLMs and transformer models against BabyLemmatizer on Akkadian and Sumerian lemmatization and token prediction, using new datasets from the Electronic Babylonian Library (eBL) and Archibab.

Key findings relevant to your pipeline:

BabyLemmatizer remains the baseline to beat for lemmatization
LLM-based approaches (few-shot prompting) achieved ~90% on in-vocabulary but only ~9% on OOV — dramatically worse than BabyLemmatizer's 68–84%
The token prediction task (filling in missing/damaged text) is a different problem where LLMs may have more promise
New data sources (eBL for first-millennium literature, Archibab for Old Babylonian) are expanding what's available for training

Schema Implication

Your pipeline architecture should be model-agnostic at the annotation layer. The TOKEN.annotation_provenance should be able to record whether an annotation came from:

BabyLemmatizer (specialized seq2seq)
An LLM (prompted or fine-tuned)
L2 (dictionary lookup)
A human annotator
A finite-state transducer (BabyFST)
Future tools not yet built

This means your schema should not bake in assumptions about which tool produced an annotation. The provenance metadata should be rich enough to evaluate and compare different methods retrospectively.

8. The Full Pipeline: From Tablet to Lemmatized Text

Here's how everything connects, with your schema as the backbone:

┌─────────────────────────────────────────────────────────────────┐
│ PHYSICAL LAYER                                                   │
│                                                                   │
│  3D scan / photograph of tablet                                  │
│       │                                                           │
│       ▼                                                           │
│  Sign identification (DeepScribe, CNN/ViT models)                │
│       │                                                           │
│       ▼                                                           │
│  ┌─────────────────────────────────────┐                         │
│  │ SCHEMA: SIGN entities              │                         │
│  │   sign_id, unicode, visual_form,   │                         │
│  │   period, region, damage_state      │                         │
│  └─────────────┬───────────────────────┘                         │
│                │                                                   │
├────────────────┼─────────────────────────────────────────────────┤
│ READING LAYER  │                                                  │
│                ▼                                                   │
│  Reading assignment (polyvalence resolution)                      │
│  → Determine language, resolve sign → syllable/logogram          │
│       │                                                           │
│       ▼                                                           │
│  Transliteration (ATF format)                                     │
│  e.g., "a-na be-lí-ia qí-bí-ma"                                 │
│       │                                                           │
│       ▼                                                           │
│  ┌─────────────────────────────────────┐                         │
│  │ SCHEMA: TOKEN.reading_layer        │                         │
│  │   transliteration, sign_sequence,   │                         │
│  │   function_per_sign, phonetic_read  │                         │
│  └─────────────┬───────────────────────┘                         │
│                │                                                   │
├────────────────┼─────────────────────────────────────────────────┤
│ LINGUISTIC LAYER (where BabyLemmatizer operates)                 │
│                ▼                                                   │
│  ┌───────────────────────────┐                                   │
│  │ Convert to CoNLL-U        │ ◄── ATF-to-CoNLL-U converter     │
│  │ (form per line, blank     │     (e.g., Pagé-Perron scripts)   │
│  │  metadata fields)         │                                   │
│  └─────────────┬─────────────┘                                   │
│                │                                                   │
│                ▼                                                   │
│  ┌───────────────────────────┐    ┌─────────────────────────┐   │
│  │ BabyLemmatizer            │    │ L2 (ORACC dictionary    │   │
│  │  → POS-tagger             │    │  lemmatizer)            │   │
│  │  → Lemmatizer             │    │  → glossary lookup      │   │
│  │  → Post-correction        │    │  → exact match only     │   │
│  └─────────────┬─────────────┘    └──────────┬──────────────┘   │
│                │                              │                    │
│                └──────────┬───────────────────┘                   │
│                           ▼                                       │
│                  Merge / reconcile                                │
│                  (prefer dictionary match;                        │
│                   fall back to neural;                            │
│                   flag disagreements)                             │
│                           │                                       │
│                           ▼                                       │
│  ┌─────────────────────────────────────┐                         │
│  │ SCHEMA: TOKEN.linguistic_layer     │                         │
│  │   lemma → DICTIONARY_ENTRY          │                         │
│  │   part_of_speech (UPOS + XPOS)     │                         │
│  │   morphological_analysis            │                         │
│  │   annotation_provenance             │                         │
│  │   confidence_score                  │                         │
│  │   human_review_status               │                         │
│  └─────────────┬───────────────────────┘                         │
│                │                                                   │
├────────────────┼─────────────────────────────────────────────────┤
│ SEMANTIC LAYER │                                                  │
│                ▼                                                   │
│  Translation / interpretation                                     │
│  (currently largely manual; some MT work emerging)                │
│       │                                                           │
│       ▼                                                           │
│  ┌─────────────────────────────────────┐                         │
│  │ SCHEMA: TOKEN.semantic_layer       │                         │
│  │   translation, semantic_domain,     │                         │
│  │   cross_references                  │                         │
│  └─────────────────────────────────────┘                         │
│                                                                   │
└───────────────────────────────────────────────────────────────────┘

9. Practical Schema Additions Based on BabyLemmatizer Integration

Based on this analysis, here are the specific additions to the schema from the earlier tutorial:

9.1 New Entity: ANNOTATION_RUN

Track each time a tool processes a text:

ANNOTATION_RUN:
  run_id: UUID
  tool: "BabyLemmatizer" | "L2" | "human" | "LLM" | ...
  tool_version: "2.2"
  model_name: "neo-assyrian"
  model_config:
    tokenizer_mode: 0
    tagger_context: 2
    lemmatizer_context: 1
  timestamp: ISO8601
  input_format: "CoNLL-U"
  corpus_scope: [list of tablet_ids processed]
  aggregate_metrics:
    pos_accuracy_estimate: 0.97
    lemma_accuracy_estimate: 0.95
    oov_rate: 0.09

9.2 Extended TOKEN Entity

TOKEN (additions to original schema):

  # CoNLL-U compatible fields:
  conllu_id: 3                        # position in sentence
  form: "a-ra-an-šu"                 # FORM field
  lemma: "arnu"                       # LEMMA field → links to DICTIONARY_ENTRY
  upos: "NOUN"                        # Universal POS
  xpos: "N"                           # ORACC-specific POS
  feats: "Gender=Masc|Number=Sing|Case=Acc"  # Morphological features
  head: null                          # Dependency head (if parsed)
  deprel: null                        # Dependency relation

  # BabyLemmatizer-specific metadata:
  tokenization_mode: 0
  tokenized_input: "a r a n š u"     # What the model actually saw

  # Multi-source annotation tracking:
  annotations[]:
    - source: "BabyLemmatizer"
      run_id: UUID → ANNOTATION_RUN
      lemma: "arnu"
      pos: "N"
      confidence: 0.93
      is_oov: false
    - source: "L2"
      run_id: UUID → ANNOTATION_RUN
      lemma: "arnu"
      pos: "N"
      confidence: 1.0    # dictionary match = certain
      is_oov: false
    - source: "human_reviewer"
      reviewer_id: "..."
      lemma: "arnu"       # confirmed
      pos: "N"
      review_date: ISO8601
      notes: null

  # Consensus annotation (derived):
  consensus_lemma: "arnu"
  consensus_pos: "N"
  consensus_confidence: 0.98
  consensus_method: "dictionary_confirmed + neural_confirmed + human_confirmed"

9.3 New Entity: LEMMATIZATION_AMBIGUITY

For cases where multiple valid lemmatizations exist:

LEMMATIZATION_AMBIGUITY:
  token_id: UUID → TOKEN
  candidates[]:
    - lemma: "pānu"        # front, face
      pos: "N"
      sense: "front"
      probability: 0.45
      evidence: "common in prepositional phrases"
    - lemma: "amāru"       # to see
      pos: "V"
      sense: "see"
      probability: 0.35
      evidence: "IGI can be logographic for amāru"
    - lemma: "naplastu"    # to look
      pos: "V"
      sense: "appear"
      probability: 0.15
      evidence: "less common reading"
  resolution_status: "unresolved" | "human_resolved" | "auto_resolved"
  resolved_to: "pānu"
  resolution_rationale: "prepositional context (ina pān) strongly favors nominal reading"

This is especially important for Sumerograms like IGI, which ORACC's documentation explicitly mentions can have readings like pān[front]N | immar[see]V | innamir[appear]V | igi[reciprocal]N.

9.4 Training Data Provenance

For your AlphaFold-equivalent, tracking where training data came from is critical:

TRAINING_CORPUS:
  corpus_id: UUID
  name: "ORACC SAAo Neo-Assyrian"
  source_project: "saao/saa08"
  language: "Akkadian"
  dialect: "Neo-Assyrian"
  genre: "state_correspondence"
  period: "Neo-Assyrian"
  token_count: 150000
  lemmatized_by: "human (ORACC project team)"
  license: "CC-BY-SA"

  # Quality metrics:
  inter_annotator_agreement: null  # often not measured
  known_biases:
    - "Predominantly royal correspondence"
    - "Limited colloquial vocabulary"
    - "Sumerogram-heavy (many underlying Akkadian forms unknown)"

  # Downstream model usage:
  used_to_train: [ANNOTATION_RUN_ids]

10. Gaps and Opportunities

What BabyLemmatizer Doesn't Do (Yet) That Your Schema Should Plan For

Capability	BabyLemmatizer Status	Your Schema Should...
Morphological analysis	On roadmap, not implemented	Reserve FEATS field; design language-specific morph templates
Phonological transcription	On roadmap	Include NORM0 field (ORACC convention)
Named entity recognition	On roadmap	Include NER tags (PN, GN, DN, etc.)
Syntactic parsing	Not planned	Support HEAD/DEPREL but don't require them
Hittite models	Tokenization supported, no trained model	Plan for it; Hittite Sumerogram resolution is the hardest test
Elamite models	Tokenization supported, no trained model	Plan for extreme uncertainty; confidence thresholds must be lower
Sign-level OCR	Out of scope (operates on transliteration)	Your graphemic layer sits below BabyLemmatizer's input
Fragment joining	Out of scope	Separate subsystem, but should link to same TOKEN entities

The "AlphaFold" Opportunity

The gap between sign identification (images → signs) and lemmatization (transliteration → lemmata) is currently bridged by human transliteration. An end-to-end system would:

Take a tablet image
Identify signs (computer vision)
Assign readings (contextual language model)
Lemmatize (BabyLemmatizer or successor)
Translate (MT)

Steps 2–3 are the least automated and arguably the hardest. Your schema, by explicitly modeling the graphemic and reading layers as separate from the linguistic layer, provides the data architecture needed to train models for these intermediate steps. The key training data would be cases where all layers are human-annotated — effectively, every published cuneiform edition is a training example linking image → signs → reading → lemma → translation.

BabyLemmatizer is the best tool for step 4. Your job is to build the schema that makes steps 1–5 interoperable, with clear provenance and confidence tracking at every stage.

Summary: How BabyLemmatizer Anchors Your Schema

BabyLemmatizer is not just a tool to plug into your pipeline — it's the existence proof that neural lemmatization works for cuneiform, and its design decisions (tokenization modes, CoNLL-U format, ORACC-derived training data, confidence scoring) establish the practical constraints your schema must accommodate. Specifically:

CoNLL-U is your interchange format — design your schema to be richer, but always serializable to it
Tokenization is language-specific — your schema must track which tokenization strategy was applied
Lemmatization is probabilistic — always store confidence scores and alternatives, never just "the answer"
The ORACC ecosystem is your primary data source — align your DICTIONARY_ENTRY model with ORACC's CF/GW/SENSE/POS conventions
Multiple annotation sources must coexist — your schema needs to support BabyLemmatizer, L2, LLM, and human annotations on the same tokens without conflict
Hittite and Elamite are unsolved — your schema must be designed for these from day one even though models don't exist yet, because the data model is harder to change than the models

The scholarly workflow goes: collation → transliteration → lemmatization → translation. BabyLemmatizer automates the third step. Your schema connects all four.

Source: github.com/wittkensis/glintstone · Issues · Edit this wiki

Home

Start here

Getting Started

Overview

Data Model

Reference — Data Model

Reference — API

Reference — MCP

Opportunities

Personas

Project

Research

Research BabyLemmatizer Integration

BabyLemmatizer & Lemmatization: Integration with the Cuneiform Data Schema

1. What BabyLemmatizer Actually Does

The Pipeline Flow

2. The Tokenization Problem: Where BabyLemmatizer Makes Its Most Schema-Relevant Decision

The Akkadian Example (Mode 0)

The Sumerian Example (Mode 1)

Schema Implication

3. CoNLL-U as the Interchange Format

Standard CoNLL-U Fields

What CoNLL-U Doesn't Capture (But Your Schema Must)

Where CoNLL-U Fits in Your Architecture

4. The ORACC Lemmatization System: The Upstream Data Source

ORACC Lemmatization Format (ATF #lem lines)

How ORACC Lemmatization Maps to Your Schema

The L2 Lemmatizer (ORACC's Native Tool)

5. What "Lemma" Actually Means Across Languages

Akkadian Lemmatization

Sumerian Lemmatization

Hittite Lemmatization

Elamite Lemmatization

Schema Implication: Language-Polymorphic Lemma Definitions

6. Confidence Scoring and the Human-in-the-Loop

How Confidence Works in BabyLemmatizer

Schema Integration

7. The EvaCun 2025 Frontier: LLMs vs. Specialized Models

Schema Implication

8. The Full Pipeline: From Tablet to Lemmatized Text

9. Practical Schema Additions Based on BabyLemmatizer Integration

9.1 New Entity: ANNOTATION_RUN

9.2 Extended TOKEN Entity

9.3 New Entity: LEMMATIZATION_AMBIGUITY

9.4 Training Data Provenance

10. Gaps and Opportunities

What BabyLemmatizer Doesn't Do (Yet) That Your Schema Should Plan For

The "AlphaFold" Opportunity

Summary: How BabyLemmatizer Anchors Your Schema

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!