-
Notifications
You must be signed in to change notification settings - Fork 0
Research BabyLemmatizer Integration
BabyLemmatizer (Sahala & Lindén, 2023) is the current state-of-the-art tool for two tasks that sit at the critical juncture between your schema's reading layer and linguistic layer:
- POS-tagging: Assigning a part-of-speech tag (N, V, AJ, PRP, etc.) to each token in transliterated cuneiform text
- Lemmatization: Mapping an inflected surface form to its dictionary headword (citation form)
It approaches both as machine translation (sequence-to-sequence) problems using OpenNMT encoder-decoder networks, not as classification or lookup tasks. This is a crucial architectural choice — it means the model can generate lemmata for forms it has never seen before (OOV words), achieving 68–84% accuracy on unseen forms depending on language/dialect.
Transliterated text (CoNLL-U)
→ POS-tagger (seq2seq: form → tag)
→ Lemmatizer (seq2seq: form + POS context → lemma)
→ Post-correction (dictionary-based heuristics + confidence scoring)
→ Annotated output (CoNLL-U)
The POS tag predicted in step 1 feeds into step 2 as contextual input — the lemmatizer sees not just the form but its surrounding POS tags. This chained architecture means POS errors propagate into lemmatization errors, which is a key design consideration for your schema's confidence tracking.
BabyLemmatizer defines three tokenization strategies, selectable per model:
| Mode | ID | Languages | Description |
|---|---|---|---|
| Logo-syllabic (unindexed) | 0 | Akkadian, Elamite, Hittite, Urartian, Hurrian | Syllabic signs → character sequences; logograms → preserved as tokens |
| Logo-syllabic (indexed) | 1 | Sumerian | Preserves sign indices (subscript numbers) because they carry meaning in Sumerian |
| Character sequences | 2 | Non-cuneiform (Greek, Latin) | Standard character-level tokenization |
This is the single most important design decision for your schema's relationship with BabyLemmatizer, and here's why:
Consider the Neo-Assyrian form IMIN{+et} (meaning "seven"):
BabyLemmatizer tokenizes this as something like:
Source: I M I N { + e t } (character-level for the logographic part, phonetic complement separated)
Target: s e b e (the lemma: sebe, "seven")
The sign IMIN is a Sumerogram — it's the Sumerian word for "seven" used logographically in an Akkadian text. The phonetic complement {+et} tells you this is the Akkadian word sebe(t) with the feminine ending. BabyLemmatizer's unindexed mode strips the subscript numbers from syllabic signs (e.g., du₃ becomes du3 or just du) because for Akkadian, the subscript is just a disambiguation device for which cuneiform sign is meant — it doesn't affect the phonological value.
For Sumerian, subscripts do matter. The signs du (to go), du₃ (to build), and du₇ (to be perfect) are completely different words that happen to share similar phonetic values. Stripping subscripts would destroy critical lexical information. So mode 1 preserves them.
Your TOKEN entity needs to track:
token:
raw_form: "IMIN{+et}" # exactly as transliterated
tokenization_mode: 0 # which strategy was applied
tokenized_form: ["I","M","I","N","{","+","e","t","}"] # character-level input to model
predicted_pos: "NU" # number
predicted_lemma: "sebe" # dictionary headword
confidence: 0.94 # from post-correction scoring
is_logogram: true # IMIN is a Sumerogram
logogram_language: "Sumerian" # the writing system origin
target_language: "Akkadian" # what language is being expressed
The tokenization mode is effectively a pre-processing contract between your data and any ML model. Different models may expect different tokenizations of the same raw form. Your schema should store the raw form as canonical and derive tokenized representations on demand, or store them as a computed field with provenance.
BabyLemmatizer uses CoNLL-U (and an extended CoNLL-U Plus variant) as its input/output format. This is also the format used by CDLI, MTAAC, and increasingly by other cuneiform NLP projects. Understanding this format is essential for your schema design because it's the closest thing the field has to a standard.
ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
1 a-di adi PRP PRP _ _ _ _ _
2 IMIN{+et} sebe NU NU _ _ _ _ _
3 a-ra-an-šu arnu N N Gender=Masc|... _ _ _ _
4 pu-uṭ-ri paṭāru V V Stem=G|Tense=Imp _ _ _ _
| Field | What it maps to in your schema | Notes |
|---|---|---|
| ID | Token position | Integer, 1-indexed per sentence |
| FORM |
raw_form in your reading_layer |
The transliterated surface form |
| LEMMA |
lemma → DICTIONARY_ENTRY |
The citation/dictionary form |
| UPOS | Universal POS tag | Cross-linguistic (N, V, ADJ, etc.) |
| XPOS | Language-specific POS tag | ORACC POS tags (more granular) |
| FEATS | Morphological features | Key=Value pairs (Gender, Number, Case, Stem, etc.) |
| HEAD | Syntactic head | Dependency parse (often empty for cuneiform) |
| DEPREL | Dependency relation | (ditto) |
| DEPS | Enhanced dependencies | (rarely used) |
| MISC | Overflow | Damage info, confidence scores, etc. |
This is where the gap between BabyLemmatizer's output and your full schema becomes critical:
-
Sign-level information: CoNLL-U operates at the word level. It doesn't track individual cuneiform signs within a word, their visual forms, damage states, or polyvalent readings. Your schema's graphemic layer sits below CoNLL-U's granularity.
-
Mixed-language writing: CoNLL-U has a single FORM field. It doesn't natively represent that
LUGAL-ušis a Sumerian logogram + Hittite phonetic complement. Your schema needs thesign_origins[]andfunctions[]arrays from the earlier tutorial. -
Alternative readings: CoNLL-U gives one LEMMA. BabyLemmatizer's post-correction produces confidence scores and can flag uncertain analyses, but the format doesn't natively support "this might be lemma A (0.7) or lemma B (0.25)." Your schema needs
alternative_readings[]. -
Tablet/text metadata: CoNLL-U has limited metadata capability via
# commentlines. It doesn't model the physical object, provenance, or the full textual hierarchy (tablet → surface → column → line → token). -
Cross-references: No mechanism for linking a token to parallel texts, lexical list entries, or sign list references.
Your Full Schema
┌─────────────────────┐
│ TABLET (physical) │
│ └── SURFACE │
│ └── LINE │
│ └── TOKEN │ ◄── This is what CoNLL-U represents
│ │ │ (partially)
│ ┌────────┼─────┤
│ Graphemic Reading Linguistic
│ Layer Layer Layer
│ (signs) (translit) (lemma,POS)
└─────────────────────┘
▲
│
┌───────────────┼───────────────┐
│ │ │
ORACC ATF CoNLL-U ORACC JSON
(source) (BabyLemmatizer (richest
I/O format) structured
output)
Design recommendation: Use CoNLL-U (or CoNLL-U Plus) as an import/export interchange format, not as your internal data model. Your schema should be richer than CoNLL-U can express, but should be able to serialize to CoNLL-U for interoperability with BabyLemmatizer, spaCy pipelines, and the broader NLP ecosystem.
BabyLemmatizer was trained on data extracted from ORACC (Open Richly Annotated Cuneiform Corpus), which has its own lemmatization conventions that predate and differ from CoNLL-U. Understanding ORACC's system is essential because it's the largest source of human-annotated cuneiform text.
In ORACC ATF files, lemmatization is encoded inline:
1. a-na be-lí-ia qí-bí-ma
#lem: ana[to]PRP; bēlu[lord]N; qabû[say]V
Each lemma annotation has the structure:
CF[GW]POS
Where:
-
CF (Citation Form): The dictionary headword (e.g.,
bēlu) -
GW (Guide Word): A basic English meaning used as a disambiguator (e.g.,
lord) -
POS: Part of speech tag (e.g.,
N)
Extended form (when adding new information):
+CF[GW//SENSE]POS'EPOS$NORM0
Where:
- SENSE: Contextual meaning (may differ from GW)
- EPOS: Effective POS (when a word functions as a different POS than its base, e.g., a noun used as a preposition)
-
NORM0: Normalized/transcribed form (the actual pronunciation, e.g.,
bēlīyafor "my lord")
| ORACC Field | Schema Entity | Notes |
|---|---|---|
| CF (Citation Form) | DICTIONARY_ENTRY.lemma_form | The canonical headword |
| GW (Guide Word) | DICTIONARY_ENTRY.meanings[0] | Primary disambiguating meaning |
| SENSE | TOKEN.semantic_layer.translation | Context-specific meaning |
| POS | TOKEN.linguistic_layer.part_of_speech (base) | |
| EPOS | TOKEN.linguistic_layer.part_of_speech (effective) | Captures re-categorization |
| NORM0 | TOKEN.reading_layer.phonetic_reading | The actual pronunciation |
| FORM (transliteration) | TOKEN.reading_layer.transliteration | What's on the tablet |
Critical insight: ORACC's distinction between CF and NORM0 maps directly to the split between your schema's DICTIONARY_ENTRY (abstract lexical item) and TOKEN (specific instance). bēlu is the dictionary entry; bēlīya ("my lord") is the normalized token-level form. BabyLemmatizer predicts the CF — it doesn't currently produce NORM0, though this is on the roadmap.
Before BabyLemmatizer, ORACC's own lemmatizer L2 (by Steve Tinney) was the primary tool. L2 is fundamentally dictionary-based — it looks up forms in a glossary and suggests matches. It cannot handle OOV words (it just flags them for manual annotation). BabyLemmatizer's neural approach fills exactly this gap.
Your pipeline should ideally combine both:
Input text → L2 (dictionary lookup, high-precision for known forms)
→ BabyLemmatizer (neural prediction, handles OOV)
→ Merge (prefer L2 when confident, fall back to BabyLemmatizer)
→ Human review (flagged uncertain cases)
This is where the four languages in your schema diverge significantly, and where BabyLemmatizer's language-specific models become essential.
The lemma is the citation form — typically the masculine singular nominative for nouns, the infinitive (G-stem) for verbs:
- Surface form:
iš-pur→ Lemma:šapāru[send]V - Surface form:
šar-ra-ti→ Lemma:šarratu[queen]N - Surface form:
LUGAL→ Lemma:šarru[king]N (logographic writing, lemmatized to the Akkadian word)
The mapping from surface to lemma involves:
- Stripping inflectional morphology (case, number, person, tense)
- Resolving logographic writing to the Akkadian word
- Normalizing phonological spelling variants
- Identifying the verbal stem (G, D, Š, N, etc.)
BabyLemmatizer achieves 94–96% accuracy on this for in-vocabulary forms, 68–84% for OOV.
Sumerian lemmatization is structured differently because of the agglutinative morphology:
- Surface form:
mu-na-du₃→ Lemma:du₃[build]V (the verbal root) - Surface form:
lugal-e→ Lemma:lugal[king]N (strip case marker) - Surface form:
e₂-gal→ Lemma:egal[palace]N (compound treated as single lemma)
Key differences from Akkadian:
- No root-and-pattern system to reverse-engineer
- Compound words may be treated as single lemmata or decomposed (scholarly convention varies)
- The same form can be verbal or nominal depending on context (Sumerian word-class flexibility)
- Sign indices (du₃ vs. du₇) are critical and must be preserved
BabyLemmatizer uses separate models for literary and administrative Sumerian (different vocabulary distributions, different formulaic patterns).
Hittite lemmatization must solve the Sumerogram/Akkadogram problem:
- Surface form:
LUGAL-uš→ Lemma:ḫassus[king] (Sumerogram resolved to Hittite word) - Surface form:
ar-ḫa→ Lemma:arḫa[away] (phonetically written Hittite)
BabyLemmatizer does not currently have a pretrained Hittite model (it lists Akkadian dialects, Sumerian, and Urartian). The tokenization mode 0 is flagged as applicable to Hittite, but no trained model exists. This is a gap your pipeline would need to fill, likely by training on the Hethitologie Portal or Chicago Hittite Dictionary data.
Similarly, no pretrained Elamite model exists. The tokenization mode 0 covers it in principle, but the training data is extremely sparse. Elamite lemmatization would likely require:
- A small hand-annotated corpus (from Persepolis texts)
- Heavy use of transfer learning from Akkadian models
- Much lower accuracy expectations
- Mandatory human review for all outputs
Your DICTIONARY_ENTRY entity needs language-specific structure:
DICTIONARY_ENTRY:
lemma_form: "šapāru"
language: "Akkadian"
dialect: "Neo-Assyrian"
pos: "V"
# Akkadian-specific:
akkadian_root: "š-p-r"
verbal_stems_attested: ["G", "Š", "N"]
# Meanings with ORACC-style GW:
guide_word: "send"
senses:
- "to send (a message, person)"
- "to write (a letter)"
- "to dispatch"
# Cross-linguistic links:
sumerian_logogram: "KIN" # Sumerogram used for this word
cognates:
- {language: "Hebrew", form: "spr", meaning: "to count, write"}
- {language: "Arabic", form: "sfr", meaning: "to write, travel"}
vs.
DICTIONARY_ENTRY:
lemma_form: "du₃"
language: "Sumerian"
pos: "V"
# Sumerian-specific:
sign_name: "DU₃"
sign_unicode: "U+12085"
compound_analysis: null # not a compound
guide_word: "build"
senses:
- "to build, construct"
- "to erect"
# Usage as Sumerogram:
used_logographically_in:
- {language: "Akkadian", read_as: "banû"}
- {language: "Hittite", read_as: "unknown (always logographic)"}
BabyLemmatizer's post-correction module assigns confidence scores by comparing the neural network's output against dictionary-based heuristics. This is directly relevant to your trust framework thinking.
The post-correction step:
- Takes the neural network's predicted lemma + POS
- Checks it against a known glossary/dictionary
- If the form-lemma pair is attested in the training data → high confidence
- If the lemma is in the dictionary but this specific form isn't attested → medium confidence
- If the lemma is entirely novel (not in dictionary) → low confidence, flagged for review
- An override lexicon can force specific corrections
TOKEN.annotation_provenance:
method: "BabyLemmatizer_v2.2"
model: "neo-assyrian"
model_version: "2024-06-07"
pos_prediction:
value: "V"
confidence: 0.97
alternatives: [{value: "N", confidence: 0.02}]
source: "neural"
lemma_prediction:
value: "šapāru"
confidence: 0.91
alternatives: [{value: "šāpiru", confidence: 0.06}]
source: "neural+dictionary_confirmed"
human_review:
status: "unreviewed" | "confirmed" | "corrected"
reviewer: null
correction: null
review_date: null
flags:
is_oov: false
is_logographic: false
damage_affects_reading: false
This maps directly to the trust acquisition → growth → maintenance cycle:
- Acquisition: BabyLemmatizer produces an initial annotation with confidence score
- Growth: Human reviewer confirms or corrects, which can be fed back into retraining
- Maintenance: As the dictionary grows and models improve, confidence scores on previously-annotated tokens can be recalculated
The field is actively moving. The EvaCun 2025 Shared Task (co-located with NAACL 2025) benchmarked LLMs and transformer models against BabyLemmatizer on Akkadian and Sumerian lemmatization and token prediction, using new datasets from the Electronic Babylonian Library (eBL) and Archibab.
Key findings relevant to your pipeline:
- BabyLemmatizer remains the baseline to beat for lemmatization
- LLM-based approaches (few-shot prompting) achieved ~90% on in-vocabulary but only ~9% on OOV — dramatically worse than BabyLemmatizer's 68–84%
- The token prediction task (filling in missing/damaged text) is a different problem where LLMs may have more promise
- New data sources (eBL for first-millennium literature, Archibab for Old Babylonian) are expanding what's available for training
Your pipeline architecture should be model-agnostic at the annotation layer. The TOKEN.annotation_provenance should be able to record whether an annotation came from:
- BabyLemmatizer (specialized seq2seq)
- An LLM (prompted or fine-tuned)
- L2 (dictionary lookup)
- A human annotator
- A finite-state transducer (BabyFST)
- Future tools not yet built
This means your schema should not bake in assumptions about which tool produced an annotation. The provenance metadata should be rich enough to evaluate and compare different methods retrospectively.
Here's how everything connects, with your schema as the backbone:
┌─────────────────────────────────────────────────────────────────┐
│ PHYSICAL LAYER │
│ │
│ 3D scan / photograph of tablet │
│ │ │
│ ▼ │
│ Sign identification (DeepScribe, CNN/ViT models) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ SCHEMA: SIGN entities │ │
│ │ sign_id, unicode, visual_form, │ │
│ │ period, region, damage_state │ │
│ └─────────────┬───────────────────────┘ │
│ │ │
├────────────────┼─────────────────────────────────────────────────┤
│ READING LAYER │ │
│ ▼ │
│ Reading assignment (polyvalence resolution) │
│ → Determine language, resolve sign → syllable/logogram │
│ │ │
│ ▼ │
│ Transliteration (ATF format) │
│ e.g., "a-na be-lí-ia qí-bí-ma" │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ SCHEMA: TOKEN.reading_layer │ │
│ │ transliteration, sign_sequence, │ │
│ │ function_per_sign, phonetic_read │ │
│ └─────────────┬───────────────────────┘ │
│ │ │
├────────────────┼─────────────────────────────────────────────────┤
│ LINGUISTIC LAYER (where BabyLemmatizer operates) │
│ ▼ │
│ ┌───────────────────────────┐ │
│ │ Convert to CoNLL-U │ ◄── ATF-to-CoNLL-U converter │
│ │ (form per line, blank │ (e.g., Pagé-Perron scripts) │
│ │ metadata fields) │ │
│ └─────────────┬─────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────┐ ┌─────────────────────────┐ │
│ │ BabyLemmatizer │ │ L2 (ORACC dictionary │ │
│ │ → POS-tagger │ │ lemmatizer) │ │
│ │ → Lemmatizer │ │ → glossary lookup │ │
│ │ → Post-correction │ │ → exact match only │ │
│ └─────────────┬─────────────┘ └──────────┬──────────────┘ │
│ │ │ │
│ └──────────┬───────────────────┘ │
│ ▼ │
│ Merge / reconcile │
│ (prefer dictionary match; │
│ fall back to neural; │
│ flag disagreements) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ SCHEMA: TOKEN.linguistic_layer │ │
│ │ lemma → DICTIONARY_ENTRY │ │
│ │ part_of_speech (UPOS + XPOS) │ │
│ │ morphological_analysis │ │
│ │ annotation_provenance │ │
│ │ confidence_score │ │
│ │ human_review_status │ │
│ └─────────────┬───────────────────────┘ │
│ │ │
├────────────────┼─────────────────────────────────────────────────┤
│ SEMANTIC LAYER │ │
│ ▼ │
│ Translation / interpretation │
│ (currently largely manual; some MT work emerging) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ SCHEMA: TOKEN.semantic_layer │ │
│ │ translation, semantic_domain, │ │
│ │ cross_references │ │
│ └─────────────────────────────────────┘ │
│ │
└───────────────────────────────────────────────────────────────────┘
Based on this analysis, here are the specific additions to the schema from the earlier tutorial:
Track each time a tool processes a text:
ANNOTATION_RUN:
run_id: UUID
tool: "BabyLemmatizer" | "L2" | "human" | "LLM" | ...
tool_version: "2.2"
model_name: "neo-assyrian"
model_config:
tokenizer_mode: 0
tagger_context: 2
lemmatizer_context: 1
timestamp: ISO8601
input_format: "CoNLL-U"
corpus_scope: [list of tablet_ids processed]
aggregate_metrics:
pos_accuracy_estimate: 0.97
lemma_accuracy_estimate: 0.95
oov_rate: 0.09
TOKEN (additions to original schema):
# CoNLL-U compatible fields:
conllu_id: 3 # position in sentence
form: "a-ra-an-šu" # FORM field
lemma: "arnu" # LEMMA field → links to DICTIONARY_ENTRY
upos: "NOUN" # Universal POS
xpos: "N" # ORACC-specific POS
feats: "Gender=Masc|Number=Sing|Case=Acc" # Morphological features
head: null # Dependency head (if parsed)
deprel: null # Dependency relation
# BabyLemmatizer-specific metadata:
tokenization_mode: 0
tokenized_input: "a r a n š u" # What the model actually saw
# Multi-source annotation tracking:
annotations[]:
- source: "BabyLemmatizer"
run_id: UUID → ANNOTATION_RUN
lemma: "arnu"
pos: "N"
confidence: 0.93
is_oov: false
- source: "L2"
run_id: UUID → ANNOTATION_RUN
lemma: "arnu"
pos: "N"
confidence: 1.0 # dictionary match = certain
is_oov: false
- source: "human_reviewer"
reviewer_id: "..."
lemma: "arnu" # confirmed
pos: "N"
review_date: ISO8601
notes: null
# Consensus annotation (derived):
consensus_lemma: "arnu"
consensus_pos: "N"
consensus_confidence: 0.98
consensus_method: "dictionary_confirmed + neural_confirmed + human_confirmed"
For cases where multiple valid lemmatizations exist:
LEMMATIZATION_AMBIGUITY:
token_id: UUID → TOKEN
candidates[]:
- lemma: "pānu" # front, face
pos: "N"
sense: "front"
probability: 0.45
evidence: "common in prepositional phrases"
- lemma: "amāru" # to see
pos: "V"
sense: "see"
probability: 0.35
evidence: "IGI can be logographic for amāru"
- lemma: "naplastu" # to look
pos: "V"
sense: "appear"
probability: 0.15
evidence: "less common reading"
resolution_status: "unresolved" | "human_resolved" | "auto_resolved"
resolved_to: "pānu"
resolution_rationale: "prepositional context (ina pān) strongly favors nominal reading"
This is especially important for Sumerograms like IGI, which ORACC's documentation explicitly mentions can have readings like pān[front]N | immar[see]V | innamir[appear]V | igi[reciprocal]N.
For your AlphaFold-equivalent, tracking where training data came from is critical:
TRAINING_CORPUS:
corpus_id: UUID
name: "ORACC SAAo Neo-Assyrian"
source_project: "saao/saa08"
language: "Akkadian"
dialect: "Neo-Assyrian"
genre: "state_correspondence"
period: "Neo-Assyrian"
token_count: 150000
lemmatized_by: "human (ORACC project team)"
license: "CC-BY-SA"
# Quality metrics:
inter_annotator_agreement: null # often not measured
known_biases:
- "Predominantly royal correspondence"
- "Limited colloquial vocabulary"
- "Sumerogram-heavy (many underlying Akkadian forms unknown)"
# Downstream model usage:
used_to_train: [ANNOTATION_RUN_ids]
| Capability | BabyLemmatizer Status | Your Schema Should... |
|---|---|---|
| Morphological analysis | On roadmap, not implemented | Reserve FEATS field; design language-specific morph templates |
| Phonological transcription | On roadmap | Include NORM0 field (ORACC convention) |
| Named entity recognition | On roadmap | Include NER tags (PN, GN, DN, etc.) |
| Syntactic parsing | Not planned | Support HEAD/DEPREL but don't require them |
| Hittite models | Tokenization supported, no trained model | Plan for it; Hittite Sumerogram resolution is the hardest test |
| Elamite models | Tokenization supported, no trained model | Plan for extreme uncertainty; confidence thresholds must be lower |
| Sign-level OCR | Out of scope (operates on transliteration) | Your graphemic layer sits below BabyLemmatizer's input |
| Fragment joining | Out of scope | Separate subsystem, but should link to same TOKEN entities |
The gap between sign identification (images → signs) and lemmatization (transliteration → lemmata) is currently bridged by human transliteration. An end-to-end system would:
- Take a tablet image
- Identify signs (computer vision)
- Assign readings (contextual language model)
- Lemmatize (BabyLemmatizer or successor)
- Translate (MT)
Steps 2–3 are the least automated and arguably the hardest. Your schema, by explicitly modeling the graphemic and reading layers as separate from the linguistic layer, provides the data architecture needed to train models for these intermediate steps. The key training data would be cases where all layers are human-annotated — effectively, every published cuneiform edition is a training example linking image → signs → reading → lemma → translation.
BabyLemmatizer is the best tool for step 4. Your job is to build the schema that makes steps 1–5 interoperable, with clear provenance and confidence tracking at every stage.
BabyLemmatizer is not just a tool to plug into your pipeline — it's the existence proof that neural lemmatization works for cuneiform, and its design decisions (tokenization modes, CoNLL-U format, ORACC-derived training data, confidence scoring) establish the practical constraints your schema must accommodate. Specifically:
- CoNLL-U is your interchange format — design your schema to be richer, but always serializable to it
- Tokenization is language-specific — your schema must track which tokenization strategy was applied
- Lemmatization is probabilistic — always store confidence scores and alternatives, never just "the answer"
- The ORACC ecosystem is your primary data source — align your DICTIONARY_ENTRY model with ORACC's CF/GW/SENSE/POS conventions
- Multiple annotation sources must coexist — your schema needs to support BabyLemmatizer, L2, LLM, and human annotations on the same tokens without conflict
- Hittite and Elamite are unsolved — your schema must be designed for these from day one even though models don't exist yet, because the data model is harder to change than the models
The scholarly workflow goes: collation → transliteration → lemmatization → translation. BabyLemmatizer automates the third step. Your schema connects all four.
Source: github.com/wittkensis/glintstone · Issues · Edit this wiki
Start here
Getting Started
Overview
Data Model
- Data Sources
- Data Quality
- Data Issues
- Import Pipeline Guide
- ML Integration
- Citation Pipeline Summary
Reference — Data Model
Reference — API
Reference — MCP
Opportunities
Personas
Project
Research