-
Notifications
You must be signed in to change notification settings - Fork 0
Data Model ML Integration
Glintstone treats machine learning models as annotation sources, not oracles. Every ML-generated record carries the same provenance infrastructure as human-produced annotations: an annotation_run_id linking to the model version, configuration, and confidence score. This means ML output and human scholarship coexist in the same tables, are queryable by the same mechanisms, and can be compared directly.
What it does: POS-tagging and lemmatization of transliterated cuneiform text using sequence-to-sequence neural networks (OpenNMT encoder-decoder). The current state-of-the-art tool for automated cuneiform lemmatization.
Reference: Sahala & Linden, 2023
BabyLemmatizer operates between the Reading layer (transliteration) and the Linguistic layer (lemma + POS):
Image > OCR > ATF > [BabyLemmatizer] > Lemmas > Translation
Reading layer Linguistic layer
It takes transliterated text as input and produces lemmatization (dictionary headword mapping) and POS tags as output. It does not operate on images or produce translations.
BabyLemmatizer defines three tokenization strategies, each designed for different writing systems:
| Mode | Languages | Behavior |
|---|---|---|
| 0: Logo-syllabic unindexed | Akkadian, Elamite, Hittite, Urartian, Hurrian | Syllabic signs become character sequences; logograms preserved as tokens. Subscript numbers stripped (du3 becomes du) because subscripts are disambiguation devices, not phonological. |
| 1: Logo-syllabic indexed | Sumerian | Preserves subscript numbers because they carry lexical meaning. du (go), du3 (build), and du7 (perfect) are different words. |
| 2: Character sequences | Non-cuneiform (Greek, Latin) | Standard character-level tokenization. |
The tokenization mode is a pre-processing contract between the data and the model. Glintstone stores the raw transliterated form as canonical on token_readings.form and can derive tokenized representations on demand.
BabyLemmatizer uses CoNLL-U as its input/output format. This is the closest thing cuneiform NLP has to a standard, also used by CDLI and MTAAC.
Mapping from CoNLL-U fields to Glintstone schema:
| CoNLL-U Field | Schema Target | Notes |
|---|---|---|
| ID | tokens.position |
1-indexed position in line |
| FORM | token_readings.form |
Surface form as transliterated |
| LEMMA | lemmatizations.citation_form |
Dictionary headword |
| UPOS | lemmatizations.pos |
Universal POS tag (N, V, ADJ, etc.) |
| XPOS | lemmatizations.epos |
ORACC-specific POS (more granular) |
| FEATS |
morphology table |
Key=Value pairs decomposed into language-specific slots |
| MISC | Various | Damage info, confidence scores |
The schema is deliberately richer than CoNLL-U:
- Sign-level data: CoNLL-U operates at word level. The graphemic layer (individual signs, visual forms, damage states) sits below CoNLL-U's granularity.
-
Mixed-language writing: CoNLL-U has one FORM field. It does not represent that
LUGAL-usis a Sumerian logogram + Hittite phonetic complement. The schema tracks this viatoken_readings.sign_function. -
Alternative analyses: CoNLL-U gives one LEMMA. The schema stores competing lemmatizations with confidence scores via
lemmatizations(multiple rows per token_id). - Physical context: CoNLL-U has no concept of the tablet, surface, or image. The schema's full hierarchy (artifact > surface > text_line > token) provides this.
CoNLL-U is the interchange format for import/export. The schema is the internal data model.
- In-vocabulary forms: 94-96% accuracy (Akkadian)
- Out-of-vocabulary forms: 68-84% accuracy depending on language/dialect
- LLM comparison (EvaCun 2025): LLMs achieved ~90% in-vocabulary but only ~9% OOV, far worse than BabyLemmatizer
BabyLemmatizer's post-correction module assigns confidence by checking neural output against a dictionary:
- Form-lemma pair attested in training data: high confidence
- Lemma in dictionary but this form not attested: medium confidence
- Entirely novel lemma: low confidence, flagged for review
POS errors propagate into lemmatization errors (chained architecture), so confidence scoring accounts for both stages.
BabyLemmatizer is trained on ORACC corpus exports -- the same human-annotated data that Glintstone imports. This creates a virtuous cycle: better ORACC annotations produce better BabyLemmatizer models, which can then annotate the 98% of texts that ORACC has not yet reached.
Separate models exist for:
- Neo-Assyrian Akkadian (richest training data)
- Other Akkadian dialects (Standard Babylonian, Old Babylonian)
- Literary Sumerian vs. administrative Sumerian (different vocabulary distributions)
No pretrained models exist for Hittite or Elamite. The tokenization modes support these languages, but training corpora are insufficient.
BabyLemmatizer output enters Glintstone through:
-
annotation_runs: A record with
source_type='model',source_name='BabyLemmatizer v{version}',method='ML_model', andconfig_snapshotcontaining tokenization mode, tagger context window, and model name. -
lemmatizations: One row per token with
citation_form,pos,confidence, andannotation_run_id. Stored alongside any existing ORACC human lemmatization as a competing analysis. - morphology: If FEATS are produced, decomposed into language-specific slots (root, stem, tense for Akkadian; conjugation_prefix, dimensional_prefixes for Sumerian).
The is_consensus flag is NOT automatically set on BabyLemmatizer output. Human ORACC annotations retain consensus by default. BabyLemmatizer becomes consensus only for tokens with no human annotation, or when explicitly promoted by editorial decision.
What it does: Detects and classifies cuneiform signs in tablet photographs using a Deformable DETR + ResNet-50 architecture.
Training: 173 sign classes, 1000 epochs on eBL cuneiform-ocr-data annotations Performance: mAP@50 = 43.1% Current status: Trained model available. Runs in mock mode on macOS ARM64 due to mmcv library incompatibility.
- Output: Bounding box coordinates + predicted sign class + confidence score
-
Schema target:
sign_annotationstable - Coordinate system: Pixel-absolute from model, converted to percentage-based (0-100%) for resolution independence
- Sign resolution: Model outputs CompVis/eBL class names, resolved to OGSL sign_id via MZL/ABZ concordance
-
Provenance:
annotation_runswithsource_type='model',source_name='eBL DETR v{version}',method='ML_model'
The ML service runs as a FastAPI application at ml/service/:
-
/health-- service status -
/classes-- available sign classes -
/detect-signs-- detect signs from uploaded image -
/detect-signs-by-path-- detect signs from local file path
Sign class to OGSL mapping defined in ml/service/sign_mapping.json (173 entries with OGSL names and Unicode codepoints).
What it does: Akkadian transliteration using BiLSTM (also HMM, MEMM variants). Accuracy: 96.7% on RINAP evaluation set Training data: RINAP corpora via ORACC License: Apache 2.0
-
Schema target:
token_readingslayer (transliteration form + reading) -
Current status: Model available in
ml/models/akkademia/, not yet wired to UI or import pipeline -
Provenance: Would use
annotation_runswithsource_type='model',source_name='Akkademia v{version}'
What it does: Elamite cuneiform OCR using CNN/ResNet architecture. Training data: Persepolis Fortification Archive (100k+ annotations) License: MIT Current status: Code only, no trained weights available.
-
Schema target:
sign_annotations(future, when weights become available) - Note: Elamite is an underrepresented language in cuneiform NLP. Training data exists (PFA) but no production model is available.
The schema is deliberately model-agnostic. The annotation_runs table can record output from any tool:
annotation_runs:
source_type: model | human | hybrid | import
source_name: "BabyLemmatizer v2.2" | "L2" | "GPT-4" | "human_collation" | ...
method: ML_model | hand_copy | collation | import | ...
config_snapshot: { tokenizer_mode: 0, model: "neo-assyrian", ... }This means new models (LLMs, future specialized tools, BabyFST finite-state transducers) can be added without schema changes. The provenance mechanism is the same regardless of what produced the annotation.
The EvaCun 2025 shared task demonstrated that different tools excel at different subtasks. The schema supports storing and comparing all of them on the same tokens, enabling retrospective evaluation as the field evolves.
Source: github.com/wittkensis/glintstone · Issues · Edit this wiki
Start here
Getting Started
Overview
Data Model
- Data Sources
- Data Quality
- Data Issues
- Import Pipeline Guide
- ML Integration
- Citation Pipeline Summary
Reference — Data Model
Reference — API
Reference — MCP
Opportunities
Personas
Project
Research