-
Notifications
You must be signed in to change notification settings - Fork 0
Data Model Data Issues
12 critical issues found by pressure-testing the schema against the live Glintstone database (314 MB, 389,715 artifacts). Each issue would break or corrupt imports if not addressed.
Related docs: glintstone-schema.yaml (schema fixes), data-quality.md (trust architecture), import-pipeline-guide.md (validation per step).
Severity: High — breaks filtering, aggregation, and sorting.
46+ distinct period formats. Same period with/without date ranges, uncertainty markers.
| Format | Count | Example |
|---|---|---|
| Canonical only | 194,556 | Old Babylonian |
| With date range | 73,211 | Old Babylonian (ca. 1900-1600 BC) |
| With uncertainty | 1,247 | Neo-Babylonian (ca. 626-539 BC) ? |
Fix (addressed in schema): period_normalized column + period_canon lookup table.
- Regex strip
(ca. ...)and?suffix - Map to canonical via lookup
- Preserve raw in
periodcolumn - Schema ref:
artifacts.period_normalized,period_canontable
Severity: High — breaks language filtering and ORACC cross-referencing.
- Multi-language artifacts:
Sumerian; Akkadian(2,854),Akkadian; Persian; Elamite(138) - CDLI uses English names, ORACC uses ISO codes (sux, akk, akk-x-stdbab)
- ORACC lemmas have
-949suffixes (akk-949) = uncertain language attribution
Fix (addressed in schema): language_map lookup table + languages JSON array column. -949 suffix becomes uncertain flag on lemmatization.
- Schema ref:
artifacts.languages,language_maptable
Severity: Medium — causes duplicate filter facets.
Administrative (194,556) vs administrative (725). Multi-genre: Lexical; Mathematical (109).
Fix: genre_canon lookup table normalizing to title-case. genre = primary, genres JSON array for multi-genre.
Severity: High — migrations must correctly separate identified words from damaged/unread tokens.
| Category | Count | % |
|---|---|---|
| Fully identified words | 86,659 | 28.1% |
| Damaged/unread tokens | 221,912 | 71.9% |
| Partially read names | 39 | 0.01% |
Top unidentified forms: x (38,833 completely illegible), geš (4,777 determinative for wood), 1(N01) (4,678 archaic number tokens).
Fix: ALL 308k become tokens. Only 86,659 with cf != NULL and cf != 'X' get lemmatizations. Unidentified tokens exist at reading layer with damage info.
Import rule (addressed in schema): IF cf IS NOT NULL AND cf != '' AND cf != 'X' THEN token + lemmatization ELSE token only
- Schema ref:
tokenstable (all 309k),lemmatizationstable (86k with real identifications)
Severity: High — 11,070 sign bounding box annotations unresolvable to canonical sign system.
CompVis annotations use MZL integers (839, 748, 10). Database uses OGSL names (A, AN, |A.AN|). No mapping exists.
216 annotations have empty surface values.
Fix:
- Auto-match MZL -> OGSL via Unicode bridge (eBL ebl.txt + OGSL)
- Auto-match via shared reading values
- Flag ~200-400 unresolved for manual curation
- Store in
signstable: newmzl_number,abz_numbercolumns - Populate
sign_idFK on sign_annotations via concordance - Empty surfaces ->
unknown
- Schema ref:
signs.mzl_number,signs.abz_number,sign_annotations.sign_id - Status: Addressed in schema; concordance.py implementation pending
Severity: Medium — breaks geographic aggregation and Pleiades linking.
Mixed formats: Nippur (mod. Nuffar), Sippar-Yahrurum (mod. Tell Abu Habbah) ?, uncertain (mod. Babylonia).
Fix: provenience_canon lookup table. provenience_normalized = ancient name only.
Severity: Low — but will cause validation errors if uncaught.
Most: 7-char P###### (P000001-P999999). Some exceed range: P1273754, P2757983.
Fix: p_number as TEXT. Validation regex: P\d{6,7}. Update zero-pad logic.
Severity: Medium — these texts have lemmatization but no intermediate text structure.
ORACC has linguistic analysis for ~7,500 texts. 2,215 of these have no corresponding CDLI ATF.
Top orphans by lemma count: P507554 (4,266), Q000055 (2,633), P282465 (2,595).
Several are Q-numbers (composite texts) — scholarly reconstructions without a single physical ATF source.
Fix: Build text_lines from ORACC CDL data (which contains line structure). Set source=oracc. Create tokens from CDL nodes as normal.
Severity: High — users cannot see the full range of word meanings.
glossary_senses table schema exists but has 0 rows. Data IS available in ORACC glossary JSON (5,271 Sumerian entries have senses).
Example: Sumerian word "a" has 9 senses — arm (86%), strap (6%), horn (1%), etc.
Fix: Parse senses[] array from every ORACC glossary entry. Each sense has: mng (meaning), icount (frequency), pos, oid, nested forms[].
Severity: Medium — comma-separated cache fields are fragile and duplicate data.
Composites store derived metadata as comma-separated strings (periods_cache, proveniences_cache, genres_cache). Period names themselves contain commas.
Fix: Remove cache columns. Derive via SQL view from artifact_composites JOIN artifacts. Composites keep only: q_number, designation, exemplar_count.
Severity: Low — but must not lose source attribution.
All 5,597 translations have source=cdli. Replacing with annotation_run_id FK.
Resolution: No data loss. Create annotation_runs record for cdli:atf import. Every translation row gets annotation_run_id pointing to that record. Join to annotation_runs returns source_name='cdli:atf' with richer provenance (date, method, scope).
Severity: Medium — must preserve source attribution while avoiding confusion.
1,726 duplicate pairs across import methods. All share same guide_word and POS. Attestation counts and spelling variants differ.
Current project values (extracted, json) are import-method labels, not actual ORACC project names.
Fix:
- Keep ALL entries from all projects
- Correct
projectto actual ORACC project names (dcclt, saao, rinap, etc.) - UNIQUE(headword, language, project) constraint
- For named_entities: dedup by headword+language, link back to ALL source entries
- API includes source attribution per project
Source: github.com/wittkensis/glintstone · Issues · Edit this wiki
Start here
Getting Started
Overview
Data Model
- Data Sources
- Data Quality
- Data Issues
- Import Pipeline Guide
- ML Integration
- Citation Pipeline Summary
Reference — Data Model
Reference — API
Reference — MCP
Opportunities
Personas
Project
Research