-
Notifications
You must be signed in to change notification settings - Fork 0
Reference Data Model Artifacts
Layer 1 is the physical object: what it is, where it came from, what catalog records describe it.
artifacts — One row per cuneiform object. Key fields:
| Field | Notes |
|---|---|
p_number |
Primary key. CDLI universal identifier. |
designation |
Human-readable label (e.g., "CDLI Seals 001234") |
museum_number |
Museum accession number |
excavation_number |
Field excavation ID |
period |
Normalized period string (e.g., "Ur III (ca. 2112-2004 BC)") |
provenience |
Find site (e.g., "Nippur (mod. Nuffar)") |
genre |
Text genre (Administrative, Literary, Legal, etc.) |
language |
Primary language |
object_type |
Tablet, prism, cone, seal, etc. |
material |
Clay, stone, metal, etc. |
pipeline_stage |
One of: captured, recognized, transcribed, lemmatized, translated |
collection |
Holding institution |
surfaces — Named surfaces of the physical object: obverse, reverse, left/right/top/bottom edges, seal. One tablet can have multiple surfaces.
artifact_identifiers — Alternative identifiers for the same artifact: museum number, excavation number, primary publication. Multiple identifier types per artifact.
composites — Q-number composite texts (abstract literary works).
artifact_composites — Junction table linking P-numbers to Q-numbers.
Almost all Layer 1 data comes from the CDLI catalog (cdli_cat.csv), which provides 353,283 records with 64 metadata fields per artifact. CDLI is CC0 — public domain.
The CDLI bulk export was last updated in August 2022. Approximately 33,000 additional stub records exist for P-numbers present in the ATF file but absent from the catalog CSV — these have minimal metadata.
From a live database audit of the CDLI catalog:
| Field | Null rate |
|---|---|
| museum_number | 9.2% |
| excavation_number | 68% |
| width / height | 82% |
| thickness | 91% |
| period | 1.8% |
| provenience | 3.1% |
| genre | 1.4% |
| language | 5.2% |
Expect NULLs. Physical measurement data is especially sparse.
Raw source data uses inconsistent strings for period, provenience, language, and genre. Canon tables normalize these to consistent values used throughout the app:
-
period_canon— maps period strings to canonical forms and groups (e.g., "Third Millennium") -
provenience_canon— maps raw provenience strings to short canonical names -
language_map— maps ISO-style language codes to display labels -
genre_canon— maps raw genre strings to canonical forms
The filter system operates on canonical values.
pipeline_stage is a string enum on artifacts tracking the highest stage the artifact has reached: captured, recognized, transcribed, lemmatized, translated. It is set during ingestion and updated as new data arrives.
Source: github.com/wittkensis/glintstone · Issues · Edit this wiki
Start here
Getting Started
Overview
Data Model
- Data Sources
- Data Quality
- Data Issues
- Import Pipeline Guide
- ML Integration
- Citation Pipeline Summary
Reference — Data Model
Reference — API
Reference — MCP
Opportunities
Personas
Project
Research