-
Notifications
You must be signed in to change notification settings - Fork 0
Data Model Import Pipeline Guide
This document summarizes the ETL process for populating the schema from source data. For the full technical specification (dependencies, row estimates, validation thresholds), see import-pipeline.yaml.
The pipeline follows four stages:
- Extract — source-specific parsers read raw data files (CSV, JSON, ATF, BibTeX)
- Transform — normalization via lookup tables and sign concordance mapping
-
Load — upsert into PostgreSQL with
annotation_run_idprovenance on every record -
Derive — computed tables, materialized views,
pipeline_statusrefresh
All connectors live in ingestion/connectors/ and are subclasses of SourceConnector (or ModelConnector for ML runs). The connector framework is documented in .claude/skills/gs-expert-integrations/. Run via python -m ingestion.cli.
Canonical mapping functions for the five normalization lookup tables:
-
normalize_period("Old Babylonian (ca. 1900-1600 BC)")->"Old Babylonian" -
normalize_provenience("Nippur (mod. Nuffar)")->"Nippur" -
normalize_genre("administrative")->"Administrative" -
normalize_language("Sumerian")->"sux" -
normalize_surface("@obverse")->"obverse"
Sign system bridge functions:
-
mzl_to_ogsl(839)->"KA"(via Unicode codepoint matching) -
abz_to_ogsl("ABZ123")-> OGSL sign_id -
ogsl_to_unicode("KA")->"U+12157"
Auto-match via Unicode covers ~90% of signs. ~200-400 require manual curation.
ImportCheckpoint class for resumable imports:
- SIGINT/SIGTERM signal handling with graceful state save
- Atomic state file writes (JSON in
_progress/directory) - Source checksums detect upstream changes since last run
- Progress reporting: processed / inserted / updated / skipped / errors
Every import step creates an annotation_runs record before writing data. All rows carry the annotation_run_id FK. Fields: source_name, source_version, tool_name, tool_version, started_at, completed_at, config_snapshot, row_count.
| Step | Target Tables | Source | Estimated Rows |
|---|---|---|---|
| 1. Seed lookup tables | period_canon, language_map, genre_canon, provenience_canon, surface_canon | Manual curation from live data audit | ~500 total |
| 2. Create retroactive annotation_runs | annotation_runs | Manual (13 records for existing v1 data sources) | 13 |
| 3. Seed scholars | scholars | ORACC project metadata, CDLI credits | 200-500 |
| 4. Import signs + concordance | signs, sign_values, sign_variants | OGSL sign list + MZL/ABZ bridge data | 3,367 signs, ~15k values |
| Step | Target Tables | Source | Estimated Rows |
|---|---|---|---|
| 5. Import artifacts | artifacts | CDLI catalog CSV + ORACC enrichment | 389,715 |
| 6. Seed artifact identifiers | artifact_identifiers | artifacts.museum_no, .excavation_no, etc. | ~500k |
| 7. Parse surfaces | surfaces | ATF @surface markers + CompVis view_desc | ~300k |
| 8. Parse text_lines | text_lines | ATF line-by-line decomposition + ORACC CDL | ~2M |
| 9. Import tokens | tokens | ORACC CDL nodes + ATF tokenizer | ~5M |
| 10. Import token_readings | token_readings | ORACC CDL grapheme-level data | ~8M |
| Step | Target Tables | Source | Estimated Rows |
|---|---|---|---|
| 11. Import lemmatizations | lemmatizations | ORACC corpus (86k real identifications) | ~86k |
| 12. Parse morphology | morphology | ORACC morph strings, structured decomposition | ~50k |
| 13. Import glossary | glossary_entries, glossary_forms, glossary_senses | ORACC glossary JSON (all projects) | ~30k entries, ~50k forms, ~5k senses |
| 14. Import translations | translations | ATF #tr.XX: markers | ~8k |
| Step | Target Tables | Source | Estimated Rows |
|---|---|---|---|
| 15. Import sign annotations | sign_annotations | CompVis (MZL -> OGSL concordance resolution) | 11,070 |
| 16. Derive named entities | named_entities, entity_mentions | glossary_entries POS tags | ~5k entities, ~20k mentions |
| 17. Import composites | composites, artifact_composites | ATF >>Q markers | 857 composites, 3,771 links |
| Step | Target Tables | Source | Status |
|---|---|---|---|
| 18. Citation pipeline | publications, scholars, artifact_editions, fragment_joins, scholarly_annotations | 9 external sources |
BUILT (10 scripts in ingestion/connectors/citations/) |
| 19. Refresh views | pipeline_status, materialized views | Computed from all tables | Design only |
run_full_import.sh runs all steps in dependency order:
# Full import (all steps)
./ingestion/connectors/run_full_import.sh
# Resume from step N
./ingestion/connectors/run_full_import.sh --from 8
# Dry run (validate sources, no writes)
./ingestion/connectors/run_full_import.sh --dry-run
# Reset database and reimport
./ingestion/connectors/run_full_import.sh --resetSteps must run in order. Key dependencies:
- Signs (step 4) before sign_annotations (step 15) -- concordance needed
- Artifacts (step 5) before surfaces (step 7), text_lines (step 8)
- Tokens (step 9) before lemmatizations (step 11)
- Publications (step 18) before annotation_runs.publication_id backfill
- All data steps before pipeline_status refresh (step 19)
Every step runs post-import checks:
- Row count vs expected (warning if deviation > 5%)
- NULL audit on critical columns
- FK integrity verification
- annotation_run_id NOT NULL enforcement
-
Step 18 (citation pipeline): Built and tested. 10 scripts + shared libraries + verification scripts in
ingestion/connectors/citations/. - Steps 1-17, 19: Designed in import-pipeline.yaml. Not yet implemented.
-
Shared infrastructure: Specified but not coded (normalize.py, concordance.py, checkpoint.py -- pattern established by citation pipeline's
lib/directory).
For field-level source mappings and transform rules, see source-mapping.yaml.
Source: github.com/wittkensis/glintstone · Issues · Edit this wiki
Start here
Getting Started
Overview
Data Model
- Data Sources
- Data Quality
- Data Issues
- Import Pipeline Guide
- ML Integration
- Citation Pipeline Summary
Reference — Data Model
Reference — API
Reference — MCP
Opportunities
Personas
Project
Research